* [PATCH] drm/i915: Use exponential backoff for wait_for()
@ 2017-11-21 15:24 Chris Wilson
2017-11-21 16:29 ` Sagar Arun Kamble
` (9 more replies)
0 siblings, 10 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 15:24 UTC (permalink / raw)
To: intel-gfx
Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
start with a small sleep and exponentially increase the sleep on each
cycle.
A good example of a beneficiary is the guc mmio communication channel.
Typically we expect (and so spin) for 10us for a quick response, but this
doesn't cover everything and so sometimes we fallback to the millisecond+
sleep. This incurs a significant delay in time-critical operations like
preemption (igt/gem_exec_latency), which can be improved significantly by
using a small sleep after the spin fails.
We've made this suggestion many times, but had little experimental data
to support adding the complexity.
References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: John Harrison <John.C.Harrison@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
---
drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index 69aab324aaa1..c1ea9a009eb4 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -50,6 +50,7 @@
*/
#define _wait_for(COND, US, W) ({ \
unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
+ long wait__ = 1; \
int ret__; \
might_sleep(); \
for (;;) { \
@@ -62,7 +63,9 @@
ret__ = -ETIMEDOUT; \
break; \
} \
- usleep_range((W), (W) * 2); \
+ usleep_range(wait__, wait__ * 2); \
+ if (wait__ < (W)) \
+ wait__ <<= 1; \
} \
ret__; \
})
--
2.15.0
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
@ 2017-11-21 16:29 ` Sagar Arun Kamble
2017-11-21 16:33 ` Chris Wilson
2017-11-21 16:50 ` ✓ Fi.CI.BAT: success for " Patchwork
` (8 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-21 16:29 UTC (permalink / raw)
To: Chris Wilson, intel-gfx
On 11/21/2017 8:54 PM, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
>
> A good example of a beneficiary is the guc mmio communication channel.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
>
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
>
> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 69aab324aaa1..c1ea9a009eb4 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -50,6 +50,7 @@
> */
> #define _wait_for(COND, US, W) ({ \
> unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> + long wait__ = 1; \
> int ret__; \
> might_sleep(); \
> for (;;) { \
> @@ -62,7 +63,9 @@
> ret__ = -ETIMEDOUT; \
> break; \
> } \
> - usleep_range((W), (W) * 2); \
> + usleep_range(wait__, wait__ * 2); \
> + if (wait__ < (W)) \
This should be "wait__ < (US)" ?
> + wait__ <<= 1; \
> } \
> ret__; \
> })
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 16:29 ` Sagar Arun Kamble
@ 2017-11-21 16:33 ` Chris Wilson
2017-11-21 16:49 ` Sagar Arun Kamble
0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 16:33 UTC (permalink / raw)
To: Sagar Arun Kamble, intel-gfx
Quoting Sagar Arun Kamble (2017-11-21 16:29:57)
>
>
> On 11/21/2017 8:54 PM, Chris Wilson wrote:
> > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > start with a small sleep and exponentially increase the sleep on each
> > cycle.
> >
> > A good example of a beneficiary is the guc mmio communication channel.
> > Typically we expect (and so spin) for 10us for a quick response, but this
> > doesn't cover everything and so sometimes we fallback to the millisecond+
> > sleep. This incurs a significant delay in time-critical operations like
> > preemption (igt/gem_exec_latency), which can be improved significantly by
> > using a small sleep after the spin fails.
> >
> > We've made this suggestion many times, but had little experimental data
> > to support adding the complexity.
> >
> > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: John Harrison <John.C.Harrison@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > ---
> > drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > index 69aab324aaa1..c1ea9a009eb4 100644
> > --- a/drivers/gpu/drm/i915/intel_drv.h
> > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > @@ -50,6 +50,7 @@
> > */
> > #define _wait_for(COND, US, W) ({ \
> > unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> > + long wait__ = 1; \
> > int ret__; \
> > might_sleep(); \
> > for (;;) { \
> > @@ -62,7 +63,9 @@
> > ret__ = -ETIMEDOUT; \
> > break; \
> > } \
> > - usleep_range((W), (W) * 2); \
> > + usleep_range(wait__, wait__ * 2); \
> > + if (wait__ < (W)) \
> This should be "wait__ < (US)" ?
US is the total timeout, W is the sleep, now used to provide a cap
(minimum frequency of polling).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 16:33 ` Chris Wilson
@ 2017-11-21 16:49 ` Sagar Arun Kamble
2017-11-21 20:58 ` Chris Wilson
0 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-21 16:49 UTC (permalink / raw)
To: Chris Wilson, intel-gfx
On 11/21/2017 10:03 PM, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-21 16:29:57)
>>
>> On 11/21/2017 8:54 PM, Chris Wilson wrote:
>>> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
>>> start with a small sleep and exponentially increase the sleep on each
>>> cycle.
>>>
>>> A good example of a beneficiary is the guc mmio communication channel.
>>> Typically we expect (and so spin) for 10us for a quick response, but this
>>> doesn't cover everything and so sometimes we fallback to the millisecond+
>>> sleep. This incurs a significant delay in time-critical operations like
>>> preemption (igt/gem_exec_latency), which can be improved significantly by
>>> using a small sleep after the spin fails.
>>>
>>> We've made this suggestion many times, but had little experimental data
>>> to support adding the complexity.
>>>
>>> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: John Harrison <John.C.Harrison@intel.com>
>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
>>> index 69aab324aaa1..c1ea9a009eb4 100644
>>> --- a/drivers/gpu/drm/i915/intel_drv.h
>>> +++ b/drivers/gpu/drm/i915/intel_drv.h
>>> @@ -50,6 +50,7 @@
>>> */
>>> #define _wait_for(COND, US, W) ({ \
>>> unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
>>> + long wait__ = 1; \
>>> int ret__; \
>>> might_sleep(); \
>>> for (;;) { \
>>> @@ -62,7 +63,9 @@
>>> ret__ = -ETIMEDOUT; \
>>> break; \
>>> } \
>>> - usleep_range((W), (W) * 2); \
>>> + usleep_range(wait__, wait__ * 2); \
>>> + if (wait__ < (W)) \
>> This should be "wait__ < (US)" ?
> US is the total timeout, W is the sleep, now used to provide a cap
> (minimum frequency of polling).
Oh. right. I thought we want to exponentially run down the total
timeout. This might not be correct approach.
Hybrid polling might be useful here but we need to precisely know for
each use of wait_for the expected wait time.
Then we could sleep half the total time and then poll :)
Users needing faster completion should use wait_for_atomic with higher
timeout in uS?
> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for()
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
2017-11-21 16:29 ` Sagar Arun Kamble
@ 2017-11-21 16:50 ` Patchwork
2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
` (7 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 16:50 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Use exponential backoff for wait_for()
URL : https://patchwork.freedesktop.org/series/34174/
State : success
== Summary ==
Series 34174v1 drm/i915: Use exponential backoff for wait_for()
https://patchwork.freedesktop.org/api/1.0/series/34174/revisions/1/mbox/
Test chamelium:
Subgroup dp-crc-fast:
fail -> PASS (fi-kbl-7500u) fdo#103163
Test gem_exec_reloc:
Subgroup basic-gtt-active:
pass -> FAIL (fi-gdg-551) fdo#102582 +3
Test kms_cursor_legacy:
Subgroup basic-busy-flip-before-cursor-legacy:
pass -> FAIL (fi-gdg-551) fdo#102618
fdo#103163 https://bugs.freedesktop.org/show_bug.cgi?id=103163
fdo#102582 https://bugs.freedesktop.org/show_bug.cgi?id=102582
fdo#102618 https://bugs.freedesktop.org/show_bug.cgi?id=102618
fi-bdw-5557u total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:446s
fi-bdw-gvtdvm total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:447s
fi-blb-e6850 total:289 pass:223 dwarn:1 dfail:0 fail:0 skip:65 time:382s
fi-bsw-n3050 total:289 pass:243 dwarn:0 dfail:0 fail:0 skip:46 time:544s
fi-bxt-dsi total:289 pass:259 dwarn:0 dfail:0 fail:0 skip:30 time:505s
fi-bxt-j4205 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:510s
fi-byt-j1900 total:289 pass:254 dwarn:0 dfail:0 fail:0 skip:35 time:490s
fi-byt-n2820 total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:487s
fi-cfl-s2 total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:613s
fi-elk-e7500 total:289 pass:229 dwarn:0 dfail:0 fail:0 skip:60 time:425s
fi-gdg-551 total:289 pass:173 dwarn:1 dfail:0 fail:6 skip:109 time:275s
fi-glk-1 total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:544s
fi-hsw-4770 total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:427s
fi-hsw-4770r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:434s
fi-ilk-650 total:289 pass:228 dwarn:0 dfail:0 fail:0 skip:61 time:431s
fi-ivb-3520m total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:471s
fi-ivb-3770 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:461s
fi-kbl-7500u total:289 pass:264 dwarn:1 dfail:0 fail:0 skip:24 time:482s
fi-kbl-7560u total:289 pass:270 dwarn:0 dfail:0 fail:0 skip:19 time:531s
fi-kbl-7567u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:476s
fi-kbl-r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:534s
fi-pnv-d510 total:289 pass:222 dwarn:1 dfail:0 fail:0 skip:66 time:576s
fi-skl-6260u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:447s
fi-skl-6600u total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:548s
fi-skl-6700hq total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:568s
fi-skl-6700k total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:510s
fi-skl-6770hq total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:501s
fi-skl-gvtdvm total:289 pass:266 dwarn:0 dfail:0 fail:0 skip:23 time:459s
fi-snb-2520m total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:553s
fi-snb-2600 total:289 pass:249 dwarn:0 dfail:0 fail:0 skip:40 time:416s
Blacklisted hosts:
fi-cnl-y total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:555s
fi-glk-dsi total:289 pass:258 dwarn:0 dfail:0 fail:1 skip:30 time:504s
fi-bwr-2160 failed to collect. IGT log at Patchwork_7223/fi-bwr-2160/igt.log
f71044153bec6b5fbde984769d97b38fccf06722 drm-tip: 2017y-11m-21d-10h-48m-53s UTC integration manifest
151e3f64cfea drm/i915: Use exponential backoff for wait_for()
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7223/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
2017-11-21 16:29 ` Sagar Arun Kamble
2017-11-21 16:50 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2017-11-21 17:00 ` Tvrtko Ursulin
2017-11-21 17:11 ` Chris Wilson
2017-11-21 17:36 ` [PATCH v2] " Chris Wilson
` (6 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Tvrtko Ursulin @ 2017-11-21 17:00 UTC (permalink / raw)
To: Chris Wilson, intel-gfx
On 21/11/2017 15:24, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
>
> A good example of a beneficiary is the guc mmio communication channel.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
>
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
>
> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 69aab324aaa1..c1ea9a009eb4 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -50,6 +50,7 @@
> */
> #define _wait_for(COND, US, W) ({ \
> unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> + long wait__ = 1; \
> int ret__; \
> might_sleep(); \
> for (;;) { \
> @@ -62,7 +63,9 @@
> ret__ = -ETIMEDOUT; \
> break; \
> } \
> - usleep_range((W), (W) * 2); \
> + usleep_range(wait__, wait__ * 2); \
> + if (wait__ < (W)) \
> + wait__ <<= 1; \
> } \
> ret__; \
> })
>
I would start the period at 10us since a) <10us is not recommended for
usleep family, b) most callers specify ms timeouts so <10us poll is
perhaps an overkill.
Latency sensitive callers like __intel_wait_for_register_us can be
tweaked at the call site to provide what they want.
For the actual guc mmio send it sounds like it should pass in 20us to
__intel_wait_for_register_us (referring to John's explanation email) to
cover 99% of the cases. And then the remaining 1% could be fine with a
10us delay?
Otherwise we are effectively making _wait_for partially busy looping, or
whatever the inefficiency in <10us usleep is. I mean, it makes no
practical difference to make a handful of quick loops there but it feels
a bit inelegant.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
@ 2017-11-21 17:11 ` Chris Wilson
2017-11-21 17:29 ` Ville Syrjälä
0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 17:11 UTC (permalink / raw)
To: Tvrtko Ursulin, intel-gfx
Quoting Tvrtko Ursulin (2017-11-21 17:00:00)
>
> On 21/11/2017 15:24, Chris Wilson wrote:
> > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > start with a small sleep and exponentially increase the sleep on each
> > cycle.
> >
> > A good example of a beneficiary is the guc mmio communication channel.
> > Typically we expect (and so spin) for 10us for a quick response, but this
> > doesn't cover everything and so sometimes we fallback to the millisecond+
> > sleep. This incurs a significant delay in time-critical operations like
> > preemption (igt/gem_exec_latency), which can be improved significantly by
> > using a small sleep after the spin fails.
> >
> > We've made this suggestion many times, but had little experimental data
> > to support adding the complexity.
> >
> > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: John Harrison <John.C.Harrison@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > ---
> > drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > index 69aab324aaa1..c1ea9a009eb4 100644
> > --- a/drivers/gpu/drm/i915/intel_drv.h
> > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > @@ -50,6 +50,7 @@
> > */
> > #define _wait_for(COND, US, W) ({ \
> > unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> > + long wait__ = 1; \
> > int ret__; \
> > might_sleep(); \
> > for (;;) { \
> > @@ -62,7 +63,9 @@
> > ret__ = -ETIMEDOUT; \
> > break; \
> > } \
> > - usleep_range((W), (W) * 2); \
> > + usleep_range(wait__, wait__ * 2); \
> > + if (wait__ < (W)) \
> > + wait__ <<= 1; \
> > } \
> > ret__; \
> > })
> >
>
> I would start the period at 10us since a) <10us is not recommended for
> usleep family, b) most callers specify ms timeouts so <10us poll is
> perhaps an overkill.
Don't forget the majority of the callers are now via wait_for_register
and so have that 10us poll prior to the sleeping wait_for().
If there's an ARCH_USLEEP_MIN_WHATNOT that would be useful.
> Latency sensitive callers like __intel_wait_for_register_us can be
> tweaked at the call site to provide what they want.
>
> For the actual guc mmio send it sounds like it should pass in 20us to
> __intel_wait_for_register_us (referring to John's explanation email) to
> cover 99% of the cases. And then the remaining 1% could be fine with a
> 10us delay?
That it fixed that was a side-effect ;) It just happened to be something
that I could measure the latency of in userspace. I'd rather we have
something generic that does have a demonstrable improvement.
> Otherwise we are effectively making _wait_for partially busy looping, or
> whatever the inefficiency in <10us usleep is. I mean, it makes no
> practical difference to make a handful of quick loops there but it feels
> a bit inelegant.
We already do use the hybrid busy-looping for wait_for_register (and
everybody is meant to be using wait_for_register, the exceptions should
be rare and well justified). The purpose of referencing 1758b90e38f5
("drm/i915: Use a hybrid scheme for fast register waits") was to remind
ourselves of the scheme and its benefits.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 17:11 ` Chris Wilson
@ 2017-11-21 17:29 ` Ville Syrjälä
2017-11-21 20:40 ` Chris Wilson
0 siblings, 1 reply; 28+ messages in thread
From: Ville Syrjälä @ 2017-11-21 17:29 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
On Tue, Nov 21, 2017 at 05:11:29PM +0000, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-11-21 17:00:00)
> >
> > On 21/11/2017 15:24, Chris Wilson wrote:
> > > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > > start with a small sleep and exponentially increase the sleep on each
> > > cycle.
> > >
> > > A good example of a beneficiary is the guc mmio communication channel.
> > > Typically we expect (and so spin) for 10us for a quick response, but this
> > > doesn't cover everything and so sometimes we fallback to the millisecond+
> > > sleep. This incurs a significant delay in time-critical operations like
> > > preemption (igt/gem_exec_latency), which can be improved significantly by
> > > using a small sleep after the spin fails.
> > >
> > > We've made this suggestion many times, but had little experimental data
> > > to support adding the complexity.
> > >
> > > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: John Harrison <John.C.Harrison@intel.com>
> > > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > > ---
> > > drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > > index 69aab324aaa1..c1ea9a009eb4 100644
> > > --- a/drivers/gpu/drm/i915/intel_drv.h
> > > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > > @@ -50,6 +50,7 @@
> > > */
> > > #define _wait_for(COND, US, W) ({ \
> > > unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> > > + long wait__ = 1; \
> > > int ret__; \
> > > might_sleep(); \
> > > for (;;) { \
> > > @@ -62,7 +63,9 @@
> > > ret__ = -ETIMEDOUT; \
> > > break; \
> > > } \
> > > - usleep_range((W), (W) * 2); \
> > > + usleep_range(wait__, wait__ * 2); \
> > > + if (wait__ < (W)) \
> > > + wait__ <<= 1; \
> > > } \
> > > ret__; \
> > > })
> > >
> >
> > I would start the period at 10us since a) <10us is not recommended for
> > usleep family, b) most callers specify ms timeouts so <10us poll is
> > perhaps an overkill.
>
> Don't forget the majority of the callers are now via wait_for_register
> and so have that 10us poll prior to the sleeping wait_for().
>
> If there's an ARCH_USLEEP_MIN_WHATNOT that would be useful.
>
> > Latency sensitive callers like __intel_wait_for_register_us can be
> > tweaked at the call site to provide what they want.
> >
> > For the actual guc mmio send it sounds like it should pass in 20us to
> > __intel_wait_for_register_us (referring to John's explanation email) to
> > cover 99% of the cases. And then the remaining 1% could be fine with a
> > 10us delay?
>
> That it fixed that was a side-effect ;) It just happened to be something
> that I could measure the latency of in userspace. I'd rather we have
> something generic that does have a demonstrable improvement.
>
> > Otherwise we are effectively making _wait_for partially busy looping, or
> > whatever the inefficiency in <10us usleep is. I mean, it makes no
> > practical difference to make a handful of quick loops there but it feels
> > a bit inelegant.
>
> We already do use the hybrid busy-looping for wait_for_register (and
> everybody is meant to be using wait_for_register, the exceptions should
> be rare and well justified).
We do have to bear 96dabe99cae8 ("drm/i915: Avoid busy-spinning on
VLV_GLTC_PW_STATUS mmio") in mind. Touching wait_for() might break
that again. Probably the correct choice for these exceptions would
be to specify a minimum polling interval explicitly.
The other cases that come to mind are the wait_for_pipe_off() type of
things which generally measure several milliseconds. Faster polling
won't really help those cases that much except if the pipe already
happens to be close to vblank when we shut it down. But as long as
the max polling interval gets capped to a few milliseconds I guess
the exponential backoff shouldn't hurt us either.
--
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2] drm/i915: Use exponential backoff for wait_for()
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (2 preceding siblings ...)
2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
@ 2017-11-21 17:36 ` Chris Wilson
2017-11-21 17:41 ` ✓ Fi.CI.IGT: success for " Patchwork
` (5 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 17:36 UTC (permalink / raw)
To: intel-gfx
Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
start with a small sleep and exponentially increase the sleep on each
cycle.
A good example of a beneficiary is the guc mmio communication channel.
Typically we expect (and so spin) for 10us for a quick response, but this
doesn't cover everything and so sometimes we fallback to the millisecond+
sleep. This incurs a significant delay in time-critical operations like
preemption (igt/gem_exec_latency), which can be improved significantly by
using a small sleep after the spin fails.
We've made this suggestion many times, but had little experimental data
to support adding the complexity.
v2: Bump the minimum usleep to 10us on advice of
Documentation/timers/timers-howto.txt (Tvrko)
References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: John Harrison <John.C.Harrison@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
---
drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index 69aab324aaa1..3e115599c2e6 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -50,6 +50,7 @@
*/
#define _wait_for(COND, US, W) ({ \
unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
+ long wait__ = 10; /* recommended minimum for usleep */ \
int ret__; \
might_sleep(); \
for (;;) { \
@@ -62,7 +63,9 @@
ret__ = -ETIMEDOUT; \
break; \
} \
- usleep_range((W), (W) * 2); \
+ usleep_range(wait__, wait__ * 2); \
+ if (wait__ < (W)) \
+ wait__ <<= 1; \
} \
ret__; \
})
--
2.15.0
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 28+ messages in thread
* ✓ Fi.CI.IGT: success for drm/i915: Use exponential backoff for wait_for()
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (3 preceding siblings ...)
2017-11-21 17:36 ` [PATCH v2] " Chris Wilson
@ 2017-11-21 17:41 ` Patchwork
2017-11-21 18:03 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2) Patchwork
` (4 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 17:41 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Use exponential backoff for wait_for()
URL : https://patchwork.freedesktop.org/series/34174/
State : success
== Summary ==
Test drv_selftest:
Subgroup mock_sanitycheck:
pass -> DMESG-WARN (shard-snb) fdo#103717
Test kms_setmode:
Subgroup basic:
pass -> FAIL (shard-hsw) fdo#99912
Test gem_exec_reloc:
Subgroup basic-write-cpu-noreloc:
incomplete -> PASS (shard-snb)
fdo#103717 https://bugs.freedesktop.org/show_bug.cgi?id=103717
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
shard-hsw total:2585 pass:1473 dwarn:2 dfail:1 fail:10 skip:1099 time:9427s
shard-snb total:2585 pass:1258 dwarn:2 dfail:1 fail:11 skip:1313 time:7929s
Blacklisted hosts:
shard-apl total:2555 pass:1599 dwarn:2 dfail:0 fail:22 skip:930 time:12175s
shard-kbl total:2481 pass:1644 dwarn:5 dfail:1 fail:21 skip:807 time:9917s
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7223/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2)
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (4 preceding siblings ...)
2017-11-21 17:41 ` ✓ Fi.CI.IGT: success for " Patchwork
@ 2017-11-21 18:03 ` Patchwork
2017-11-21 19:07 ` ✓ Fi.CI.IGT: " Patchwork
` (3 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 18:03 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Use exponential backoff for wait_for() (rev2)
URL : https://patchwork.freedesktop.org/series/34174/
State : success
== Summary ==
Series 34174v2 drm/i915: Use exponential backoff for wait_for()
https://patchwork.freedesktop.org/api/1.0/series/34174/revisions/2/mbox/
Test chamelium:
Subgroup dp-crc-fast:
fail -> PASS (fi-kbl-7500u) fdo#103163
fdo#103163 https://bugs.freedesktop.org/show_bug.cgi?id=103163
fi-bdw-5557u total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:446s
fi-bdw-gvtdvm total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:450s
fi-blb-e6850 total:289 pass:223 dwarn:1 dfail:0 fail:0 skip:65 time:379s
fi-bsw-n3050 total:289 pass:243 dwarn:0 dfail:0 fail:0 skip:46 time:524s
fi-bwr-2160 total:289 pass:183 dwarn:0 dfail:0 fail:0 skip:106 time:277s
fi-bxt-dsi total:289 pass:259 dwarn:0 dfail:0 fail:0 skip:30 time:503s
fi-bxt-j4205 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:509s
fi-byt-j1900 total:289 pass:254 dwarn:0 dfail:0 fail:0 skip:35 time:486s
fi-byt-n2820 total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:483s
fi-cfl-s2 total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:606s
fi-elk-e7500 total:289 pass:229 dwarn:0 dfail:0 fail:0 skip:60 time:419s
fi-gdg-551 total:289 pass:178 dwarn:1 dfail:0 fail:1 skip:109 time:264s
fi-glk-1 total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:540s
fi-hsw-4770 total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:424s
fi-hsw-4770r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:434s
fi-ilk-650 total:289 pass:228 dwarn:0 dfail:0 fail:0 skip:61 time:430s
fi-ivb-3520m total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:482s
fi-ivb-3770 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:460s
fi-kbl-7500u total:289 pass:264 dwarn:1 dfail:0 fail:0 skip:24 time:487s
fi-kbl-7560u total:289 pass:270 dwarn:0 dfail:0 fail:0 skip:19 time:531s
fi-kbl-7567u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:482s
fi-kbl-r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:534s
fi-pnv-d510 total:289 pass:222 dwarn:1 dfail:0 fail:0 skip:66 time:579s
fi-skl-6260u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:446s
fi-skl-6600u total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:542s
fi-skl-6700hq total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:567s
fi-skl-6700k total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:512s
fi-skl-6770hq total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:504s
fi-skl-gvtdvm total:289 pass:266 dwarn:0 dfail:0 fail:0 skip:23 time:453s
fi-snb-2520m total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:549s
fi-snb-2600 total:289 pass:249 dwarn:0 dfail:0 fail:0 skip:40 time:418s
Blacklisted hosts:
fi-cnl-y total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:549s
fi-glk-dsi total:32 pass:22 dwarn:0 dfail:0 fail:0 skip:9
f71044153bec6b5fbde984769d97b38fccf06722 drm-tip: 2017y-11m-21d-10h-48m-53s UTC integration manifest
6ef981ecbe0b drm/i915: Use exponential backoff for wait_for()
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7224/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* ✓ Fi.CI.IGT: success for drm/i915: Use exponential backoff for wait_for() (rev2)
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (5 preceding siblings ...)
2017-11-21 18:03 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2) Patchwork
@ 2017-11-21 19:07 ` Patchwork
2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (2 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 19:07 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Use exponential backoff for wait_for() (rev2)
URL : https://patchwork.freedesktop.org/series/34174/
State : success
== Summary ==
Test drv_selftest:
Subgroup mock_sanitycheck:
pass -> DMESG-WARN (shard-snb) fdo#103717
dmesg-warn -> PASS (shard-hsw) fdo#103719
Test gem_exec_reloc:
Subgroup basic-write-cpu-noreloc:
incomplete -> PASS (shard-snb)
Test kms_setmode:
Subgroup basic:
pass -> FAIL (shard-hsw) fdo#99912
fdo#103717 https://bugs.freedesktop.org/show_bug.cgi?id=103717
fdo#103719 https://bugs.freedesktop.org/show_bug.cgi?id=103719
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
shard-hsw total:2585 pass:1474 dwarn:1 dfail:1 fail:10 skip:1099 time:9419s
shard-snb total:2585 pass:1258 dwarn:2 dfail:1 fail:11 skip:1313 time:8011s
Blacklisted hosts:
shard-apl total:2565 pass:1604 dwarn:2 dfail:0 fail:22 skip:936 time:12844s
shard-kbl total:2500 pass:1656 dwarn:5 dfail:1 fail:25 skip:811 time:10126s
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7224/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 17:29 ` Ville Syrjälä
@ 2017-11-21 20:40 ` Chris Wilson
0 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 20:40 UTC (permalink / raw)
To: Ville Syrjälä; +Cc: intel-gfx
Quoting Ville Syrjälä (2017-11-21 17:29:43)
> On Tue, Nov 21, 2017 at 05:11:29PM +0000, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2017-11-21 17:00:00)
> > >
> > > On 21/11/2017 15:24, Chris Wilson wrote:
> > > > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > > > start with a small sleep and exponentially increase the sleep on each
> > > > cycle.
> > > >
> > > > A good example of a beneficiary is the guc mmio communication channel.
> > > > Typically we expect (and so spin) for 10us for a quick response, but this
> > > > doesn't cover everything and so sometimes we fallback to the millisecond+
> > > > sleep. This incurs a significant delay in time-critical operations like
> > > > preemption (igt/gem_exec_latency), which can be improved significantly by
> > > > using a small sleep after the spin fails.
> > > >
> > > > We've made this suggestion many times, but had little experimental data
> > > > to support adding the complexity.
> > > >
> > > > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > Cc: John Harrison <John.C.Harrison@intel.com>
> > > > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > > > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > > > ---
> > > > drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > > > index 69aab324aaa1..c1ea9a009eb4 100644
> > > > --- a/drivers/gpu/drm/i915/intel_drv.h
> > > > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > > > @@ -50,6 +50,7 @@
> > > > */
> > > > #define _wait_for(COND, US, W) ({ \
> > > > unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> > > > + long wait__ = 1; \
> > > > int ret__; \
> > > > might_sleep(); \
> > > > for (;;) { \
> > > > @@ -62,7 +63,9 @@
> > > > ret__ = -ETIMEDOUT; \
> > > > break; \
> > > > } \
> > > > - usleep_range((W), (W) * 2); \
> > > > + usleep_range(wait__, wait__ * 2); \
> > > > + if (wait__ < (W)) \
> > > > + wait__ <<= 1; \
> > > > } \
> > > > ret__; \
> > > > })
> > > >
> > >
> > > I would start the period at 10us since a) <10us is not recommended for
> > > usleep family, b) most callers specify ms timeouts so <10us poll is
> > > perhaps an overkill.
> >
> > Don't forget the majority of the callers are now via wait_for_register
> > and so have that 10us poll prior to the sleeping wait_for().
> >
> > If there's an ARCH_USLEEP_MIN_WHATNOT that would be useful.
> >
> > > Latency sensitive callers like __intel_wait_for_register_us can be
> > > tweaked at the call site to provide what they want.
> > >
> > > For the actual guc mmio send it sounds like it should pass in 20us to
> > > __intel_wait_for_register_us (referring to John's explanation email) to
> > > cover 99% of the cases. And then the remaining 1% could be fine with a
> > > 10us delay?
> >
> > That it fixed that was a side-effect ;) It just happened to be something
> > that I could measure the latency of in userspace. I'd rather we have
> > something generic that does have a demonstrable improvement.
> >
> > > Otherwise we are effectively making _wait_for partially busy looping, or
> > > whatever the inefficiency in <10us usleep is. I mean, it makes no
> > > practical difference to make a handful of quick loops there but it feels
> > > a bit inelegant.
> >
> > We already do use the hybrid busy-looping for wait_for_register (and
> > everybody is meant to be using wait_for_register, the exceptions should
> > be rare and well justified).
>
> We do have to bear 96dabe99cae8 ("drm/i915: Avoid busy-spinning on
> VLV_GLTC_PW_STATUS mmio") in mind. Touching wait_for() might break
> that again. Probably the correct choice for these exceptions would
> be to specify a minimum polling interval explicitly.
Fortunately, that was triggered by BAT:
https://bugs.freedesktop.org/show_bug.cgi?id=100718#c4
> Close to 100% reproducing rate on BYT-n2820 and BSW-n3050 with
> DRM-Tip:
> igt@pm_rpm@gem-execbuf-stress
> igt@pm_rpm@gem-execbuf-stress-extra-wait
Hopefully, the timing hasn't changed significantly since then and we can
rely on current CI to trip over it if the usleep interval is too small.
Thanks for the reminder!
> The other cases that come to mind are the wait_for_pipe_off() type of
> things which generally measure several milliseconds. Faster polling
> won't really help those cases that much except if the pipe already
> happens to be close to vblank when we shut it down. But as long as
> the max polling interval gets capped to a few milliseconds I guess
> the exponential backoff shouldn't hurt us either.
Indeed. Starting at 10us, it'll take us 8 cycles (total of 2.5ms) to hit
the timeout, so 7 more loops than required. Likely in the noise, but
(Wmin, Wmax) is an easy improvement.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
2017-11-21 16:49 ` Sagar Arun Kamble
@ 2017-11-21 20:58 ` Chris Wilson
0 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 20:58 UTC (permalink / raw)
To: Sagar Arun Kamble, intel-gfx
Quoting Sagar Arun Kamble (2017-11-21 16:49:37)
>
>
> On 11/21/2017 10:03 PM, Chris Wilson wrote:
> > Quoting Sagar Arun Kamble (2017-11-21 16:29:57)
> >>
> >> On 11/21/2017 8:54 PM, Chris Wilson wrote:
> >>> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> >>> start with a small sleep and exponentially increase the sleep on each
> >>> cycle.
> >>>
> >>> A good example of a beneficiary is the guc mmio communication channel.
> >>> Typically we expect (and so spin) for 10us for a quick response, but this
> >>> doesn't cover everything and so sometimes we fallback to the millisecond+
> >>> sleep. This incurs a significant delay in time-critical operations like
> >>> preemption (igt/gem_exec_latency), which can be improved significantly by
> >>> using a small sleep after the spin fails.
> >>>
> >>> We've made this suggestion many times, but had little experimental data
> >>> to support adding the complexity.
> >>>
> >>> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> Cc: John Harrison <John.C.Harrison@intel.com>
> >>> Cc: Michał Winiarski <michal.winiarski@intel.com>
> >>> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> >>> ---
> >>> drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> >>> 1 file changed, 4 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> >>> index 69aab324aaa1..c1ea9a009eb4 100644
> >>> --- a/drivers/gpu/drm/i915/intel_drv.h
> >>> +++ b/drivers/gpu/drm/i915/intel_drv.h
> >>> @@ -50,6 +50,7 @@
> >>> */
> >>> #define _wait_for(COND, US, W) ({ \
> >>> unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> >>> + long wait__ = 1; \
> >>> int ret__; \
> >>> might_sleep(); \
> >>> for (;;) { \
> >>> @@ -62,7 +63,9 @@
> >>> ret__ = -ETIMEDOUT; \
> >>> break; \
> >>> } \
> >>> - usleep_range((W), (W) * 2); \
> >>> + usleep_range(wait__, wait__ * 2); \
> >>> + if (wait__ < (W)) \
> >> This should be "wait__ < (US)" ?
> > US is the total timeout, W is the sleep, now used to provide a cap
> > (minimum frequency of polling).
> Oh. right. I thought we want to exponentially run down the total
> timeout. This might not be correct approach.
> Hybrid polling might be useful here but we need to precisely know for
> each use of wait_for the expected wait time.
> Then we could sleep half the total time and then poll :)
> Users needing faster completion should use wait_for_atomic with higher
> timeout in uS?
The majority of users are expected to use intel_wait_for_register*() so
that we don't have large numbers of this large macro expansion in the
code. There we have put some basic attempt at spin/poll. Doing the wait
for half the expected period then spin would be fun -- except that we
have no idea what the expected period is in most cases :)
If you did have an expected period, then usleep(expected_period >>= 1)
would be a good heuristic indeed.
if (expected) {
do {
int tmp;
if (COND) ...
tmp = expected >> 1;
usleep_range(tmp, expected);
expected = tmp;
} while (expected);
}
(then the current wait_for loops)
But we would also need to feed that knowledge in from somewhere. (Not a
bad idea, just needs a use case to start the ball rolling. Wait for vblank
might be one?)
Instead our assumption is that mmio completes and is acked immediately,
and backoff from that assumption when it fails.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (6 preceding siblings ...)
2017-11-21 19:07 ` ✓ Fi.CI.IGT: " Patchwork
@ 2017-11-21 20:59 ` Chris Wilson
2017-11-22 7:41 ` Sagar Arun Kamble
2017-11-24 12:37 ` Michał Winiarski
2017-11-21 21:32 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3) Patchwork
2017-11-21 22:41 ` ✗ Fi.CI.IGT: warning " Patchwork
9 siblings, 2 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 20:59 UTC (permalink / raw)
To: intel-gfx
Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
start with a small sleep and exponentially increase the sleep on each
cycle.
A good example of a beneficiary is the guc mmio communication channel.
Typically we expect (and so spin) for 10us for a quick response, but this
doesn't cover everything and so sometimes we fallback to the millisecond+
sleep. This incurs a significant delay in time-critical operations like
preemption (igt/gem_exec_latency), which can be improved significantly by
using a small sleep after the spin fails.
We've made this suggestion many times, but had little experimental data
to support adding the complexity.
v2: Bump the minimum usleep to 10us on advice of
Documentation/timers/timers-howto.txt (Tvrko)
v3: Specify min, max range for usleep intervals -- some code may
crucially depend upon and so want to specify the sleep pattern.
References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: John Harrison <John.C.Harrison@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
---
drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
drivers/gpu/drm/i915/intel_pm.c | 2 +-
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index 635a96fcd788..c00441a3d649 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -48,8 +48,9 @@
* having timed out, since the timeout could be due to preemption or similar and
* we've never had a chance to check the condition before the timeout.
*/
-#define _wait_for(COND, US, W) ({ \
+#define _wait_for(COND, US, Wmin, Wmax) ({ \
unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
+ long wait__ = (Wmin); /* recommended min for usleep is 10 us */ \
int ret__; \
might_sleep(); \
for (;;) { \
@@ -62,12 +63,14 @@
ret__ = -ETIMEDOUT; \
break; \
} \
- usleep_range((W), (W) * 2); \
+ usleep_range(wait__, wait__ * 2); \
+ if (wait__ < (Wmax)) \
+ wait__ <<= 1; \
} \
ret__; \
})
-#define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 1000)
+#define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 10, 1000)
/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
@@ -116,7 +119,7 @@
int ret__; \
BUILD_BUG_ON(!__builtin_constant_p(US)); \
if ((US) > 10) \
- ret__ = _wait_for((COND), (US), 10); \
+ ret__ = _wait_for((COND), (US), 10, 10); \
else \
ret__ = _wait_for_atomic((COND), (US), 0); \
ret__; \
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index e445ec174831..f07f14ae198d 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -9294,7 +9294,7 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
ret = 0;
goto out;
}
- ret = _wait_for(COND, timeout_base_ms * 1000, 10);
+ ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
if (!ret)
goto out;
--
2.15.0
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 28+ messages in thread
* ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3)
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (7 preceding siblings ...)
2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
@ 2017-11-21 21:32 ` Patchwork
2017-11-21 22:41 ` ✗ Fi.CI.IGT: warning " Patchwork
9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 21:32 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Use exponential backoff for wait_for() (rev3)
URL : https://patchwork.freedesktop.org/series/34174/
State : success
== Summary ==
Series 34174v3 drm/i915: Use exponential backoff for wait_for()
https://patchwork.freedesktop.org/api/1.0/series/34174/revisions/3/mbox/
Test kms_pipe_crc_basic:
Subgroup suspend-read-crc-pipe-b:
pass -> INCOMPLETE (fi-snb-2520m) fdo#103713
fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
fi-bdw-5557u total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:459s
fi-bdw-gvtdvm total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:453s
fi-blb-e6850 total:289 pass:223 dwarn:1 dfail:0 fail:0 skip:65 time:380s
fi-bsw-n3050 total:289 pass:243 dwarn:0 dfail:0 fail:0 skip:46 time:534s
fi-bwr-2160 total:289 pass:183 dwarn:0 dfail:0 fail:0 skip:106 time:279s
fi-bxt-dsi total:289 pass:259 dwarn:0 dfail:0 fail:0 skip:30 time:501s
fi-bxt-j4205 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:497s
fi-byt-j1900 total:289 pass:254 dwarn:0 dfail:0 fail:0 skip:35 time:490s
fi-byt-n2820 total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:490s
fi-cfl-s2 total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:604s
fi-elk-e7500 total:289 pass:229 dwarn:0 dfail:0 fail:0 skip:60 time:424s
fi-gdg-551 total:289 pass:178 dwarn:1 dfail:0 fail:1 skip:109 time:263s
fi-glk-1 total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:541s
fi-hsw-4770r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:432s
fi-ilk-650 total:289 pass:228 dwarn:0 dfail:0 fail:0 skip:61 time:427s
fi-ivb-3520m total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:475s
fi-ivb-3770 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:457s
fi-kbl-7500u total:289 pass:264 dwarn:1 dfail:0 fail:0 skip:24 time:477s
fi-kbl-7560u total:289 pass:270 dwarn:0 dfail:0 fail:0 skip:19 time:528s
fi-kbl-7567u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:476s
fi-kbl-r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:531s
fi-pnv-d510 total:289 pass:222 dwarn:1 dfail:0 fail:0 skip:66 time:580s
fi-skl-6260u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:451s
fi-skl-6600u total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:545s
fi-skl-6700hq total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:567s
fi-skl-6700k total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:513s
fi-skl-6770hq total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:495s
fi-skl-gvtdvm total:289 pass:266 dwarn:0 dfail:0 fail:0 skip:23 time:454s
fi-snb-2520m total:246 pass:212 dwarn:0 dfail:0 fail:0 skip:33
fi-snb-2600 total:289 pass:249 dwarn:0 dfail:0 fail:0 skip:40 time:418s
Blacklisted hosts:
fi-glk-dsi total:289 pass:149 dwarn:0 dfail:10 fail:2 skip:128
9cbb5ad289f8f874e048120d6ac3862b2f36a8ac drm-tip: 2017y-11m-21d-19h-39m-42s UTC integration manifest
0a446ff35dd9 drm/i915: Use exponential backoff for wait_for()
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7228/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* ✗ Fi.CI.IGT: warning for drm/i915: Use exponential backoff for wait_for() (rev3)
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
` (8 preceding siblings ...)
2017-11-21 21:32 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3) Patchwork
@ 2017-11-21 22:41 ` Patchwork
9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 22:41 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915: Use exponential backoff for wait_for() (rev3)
URL : https://patchwork.freedesktop.org/series/34174/
State : warning
== Summary ==
Test kms_flip:
Subgroup modeset-vs-vblank-race-interruptible:
pass -> FAIL (shard-hsw) fdo#103060
Subgroup flip-vs-panning-vs-hang:
pass -> DMESG-WARN (shard-snb) fdo#103821
Test kms_setmode:
Subgroup basic:
pass -> FAIL (shard-hsw) fdo#99912
Test perf:
Subgroup polling:
fail -> PASS (shard-hsw) fdo#102252
Test drv_suspend:
Subgroup sysfs-reader:
pass -> SKIP (shard-snb)
Subgroup fence-restore-untiled:
pass -> SKIP (shard-hsw)
Test kms_atomic_transition:
Subgroup plane-all-transition-nonblocking-fencing:
pass -> SKIP (shard-hsw)
Test kms_chv_cursor_fail:
Subgroup pipe-b-128x128-left-edge:
pass -> SKIP (shard-hsw)
fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
fdo#103821 https://bugs.freedesktop.org/show_bug.cgi?id=103821
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252
shard-hsw total:2585 pass:1470 dwarn:1 dfail:1 fail:11 skip:1102 time:9426s
shard-snb total:2585 pass:1255 dwarn:2 dfail:1 fail:13 skip:1314 time:8045s
Blacklisted hosts:
shard-apl total:2563 pass:1598 dwarn:1 dfail:0 fail:23 skip:939 time:13086s
shard-kbl total:2535 pass:1683 dwarn:4 dfail:0 fail:25 skip:821 time:10218s
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7228/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
@ 2017-11-22 7:41 ` Sagar Arun Kamble
2017-11-22 9:36 ` Chris Wilson
2017-11-24 12:37 ` Michał Winiarski
1 sibling, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-22 7:41 UTC (permalink / raw)
To: Chris Wilson, intel-gfx
On 11/22/2017 2:29 AM, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
>
> A good example of a beneficiary is the guc mmio communication channel.
As Tvrtko said, for the current GuC communication (guc_send_mmio) we
will need to update fast timeout of
__intel_wait_for_register to 20us. Improvement this patch proposes
through wait_for will
certainly be seen once we switch over to GuC CT. May be specifying "GuC
CT channel" here is apt.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
>
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
>
> v2: Bump the minimum usleep to 10us on advice of
> Documentation/timers/timers-howto.txt (Tvrko)
> v3: Specify min, max range for usleep intervals -- some code may
> crucially depend upon and so want to specify the sleep pattern.
>
> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
> drivers/gpu/drm/i915/intel_pm.c | 2 +-
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 635a96fcd788..c00441a3d649 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -48,8 +48,9 @@
> * having timed out, since the timeout could be due to preemption or similar and
> * we've never had a chance to check the condition before the timeout.
> */
> -#define _wait_for(COND, US, W) ({ \
> +#define _wait_for(COND, US, Wmin, Wmax) ({ \
> unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> + long wait__ = (Wmin); /* recommended min for usleep is 10 us */ \
> int ret__; \
> might_sleep(); \
> for (;;) { \
> @@ -62,12 +63,14 @@
> ret__ = -ETIMEDOUT; \
> break; \
> } \
> - usleep_range((W), (W) * 2); \
> + usleep_range(wait__, wait__ * 2); \
> + if (wait__ < (Wmax)) \
> + wait__ <<= 1; \
I think we need to keep track of total time we have waited else we might
wait for longer than necessary.
For e.g. for wait_for_us(COND, 900) this approach might actually lead to
sleep of 1270us.
> } \
> ret__; \
> })
>
> -#define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 1000)
> +#define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 10, 1000)
>
> /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
> #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
> @@ -116,7 +119,7 @@
> int ret__; \
> BUILD_BUG_ON(!__builtin_constant_p(US)); \
> if ((US) > 10) \
> - ret__ = _wait_for((COND), (US), 10); \
> + ret__ = _wait_for((COND), (US), 10, 10); \
> else \
> ret__ = _wait_for_atomic((COND), (US), 0); \
> ret__; \
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index e445ec174831..f07f14ae198d 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -9294,7 +9294,7 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
> ret = 0;
> goto out;
> }
> - ret = _wait_for(COND, timeout_base_ms * 1000, 10);
> + ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
> if (!ret)
> goto out;
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-22 7:41 ` Sagar Arun Kamble
@ 2017-11-22 9:36 ` Chris Wilson
2017-11-22 9:47 ` Michal Wajdeczko
2017-11-22 10:03 ` Sagar Arun Kamble
0 siblings, 2 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-22 9:36 UTC (permalink / raw)
To: Sagar Arun Kamble, intel-gfx
Quoting Sagar Arun Kamble (2017-11-22 07:41:02)
>
>
> On 11/22/2017 2:29 AM, Chris Wilson wrote:
> > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > start with a small sleep and exponentially increase the sleep on each
> > cycle.
> >
> > A good example of a beneficiary is the guc mmio communication channel.
> As Tvrtko said, for the current GuC communication (guc_send_mmio) we
> will need to update fast timeout of
> __intel_wait_for_register to 20us. Improvement this patch proposes
> through wait_for will
> certainly be seen once we switch over to GuC CT. May be specifying "GuC
> CT channel" here is apt.
guc mmio comm falls off the fastpath hitting the sleeping wait_for, and
*is* improved by this patch. As far as the latency experienced by
gem_exec_latency, there is no difference between a 10us sleep and
spinning for 20us. Changing the spin length to 20us!!! is
something that you should talk to the guc about, at that point we really
need an asynchronous communication channel (ct is still being used
synchronously).
Changing the intel_guc_send_mmio fast timeout is a different
conversation, that is of more dubious merit because of this patch (i.e.
if we can achieve the same latency with a sleep should we spin at all?).
> > Typically we expect (and so spin) for 10us for a quick response, but this
> > doesn't cover everything and so sometimes we fallback to the millisecond+
> > sleep. This incurs a significant delay in time-critical operations like
> > preemption (igt/gem_exec_latency), which can be improved significantly by
> > using a small sleep after the spin fails.
> >
> > We've made this suggestion many times, but had little experimental data
> > to support adding the complexity.
> >
> > v2: Bump the minimum usleep to 10us on advice of
> > Documentation/timers/timers-howto.txt (Tvrko)
> > v3: Specify min, max range for usleep intervals -- some code may
> > crucially depend upon and so want to specify the sleep pattern.
> >
> > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: John Harrison <John.C.Harrison@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > ---
> > drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
> > drivers/gpu/drm/i915/intel_pm.c | 2 +-
> > 2 files changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > index 635a96fcd788..c00441a3d649 100644
> > --- a/drivers/gpu/drm/i915/intel_drv.h
> > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > @@ -48,8 +48,9 @@
> > * having timed out, since the timeout could be due to preemption or similar and
> > * we've never had a chance to check the condition before the timeout.
> > */
> > -#define _wait_for(COND, US, W) ({ \
> > +#define _wait_for(COND, US, Wmin, Wmax) ({ \
> > unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> > + long wait__ = (Wmin); /* recommended min for usleep is 10 us */ \
> > int ret__; \
> > might_sleep(); \
> > for (;;) { \
> > @@ -62,12 +63,14 @@
> > ret__ = -ETIMEDOUT; \
> > break; \
> > } \
> > - usleep_range((W), (W) * 2); \
> > + usleep_range(wait__, wait__ * 2); \
> > + if (wait__ < (Wmax)) \
> > + wait__ <<= 1; \
> I think we need to keep track of total time we have waited else we might
> wait for longer than necessary.
This is not a precise wait, it's a sleep with a rough upper bound.
Sleeps by their very nature are very rough, you are giving up the
processor and telling it not to wake you before a certain time. We are
not asking to be woken at that time, just not before.
> For e.g. for wait_for_us(COND, 900) this approach might actually lead to
> sleep of 1270us.
The numbers are irrelevant, they are merely round numbers from once upon
a time using msleep(1).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-22 9:36 ` Chris Wilson
@ 2017-11-22 9:47 ` Michal Wajdeczko
2017-11-22 10:03 ` Sagar Arun Kamble
1 sibling, 0 replies; 28+ messages in thread
From: Michal Wajdeczko @ 2017-11-22 9:47 UTC (permalink / raw)
To: Sagar Arun Kamble, intel-gfx, Chris Wilson
On Wed, 22 Nov 2017 10:36:15 +0100, Chris Wilson
<chris@chris-wilson.co.uk> wrote:
> Quoting Sagar Arun Kamble (2017-11-22 07:41:02)
>>
>>
>> On 11/22/2017 2:29 AM, Chris Wilson wrote:
>> > Instead of sleeping for a fixed 1ms (roughly, depending on timer
>> slack),
>> > start with a small sleep and exponentially increase the sleep on each
>> > cycle.
>> >
>> > A good example of a beneficiary is the guc mmio communication channel.
>> As Tvrtko said, for the current GuC communication (guc_send_mmio) we
>> will need to update fast timeout of
>> __intel_wait_for_register to 20us. Improvement this patch proposes
>> through wait_for will
>> certainly be seen once we switch over to GuC CT. May be specifying "GuC
>> CT channel" here is apt.
>
> guc mmio comm falls off the fastpath hitting the sleeping wait_for, and
> *is* improved by this patch. As far as the latency experienced by
> gem_exec_latency, there is no difference between a 10us sleep and
> spinning for 20us. Changing the spin length to 20us!!! is
> something that you should talk to the guc about, at that point we really
> need an asynchronous communication channel (ct is still being used
> synchronously).
FYI: updated series with asynchronous CT will be posted soon, delay is due
to planned changes of commands format in GuC firmware.
Michal
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-22 9:36 ` Chris Wilson
2017-11-22 9:47 ` Michal Wajdeczko
@ 2017-11-22 10:03 ` Sagar Arun Kamble
1 sibling, 0 replies; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-22 10:03 UTC (permalink / raw)
To: Chris Wilson, intel-gfx
On 11/22/2017 3:06 PM, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-22 07:41:02)
>>
>> On 11/22/2017 2:29 AM, Chris Wilson wrote:
>>> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
>>> start with a small sleep and exponentially increase the sleep on each
>>> cycle.
>>>
>>> A good example of a beneficiary is the guc mmio communication channel.
>> As Tvrtko said, for the current GuC communication (guc_send_mmio) we
>> will need to update fast timeout of
>> __intel_wait_for_register to 20us. Improvement this patch proposes
>> through wait_for will
>> certainly be seen once we switch over to GuC CT. May be specifying "GuC
>> CT channel" here is apt.
> guc mmio comm falls off the fastpath hitting the sleeping wait_for, and
> *is* improved by this patch.
Thanks for clarification. I overlooked the sleeping wait of 10ms in
_intel_wait_for_register.
> As far as the latency experienced by
> gem_exec_latency, there is no difference between a 10us sleep and
> spinning for 20us. Changing the spin length to 20us!!! is
> something that you should talk to the guc about, at that point we really
> need an asynchronous communication channel (ct is still being used
> synchronously).
>
> Changing the intel_guc_send_mmio fast timeout is a different
> conversation, that is of more dubious merit because of this patch (i.e.
> if we can achieve the same latency with a sleep should we spin at all?).
Agree.
>>> Typically we expect (and so spin) for 10us for a quick response, but this
>>> doesn't cover everything and so sometimes we fallback to the millisecond+
>>> sleep. This incurs a significant delay in time-critical operations like
>>> preemption (igt/gem_exec_latency), which can be improved significantly by
>>> using a small sleep after the spin fails.
>>>
>>> We've made this suggestion many times, but had little experimental data
>>> to support adding the complexity.
>>>
>>> v2: Bump the minimum usleep to 10us on advice of
>>> Documentation/timers/timers-howto.txt (Tvrko)
>>> v3: Specify min, max range for usleep intervals -- some code may
>>> crucially depend upon and so want to specify the sleep pattern.
>>>
>>> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: John Harrison <John.C.Harrison@intel.com>
>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Reviewed-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>>> ---
>>> drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
>>> drivers/gpu/drm/i915/intel_pm.c | 2 +-
>>> 2 files changed, 8 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
>>> index 635a96fcd788..c00441a3d649 100644
>>> --- a/drivers/gpu/drm/i915/intel_drv.h
>>> +++ b/drivers/gpu/drm/i915/intel_drv.h
>>> @@ -48,8 +48,9 @@
>>> * having timed out, since the timeout could be due to preemption or similar and
>>> * we've never had a chance to check the condition before the timeout.
>>> */
>>> -#define _wait_for(COND, US, W) ({ \
>>> +#define _wait_for(COND, US, Wmin, Wmax) ({ \
>>> unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
>>> + long wait__ = (Wmin); /* recommended min for usleep is 10 us */ \
>>> int ret__; \
>>> might_sleep(); \
>>> for (;;) { \
>>> @@ -62,12 +63,14 @@
>>> ret__ = -ETIMEDOUT; \
>>> break; \
>>> } \
>>> - usleep_range((W), (W) * 2); \
>>> + usleep_range(wait__, wait__ * 2); \
>>> + if (wait__ < (Wmax)) \
>>> + wait__ <<= 1; \
>> I think we need to keep track of total time we have waited else we might
>> wait for longer than necessary.
> This is not a precise wait, it's a sleep with a rough upper bound.
> Sleeps by their very nature are very rough, you are giving up the
> processor and telling it not to wake you before a certain time. We are
> not asking to be woken at that time, just not before.
>
>> For e.g. for wait_for_us(COND, 900) this approach might actually lead to
>> sleep of 1270us.
> The numbers are irrelevant, they are merely round numbers from once upon
> a time using msleep(1).
Ok. Thanks for clarification.
> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
2017-11-22 7:41 ` Sagar Arun Kamble
@ 2017-11-24 12:37 ` Michał Winiarski
2017-11-24 14:12 ` Chris Wilson
1 sibling, 1 reply; 28+ messages in thread
From: Michał Winiarski @ 2017-11-24 12:37 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
On Tue, Nov 21, 2017 at 08:59:51PM +0000, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
>
> A good example of a beneficiary is the guc mmio communication channel.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
>
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
>
> v2: Bump the minimum usleep to 10us on advice of
> Documentation/timers/timers-howto.txt (Tvrko)
> v3: Specify min, max range for usleep intervals -- some code may
> crucially depend upon and so want to specify the sleep pattern.
Since we see the effects for GuC preeption, let's gather some evidence.
(SKL)
intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
drm-tip:
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 44 | |
16 -> 31 : 1088 | |
32 -> 63 : 832 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 12 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 29899 |********* |
2048 -> 4095 : 131033 |****************************************|
exponential backoff:
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 58 | |
16 -> 31 : 2009 | |
32 -> 63 : 153615 |****************************************|
64 -> 127 : 4360 |* |
128 -> 255 : 14 | |
256 -> 511 : 4236 |* |
512 -> 1023 : 4 | |
s/10us/100us in __intel_wait_for_register_fw fast_timeout (without exponential
backoff):
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 946 | |
16 -> 31 : 159803 |****************************************|
32 -> 63 : 3440 | |
64 -> 127 : 33 | |
128 -> 255 : 1 | |
256 -> 511 : 74 | |
same as the previous one, except with lower GPU clock, let's say max == hw_min:
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 26 | |
32 -> 63 : 160411 |****************************************|
64 -> 127 : 3402 | |
128 -> 255 : 12 | |
256 -> 511 : 50 | |
512 -> 1023 : 3 | |
1024 -> 2047 : 1 | |
2048 -> 4095 : 2 | |
We're waaay better with exponential backoff compared with current drm-tip, and
we're almost as good as with spinning for a longer period. Unfortunately with
GuC the delay depends on platform (and/or GPU clock?), so we're still seeing the
long tail without exponential backoff if we manage to miss the fast timeout.
> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
-Michał
> ---
> drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
> drivers/gpu/drm/i915/intel_pm.c | 2 +-
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 635a96fcd788..c00441a3d649 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -48,8 +48,9 @@
> * having timed out, since the timeout could be due to preemption or similar and
> * we've never had a chance to check the condition before the timeout.
> */
> -#define _wait_for(COND, US, W) ({ \
> +#define _wait_for(COND, US, Wmin, Wmax) ({ \
> unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1; \
> + long wait__ = (Wmin); /* recommended min for usleep is 10 us */ \
> int ret__; \
> might_sleep(); \
> for (;;) { \
> @@ -62,12 +63,14 @@
> ret__ = -ETIMEDOUT; \
> break; \
> } \
> - usleep_range((W), (W) * 2); \
> + usleep_range(wait__, wait__ * 2); \
> + if (wait__ < (Wmax)) \
> + wait__ <<= 1; \
> } \
> ret__; \
> })
>
> -#define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 1000)
> +#define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 10, 1000)
>
> /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
> #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
> @@ -116,7 +119,7 @@
> int ret__; \
> BUILD_BUG_ON(!__builtin_constant_p(US)); \
> if ((US) > 10) \
> - ret__ = _wait_for((COND), (US), 10); \
> + ret__ = _wait_for((COND), (US), 10, 10); \
> else \
> ret__ = _wait_for_atomic((COND), (US), 0); \
> ret__; \
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index e445ec174831..f07f14ae198d 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -9294,7 +9294,7 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
> ret = 0;
> goto out;
> }
> - ret = _wait_for(COND, timeout_base_ms * 1000, 10);
> + ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
> if (!ret)
> goto out;
>
> --
> 2.15.0
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-24 12:37 ` Michał Winiarski
@ 2017-11-24 14:12 ` Chris Wilson
2017-11-30 3:04 ` John Harrison
0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2017-11-24 14:12 UTC (permalink / raw)
To: Michał Winiarski; +Cc: intel-gfx
Quoting Michał Winiarski (2017-11-24 12:37:56)
> Since we see the effects for GuC preeption, let's gather some evidence.
>
> (SKL)
> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>
> drm-tip:
> usecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 44 | |
> 16 -> 31 : 1088 | |
> 32 -> 63 : 832 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 12 | |
> 512 -> 1023 : 0 | |
> 1024 -> 2047 : 29899 |********* |
> 2048 -> 4095 : 131033 |****************************************|
Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
could create a tracepoint/kprobe that would output a histogram for each
waiter (filterable ofc). Benefit? Just thinking of tuning the
spin/sleep, in which case overall metrics are best
(intel_eait_for_register needs to be optimised for the typical case). I
am wondering if we could tune the spin period down to 5us, 2us? And then
have the 10us sleep.
We would also need a typical workload to run, it's profile-guided
optimisation after all. Hmm.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-24 14:12 ` Chris Wilson
@ 2017-11-30 3:04 ` John Harrison
2017-11-30 6:19 ` Sagar Arun Kamble
0 siblings, 1 reply; 28+ messages in thread
From: John Harrison @ 2017-11-30 3:04 UTC (permalink / raw)
To: Chris Wilson, Michał Winiarski; +Cc: intel-gfx
[-- Attachment #1.1: Type: text/plain, Size: 9960 bytes --]
On 11/24/2017 6:12 AM, Chris Wilson wrote:
> Quoting Michał Winiarski (2017-11-24 12:37:56)
>> Since we see the effects for GuC preeption, let's gather some evidence.
>>
>> (SKL)
>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>
>> drm-tip:
>> usecs : count distribution
>> 0 -> 1 : 0 | |
>> 2 -> 3 : 0 | |
>> 4 -> 7 : 0 | |
>> 8 -> 15 : 44 | |
>> 16 -> 31 : 1088 | |
>> 32 -> 63 : 832 | |
>> 64 -> 127 : 0 | |
>> 128 -> 255 : 0 | |
>> 256 -> 511 : 12 | |
>> 512 -> 1023 : 0 | |
>> 1024 -> 2047 : 29899 |********* |
>> 2048 -> 4095 : 131033 |****************************************|
> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
> could create a tracepoint/kprobe that would output a histogram for each
> waiter (filterable ofc). Benefit? Just thinking of tuning the
> spin/sleep, in which case overall metrics are best
> (intel_eait_for_register needs to be optimised for the typical case). I
> am wondering if we could tune the spin period down to 5us, 2us? And then
> have the 10us sleep.
>
> We would also need a typical workload to run, it's profile-guided
> optimisation after all. Hmm.
> -Chris
It took me a while to get back to this but I've now had chance to run
with this exponential backoff scheme on the original system that showed
the problem. It was a slightly messy back port due to the customer tree
being much older than current nightly. I'm pretty sure I got it correct
though. However, I'm not sure what the recommendation is for the two
timeout values. Using the default of '10, 10' in the patch, I still get
lots of very long delays. I have to up the Wmin value to at least 140 to
get a stall free result. Which is plausible given that the big spike in
the results of any fast version is at 110-150us. Also of note is that a
Wmin between 10 and 110 actually makes things worse. Changing Wmax has
no effect.
In the following table, 'original' is the original driver before any
changes and 'retry loop' is the version using the first workaround of
just running the busy poll wait in a 10x loop. The other columns are
using the backoff patch with the given Wmin/Wmax values. Note that the
times are bucketed to 10us up to 500us and then in 500us lumps
thereafter. The value listed is the lower limit, i.e. there were no
times of <10us measured. Each case was run for 1000 samples.
Time Original 10/10 50/10 100/10 110/10
130/10 140/10 RetryLoop
10us: 2 2 2 2 2 2
2 2
30us: 1 1 1 1 1
50us: 1
70us: 14 63 56 64
63 61
80us: 8 41 52 44
46 41
90us: 6 24 10 28
12 17
100us: 2 4 20 16 17
17 22
110us: 13 21 14
13 11
120us: 6 366 633 636
660 650
130us: 2 2 46 125 95
86 95
140us: 3 2 16 18 32
46 48
150us: 210 3 12 13 37
32 31
160us: 322 1 18 10 14
12 17
170us: 157 4 5 5 3
5 2
180us: 62 11 3 1 2
1 1
190us: 32 212 1 1 2
200us: 27 266 1 1
210us: 16
181 1
220us: 16 51 1
230us: 10 43 4
240us: 12 22 62 1
250us: 4 12 112 3
260us: 3 13 73 8
270us: 5 12 12 8 2
280us: 4 7 12 5 1
290us: 9 4
300us: 1 3 9 1 1
310us: 2 3 5 1 1
320us: 1 4 2 3
330us: 1 5 1
340us: 1 2 1
350us: 2 1
360us: 2 1
370us: 2 2
380us: 1
390us: 2 1 2 1
410us: 1
420us: 3
430us: 2 2 1
440us: 2 1
450us: 4
460us: 3 1
470us: 3 1
480us: 2 2
490us: 1
500us: 19 13 17
1000us: 249 22 30 11
1500us: 393 4 4 2 1
2000us: 132 7 8 8 2
1 1
2500us: 63 4 4 6 1 1 1
3000us: 59 9 7 6 1
3500us: 34 2 1 1
4000us: 17 9 4 1
4500us: 8 2 1 1
5000us: 7 1 2
5500us: 7 2 1
6000us: 4 2 1 1
6500us: 3 1
7000us: 6 2 1
7500us: 4 1 1
8000us: 5 1
8500us: 1 1
9000us: 2
9500us: 2 1
>10000us: 3 1
John.
[-- Attachment #1.2: Type: text/html, Size: 12023 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-30 3:04 ` John Harrison
@ 2017-11-30 6:19 ` Sagar Arun Kamble
2017-11-30 7:15 ` John Harrison
0 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-30 6:19 UTC (permalink / raw)
To: John Harrison, Chris Wilson, Michał Winiarski; +Cc: intel-gfx
[-- Attachment #1.1: Type: text/plain, Size: 11041 bytes --]
On 11/30/2017 8:34 AM, John Harrison wrote:
> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>
>>> (SKL)
>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>
>>> drm-tip:
>>> usecs : count distribution
>>> 0 -> 1 : 0 | |
>>> 2 -> 3 : 0 | |
>>> 4 -> 7 : 0 | |
>>> 8 -> 15 : 44 | |
>>> 16 -> 31 : 1088 | |
>>> 32 -> 63 : 832 | |
>>> 64 -> 127 : 0 | |
>>> 128 -> 255 : 0 | |
>>> 256 -> 511 : 12 | |
>>> 512 -> 1023 : 0 | |
>>> 1024 -> 2047 : 29899 |********* |
>>> 2048 -> 4095 : 131033 |****************************************|
>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>> could create a tracepoint/kprobe that would output a histogram for each
>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>> spin/sleep, in which case overall metrics are best
>> (intel_eait_for_register needs to be optimised for the typical case). I
>> am wondering if we could tune the spin period down to 5us, 2us? And then
>> have the 10us sleep.
>>
>> We would also need a typical workload to run, it's profile-guided
>> optimisation after all. Hmm.
>> -Chris
>
> It took me a while to get back to this but I've now had chance to run
> with this exponential backoff scheme on the original system that
> showed the problem. It was a slightly messy back port due to the
> customer tree being much older than current nightly. I'm pretty sure I
> got it correct though. However, I'm not sure what the recommendation
> is for the two timeout values. Using the default of '10, 10' in the
> patch, I still get lots of very long delays.
Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us and
Wmin=10, Wmax=1000 for wait_for.
Exponential backoff is more helpful inside wait_for if wait_for_us prior
to wait_for is smaller.
Setting Wmax less than Wmin is effectively changing the backoff strategy
to just linear waits of Wmin.
> I have to up the Wmin value to at least 140 to get a stall free
> result. Which is plausible given that the big spike in the results of
> any fast version is at 110-150us. Also of note is that a Wmin between
> 10 and 110 actually makes things worse. Changing Wmax has no effect.
>
> In the following table, 'original' is the original driver before any
> changes and 'retry loop' is the version using the first workaround of
> just running the busy poll wait in a 10x loop. The other columns are
> using the backoff patch with the given Wmin/Wmax values. Note that the
> times are bucketed to 10us up to 500us and then in 500us lumps
> thereafter. The value listed is the lower limit, i.e. there were no
> times of <10us measured. Each case was run for 1000 samples.
>
Below setting like in current nightly will suit this workload and as you
have found this will also likely complete most waits in <150us.
If many samples had been beyond 160us and less than 300us we might have
been needed to change Wmin to may be 15 or 20 to ensure the
exponential rise caps around 300us.
wait_for_us(10, 10)
wait_for()
#define wait_for _wait_for(10, 1000)
> Time Original 10/10 50/10 100/10 110/10
> 130/10 140/10 RetryLoop
> 10us: 2 2 2 2 2 2
> 2 2
> 30us: 1 1 1 1 1
> 50us: 1
> 70us: 14 63 56 64
> 63 61
> 80us: 8 41 52 44
> 46 41
> 90us: 6 24 10 28
> 12 17
> 100us: 2 4 20 16 17
> 17 22
> 110us: 13 21 14
> 13 11
> 120us: 6 366 633 636
> 660 650
> 130us: 2 2 46 125
> 95 86 95
> 140us: 3 2 16 18 32
> 46 48
> 150us: 210 3 12 13 37
> 32 31
> 160us: 322 1 18 10 14
> 12 17
> 170us: 157 4 5 5 3
> 5 2
> 180us: 62 11 3 1 2
> 1 1
> 190us: 32 212 1 1 2
> 200us: 27 266 1 1
> 210us: 16
> 181 1
> 220us: 16 51 1
> 230us: 10 43 4
> 240us: 12 22 62 1
> 250us: 4 12 112 3
> 260us: 3 13 73 8
> 270us: 5 12 12 8 2
> 280us: 4 7 12 5 1
> 290us: 9 4
> 300us: 1 3 9 1 1
> 310us: 2 3 5 1 1
> 320us: 1 4 2 3
> 330us: 1 5 1
> 340us: 1 2 1
> 350us: 2 1
> 360us: 2 1
> 370us: 2 2
> 380us: 1
> 390us: 2 1 2 1
> 410us: 1
> 420us: 3
> 430us: 2 2 1
> 440us: 2 1
> 450us: 4
> 460us: 3 1
> 470us: 3 1
> 480us: 2 2
> 490us: 1
> 500us: 19 13 17
> 1000us: 249 22 30 11
> 1500us: 393 4 4 2 1
> 2000us: 132 7 8 8 2
> 1 1
> 2500us: 63 4 4 6 1 1 1
> 3000us: 59 9 7 6 1
> 3500us: 34 2 1 1
> 4000us: 17 9 4 1
> 4500us: 8 2 1 1
> 5000us: 7 1 2
> 5500us: 7 2 1
> 6000us: 4 2 1 1
> 6500us: 3 1
> 7000us: 6 2 1
> 7500us: 4 1 1
> 8000us: 5 1
> 8500us: 1 1
> 9000us: 2
> 9500us: 2 1
> >10000us: 3 1
>
>
> John.
>
>
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
[-- Attachment #1.2: Type: text/html, Size: 14094 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-30 6:19 ` Sagar Arun Kamble
@ 2017-11-30 7:15 ` John Harrison
2017-11-30 7:55 ` Sagar Arun Kamble
0 siblings, 1 reply; 28+ messages in thread
From: John Harrison @ 2017-11-30 7:15 UTC (permalink / raw)
To: Sagar Arun Kamble, Chris Wilson, Michał Winiarski; +Cc: intel-gfx
[-- Attachment #1.1: Type: text/plain, Size: 11888 bytes --]
On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
> On 11/30/2017 8:34 AM, John Harrison wrote:
>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>
>>>> (SKL)
>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>
>>>> drm-tip:
>>>> usecs : count distribution
>>>> 0 -> 1 : 0 | |
>>>> 2 -> 3 : 0 | |
>>>> 4 -> 7 : 0 | |
>>>> 8 -> 15 : 44 | |
>>>> 16 -> 31 : 1088 | |
>>>> 32 -> 63 : 832 | |
>>>> 64 -> 127 : 0 | |
>>>> 128 -> 255 : 0 | |
>>>> 256 -> 511 : 12 | |
>>>> 512 -> 1023 : 0 | |
>>>> 1024 -> 2047 : 29899 |********* |
>>>> 2048 -> 4095 : 131033 |****************************************|
>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>> could create a tracepoint/kprobe that would output a histogram for each
>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>> spin/sleep, in which case overall metrics are best
>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>> have the 10us sleep.
>>>
>>> We would also need a typical workload to run, it's profile-guided
>>> optimisation after all. Hmm.
>>> -Chris
>>
>> It took me a while to get back to this but I've now had chance to run
>> with this exponential backoff scheme on the original system that
>> showed the problem. It was a slightly messy back port due to the
>> customer tree being much older than current nightly. I'm pretty sure
>> I got it correct though. However, I'm not sure what the
>> recommendation is for the two timeout values. Using the default of
>> '10, 10' in the patch, I still get lots of very long delays.
> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us and
> Wmin=10, Wmax=1000 for wait_for.
>
> Exponential backoff is more helpful inside wait_for if wait_for_us
> prior to wait_for is smaller.
> Setting Wmax less than Wmin is effectively changing the backoff
> strategy to just linear waits of Wmin.
>> I have to up the Wmin value to at least 140 to get a stall free
>> result. Which is plausible given that the big spike in the results of
>> any fast version is at 110-150us. Also of note is that a Wmin between
>> 10 and 110 actually makes things worse. Changing Wmax has no effect.
>>
>> In the following table, 'original' is the original driver before any
>> changes and 'retry loop' is the version using the first workaround of
>> just running the busy poll wait in a 10x loop. The other columns are
>> using the backoff patch with the given Wmin/Wmax values. Note that
>> the times are bucketed to 10us up to 500us and then in 500us lumps
>> thereafter. The value listed is the lower limit, i.e. there were no
>> times of <10us measured. Each case was run for 1000 samples.
>>
> Below setting like in current nightly will suit this workload and as
> you have found this will also likely complete most waits in <150us.
> If many samples had been beyond 160us and less than 300us we might
> have been needed to change Wmin to may be 15 or 20 to ensure the
> exponential rise caps around 300us.
>
> wait_for_us(10, 10)
> wait_for()
>
> #define wait_for _wait_for(10, 1000)
>
But as shown in the table, a setting of 10/10 does not work well for
this workload. The best results possible are a large spike of waits in
the 120-130us bucket with a small tail out to 150us. Whereas, the 10/10
setting produces a spike from 150-170us with the tail extending to 240us
and an appreciable number of samples stretching all the way out to the
1-10ms range. A regular delay of multiple milliseconds is not acceptable
when this path is supposed to be a low latency pre-emption to switch to
some super high priority time critical task. And as noted, I did try a
bunch of different settings for Wmax but nothing seemed to make much of
a difference. E.g. 10/10 vs 10/1000 produced pretty much identical
results. Hence it didn't seem worth including those in the table.
>> Time Original 10/10 50/10 100/10 110/10
>> 130/10 140/10 RetryLoop
>> 10us: 2 2 2 2 2
>> 2 2 2
>> 30us: 1 1 1 1 1
>> 50us: 1
>> 70us: 14 63 56
>> 64 63 61
>> 80us: 8 41 52
>> 44 46 41
>> 90us: 6 24 10
>> 28 12 17
>> 100us: 2 4 20 16
>> 17 17 22
>> 110us: 13 21
>> 14 13 11
>> 120us: 6 366 633
>> 636 660 650
>> 130us: 2 2 46 125
>> 95 86 95
>> 140us: 3 2 16 18
>> 32 46 48
>> 150us: 210 3 12 13
>> 37 32 31
>> 160us: 322 1 18 10
>> 14 12 17
>> 170us: 157 4 5 5
>> 3 5 2
>> 180us: 62 11 3 1
>> 2 1 1
>> 190us: 32 212 1 1 2
>> 200us: 27 266 1 1
>> 210us: 16
>> 181 1
>> 220us: 16 51 1
>> 230us: 10 43 4
>> 240us: 12 22 62 1
>> 250us: 4 12 112 3
>> 260us: 3 13 73 8
>> 270us: 5 12 12 8 2
>> 280us: 4 7 12 5 1
>> 290us: 9 4
>> 300us: 1 3 9 1 1
>> 310us: 2 3 5 1 1
>> 320us: 1 4 2 3
>> 330us: 1 5 1
>> 340us: 1 2 1
>> 350us: 2 1
>> 360us: 2 1
>> 370us: 2 2
>> 380us: 1
>> 390us: 2 1 2 1
>> 410us: 1
>> 420us: 3
>> 430us: 2 2 1
>> 440us: 2 1
>> 450us: 4
>> 460us: 3 1
>> 470us: 3 1
>> 480us: 2 2
>> 490us: 1
>> 500us: 19 13 17
>> 1000us: 249 22 30 11
>> 1500us: 393 4 4 2 1
>> 2000us: 132 7 8 8 2
>> 1 1
>> 2500us: 63 4 4 6 1 1 1
>> 3000us: 59 9 7 6 1
>> 3500us: 34 2 1 1
>> 4000us: 17 9 4 1
>> 4500us: 8 2 1 1
>> 5000us: 7 1 2
>> 5500us: 7 2 1
>> 6000us: 4 2 1 1
>> 6500us: 3 1
>> 7000us: 6 2 1
>> 7500us: 4 1 1
>> 8000us: 5 1
>> 8500us: 1 1
>> 9000us: 2
>> 9500us: 2 1
>> >10000us: 3 1
>>
>>
>> John.
>>
>>
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
[-- Attachment #1.2: Type: text/html, Size: 15637 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-30 7:15 ` John Harrison
@ 2017-11-30 7:55 ` Sagar Arun Kamble
2017-12-04 21:51 ` John Harrison
0 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-30 7:55 UTC (permalink / raw)
To: John Harrison, Chris Wilson, Michał Winiarski; +Cc: intel-gfx
[-- Attachment #1.1: Type: text/plain, Size: 12809 bytes --]
On 11/30/2017 12:45 PM, John Harrison wrote:
> On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
>> On 11/30/2017 8:34 AM, John Harrison wrote:
>>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>>
>>>>> (SKL)
>>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>>
>>>>> drm-tip:
>>>>> usecs : count distribution
>>>>> 0 -> 1 : 0 | |
>>>>> 2 -> 3 : 0 | |
>>>>> 4 -> 7 : 0 | |
>>>>> 8 -> 15 : 44 | |
>>>>> 16 -> 31 : 1088 | |
>>>>> 32 -> 63 : 832 | |
>>>>> 64 -> 127 : 0 | |
>>>>> 128 -> 255 : 0 | |
>>>>> 256 -> 511 : 12 | |
>>>>> 512 -> 1023 : 0 | |
>>>>> 1024 -> 2047 : 29899 |********* |
>>>>> 2048 -> 4095 : 131033 |****************************************|
>>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>>> could create a tracepoint/kprobe that would output a histogram for each
>>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>>> spin/sleep, in which case overall metrics are best
>>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>>> have the 10us sleep.
>>>>
>>>> We would also need a typical workload to run, it's profile-guided
>>>> optimisation after all. Hmm.
>>>> -Chris
>>>
>>> It took me a while to get back to this but I've now had chance to
>>> run with this exponential backoff scheme on the original system that
>>> showed the problem. It was a slightly messy back port due to the
>>> customer tree being much older than current nightly. I'm pretty sure
>>> I got it correct though. However, I'm not sure what the
>>> recommendation is for the two timeout values. Using the default of
>>> '10, 10' in the patch, I still get lots of very long delays.
>> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us and
>> Wmin=10, Wmax=1000 for wait_for.
>>
>> Exponential backoff is more helpful inside wait_for if wait_for_us
>> prior to wait_for is smaller.
>> Setting Wmax less than Wmin is effectively changing the backoff
>> strategy to just linear waits of Wmin.
>>> I have to up the Wmin value to at least 140 to get a stall free
>>> result. Which is plausible given that the big spike in the results
>>> of any fast version is at 110-150us. Also of note is that a Wmin
>>> between 10 and 110 actually makes things worse. Changing Wmax has no
>>> effect.
>>>
>>> In the following table, 'original' is the original driver before any
>>> changes and 'retry loop' is the version using the first workaround
>>> of just running the busy poll wait in a 10x loop. The other columns
>>> are using the backoff patch with the given Wmin/Wmax values. Note
>>> that the times are bucketed to 10us up to 500us and then in 500us
>>> lumps thereafter. The value listed is the lower limit, i.e. there
>>> were no times of <10us measured. Each case was run for 1000 samples.
>>>
>> Below setting like in current nightly will suit this workload and as
>> you have found this will also likely complete most waits in <150us.
>> If many samples had been beyond 160us and less than 300us we might
>> have been needed to change Wmin to may be 15 or 20 to ensure the
>> exponential rise caps around 300us.
>>
>> wait_for_us(10, 10)
>> wait_for()
>>
>> #define wait_for _wait_for(10, 1000)
>>
> But as shown in the table, a setting of 10/10 does not work well for
> this workload. The best results possible are a large spike of waits in
> the 120-130us bucket with a small tail out to 150us. Whereas, the
> 10/10 setting produces a spike from 150-170us with the tail extending
> to 240us and an appreciable number of samples stretching all the way
> out to the 1-10ms range. A regular delay of multiple milliseconds is
> not acceptable when this path is supposed to be a low latency
> pre-emption to switch to some super high priority time critical task.
> And as noted, I did try a bunch of different settings for Wmax but
> nothing seemed to make much of a difference. E.g. 10/10 vs 10/1000
> produced pretty much identical results. Hence it didn't seem worth
> including those in the table.
>
Wmin = 10us leads us to total delay of 150us in 3 loops (this might be
tight to catch most conditions)
Wmin = 25us can lead us to total delay of 175us in 3 loops
Since most conditions are likely to complete around 140us-160us, Looks
like Wmin of 25 to 30 (25,1000 or 30, 1000) will suit this workload but
since this profile driver optimization I am wondering about the optimal
Wmin point.
This wait need is very time critical. Exponential rise might not be good
strategy during higher wait times.
usleep_range might also be adding extra latency.
May be we should do this exponential backoff for waits having US >= 1000
and do periodic backoff for US<1000 with period of 50us?
>>> Time Original 10/10 50/10 100/10 110/10
>>> 130/10 140/10 RetryLoop
>>> 10us: 2 2 2 2 2
>>> 2 2 2
>>> 30us: 1 1 1 1 1
>>> 50us: 1
>>> 70us: 14 63 56
>>> 64 63 61
>>> 80us: 8 41 52
>>> 44 46 41
>>> 90us: 6 24 10
>>> 28 12 17
>>> 100us: 2 4 20 16
>>> 17 17 22
>>> 110us: 13 21 14 13 11
>>> 120us: 6 366 633
>>> 636 660 650
>>> 130us: 2 2 46 125
>>> 95 86 95
>>> 140us: 3 2 16 18
>>> 32 46 48
>>> 150us: 210 3 12 13
>>> 37 32 31
>>> 160us: 322 1 18 10
>>> 14 12 17
>>> 170us: 157 4 5 5
>>> 3 5 2
>>> 180us: 62 11 3 1
>>> 2 1 1
>>> 190us: 32 212 1 1 2
>>> 200us: 27 266 1 1
>>> 210us: 16
>>> 181 1
>>> 220us: 16
>>> 51 1
>>> 230us: 10 43 4
>>> 240us: 12 22 62 1
>>> 250us: 4 12 112 3
>>> 260us: 3 13 73 8
>>> 270us: 5 12 12 8 2
>>> 280us: 4 7 12 5 1
>>> 290us: 9 4
>>> 300us: 1 3 9 1 1
>>> 310us: 2 3 5 1 1
>>> 320us: 1 4 2 3
>>> 330us: 1 5 1
>>> 340us: 1 2 1
>>> 350us: 2 1
>>> 360us: 2 1
>>> 370us: 2 2
>>> 380us: 1
>>> 390us: 2 1 2 1
>>> 410us: 1
>>> 420us: 3
>>> 430us: 2 2 1
>>> 440us: 2 1
>>> 450us: 4
>>> 460us: 3 1
>>> 470us: 3 1
>>> 480us: 2 2
>>> 490us: 1
>>> 500us: 19 13 17
>>> 1000us: 249 22 30 11
>>> 1500us: 393 4 4 2 1
>>> 2000us: 132 7 8 8 2
>>> 1 1
>>> 2500us: 63 4 4 6 1 1 1
>>> 3000us: 59 9 7 6 1
>>> 3500us: 34 2 1 1
>>> 4000us: 17 9 4 1
>>> 4500us: 8 2 1 1
>>> 5000us: 7 1 2
>>> 5500us: 7 2 1
>>> 6000us: 4 2 1 1
>>> 6500us: 3 1
>>> 7000us: 6 2 1
>>> 7500us: 4 1 1
>>> 8000us: 5 1
>>> 8500us: 1 1
>>> 9000us: 2
>>> 9500us: 2 1
>>> >10000us: 3 1
>>>
>>>
>>> John.
>>>
>>>
>>>
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
>
[-- Attachment #1.2: Type: text/html, Size: 17182 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
2017-11-30 7:55 ` Sagar Arun Kamble
@ 2017-12-04 21:51 ` John Harrison
0 siblings, 0 replies; 28+ messages in thread
From: John Harrison @ 2017-12-04 21:51 UTC (permalink / raw)
To: Sagar Arun Kamble, Chris Wilson, Michał Winiarski; +Cc: intel-gfx
[-- Attachment #1.1: Type: text/plain, Size: 14541 bytes --]
On 11/29/2017 11:55 PM, Sagar Arun Kamble wrote:
> On 11/30/2017 12:45 PM, John Harrison wrote:
>> On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
>>> On 11/30/2017 8:34 AM, John Harrison wrote:
>>>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>>>
>>>>>> (SKL)
>>>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>>>
>>>>>> drm-tip:
>>>>>> usecs : count distribution
>>>>>> 0 -> 1 : 0 | |
>>>>>> 2 -> 3 : 0 | |
>>>>>> 4 -> 7 : 0 | |
>>>>>> 8 -> 15 : 44 | |
>>>>>> 16 -> 31 : 1088 | |
>>>>>> 32 -> 63 : 832 | |
>>>>>> 64 -> 127 : 0 | |
>>>>>> 128 -> 255 : 0 | |
>>>>>> 256 -> 511 : 12 | |
>>>>>> 512 -> 1023 : 0 | |
>>>>>> 1024 -> 2047 : 29899 |********* |
>>>>>> 2048 -> 4095 : 131033 |****************************************|
>>>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>>>> could create a tracepoint/kprobe that would output a histogram for each
>>>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>>>> spin/sleep, in which case overall metrics are best
>>>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>>>> have the 10us sleep.
>>>>>
>>>>> We would also need a typical workload to run, it's profile-guided
>>>>> optimisation after all. Hmm.
>>>>> -Chris
>>>>
>>>> It took me a while to get back to this but I've now had chance to
>>>> run with this exponential backoff scheme on the original system
>>>> that showed the problem. It was a slightly messy back port due to
>>>> the customer tree being much older than current nightly. I'm pretty
>>>> sure I got it correct though. However, I'm not sure what the
>>>> recommendation is for the two timeout values. Using the default of
>>>> '10, 10' in the patch, I still get lots of very long delays.
>>> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us
>>> and Wmin=10, Wmax=1000 for wait_for.
>>>
>>> Exponential backoff is more helpful inside wait_for if wait_for_us
>>> prior to wait_for is smaller.
>>> Setting Wmax less than Wmin is effectively changing the backoff
>>> strategy to just linear waits of Wmin.
>>>> I have to up the Wmin value to at least 140 to get a stall free
>>>> result. Which is plausible given that the big spike in the results
>>>> of any fast version is at 110-150us. Also of note is that a Wmin
>>>> between 10 and 110 actually makes things worse. Changing Wmax has
>>>> no effect.
>>>>
>>>> In the following table, 'original' is the original driver before
>>>> any changes and 'retry loop' is the version using the first
>>>> workaround of just running the busy poll wait in a 10x loop. The
>>>> other columns are using the backoff patch with the given Wmin/Wmax
>>>> values. Note that the times are bucketed to 10us up to 500us and
>>>> then in 500us lumps thereafter. The value listed is the lower
>>>> limit, i.e. there were no times of <10us measured. Each case was
>>>> run for 1000 samples.
>>>>
>>> Below setting like in current nightly will suit this workload and as
>>> you have found this will also likely complete most waits in <150us.
>>> If many samples had been beyond 160us and less than 300us we might
>>> have been needed to change Wmin to may be 15 or 20 to ensure the
>>> exponential rise caps around 300us.
>>>
>>> wait_for_us(10, 10)
>>> wait_for()
>>>
>>> #define wait_for _wait_for(10, 1000)
>>>
>> But as shown in the table, a setting of 10/10 does not work well for
>> this workload. The best results possible are a large spike of waits
>> in the 120-130us bucket with a small tail out to 150us. Whereas, the
>> 10/10 setting produces a spike from 150-170us with the tail extending
>> to 240us and an appreciable number of samples stretching all the way
>> out to the 1-10ms range. A regular delay of multiple milliseconds is
>> not acceptable when this path is supposed to be a low latency
>> pre-emption to switch to some super high priority time critical task.
>> And as noted, I did try a bunch of different settings for Wmax but
>> nothing seemed to make much of a difference. E.g. 10/10 vs 10/1000
>> produced pretty much identical results. Hence it didn't seem worth
>> including those in the table.
>>
> Wmin = 10us leads us to total delay of 150us in 3 loops (this might be
> tight to catch most conditions)
> Wmin = 25us can lead us to total delay of 175us in 3 loops
>
> Since most conditions are likely to complete around 140us-160us, Looks
> like Wmin of 25 to 30 (25,1000 or 30, 1000) will suit this workload but
> since this profile driver optimization I am wondering about the
> optimal Wmin point.
>
> This wait need is very time critical. Exponential rise might not be
> good strategy during higher wait times.
> usleep_range might also be adding extra latency.
>
> May be we should do this exponential backoff for waits having US >=
> 1000 and do periodic backoff for US<1000 with period of 50us?
>
The results I am seeing do not correspond. First of, it seems I get
different results depending upon the context. That is in the context of
the pre-emption GuC send action command I get the results previously
posted. If I just run usleep_range(x, y) in loop 1000 times from the
context of dumping a debugfs file, I get something very different.
Basically, the minimum sleep time is 110-120us irrespective of the
values of X and Y. Pushing X and Y beyond 120 seems to make it complete
in Y+10-20us. E.g. u_r(100,200) completes in 210-230us for 80% of
samples. On the other hand, I don't get anywhere near so many samples in
the millisecond range as when called in the send action code path.
However, it sounds like the underlying issue might actually be a
back-port merge problem. The customer tree in question is actually a
combination of a 4.1 base kernel with a 4.11 DRM dropped on top. As
noted in a separate thread, this tree also has a problem with the
mutex_lock() call stalling even when the mutex is very definitely not
acquired (using mutex_trylock() eliminates the stall completely).
Apparently the back port process did hit a bunch of conflicts in the
base kernel's scheduling code. So there is a strong possibility that all
the issues we are seeing in that tree are an artifact of a merge issue.
So I think it is probably safe to ignore the results I am seeing in
terms of what the best upstream solution should be.
>>>> Time Original 10/10 50/10 100/10 110/10
>>>> 130/10 140/10 RetryLoop
>>>> 10us: 2 2 2 2 2
>>>> 2 2 2
>>>> 30us: 1 1 1
>>>> 1 1
>>>> 50us: 1
>>>> 70us: 14 63 56
>>>> 64 63 61
>>>> 80us: 8 41 52
>>>> 44 46 41
>>>> 90us: 6 24 10
>>>> 28 12 17
>>>> 100us: 2 4 20 16
>>>> 17 17 22
>>>> 110us: 13 21 14 13 11
>>>> 120us: 6 366 633
>>>> 636 660 650
>>>> 130us: 2 2 46 125
>>>> 95 86 95
>>>> 140us: 3 2 16 18
>>>> 32 46 48
>>>> 150us: 210 3 12 13
>>>> 37 32 31
>>>> 160us: 322 1 18 10
>>>> 14 12 17
>>>> 170us: 157 4 5 5
>>>> 3 5 2
>>>> 180us: 62 11 3 1
>>>> 2 1 1
>>>> 190us: 32 212 1
>>>> 1 2
>>>> 200us: 27 266 1 1
>>>> 210us: 16
>>>> 181 1
>>>> 220us: 16
>>>> 51 1
>>>> 230us: 10 43 4
>>>> 240us: 12 22 62 1
>>>> 250us: 4 12 112 3
>>>> 260us: 3 13 73 8
>>>> 270us: 5 12 12 8 2
>>>> 280us: 4 7 12 5 1
>>>> 290us: 9 4
>>>> 300us: 1 3 9 1 1
>>>> 310us: 2 3 5 1 1
>>>> 320us: 1 4 2 3
>>>> 330us: 1 5 1
>>>> 340us: 1 2 1
>>>> 350us: 2 1
>>>> 360us: 2 1
>>>> 370us: 2 2
>>>> 380us: 1
>>>> 390us: 2 1 2 1
>>>> 410us: 1
>>>> 420us: 3
>>>> 430us: 2 2 1
>>>> 440us: 2 1
>>>> 450us: 4
>>>> 460us: 3 1
>>>> 470us: 3 1
>>>> 480us: 2 2
>>>> 490us: 1
>>>> 500us: 19 13 17
>>>> 1000us: 249 22 30 11
>>>> 1500us: 393 4 4 2 1
>>>> 2000us: 132 7 8 8 2
>>>> 1 1
>>>> 2500us: 63 4 4 6 1
>>>> 1 1
>>>> 3000us: 59 9 7 6 1
>>>> 3500us: 34 2 1 1
>>>> 4000us: 17 9 4 1
>>>> 4500us: 8 2 1 1
>>>> 5000us: 7 1 2
>>>> 5500us: 7 2 1
>>>> 6000us: 4 2 1 1
>>>> 6500us: 3 1
>>>> 7000us: 6 2 1
>>>> 7500us: 4 1 1
>>>> 8000us: 5 1
>>>> 8500us: 1 1
>>>> 9000us: 2
>>>> 9500us: 2 1
>>>> >10000us: 3 1
>>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Intel-gfx mailing list
>>>> Intel-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>
>>
>
[-- Attachment #1.2: Type: text/html, Size: 19444 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2017-12-04 21:51 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
2017-11-21 16:29 ` Sagar Arun Kamble
2017-11-21 16:33 ` Chris Wilson
2017-11-21 16:49 ` Sagar Arun Kamble
2017-11-21 20:58 ` Chris Wilson
2017-11-21 16:50 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
2017-11-21 17:11 ` Chris Wilson
2017-11-21 17:29 ` Ville Syrjälä
2017-11-21 20:40 ` Chris Wilson
2017-11-21 17:36 ` [PATCH v2] " Chris Wilson
2017-11-21 17:41 ` ✓ Fi.CI.IGT: success for " Patchwork
2017-11-21 18:03 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2) Patchwork
2017-11-21 19:07 ` ✓ Fi.CI.IGT: " Patchwork
2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
2017-11-22 7:41 ` Sagar Arun Kamble
2017-11-22 9:36 ` Chris Wilson
2017-11-22 9:47 ` Michal Wajdeczko
2017-11-22 10:03 ` Sagar Arun Kamble
2017-11-24 12:37 ` Michał Winiarski
2017-11-24 14:12 ` Chris Wilson
2017-11-30 3:04 ` John Harrison
2017-11-30 6:19 ` Sagar Arun Kamble
2017-11-30 7:15 ` John Harrison
2017-11-30 7:55 ` Sagar Arun Kamble
2017-12-04 21:51 ` John Harrison
2017-11-21 21:32 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3) Patchwork
2017-11-21 22:41 ` ✗ Fi.CI.IGT: warning " Patchwork
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.