All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Use exponential backoff for wait_for()
@ 2017-11-21 15:24 Chris Wilson
  2017-11-21 16:29 ` Sagar Arun Kamble
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 15:24 UTC (permalink / raw)
  To: intel-gfx

Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
start with a small sleep and exponentially increase the sleep on each
cycle.

A good example of a beneficiary is the guc mmio communication channel.
Typically we expect (and so spin) for 10us for a quick response, but this
doesn't cover everything and so sometimes we fallback to the millisecond+
sleep. This incurs a significant delay in time-critical operations like
preemption (igt/gem_exec_latency), which can be improved significantly by
using a small sleep after the spin fails.

We've made this suggestion many times, but had little experimental data
to support adding the complexity.

References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: John Harrison <John.C.Harrison@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index 69aab324aaa1..c1ea9a009eb4 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -50,6 +50,7 @@
  */
 #define _wait_for(COND, US, W) ({ \
 	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
+	long wait__ = 1;						\
 	int ret__;							\
 	might_sleep();							\
 	for (;;) {							\
@@ -62,7 +63,9 @@
 			ret__ = -ETIMEDOUT;				\
 			break;						\
 		}							\
-		usleep_range((W), (W) * 2);				\
+		usleep_range(wait__, wait__ * 2);			\
+		if (wait__ < (W))					\
+			wait__ <<= 1;					\
 	}								\
 	ret__;								\
 })
-- 
2.15.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
@ 2017-11-21 16:29 ` Sagar Arun Kamble
  2017-11-21 16:33   ` Chris Wilson
  2017-11-21 16:50 ` ✓ Fi.CI.BAT: success for " Patchwork
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-21 16:29 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 11/21/2017 8:54 PM, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
>
> A good example of a beneficiary is the guc mmio communication channel.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
>
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
>
> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 69aab324aaa1..c1ea9a009eb4 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -50,6 +50,7 @@
>    */
>   #define _wait_for(COND, US, W) ({ \
>   	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
> +	long wait__ = 1;						\
>   	int ret__;							\
>   	might_sleep();							\
>   	for (;;) {							\
> @@ -62,7 +63,9 @@
>   			ret__ = -ETIMEDOUT;				\
>   			break;						\
>   		}							\
> -		usleep_range((W), (W) * 2);				\
> +		usleep_range(wait__, wait__ * 2);			\
> +		if (wait__ < (W))					\
This should be "wait__ < (US)" ?
> +			wait__ <<= 1;					\
>   	}								\
>   	ret__;								\
>   })

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 16:29 ` Sagar Arun Kamble
@ 2017-11-21 16:33   ` Chris Wilson
  2017-11-21 16:49     ` Sagar Arun Kamble
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 16:33 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx

Quoting Sagar Arun Kamble (2017-11-21 16:29:57)
> 
> 
> On 11/21/2017 8:54 PM, Chris Wilson wrote:
> > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > start with a small sleep and exponentially increase the sleep on each
> > cycle.
> >
> > A good example of a beneficiary is the guc mmio communication channel.
> > Typically we expect (and so spin) for 10us for a quick response, but this
> > doesn't cover everything and so sometimes we fallback to the millisecond+
> > sleep. This incurs a significant delay in time-critical operations like
> > preemption (igt/gem_exec_latency), which can be improved significantly by
> > using a small sleep after the spin fails.
> >
> > We've made this suggestion many times, but had little experimental data
> > to support adding the complexity.
> >
> > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: John Harrison <John.C.Harrison@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > index 69aab324aaa1..c1ea9a009eb4 100644
> > --- a/drivers/gpu/drm/i915/intel_drv.h
> > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > @@ -50,6 +50,7 @@
> >    */
> >   #define _wait_for(COND, US, W) ({ \
> >       unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
> > +     long wait__ = 1;                                                \
> >       int ret__;                                                      \
> >       might_sleep();                                                  \
> >       for (;;) {                                                      \
> > @@ -62,7 +63,9 @@
> >                       ret__ = -ETIMEDOUT;                             \
> >                       break;                                          \
> >               }                                                       \
> > -             usleep_range((W), (W) * 2);                             \
> > +             usleep_range(wait__, wait__ * 2);                       \
> > +             if (wait__ < (W))                                       \
> This should be "wait__ < (US)" ?

US is the total timeout, W is the sleep, now used to provide a cap
(minimum frequency of polling).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 16:33   ` Chris Wilson
@ 2017-11-21 16:49     ` Sagar Arun Kamble
  2017-11-21 20:58       ` Chris Wilson
  0 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-21 16:49 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 11/21/2017 10:03 PM, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-21 16:29:57)
>>
>> On 11/21/2017 8:54 PM, Chris Wilson wrote:
>>> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
>>> start with a small sleep and exponentially increase the sleep on each
>>> cycle.
>>>
>>> A good example of a beneficiary is the guc mmio communication channel.
>>> Typically we expect (and so spin) for 10us for a quick response, but this
>>> doesn't cover everything and so sometimes we fallback to the millisecond+
>>> sleep. This incurs a significant delay in time-critical operations like
>>> preemption (igt/gem_exec_latency), which can be improved significantly by
>>> using a small sleep after the spin fails.
>>>
>>> We've made this suggestion many times, but had little experimental data
>>> to support adding the complexity.
>>>
>>> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: John Harrison <John.C.Harrison@intel.com>
>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
>>> index 69aab324aaa1..c1ea9a009eb4 100644
>>> --- a/drivers/gpu/drm/i915/intel_drv.h
>>> +++ b/drivers/gpu/drm/i915/intel_drv.h
>>> @@ -50,6 +50,7 @@
>>>     */
>>>    #define _wait_for(COND, US, W) ({ \
>>>        unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
>>> +     long wait__ = 1;                                                \
>>>        int ret__;                                                      \
>>>        might_sleep();                                                  \
>>>        for (;;) {                                                      \
>>> @@ -62,7 +63,9 @@
>>>                        ret__ = -ETIMEDOUT;                             \
>>>                        break;                                          \
>>>                }                                                       \
>>> -             usleep_range((W), (W) * 2);                             \
>>> +             usleep_range(wait__, wait__ * 2);                       \
>>> +             if (wait__ < (W))                                       \
>> This should be "wait__ < (US)" ?
> US is the total timeout, W is the sleep, now used to provide a cap
> (minimum frequency of polling).
Oh. right. I thought we want to exponentially run down the total 
timeout. This might not be correct approach.
Hybrid polling might be useful here but we need to precisely know for 
each use of wait_for the expected wait time.
Then we could sleep half the total time and then poll :)
Users needing faster completion should use wait_for_atomic with higher 
timeout in uS?
> -Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for()
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
  2017-11-21 16:29 ` Sagar Arun Kamble
@ 2017-11-21 16:50 ` Patchwork
  2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 16:50 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Use exponential backoff for wait_for()
URL   : https://patchwork.freedesktop.org/series/34174/
State : success

== Summary ==

Series 34174v1 drm/i915: Use exponential backoff for wait_for()
https://patchwork.freedesktop.org/api/1.0/series/34174/revisions/1/mbox/

Test chamelium:
        Subgroup dp-crc-fast:
                fail       -> PASS       (fi-kbl-7500u) fdo#103163
Test gem_exec_reloc:
        Subgroup basic-gtt-active:
                pass       -> FAIL       (fi-gdg-551) fdo#102582 +3
Test kms_cursor_legacy:
        Subgroup basic-busy-flip-before-cursor-legacy:
                pass       -> FAIL       (fi-gdg-551) fdo#102618

fdo#103163 https://bugs.freedesktop.org/show_bug.cgi?id=103163
fdo#102582 https://bugs.freedesktop.org/show_bug.cgi?id=102582
fdo#102618 https://bugs.freedesktop.org/show_bug.cgi?id=102618

fi-bdw-5557u     total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:446s
fi-bdw-gvtdvm    total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:447s
fi-blb-e6850     total:289  pass:223  dwarn:1   dfail:0   fail:0   skip:65  time:382s
fi-bsw-n3050     total:289  pass:243  dwarn:0   dfail:0   fail:0   skip:46  time:544s
fi-bxt-dsi       total:289  pass:259  dwarn:0   dfail:0   fail:0   skip:30  time:505s
fi-bxt-j4205     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:510s
fi-byt-j1900     total:289  pass:254  dwarn:0   dfail:0   fail:0   skip:35  time:490s
fi-byt-n2820     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:487s
fi-cfl-s2        total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:613s
fi-elk-e7500     total:289  pass:229  dwarn:0   dfail:0   fail:0   skip:60  time:425s
fi-gdg-551       total:289  pass:173  dwarn:1   dfail:0   fail:6   skip:109 time:275s
fi-glk-1         total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:544s
fi-hsw-4770      total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:427s
fi-hsw-4770r     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:434s
fi-ilk-650       total:289  pass:228  dwarn:0   dfail:0   fail:0   skip:61  time:431s
fi-ivb-3520m     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:471s
fi-ivb-3770      total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:461s
fi-kbl-7500u     total:289  pass:264  dwarn:1   dfail:0   fail:0   skip:24  time:482s
fi-kbl-7560u     total:289  pass:270  dwarn:0   dfail:0   fail:0   skip:19  time:531s
fi-kbl-7567u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:476s
fi-kbl-r         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:534s
fi-pnv-d510      total:289  pass:222  dwarn:1   dfail:0   fail:0   skip:66  time:576s
fi-skl-6260u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:447s
fi-skl-6600u     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:548s
fi-skl-6700hq    total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:568s
fi-skl-6700k     total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:510s
fi-skl-6770hq    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:501s
fi-skl-gvtdvm    total:289  pass:266  dwarn:0   dfail:0   fail:0   skip:23  time:459s
fi-snb-2520m     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:553s
fi-snb-2600      total:289  pass:249  dwarn:0   dfail:0   fail:0   skip:40  time:416s
Blacklisted hosts:
fi-cnl-y         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:555s
fi-glk-dsi       total:289  pass:258  dwarn:0   dfail:0   fail:1   skip:30  time:504s
fi-bwr-2160 failed to collect. IGT log at Patchwork_7223/fi-bwr-2160/igt.log

f71044153bec6b5fbde984769d97b38fccf06722 drm-tip: 2017y-11m-21d-10h-48m-53s UTC integration manifest
151e3f64cfea drm/i915: Use exponential backoff for wait_for()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7223/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
  2017-11-21 16:29 ` Sagar Arun Kamble
  2017-11-21 16:50 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2017-11-21 17:00 ` Tvrtko Ursulin
  2017-11-21 17:11   ` Chris Wilson
  2017-11-21 17:36 ` [PATCH v2] " Chris Wilson
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Tvrtko Ursulin @ 2017-11-21 17:00 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 21/11/2017 15:24, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
> 
> A good example of a beneficiary is the guc mmio communication channel.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
> 
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
> 
> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 69aab324aaa1..c1ea9a009eb4 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -50,6 +50,7 @@
>    */
>   #define _wait_for(COND, US, W) ({ \
>   	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
> +	long wait__ = 1;						\
>   	int ret__;							\
>   	might_sleep();							\
>   	for (;;) {							\
> @@ -62,7 +63,9 @@
>   			ret__ = -ETIMEDOUT;				\
>   			break;						\
>   		}							\
> -		usleep_range((W), (W) * 2);				\
> +		usleep_range(wait__, wait__ * 2);			\
> +		if (wait__ < (W))					\
> +			wait__ <<= 1;					\
>   	}								\
>   	ret__;								\
>   })
> 

I would start the period at 10us since a) <10us is not recommended for 
usleep family, b) most callers specify ms timeouts so <10us poll is 
perhaps an overkill.

Latency sensitive callers like __intel_wait_for_register_us can be 
tweaked at the call site to provide what they want.

For the actual guc mmio send it sounds like it should pass in 20us to 
__intel_wait_for_register_us (referring to John's explanation email) to 
cover 99% of the cases. And then the remaining 1% could be fine with a 
10us delay?

Otherwise we are effectively making _wait_for partially busy looping, or 
whatever the inefficiency in <10us usleep is. I mean, it makes no 
practical difference to make a handful of quick loops there but it feels 
a bit inelegant.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
@ 2017-11-21 17:11   ` Chris Wilson
  2017-11-21 17:29     ` Ville Syrjälä
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 17:11 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2017-11-21 17:00:00)
> 
> On 21/11/2017 15:24, Chris Wilson wrote:
> > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > start with a small sleep and exponentially increase the sleep on each
> > cycle.
> > 
> > A good example of a beneficiary is the guc mmio communication channel.
> > Typically we expect (and so spin) for 10us for a quick response, but this
> > doesn't cover everything and so sometimes we fallback to the millisecond+
> > sleep. This incurs a significant delay in time-critical operations like
> > preemption (igt/gem_exec_latency), which can be improved significantly by
> > using a small sleep after the spin fails.
> > 
> > We've made this suggestion many times, but had little experimental data
> > to support adding the complexity.
> > 
> > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: John Harrison <John.C.Harrison@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > index 69aab324aaa1..c1ea9a009eb4 100644
> > --- a/drivers/gpu/drm/i915/intel_drv.h
> > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > @@ -50,6 +50,7 @@
> >    */
> >   #define _wait_for(COND, US, W) ({ \
> >       unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
> > +     long wait__ = 1;                                                \
> >       int ret__;                                                      \
> >       might_sleep();                                                  \
> >       for (;;) {                                                      \
> > @@ -62,7 +63,9 @@
> >                       ret__ = -ETIMEDOUT;                             \
> >                       break;                                          \
> >               }                                                       \
> > -             usleep_range((W), (W) * 2);                             \
> > +             usleep_range(wait__, wait__ * 2);                       \
> > +             if (wait__ < (W))                                       \
> > +                     wait__ <<= 1;                                   \
> >       }                                                               \
> >       ret__;                                                          \
> >   })
> > 
> 
> I would start the period at 10us since a) <10us is not recommended for 
> usleep family, b) most callers specify ms timeouts so <10us poll is 
> perhaps an overkill.

Don't forget the majority of the callers are now via wait_for_register
and so have that 10us poll prior to the sleeping wait_for().

If there's an ARCH_USLEEP_MIN_WHATNOT that would be useful.

> Latency sensitive callers like __intel_wait_for_register_us can be 
> tweaked at the call site to provide what they want.
> 
> For the actual guc mmio send it sounds like it should pass in 20us to 
> __intel_wait_for_register_us (referring to John's explanation email) to 
> cover 99% of the cases. And then the remaining 1% could be fine with a 
> 10us delay?

That it fixed that was a side-effect ;) It just happened to be something
that I could measure the latency of in userspace. I'd rather we have
something generic that does have a demonstrable improvement.

> Otherwise we are effectively making _wait_for partially busy looping, or 
> whatever the inefficiency in <10us usleep is. I mean, it makes no 
> practical difference to make a handful of quick loops there but it feels 
> a bit inelegant.

We already do use the hybrid busy-looping for wait_for_register (and
everybody is meant to be using wait_for_register, the exceptions should
be rare and well justified). The purpose of referencing 1758b90e38f5
("drm/i915: Use a hybrid scheme for fast register waits") was to remind
ourselves of the scheme and its benefits.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 17:11   ` Chris Wilson
@ 2017-11-21 17:29     ` Ville Syrjälä
  2017-11-21 20:40       ` Chris Wilson
  0 siblings, 1 reply; 28+ messages in thread
From: Ville Syrjälä @ 2017-11-21 17:29 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Tue, Nov 21, 2017 at 05:11:29PM +0000, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-11-21 17:00:00)
> > 
> > On 21/11/2017 15:24, Chris Wilson wrote:
> > > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > > start with a small sleep and exponentially increase the sleep on each
> > > cycle.
> > > 
> > > A good example of a beneficiary is the guc mmio communication channel.
> > > Typically we expect (and so spin) for 10us for a quick response, but this
> > > doesn't cover everything and so sometimes we fallback to the millisecond+
> > > sleep. This incurs a significant delay in time-critical operations like
> > > preemption (igt/gem_exec_latency), which can be improved significantly by
> > > using a small sleep after the spin fails.
> > > 
> > > We've made this suggestion many times, but had little experimental data
> > > to support adding the complexity.
> > > 
> > > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: John Harrison <John.C.Harrison@intel.com>
> > > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> > >   1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > > index 69aab324aaa1..c1ea9a009eb4 100644
> > > --- a/drivers/gpu/drm/i915/intel_drv.h
> > > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > > @@ -50,6 +50,7 @@
> > >    */
> > >   #define _wait_for(COND, US, W) ({ \
> > >       unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
> > > +     long wait__ = 1;                                                \
> > >       int ret__;                                                      \
> > >       might_sleep();                                                  \
> > >       for (;;) {                                                      \
> > > @@ -62,7 +63,9 @@
> > >                       ret__ = -ETIMEDOUT;                             \
> > >                       break;                                          \
> > >               }                                                       \
> > > -             usleep_range((W), (W) * 2);                             \
> > > +             usleep_range(wait__, wait__ * 2);                       \
> > > +             if (wait__ < (W))                                       \
> > > +                     wait__ <<= 1;                                   \
> > >       }                                                               \
> > >       ret__;                                                          \
> > >   })
> > > 
> > 
> > I would start the period at 10us since a) <10us is not recommended for 
> > usleep family, b) most callers specify ms timeouts so <10us poll is 
> > perhaps an overkill.
> 
> Don't forget the majority of the callers are now via wait_for_register
> and so have that 10us poll prior to the sleeping wait_for().
> 
> If there's an ARCH_USLEEP_MIN_WHATNOT that would be useful.
> 
> > Latency sensitive callers like __intel_wait_for_register_us can be 
> > tweaked at the call site to provide what they want.
> > 
> > For the actual guc mmio send it sounds like it should pass in 20us to 
> > __intel_wait_for_register_us (referring to John's explanation email) to 
> > cover 99% of the cases. And then the remaining 1% could be fine with a 
> > 10us delay?
> 
> That it fixed that was a side-effect ;) It just happened to be something
> that I could measure the latency of in userspace. I'd rather we have
> something generic that does have a demonstrable improvement.
> 
> > Otherwise we are effectively making _wait_for partially busy looping, or 
> > whatever the inefficiency in <10us usleep is. I mean, it makes no 
> > practical difference to make a handful of quick loops there but it feels 
> > a bit inelegant.
> 
> We already do use the hybrid busy-looping for wait_for_register (and
> everybody is meant to be using wait_for_register, the exceptions should
> be rare and well justified).

We do have to bear 96dabe99cae8 ("drm/i915: Avoid busy-spinning on
VLV_GLTC_PW_STATUS mmio") in mind. Touching wait_for() might break
that again. Probably the correct choice for these exceptions would
be to specify a minimum polling interval explicitly.

The other cases that come to mind are the wait_for_pipe_off() type of
things which generally measure several milliseconds. Faster polling
won't really help those cases that much except if the pipe already
happens to be close to vblank when we shut it down. But as long as
the max polling interval gets capped to a few milliseconds I guess
the exponential backoff shouldn't hurt us either.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (2 preceding siblings ...)
  2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
@ 2017-11-21 17:36 ` Chris Wilson
  2017-11-21 17:41 ` ✓ Fi.CI.IGT: success for " Patchwork
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 17:36 UTC (permalink / raw)
  To: intel-gfx

Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
start with a small sleep and exponentially increase the sleep on each
cycle.

A good example of a beneficiary is the guc mmio communication channel.
Typically we expect (and so spin) for 10us for a quick response, but this
doesn't cover everything and so sometimes we fallback to the millisecond+
sleep. This incurs a significant delay in time-critical operations like
preemption (igt/gem_exec_latency), which can be improved significantly by
using a small sleep after the spin fails.

We've made this suggestion many times, but had little experimental data
to support adding the complexity.

v2: Bump the minimum usleep to 10us on advice of
Documentation/timers/timers-howto.txt (Tvrko)

References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: John Harrison <John.C.Harrison@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index 69aab324aaa1..3e115599c2e6 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -50,6 +50,7 @@
  */
 #define _wait_for(COND, US, W) ({ \
 	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
+	long wait__ = 10; /* recommended minimum for usleep */		\
 	int ret__;							\
 	might_sleep();							\
 	for (;;) {							\
@@ -62,7 +63,9 @@
 			ret__ = -ETIMEDOUT;				\
 			break;						\
 		}							\
-		usleep_range((W), (W) * 2);				\
+		usleep_range(wait__, wait__ * 2);			\
+		if (wait__ < (W))					\
+			wait__ <<= 1;					\
 	}								\
 	ret__;								\
 })
-- 
2.15.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* ✓ Fi.CI.IGT: success for drm/i915: Use exponential backoff for wait_for()
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (3 preceding siblings ...)
  2017-11-21 17:36 ` [PATCH v2] " Chris Wilson
@ 2017-11-21 17:41 ` Patchwork
  2017-11-21 18:03 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2) Patchwork
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 17:41 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Use exponential backoff for wait_for()
URL   : https://patchwork.freedesktop.org/series/34174/
State : success

== Summary ==

Test drv_selftest:
        Subgroup mock_sanitycheck:
                pass       -> DMESG-WARN (shard-snb) fdo#103717
Test kms_setmode:
        Subgroup basic:
                pass       -> FAIL       (shard-hsw) fdo#99912
Test gem_exec_reloc:
        Subgroup basic-write-cpu-noreloc:
                incomplete -> PASS       (shard-snb)

fdo#103717 https://bugs.freedesktop.org/show_bug.cgi?id=103717
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912

shard-hsw        total:2585 pass:1473 dwarn:2   dfail:1   fail:10  skip:1099 time:9427s
shard-snb        total:2585 pass:1258 dwarn:2   dfail:1   fail:11  skip:1313 time:7929s
Blacklisted hosts:
shard-apl        total:2555 pass:1599 dwarn:2   dfail:0   fail:22  skip:930 time:12175s
shard-kbl        total:2481 pass:1644 dwarn:5   dfail:1   fail:21  skip:807 time:9917s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7223/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2)
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (4 preceding siblings ...)
  2017-11-21 17:41 ` ✓ Fi.CI.IGT: success for " Patchwork
@ 2017-11-21 18:03 ` Patchwork
  2017-11-21 19:07 ` ✓ Fi.CI.IGT: " Patchwork
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 18:03 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Use exponential backoff for wait_for() (rev2)
URL   : https://patchwork.freedesktop.org/series/34174/
State : success

== Summary ==

Series 34174v2 drm/i915: Use exponential backoff for wait_for()
https://patchwork.freedesktop.org/api/1.0/series/34174/revisions/2/mbox/

Test chamelium:
        Subgroup dp-crc-fast:
                fail       -> PASS       (fi-kbl-7500u) fdo#103163

fdo#103163 https://bugs.freedesktop.org/show_bug.cgi?id=103163

fi-bdw-5557u     total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:446s
fi-bdw-gvtdvm    total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:450s
fi-blb-e6850     total:289  pass:223  dwarn:1   dfail:0   fail:0   skip:65  time:379s
fi-bsw-n3050     total:289  pass:243  dwarn:0   dfail:0   fail:0   skip:46  time:524s
fi-bwr-2160      total:289  pass:183  dwarn:0   dfail:0   fail:0   skip:106 time:277s
fi-bxt-dsi       total:289  pass:259  dwarn:0   dfail:0   fail:0   skip:30  time:503s
fi-bxt-j4205     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:509s
fi-byt-j1900     total:289  pass:254  dwarn:0   dfail:0   fail:0   skip:35  time:486s
fi-byt-n2820     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:483s
fi-cfl-s2        total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:606s
fi-elk-e7500     total:289  pass:229  dwarn:0   dfail:0   fail:0   skip:60  time:419s
fi-gdg-551       total:289  pass:178  dwarn:1   dfail:0   fail:1   skip:109 time:264s
fi-glk-1         total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:540s
fi-hsw-4770      total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:424s
fi-hsw-4770r     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:434s
fi-ilk-650       total:289  pass:228  dwarn:0   dfail:0   fail:0   skip:61  time:430s
fi-ivb-3520m     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:482s
fi-ivb-3770      total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:460s
fi-kbl-7500u     total:289  pass:264  dwarn:1   dfail:0   fail:0   skip:24  time:487s
fi-kbl-7560u     total:289  pass:270  dwarn:0   dfail:0   fail:0   skip:19  time:531s
fi-kbl-7567u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:482s
fi-kbl-r         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:534s
fi-pnv-d510      total:289  pass:222  dwarn:1   dfail:0   fail:0   skip:66  time:579s
fi-skl-6260u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:446s
fi-skl-6600u     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:542s
fi-skl-6700hq    total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:567s
fi-skl-6700k     total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:512s
fi-skl-6770hq    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:504s
fi-skl-gvtdvm    total:289  pass:266  dwarn:0   dfail:0   fail:0   skip:23  time:453s
fi-snb-2520m     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:549s
fi-snb-2600      total:289  pass:249  dwarn:0   dfail:0   fail:0   skip:40  time:418s
Blacklisted hosts:
fi-cnl-y         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:549s
fi-glk-dsi       total:32   pass:22   dwarn:0   dfail:0   fail:0   skip:9  

f71044153bec6b5fbde984769d97b38fccf06722 drm-tip: 2017y-11m-21d-10h-48m-53s UTC integration manifest
6ef981ecbe0b drm/i915: Use exponential backoff for wait_for()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7224/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* ✓ Fi.CI.IGT: success for drm/i915: Use exponential backoff for wait_for() (rev2)
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (5 preceding siblings ...)
  2017-11-21 18:03 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2) Patchwork
@ 2017-11-21 19:07 ` Patchwork
  2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 19:07 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Use exponential backoff for wait_for() (rev2)
URL   : https://patchwork.freedesktop.org/series/34174/
State : success

== Summary ==

Test drv_selftest:
        Subgroup mock_sanitycheck:
                pass       -> DMESG-WARN (shard-snb) fdo#103717
                dmesg-warn -> PASS       (shard-hsw) fdo#103719
Test gem_exec_reloc:
        Subgroup basic-write-cpu-noreloc:
                incomplete -> PASS       (shard-snb)
Test kms_setmode:
        Subgroup basic:
                pass       -> FAIL       (shard-hsw) fdo#99912

fdo#103717 https://bugs.freedesktop.org/show_bug.cgi?id=103717
fdo#103719 https://bugs.freedesktop.org/show_bug.cgi?id=103719
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912

shard-hsw        total:2585 pass:1474 dwarn:1   dfail:1   fail:10  skip:1099 time:9419s
shard-snb        total:2585 pass:1258 dwarn:2   dfail:1   fail:11  skip:1313 time:8011s
Blacklisted hosts:
shard-apl        total:2565 pass:1604 dwarn:2   dfail:0   fail:22  skip:936 time:12844s
shard-kbl        total:2500 pass:1656 dwarn:5   dfail:1   fail:25  skip:811 time:10126s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7224/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 17:29     ` Ville Syrjälä
@ 2017-11-21 20:40       ` Chris Wilson
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 20:40 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx

Quoting Ville Syrjälä (2017-11-21 17:29:43)
> On Tue, Nov 21, 2017 at 05:11:29PM +0000, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2017-11-21 17:00:00)
> > > 
> > > On 21/11/2017 15:24, Chris Wilson wrote:
> > > > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > > > start with a small sleep and exponentially increase the sleep on each
> > > > cycle.
> > > > 
> > > > A good example of a beneficiary is the guc mmio communication channel.
> > > > Typically we expect (and so spin) for 10us for a quick response, but this
> > > > doesn't cover everything and so sometimes we fallback to the millisecond+
> > > > sleep. This incurs a significant delay in time-critical operations like
> > > > preemption (igt/gem_exec_latency), which can be improved significantly by
> > > > using a small sleep after the spin fails.
> > > > 
> > > > We've made this suggestion many times, but had little experimental data
> > > > to support adding the complexity.
> > > > 
> > > > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > Cc: John Harrison <John.C.Harrison@intel.com>
> > > > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > > > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > > > ---
> > > >   drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> > > >   1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > > > index 69aab324aaa1..c1ea9a009eb4 100644
> > > > --- a/drivers/gpu/drm/i915/intel_drv.h
> > > > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > > > @@ -50,6 +50,7 @@
> > > >    */
> > > >   #define _wait_for(COND, US, W) ({ \
> > > >       unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
> > > > +     long wait__ = 1;                                                \
> > > >       int ret__;                                                      \
> > > >       might_sleep();                                                  \
> > > >       for (;;) {                                                      \
> > > > @@ -62,7 +63,9 @@
> > > >                       ret__ = -ETIMEDOUT;                             \
> > > >                       break;                                          \
> > > >               }                                                       \
> > > > -             usleep_range((W), (W) * 2);                             \
> > > > +             usleep_range(wait__, wait__ * 2);                       \
> > > > +             if (wait__ < (W))                                       \
> > > > +                     wait__ <<= 1;                                   \
> > > >       }                                                               \
> > > >       ret__;                                                          \
> > > >   })
> > > > 
> > > 
> > > I would start the period at 10us since a) <10us is not recommended for 
> > > usleep family, b) most callers specify ms timeouts so <10us poll is 
> > > perhaps an overkill.
> > 
> > Don't forget the majority of the callers are now via wait_for_register
> > and so have that 10us poll prior to the sleeping wait_for().
> > 
> > If there's an ARCH_USLEEP_MIN_WHATNOT that would be useful.
> > 
> > > Latency sensitive callers like __intel_wait_for_register_us can be 
> > > tweaked at the call site to provide what they want.
> > > 
> > > For the actual guc mmio send it sounds like it should pass in 20us to 
> > > __intel_wait_for_register_us (referring to John's explanation email) to 
> > > cover 99% of the cases. And then the remaining 1% could be fine with a 
> > > 10us delay?
> > 
> > That it fixed that was a side-effect ;) It just happened to be something
> > that I could measure the latency of in userspace. I'd rather we have
> > something generic that does have a demonstrable improvement.
> > 
> > > Otherwise we are effectively making _wait_for partially busy looping, or 
> > > whatever the inefficiency in <10us usleep is. I mean, it makes no 
> > > practical difference to make a handful of quick loops there but it feels 
> > > a bit inelegant.
> > 
> > We already do use the hybrid busy-looping for wait_for_register (and
> > everybody is meant to be using wait_for_register, the exceptions should
> > be rare and well justified).
> 
> We do have to bear 96dabe99cae8 ("drm/i915: Avoid busy-spinning on
> VLV_GLTC_PW_STATUS mmio") in mind. Touching wait_for() might break
> that again. Probably the correct choice for these exceptions would
> be to specify a minimum polling interval explicitly.

Fortunately, that was triggered by BAT:

https://bugs.freedesktop.org/show_bug.cgi?id=100718#c4
> Close to 100% reproducing rate on BYT-n2820 and BSW-n3050 with
> DRM-Tip:
> igt@pm_rpm@gem-execbuf-stress
> igt@pm_rpm@gem-execbuf-stress-extra-wait

Hopefully, the timing hasn't changed significantly since then and we can
rely on current CI to trip over it if the usleep interval is too small.

Thanks for the reminder!

> The other cases that come to mind are the wait_for_pipe_off() type of
> things which generally measure several milliseconds. Faster polling
> won't really help those cases that much except if the pipe already
> happens to be close to vblank when we shut it down. But as long as
> the max polling interval gets capped to a few milliseconds I guess
> the exponential backoff shouldn't hurt us either.

Indeed. Starting at 10us, it'll take us 8 cycles (total of 2.5ms) to hit
the timeout, so 7 more loops than required. Likely in the noise, but
(Wmin, Wmax) is an easy improvement.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 16:49     ` Sagar Arun Kamble
@ 2017-11-21 20:58       ` Chris Wilson
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 20:58 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx

Quoting Sagar Arun Kamble (2017-11-21 16:49:37)
> 
> 
> On 11/21/2017 10:03 PM, Chris Wilson wrote:
> > Quoting Sagar Arun Kamble (2017-11-21 16:29:57)
> >>
> >> On 11/21/2017 8:54 PM, Chris Wilson wrote:
> >>> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> >>> start with a small sleep and exponentially increase the sleep on each
> >>> cycle.
> >>>
> >>> A good example of a beneficiary is the guc mmio communication channel.
> >>> Typically we expect (and so spin) for 10us for a quick response, but this
> >>> doesn't cover everything and so sometimes we fallback to the millisecond+
> >>> sleep. This incurs a significant delay in time-critical operations like
> >>> preemption (igt/gem_exec_latency), which can be improved significantly by
> >>> using a small sleep after the spin fails.
> >>>
> >>> We've made this suggestion many times, but had little experimental data
> >>> to support adding the complexity.
> >>>
> >>> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> Cc: John Harrison <John.C.Harrison@intel.com>
> >>> Cc: Michał Winiarski <michal.winiarski@intel.com>
> >>> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/intel_drv.h | 5 ++++-
> >>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> >>> index 69aab324aaa1..c1ea9a009eb4 100644
> >>> --- a/drivers/gpu/drm/i915/intel_drv.h
> >>> +++ b/drivers/gpu/drm/i915/intel_drv.h
> >>> @@ -50,6 +50,7 @@
> >>>     */
> >>>    #define _wait_for(COND, US, W) ({ \
> >>>        unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
> >>> +     long wait__ = 1;                                                \
> >>>        int ret__;                                                      \
> >>>        might_sleep();                                                  \
> >>>        for (;;) {                                                      \
> >>> @@ -62,7 +63,9 @@
> >>>                        ret__ = -ETIMEDOUT;                             \
> >>>                        break;                                          \
> >>>                }                                                       \
> >>> -             usleep_range((W), (W) * 2);                             \
> >>> +             usleep_range(wait__, wait__ * 2);                       \
> >>> +             if (wait__ < (W))                                       \
> >> This should be "wait__ < (US)" ?
> > US is the total timeout, W is the sleep, now used to provide a cap
> > (minimum frequency of polling).
> Oh. right. I thought we want to exponentially run down the total 
> timeout. This might not be correct approach.
> Hybrid polling might be useful here but we need to precisely know for 
> each use of wait_for the expected wait time.
> Then we could sleep half the total time and then poll :)
> Users needing faster completion should use wait_for_atomic with higher 
> timeout in uS?

The majority of users are expected to use intel_wait_for_register*() so
that we don't have large numbers of this large macro expansion in the
code. There we have put some basic attempt at spin/poll. Doing the wait
for half the expected period then spin would be fun -- except that we
have no idea what the expected period is in most cases :)

If you did have an expected period, then usleep(expected_period >>= 1)
would be a good heuristic indeed.

if (expected) {
	do {
		int tmp;

		if (COND) ...

		tmp = expected >> 1;
		usleep_range(tmp, expected);
		expected = tmp;
	} while (expected);
}
(then the current wait_for loops)

But we would also need to feed that knowledge in from somewhere. (Not a
bad idea, just needs a use case to start the ball rolling. Wait for vblank
might be one?)

Instead our assumption is that mmio completes and is acked immediately,
and backoff from that assumption when it fails.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (6 preceding siblings ...)
  2017-11-21 19:07 ` ✓ Fi.CI.IGT: " Patchwork
@ 2017-11-21 20:59 ` Chris Wilson
  2017-11-22  7:41   ` Sagar Arun Kamble
  2017-11-24 12:37   ` Michał Winiarski
  2017-11-21 21:32 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3) Patchwork
  2017-11-21 22:41 ` ✗ Fi.CI.IGT: warning " Patchwork
  9 siblings, 2 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-21 20:59 UTC (permalink / raw)
  To: intel-gfx

Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
start with a small sleep and exponentially increase the sleep on each
cycle.

A good example of a beneficiary is the guc mmio communication channel.
Typically we expect (and so spin) for 10us for a quick response, but this
doesn't cover everything and so sometimes we fallback to the millisecond+
sleep. This incurs a significant delay in time-critical operations like
preemption (igt/gem_exec_latency), which can be improved significantly by
using a small sleep after the spin fails.

We've made this suggestion many times, but had little experimental data
to support adding the complexity.

v2: Bump the minimum usleep to 10us on advice of
Documentation/timers/timers-howto.txt (Tvrko)
v3: Specify min, max range for usleep intervals -- some code may
crucially depend upon and so want to specify the sleep pattern.

References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: John Harrison <John.C.Harrison@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
 drivers/gpu/drm/i915/intel_pm.c  |  2 +-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index 635a96fcd788..c00441a3d649 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -48,8 +48,9 @@
  * having timed out, since the timeout could be due to preemption or similar and
  * we've never had a chance to check the condition before the timeout.
  */
-#define _wait_for(COND, US, W) ({ \
+#define _wait_for(COND, US, Wmin, Wmax) ({ \
 	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
+	long wait__ = (Wmin); /* recommended min for usleep is 10 us */	\
 	int ret__;							\
 	might_sleep();							\
 	for (;;) {							\
@@ -62,12 +63,14 @@
 			ret__ = -ETIMEDOUT;				\
 			break;						\
 		}							\
-		usleep_range((W), (W) * 2);				\
+		usleep_range(wait__, wait__ * 2);			\
+		if (wait__ < (Wmax))					\
+			wait__ <<= 1;					\
 	}								\
 	ret__;								\
 })
 
-#define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
+#define wait_for(COND, MS)	_wait_for((COND), (MS) * 1000, 10, 1000)
 
 /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
 #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
@@ -116,7 +119,7 @@
 	int ret__; \
 	BUILD_BUG_ON(!__builtin_constant_p(US)); \
 	if ((US) > 10) \
-		ret__ = _wait_for((COND), (US), 10); \
+		ret__ = _wait_for((COND), (US), 10, 10); \
 	else \
 		ret__ = _wait_for_atomic((COND), (US), 0); \
 	ret__; \
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index e445ec174831..f07f14ae198d 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -9294,7 +9294,7 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
 		ret = 0;
 		goto out;
 	}
-	ret = _wait_for(COND, timeout_base_ms * 1000, 10);
+	ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
 	if (!ret)
 		goto out;
 
-- 
2.15.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3)
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (7 preceding siblings ...)
  2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
@ 2017-11-21 21:32 ` Patchwork
  2017-11-21 22:41 ` ✗ Fi.CI.IGT: warning " Patchwork
  9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 21:32 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Use exponential backoff for wait_for() (rev3)
URL   : https://patchwork.freedesktop.org/series/34174/
State : success

== Summary ==

Series 34174v3 drm/i915: Use exponential backoff for wait_for()
https://patchwork.freedesktop.org/api/1.0/series/34174/revisions/3/mbox/

Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-b:
                pass       -> INCOMPLETE (fi-snb-2520m) fdo#103713

fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713

fi-bdw-5557u     total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:459s
fi-bdw-gvtdvm    total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:453s
fi-blb-e6850     total:289  pass:223  dwarn:1   dfail:0   fail:0   skip:65  time:380s
fi-bsw-n3050     total:289  pass:243  dwarn:0   dfail:0   fail:0   skip:46  time:534s
fi-bwr-2160      total:289  pass:183  dwarn:0   dfail:0   fail:0   skip:106 time:279s
fi-bxt-dsi       total:289  pass:259  dwarn:0   dfail:0   fail:0   skip:30  time:501s
fi-bxt-j4205     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:497s
fi-byt-j1900     total:289  pass:254  dwarn:0   dfail:0   fail:0   skip:35  time:490s
fi-byt-n2820     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:490s
fi-cfl-s2        total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:604s
fi-elk-e7500     total:289  pass:229  dwarn:0   dfail:0   fail:0   skip:60  time:424s
fi-gdg-551       total:289  pass:178  dwarn:1   dfail:0   fail:1   skip:109 time:263s
fi-glk-1         total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:541s
fi-hsw-4770r     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:432s
fi-ilk-650       total:289  pass:228  dwarn:0   dfail:0   fail:0   skip:61  time:427s
fi-ivb-3520m     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:475s
fi-ivb-3770      total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:457s
fi-kbl-7500u     total:289  pass:264  dwarn:1   dfail:0   fail:0   skip:24  time:477s
fi-kbl-7560u     total:289  pass:270  dwarn:0   dfail:0   fail:0   skip:19  time:528s
fi-kbl-7567u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:476s
fi-kbl-r         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:531s
fi-pnv-d510      total:289  pass:222  dwarn:1   dfail:0   fail:0   skip:66  time:580s
fi-skl-6260u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:451s
fi-skl-6600u     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:545s
fi-skl-6700hq    total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:567s
fi-skl-6700k     total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:513s
fi-skl-6770hq    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:495s
fi-skl-gvtdvm    total:289  pass:266  dwarn:0   dfail:0   fail:0   skip:23  time:454s
fi-snb-2520m     total:246  pass:212  dwarn:0   dfail:0   fail:0   skip:33 
fi-snb-2600      total:289  pass:249  dwarn:0   dfail:0   fail:0   skip:40  time:418s
Blacklisted hosts:
fi-glk-dsi       total:289  pass:149  dwarn:0   dfail:10  fail:2   skip:128

9cbb5ad289f8f874e048120d6ac3862b2f36a8ac drm-tip: 2017y-11m-21d-19h-39m-42s UTC integration manifest
0a446ff35dd9 drm/i915: Use exponential backoff for wait_for()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7228/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* ✗ Fi.CI.IGT: warning for drm/i915: Use exponential backoff for wait_for() (rev3)
  2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
                   ` (8 preceding siblings ...)
  2017-11-21 21:32 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3) Patchwork
@ 2017-11-21 22:41 ` Patchwork
  9 siblings, 0 replies; 28+ messages in thread
From: Patchwork @ 2017-11-21 22:41 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Use exponential backoff for wait_for() (rev3)
URL   : https://patchwork.freedesktop.org/series/34174/
State : warning

== Summary ==

Test kms_flip:
        Subgroup modeset-vs-vblank-race-interruptible:
                pass       -> FAIL       (shard-hsw) fdo#103060
        Subgroup flip-vs-panning-vs-hang:
                pass       -> DMESG-WARN (shard-snb) fdo#103821
Test kms_setmode:
        Subgroup basic:
                pass       -> FAIL       (shard-hsw) fdo#99912
Test perf:
        Subgroup polling:
                fail       -> PASS       (shard-hsw) fdo#102252
Test drv_suspend:
        Subgroup sysfs-reader:
                pass       -> SKIP       (shard-snb)
        Subgroup fence-restore-untiled:
                pass       -> SKIP       (shard-hsw)
Test kms_atomic_transition:
        Subgroup plane-all-transition-nonblocking-fencing:
                pass       -> SKIP       (shard-hsw)
Test kms_chv_cursor_fail:
        Subgroup pipe-b-128x128-left-edge:
                pass       -> SKIP       (shard-hsw)

fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
fdo#103821 https://bugs.freedesktop.org/show_bug.cgi?id=103821
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252

shard-hsw        total:2585 pass:1470 dwarn:1   dfail:1   fail:11  skip:1102 time:9426s
shard-snb        total:2585 pass:1255 dwarn:2   dfail:1   fail:13  skip:1314 time:8045s
Blacklisted hosts:
shard-apl        total:2563 pass:1598 dwarn:1   dfail:0   fail:23  skip:939 time:13086s
shard-kbl        total:2535 pass:1683 dwarn:4   dfail:0   fail:25  skip:821 time:10218s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7228/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
@ 2017-11-22  7:41   ` Sagar Arun Kamble
  2017-11-22  9:36     ` Chris Wilson
  2017-11-24 12:37   ` Michał Winiarski
  1 sibling, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-22  7:41 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 11/22/2017 2:29 AM, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
>
> A good example of a beneficiary is the guc mmio communication channel.
As Tvrtko said, for the current GuC communication (guc_send_mmio) we 
will need to update fast timeout of
__intel_wait_for_register to 20us. Improvement this patch proposes 
through wait_for will
certainly be seen once we switch over to GuC CT. May be specifying "GuC 
CT channel" here is apt.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
>
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
>
> v2: Bump the minimum usleep to 10us on advice of
> Documentation/timers/timers-howto.txt (Tvrko)
> v3: Specify min, max range for usleep intervals -- some code may
> crucially depend upon and so want to specify the sleep pattern.
>
> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
>   drivers/gpu/drm/i915/intel_pm.c  |  2 +-
>   2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 635a96fcd788..c00441a3d649 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -48,8 +48,9 @@
>    * having timed out, since the timeout could be due to preemption or similar and
>    * we've never had a chance to check the condition before the timeout.
>    */
> -#define _wait_for(COND, US, W) ({ \
> +#define _wait_for(COND, US, Wmin, Wmax) ({ \
>   	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
> +	long wait__ = (Wmin); /* recommended min for usleep is 10 us */	\
>   	int ret__;							\
>   	might_sleep();							\
>   	for (;;) {							\
> @@ -62,12 +63,14 @@
>   			ret__ = -ETIMEDOUT;				\
>   			break;						\
>   		}							\
> -		usleep_range((W), (W) * 2);				\
> +		usleep_range(wait__, wait__ * 2);			\
> +		if (wait__ < (Wmax))					\
> +			wait__ <<= 1;					\
I think we need to keep track of total time we have waited else we might 
wait for longer than necessary.
For e.g. for wait_for_us(COND, 900) this approach might actually lead to 
sleep of 1270us.
>   	}								\
>   	ret__;								\
>   })
>   
> -#define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
> +#define wait_for(COND, MS)	_wait_for((COND), (MS) * 1000, 10, 1000)
>   
>   /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
>   #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
> @@ -116,7 +119,7 @@
>   	int ret__; \
>   	BUILD_BUG_ON(!__builtin_constant_p(US)); \
>   	if ((US) > 10) \
> -		ret__ = _wait_for((COND), (US), 10); \
> +		ret__ = _wait_for((COND), (US), 10, 10); \
>   	else \
>   		ret__ = _wait_for_atomic((COND), (US), 0); \
>   	ret__; \
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index e445ec174831..f07f14ae198d 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -9294,7 +9294,7 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
>   		ret = 0;
>   		goto out;
>   	}
> -	ret = _wait_for(COND, timeout_base_ms * 1000, 10);
> +	ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
>   	if (!ret)
>   		goto out;
>   

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-22  7:41   ` Sagar Arun Kamble
@ 2017-11-22  9:36     ` Chris Wilson
  2017-11-22  9:47       ` Michal Wajdeczko
  2017-11-22 10:03       ` Sagar Arun Kamble
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Wilson @ 2017-11-22  9:36 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx

Quoting Sagar Arun Kamble (2017-11-22 07:41:02)
> 
> 
> On 11/22/2017 2:29 AM, Chris Wilson wrote:
> > Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> > start with a small sleep and exponentially increase the sleep on each
> > cycle.
> >
> > A good example of a beneficiary is the guc mmio communication channel.
> As Tvrtko said, for the current GuC communication (guc_send_mmio) we 
> will need to update fast timeout of
> __intel_wait_for_register to 20us. Improvement this patch proposes 
> through wait_for will
> certainly be seen once we switch over to GuC CT. May be specifying "GuC 
> CT channel" here is apt.

guc mmio comm falls off the fastpath hitting the sleeping wait_for, and
*is* improved by this patch. As far as the latency experienced by
gem_exec_latency, there is no difference between a 10us sleep and
spinning for 20us.  Changing the spin length to 20us!!! is
something that you should talk to the guc about, at that point we really
need an asynchronous communication channel (ct is still being used
synchronously).

Changing the intel_guc_send_mmio fast timeout is a different
conversation, that is of more dubious merit because of this patch (i.e.
if we can achieve the same latency with a sleep should we spin at all?).

> > Typically we expect (and so spin) for 10us for a quick response, but this
> > doesn't cover everything and so sometimes we fallback to the millisecond+
> > sleep. This incurs a significant delay in time-critical operations like
> > preemption (igt/gem_exec_latency), which can be improved significantly by
> > using a small sleep after the spin fails.
> >
> > We've made this suggestion many times, but had little experimental data
> > to support adding the complexity.
> >
> > v2: Bump the minimum usleep to 10us on advice of
> > Documentation/timers/timers-howto.txt (Tvrko)
> > v3: Specify min, max range for usleep intervals -- some code may
> > crucially depend upon and so want to specify the sleep pattern.
> >
> > References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: John Harrison <John.C.Harrison@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
> >   drivers/gpu/drm/i915/intel_pm.c  |  2 +-
> >   2 files changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > index 635a96fcd788..c00441a3d649 100644
> > --- a/drivers/gpu/drm/i915/intel_drv.h
> > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > @@ -48,8 +48,9 @@
> >    * having timed out, since the timeout could be due to preemption or similar and
> >    * we've never had a chance to check the condition before the timeout.
> >    */
> > -#define _wait_for(COND, US, W) ({ \
> > +#define _wait_for(COND, US, Wmin, Wmax) ({ \
> >       unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
> > +     long wait__ = (Wmin); /* recommended min for usleep is 10 us */ \
> >       int ret__;                                                      \
> >       might_sleep();                                                  \
> >       for (;;) {                                                      \
> > @@ -62,12 +63,14 @@
> >                       ret__ = -ETIMEDOUT;                             \
> >                       break;                                          \
> >               }                                                       \
> > -             usleep_range((W), (W) * 2);                             \
> > +             usleep_range(wait__, wait__ * 2);                       \
> > +             if (wait__ < (Wmax))                                    \
> > +                     wait__ <<= 1;                                   \
> I think we need to keep track of total time we have waited else we might 
> wait for longer than necessary.

This is not a precise wait, it's a sleep with a rough upper bound.
Sleeps by their very nature are very rough, you are giving up the
processor and telling it not to wake you before a certain time. We are
not asking to be woken at that time, just not before.

> For e.g. for wait_for_us(COND, 900) this approach might actually lead to 
> sleep of 1270us.

The numbers are irrelevant, they are merely round numbers from once upon
a time using msleep(1).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-22  9:36     ` Chris Wilson
@ 2017-11-22  9:47       ` Michal Wajdeczko
  2017-11-22 10:03       ` Sagar Arun Kamble
  1 sibling, 0 replies; 28+ messages in thread
From: Michal Wajdeczko @ 2017-11-22  9:47 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx, Chris Wilson

On Wed, 22 Nov 2017 10:36:15 +0100, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Sagar Arun Kamble (2017-11-22 07:41:02)
>>
>>
>> On 11/22/2017 2:29 AM, Chris Wilson wrote:
>> > Instead of sleeping for a fixed 1ms (roughly, depending on timer  
>> slack),
>> > start with a small sleep and exponentially increase the sleep on each
>> > cycle.
>> >
>> > A good example of a beneficiary is the guc mmio communication channel.
>> As Tvrtko said, for the current GuC communication (guc_send_mmio) we
>> will need to update fast timeout of
>> __intel_wait_for_register to 20us. Improvement this patch proposes
>> through wait_for will
>> certainly be seen once we switch over to GuC CT. May be specifying "GuC
>> CT channel" here is apt.
>
> guc mmio comm falls off the fastpath hitting the sleeping wait_for, and
> *is* improved by this patch. As far as the latency experienced by
> gem_exec_latency, there is no difference between a 10us sleep and
> spinning for 20us.  Changing the spin length to 20us!!! is
> something that you should talk to the guc about, at that point we really
> need an asynchronous communication channel (ct is still being used
> synchronously).

FYI: updated series with asynchronous CT will be posted soon, delay is due
to planned changes of commands format in GuC firmware.

Michal
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-22  9:36     ` Chris Wilson
  2017-11-22  9:47       ` Michal Wajdeczko
@ 2017-11-22 10:03       ` Sagar Arun Kamble
  1 sibling, 0 replies; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-22 10:03 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 11/22/2017 3:06 PM, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-22 07:41:02)
>>
>> On 11/22/2017 2:29 AM, Chris Wilson wrote:
>>> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
>>> start with a small sleep and exponentially increase the sleep on each
>>> cycle.
>>>
>>> A good example of a beneficiary is the guc mmio communication channel.
>> As Tvrtko said, for the current GuC communication (guc_send_mmio) we
>> will need to update fast timeout of
>> __intel_wait_for_register to 20us. Improvement this patch proposes
>> through wait_for will
>> certainly be seen once we switch over to GuC CT. May be specifying "GuC
>> CT channel" here is apt.
> guc mmio comm falls off the fastpath hitting the sleeping wait_for, and
> *is* improved by this patch.
Thanks for clarification. I overlooked the sleeping wait of 10ms in 
_intel_wait_for_register.
>   As far as the latency experienced by
> gem_exec_latency, there is no difference between a 10us sleep and
> spinning for 20us.  Changing the spin length to 20us!!! is
> something that you should talk to the guc about, at that point we really
> need an asynchronous communication channel (ct is still being used
> synchronously).
>
> Changing the intel_guc_send_mmio fast timeout is a different
> conversation, that is of more dubious merit because of this patch (i.e.
> if we can achieve the same latency with a sleep should we spin at all?).
Agree.
>>> Typically we expect (and so spin) for 10us for a quick response, but this
>>> doesn't cover everything and so sometimes we fallback to the millisecond+
>>> sleep. This incurs a significant delay in time-critical operations like
>>> preemption (igt/gem_exec_latency), which can be improved significantly by
>>> using a small sleep after the spin fails.
>>>
>>> We've made this suggestion many times, but had little experimental data
>>> to support adding the complexity.
>>>
>>> v2: Bump the minimum usleep to 10us on advice of
>>> Documentation/timers/timers-howto.txt (Tvrko)
>>> v3: Specify min, max range for usleep intervals -- some code may
>>> crucially depend upon and so want to specify the sleep pattern.
>>>
>>> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: John Harrison <John.C.Harrison@intel.com>
>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Reviewed-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
>>>    drivers/gpu/drm/i915/intel_pm.c  |  2 +-
>>>    2 files changed, 8 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
>>> index 635a96fcd788..c00441a3d649 100644
>>> --- a/drivers/gpu/drm/i915/intel_drv.h
>>> +++ b/drivers/gpu/drm/i915/intel_drv.h
>>> @@ -48,8 +48,9 @@
>>>     * having timed out, since the timeout could be due to preemption or similar and
>>>     * we've never had a chance to check the condition before the timeout.
>>>     */
>>> -#define _wait_for(COND, US, W) ({ \
>>> +#define _wait_for(COND, US, Wmin, Wmax) ({ \
>>>        unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;   \
>>> +     long wait__ = (Wmin); /* recommended min for usleep is 10 us */ \
>>>        int ret__;                                                      \
>>>        might_sleep();                                                  \
>>>        for (;;) {                                                      \
>>> @@ -62,12 +63,14 @@
>>>                        ret__ = -ETIMEDOUT;                             \
>>>                        break;                                          \
>>>                }                                                       \
>>> -             usleep_range((W), (W) * 2);                             \
>>> +             usleep_range(wait__, wait__ * 2);                       \
>>> +             if (wait__ < (Wmax))                                    \
>>> +                     wait__ <<= 1;                                   \
>> I think we need to keep track of total time we have waited else we might
>> wait for longer than necessary.
> This is not a precise wait, it's a sleep with a rough upper bound.
> Sleeps by their very nature are very rough, you are giving up the
> processor and telling it not to wake you before a certain time. We are
> not asking to be woken at that time, just not before.
>
>> For e.g. for wait_for_us(COND, 900) this approach might actually lead to
>> sleep of 1270us.
> The numbers are irrelevant, they are merely round numbers from once upon
> a time using msleep(1).
Ok. Thanks for clarification.
> -Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
  2017-11-22  7:41   ` Sagar Arun Kamble
@ 2017-11-24 12:37   ` Michał Winiarski
  2017-11-24 14:12     ` Chris Wilson
  1 sibling, 1 reply; 28+ messages in thread
From: Michał Winiarski @ 2017-11-24 12:37 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Tue, Nov 21, 2017 at 08:59:51PM +0000, Chris Wilson wrote:
> Instead of sleeping for a fixed 1ms (roughly, depending on timer slack),
> start with a small sleep and exponentially increase the sleep on each
> cycle.
> 
> A good example of a beneficiary is the guc mmio communication channel.
> Typically we expect (and so spin) for 10us for a quick response, but this
> doesn't cover everything and so sometimes we fallback to the millisecond+
> sleep. This incurs a significant delay in time-critical operations like
> preemption (igt/gem_exec_latency), which can be improved significantly by
> using a small sleep after the spin fails.
> 
> We've made this suggestion many times, but had little experimental data
> to support adding the complexity.
> 
> v2: Bump the minimum usleep to 10us on advice of
> Documentation/timers/timers-howto.txt (Tvrko)
> v3: Specify min, max range for usleep intervals -- some code may
> crucially depend upon and so want to specify the sleep pattern.

Since we see the effects for GuC preeption, let's gather some evidence.

(SKL)
intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'

drm-tip:
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 44       |                                        |
        16 -> 31         : 1088     |                                        |
        32 -> 63         : 832      |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 12       |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 29899    |*********                               |
      2048 -> 4095       : 131033   |****************************************|

exponential backoff:
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 58       |                                        |
        16 -> 31         : 2009     |                                        |
        32 -> 63         : 153615   |****************************************|
        64 -> 127        : 4360     |*                                       |
       128 -> 255        : 14       |                                        |
       256 -> 511        : 4236     |*                                       |
       512 -> 1023       : 4        |                                        |

s/10us/100us in __intel_wait_for_register_fw fast_timeout (without exponential
backoff):
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 946      |                                        |
        16 -> 31         : 159803   |****************************************|
        32 -> 63         : 3440     |                                        |
        64 -> 127        : 33       |                                        |
       128 -> 255        : 1        |                                        |
       256 -> 511        : 74       |                                        |

same as the previous one, except with lower GPU clock, let's say max == hw_min:
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 26       |                                        |
        32 -> 63         : 160411   |****************************************|
        64 -> 127        : 3402     |                                        |
       128 -> 255        : 12       |                                        |
       256 -> 511        : 50       |                                        |
       512 -> 1023       : 3        |                                        |
      1024 -> 2047       : 1        |                                        |
      2048 -> 4095       : 2        |                                        |

We're waaay better with exponential backoff compared with current drm-tip, and
we're almost as good as with spinning for a longer period. Unfortunately with
GuC the delay depends on platform (and/or GPU clock?), so we're still seeing the
long tail without exponential backoff if we manage to miss the fast timeout.

> References: 1758b90e38f5 ("drm/i915: Use a hybrid scheme for fast register waits")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>

Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>

-Michał

> ---
>  drivers/gpu/drm/i915/intel_drv.h | 11 +++++++----
>  drivers/gpu/drm/i915/intel_pm.c  |  2 +-
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 635a96fcd788..c00441a3d649 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -48,8 +48,9 @@
>   * having timed out, since the timeout could be due to preemption or similar and
>   * we've never had a chance to check the condition before the timeout.
>   */
> -#define _wait_for(COND, US, W) ({ \
> +#define _wait_for(COND, US, Wmin, Wmax) ({ \
>  	unsigned long timeout__ = jiffies + usecs_to_jiffies(US) + 1;	\
> +	long wait__ = (Wmin); /* recommended min for usleep is 10 us */	\
>  	int ret__;							\
>  	might_sleep();							\
>  	for (;;) {							\
> @@ -62,12 +63,14 @@
>  			ret__ = -ETIMEDOUT;				\
>  			break;						\
>  		}							\
> -		usleep_range((W), (W) * 2);				\
> +		usleep_range(wait__, wait__ * 2);			\
> +		if (wait__ < (Wmax))					\
> +			wait__ <<= 1;					\
>  	}								\
>  	ret__;								\
>  })
>  
> -#define wait_for(COND, MS)	  	_wait_for((COND), (MS) * 1000, 1000)
> +#define wait_for(COND, MS)	_wait_for((COND), (MS) * 1000, 10, 1000)
>  
>  /* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */
>  #if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT)
> @@ -116,7 +119,7 @@
>  	int ret__; \
>  	BUILD_BUG_ON(!__builtin_constant_p(US)); \
>  	if ((US) > 10) \
> -		ret__ = _wait_for((COND), (US), 10); \
> +		ret__ = _wait_for((COND), (US), 10, 10); \
>  	else \
>  		ret__ = _wait_for_atomic((COND), (US), 0); \
>  	ret__; \
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index e445ec174831..f07f14ae198d 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -9294,7 +9294,7 @@ int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request,
>  		ret = 0;
>  		goto out;
>  	}
> -	ret = _wait_for(COND, timeout_base_ms * 1000, 10);
> +	ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
>  	if (!ret)
>  		goto out;
>  
> -- 
> 2.15.0
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-24 12:37   ` Michał Winiarski
@ 2017-11-24 14:12     ` Chris Wilson
  2017-11-30  3:04       ` John Harrison
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Wilson @ 2017-11-24 14:12 UTC (permalink / raw)
  To: Michał Winiarski; +Cc: intel-gfx

Quoting Michał Winiarski (2017-11-24 12:37:56)
> Since we see the effects for GuC preeption, let's gather some evidence.
> 
> (SKL)
> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
> 
> drm-tip:
>      usecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 44       |                                        |
>         16 -> 31         : 1088     |                                        |
>         32 -> 63         : 832      |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 12       |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 29899    |*********                               |
>       2048 -> 4095       : 131033   |****************************************|

Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
could create a tracepoint/kprobe that would output a histogram for each
waiter (filterable ofc). Benefit? Just thinking of tuning the
spin/sleep, in which case overall metrics are best
(intel_eait_for_register needs to be optimised for the typical case). I
am wondering if we could tune the spin period down to 5us, 2us? And then
have the 10us sleep.

We would also need a typical workload to run, it's profile-guided
optimisation after all. Hmm.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-24 14:12     ` Chris Wilson
@ 2017-11-30  3:04       ` John Harrison
  2017-11-30  6:19         ` Sagar Arun Kamble
  0 siblings, 1 reply; 28+ messages in thread
From: John Harrison @ 2017-11-30  3:04 UTC (permalink / raw)
  To: Chris Wilson, Michał Winiarski; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 9960 bytes --]

On 11/24/2017 6:12 AM, Chris Wilson wrote:
> Quoting Michał Winiarski (2017-11-24 12:37:56)
>> Since we see the effects for GuC preeption, let's gather some evidence.
>>
>> (SKL)
>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>
>> drm-tip:
>>       usecs               : count     distribution
>>           0 -> 1          : 0        |                                        |
>>           2 -> 3          : 0        |                                        |
>>           4 -> 7          : 0        |                                        |
>>           8 -> 15         : 44       |                                        |
>>          16 -> 31         : 1088     |                                        |
>>          32 -> 63         : 832      |                                        |
>>          64 -> 127        : 0        |                                        |
>>         128 -> 255        : 0        |                                        |
>>         256 -> 511        : 12       |                                        |
>>         512 -> 1023       : 0        |                                        |
>>        1024 -> 2047       : 29899    |*********                               |
>>        2048 -> 4095       : 131033   |****************************************|
> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
> could create a tracepoint/kprobe that would output a histogram for each
> waiter (filterable ofc). Benefit? Just thinking of tuning the
> spin/sleep, in which case overall metrics are best
> (intel_eait_for_register needs to be optimised for the typical case). I
> am wondering if we could tune the spin period down to 5us, 2us? And then
> have the 10us sleep.
>
> We would also need a typical workload to run, it's profile-guided
> optimisation after all. Hmm.
> -Chris

It took me a while to get back to this but I've now had chance to run 
with this exponential backoff scheme on the original system that showed 
the problem. It was a slightly messy back port due to the customer tree 
being much older than current nightly. I'm pretty sure I got it correct 
though. However, I'm not sure what the recommendation is for the two 
timeout values. Using the default of '10, 10' in the patch, I still get 
lots of very long delays. I have to up the Wmin value to at least 140 to 
get a stall free result. Which is plausible given that the big spike in 
the results of any fast version is at 110-150us. Also of note is that a 
Wmin between 10 and 110 actually makes things worse. Changing Wmax has 
no effect.

In the following table, 'original' is the original driver before any 
changes and 'retry loop' is the version using the first workaround of 
just running the busy poll wait in a 10x loop. The other columns are 
using the backoff patch with the given Wmin/Wmax values. Note that the 
times are bucketed to 10us up to 500us and then in 500us lumps 
thereafter. The value listed is the lower limit, i.e. there were no 
times of <10us measured. Each case was run for 1000 samples.

     Time        Original    10/10 50/10    100/10    110/10    
130/10    140/10  RetryLoop
     10us:          2         2         2         2 2         2         
2         2
     30us:                              1         1 1         1         1
     50us:                              1
     70us:                             14        63 56        64        
63        61
     80us:                              8        41 52        44        
46        41
     90us:                              6        24 10        28        
12        17
    100us:                    2         4        20 16        17        
17        22
    110us:                                       13 21        14        
13        11
    120us:                              6       366 633       636       
660       650
    130us:                    2         2        46 125        95        
86        95
    140us:                    3         2        16 18        32        
46        48
    150us:                  210         3        12 13        37        
32        31
    160us:                  322         1        18 10        14        
12        17
    170us:                  157         4         5 5         3         
5         2
    180us:                   62        11         3 1         2         
1         1
    190us:                   32       212 1                   1         2
    200us:                   27       266 1                   1
    210us:                   16 
181                                                 1
    220us:                   16 51                                       1
    230us:                   10        43         4
    240us:                   12        22        62         1
    250us:                    4        12       112         3
    260us:                    3        13        73         8
    270us:                    5        12        12 8         2
    280us:                    4         7        12 5         1
    290us:                              9         4
    300us:                    1         3         9 1         1
    310us:                    2         3         5 1         1
    320us:                    1         4         2         3
    330us:                    1         5         1
    340us:                    1 2                   1
    350us:                              2         1
    360us:                              2         1
    370us:                    2                   2
    380us:                                        1
    390us:                    2         1         2         1
    410us:                    1
    420us:                    3
    430us:                    2         2         1
    440us:                    2         1
    450us:                              4
    460us:                    3         1
    470us:                              3         1
    480us:                    2                   2
    490us:                                        1
    500us:                   19        13        17
   1000us:        249        22        30        11
   1500us:        393         4         4         2         1
   2000us:        132         7         8         8 2         
1                   1
   2500us:         63         4         4         6 1         1         1
   3000us:         59         9         7         6         1
   3500us:         34         2 1                             1
   4000us:         17         9         4         1
   4500us:          8         2         1         1
   5000us:          7         1         2
   5500us:          7         2                   1
   6000us:          4         2         1         1
   6500us:          3                             1
   7000us:          6         2                   1
   7500us:          4         1                             1
   8000us:          5                             1
   8500us:                    1         1
   9000us:          2
   9500us:          2         1
 >10000us:          3                             1


John.


[-- Attachment #1.2: Type: text/html, Size: 12023 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-30  3:04       ` John Harrison
@ 2017-11-30  6:19         ` Sagar Arun Kamble
  2017-11-30  7:15           ` John Harrison
  0 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-30  6:19 UTC (permalink / raw)
  To: John Harrison, Chris Wilson, Michał Winiarski; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 11041 bytes --]



On 11/30/2017 8:34 AM, John Harrison wrote:
> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>
>>> (SKL)
>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>
>>> drm-tip:
>>>       usecs               : count     distribution
>>>           0 -> 1          : 0        |                                        |
>>>           2 -> 3          : 0        |                                        |
>>>           4 -> 7          : 0        |                                        |
>>>           8 -> 15         : 44       |                                        |
>>>          16 -> 31         : 1088     |                                        |
>>>          32 -> 63         : 832      |                                        |
>>>          64 -> 127        : 0        |                                        |
>>>         128 -> 255        : 0        |                                        |
>>>         256 -> 511        : 12       |                                        |
>>>         512 -> 1023       : 0        |                                        |
>>>        1024 -> 2047       : 29899    |*********                               |
>>>        2048 -> 4095       : 131033   |****************************************|
>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>> could create a tracepoint/kprobe that would output a histogram for each
>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>> spin/sleep, in which case overall metrics are best
>> (intel_eait_for_register needs to be optimised for the typical case). I
>> am wondering if we could tune the spin period down to 5us, 2us? And then
>> have the 10us sleep.
>>
>> We would also need a typical workload to run, it's profile-guided
>> optimisation after all. Hmm.
>> -Chris
>
> It took me a while to get back to this but I've now had chance to run 
> with this exponential backoff scheme on the original system that 
> showed the problem. It was a slightly messy back port due to the 
> customer tree being much older than current nightly. I'm pretty sure I 
> got it correct though. However, I'm not sure what the recommendation 
> is for the two timeout values. Using the default of '10, 10' in the 
> patch, I still get lots of very long delays. 
Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us and 
Wmin=10, Wmax=1000 for wait_for.

Exponential backoff is more helpful inside wait_for if wait_for_us prior 
to wait_for is smaller.
Setting Wmax less than Wmin is effectively changing the backoff strategy 
to just linear waits of Wmin.
> I have to up the Wmin value to at least 140 to get a stall free 
> result. Which is plausible given that the big spike in the results of 
> any fast version is at 110-150us. Also of note is that a Wmin between 
> 10 and 110 actually makes things worse. Changing Wmax has no effect.
>
> In the following table, 'original' is the original driver before any 
> changes and 'retry loop' is the version using the first workaround of 
> just running the busy poll wait in a 10x loop. The other columns are 
> using the backoff patch with the given Wmin/Wmax values. Note that the 
> times are bucketed to 10us up to 500us and then in 500us lumps 
> thereafter. The value listed is the lower limit, i.e. there were no 
> times of <10us measured. Each case was run for 1000 samples.
>
Below setting like in current nightly will suit this workload and as you 
have found this will also likely complete most waits in <150us.
If many samples had been beyond 160us and less than 300us we might have 
been needed to change Wmin to may be 15 or 20 to ensure the
exponential rise caps around 300us.

wait_for_us(10, 10)
wait_for()

#define wait_for _wait_for(10, 1000)

>     Time        Original    10/10     50/10 100/10    110/10    
> 130/10    140/10  RetryLoop
>     10us:          2         2         2         2 2         2         
> 2         2
>     30us:                              1         1 1         1         1
>     50us:                              1
>     70us:                             14        63 56        64        
> 63        61
>     80us:                              8        41 52        44        
> 46        41
>     90us:                              6        24 10        28        
> 12        17
>    100us:                    2         4        20 16        17        
> 17        22
>    110us:                                       13 21        14        
> 13        11
>    120us:                              6       366 633       636       
> 660       650
>    130us:                    2         2        46 125        
> 95        86        95
>    140us:                    3         2        16 18        32        
> 46        48
>    150us:                  210         3        12 13        37        
> 32        31
>    160us:                  322         1        18 10        14        
> 12        17
>    170us:                  157         4         5 5         3         
> 5         2
>    180us:                   62        11         3 1         2         
> 1         1
>    190us:                   32       212 1                   1         2
>    200us:                   27       266 1                   1
>    210us:                   16 
> 181                                                 1
>    220us:                   16 51                                       1
>    230us:                   10        43         4
>    240us:                   12        22        62         1
>    250us:                    4        12       112         3
>    260us:                    3        13        73         8
>    270us:                    5        12        12 8         2
>    280us:                    4         7        12 5         1
>    290us:                              9         4
>    300us:                    1         3         9 1         1
>    310us:                    2         3         5 1         1
>    320us:                    1         4         2         3
>    330us:                    1         5         1
>    340us:                    1 2                   1
>    350us:                              2         1
>    360us:                              2         1
>    370us:                    2                   2
>    380us:                                        1
>    390us:                    2         1         2         1
>    410us:                    1
>    420us:                    3
>    430us:                    2         2         1
>    440us:                    2         1
>    450us:                              4
>    460us:                    3         1
>    470us:                              3         1
>    480us:                    2                   2
>    490us:                                        1
>    500us:                   19        13        17
>   1000us:        249        22        30        11
>   1500us:        393         4         4         2         1
>   2000us:        132         7         8         8 2         
> 1                   1
>   2500us:         63         4         4         6 1         1         1
>   3000us:         59         9         7         6         1
>   3500us:         34         2 1                             1
>   4000us:         17         9         4         1
>   4500us:          8         2         1         1
>   5000us:          7         1         2
>   5500us:          7         2                   1
>   6000us:          4         2         1         1
>   6500us:          3                             1
>   7000us:          6         2                   1
>   7500us:          4         1                             1
>   8000us:          5                             1
>   8500us:                    1         1
>   9000us:          2
>   9500us:          2         1
> >10000us:          3                             1
>
>
> John.
>
>
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[-- Attachment #1.2: Type: text/html, Size: 14094 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-30  6:19         ` Sagar Arun Kamble
@ 2017-11-30  7:15           ` John Harrison
  2017-11-30  7:55             ` Sagar Arun Kamble
  0 siblings, 1 reply; 28+ messages in thread
From: John Harrison @ 2017-11-30  7:15 UTC (permalink / raw)
  To: Sagar Arun Kamble, Chris Wilson, Michał Winiarski; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 11888 bytes --]

On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
> On 11/30/2017 8:34 AM, John Harrison wrote:
>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>
>>>> (SKL)
>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>
>>>> drm-tip:
>>>>       usecs               : count     distribution
>>>>           0 -> 1          : 0        |                                        |
>>>>           2 -> 3          : 0        |                                        |
>>>>           4 -> 7          : 0        |                                        |
>>>>           8 -> 15         : 44       |                                        |
>>>>          16 -> 31         : 1088     |                                        |
>>>>          32 -> 63         : 832      |                                        |
>>>>          64 -> 127        : 0        |                                        |
>>>>         128 -> 255        : 0        |                                        |
>>>>         256 -> 511        : 12       |                                        |
>>>>         512 -> 1023       : 0        |                                        |
>>>>        1024 -> 2047       : 29899    |*********                               |
>>>>        2048 -> 4095       : 131033   |****************************************|
>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>> could create a tracepoint/kprobe that would output a histogram for each
>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>> spin/sleep, in which case overall metrics are best
>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>> have the 10us sleep.
>>>
>>> We would also need a typical workload to run, it's profile-guided
>>> optimisation after all. Hmm.
>>> -Chris
>>
>> It took me a while to get back to this but I've now had chance to run 
>> with this exponential backoff scheme on the original system that 
>> showed the problem. It was a slightly messy back port due to the 
>> customer tree being much older than current nightly. I'm pretty sure 
>> I got it correct though. However, I'm not sure what the 
>> recommendation is for the two timeout values. Using the default of 
>> '10, 10' in the patch, I still get lots of very long delays. 
> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us and 
> Wmin=10, Wmax=1000 for wait_for.
>
> Exponential backoff is more helpful inside wait_for if wait_for_us 
> prior to wait_for is smaller.
> Setting Wmax less than Wmin is effectively changing the backoff 
> strategy to just linear waits of Wmin.
>> I have to up the Wmin value to at least 140 to get a stall free 
>> result. Which is plausible given that the big spike in the results of 
>> any fast version is at 110-150us. Also of note is that a Wmin between 
>> 10 and 110 actually makes things worse. Changing Wmax has no effect.
>>
>> In the following table, 'original' is the original driver before any 
>> changes and 'retry loop' is the version using the first workaround of 
>> just running the busy poll wait in a 10x loop. The other columns are 
>> using the backoff patch with the given Wmin/Wmax values. Note that 
>> the times are bucketed to 10us up to 500us and then in 500us lumps 
>> thereafter. The value listed is the lower limit, i.e. there were no 
>> times of <10us measured. Each case was run for 1000 samples.
>>
> Below setting like in current nightly will suit this workload and as 
> you have found this will also likely complete most waits in <150us.
> If many samples had been beyond 160us and less than 300us we might 
> have been needed to change Wmin to may be 15 or 20 to ensure the
> exponential rise caps around 300us.
>
> wait_for_us(10, 10)
> wait_for()
>
> #define wait_for _wait_for(10, 1000)
>
But as shown in the table, a setting of 10/10 does not work well for 
this workload. The best results possible are a large spike of waits in 
the 120-130us bucket with a small tail out to 150us. Whereas, the 10/10 
setting produces a spike from 150-170us with the tail extending to 240us 
and an appreciable number of samples stretching all the way out to the 
1-10ms range. A regular delay of multiple milliseconds is not acceptable 
when this path is supposed to be a low latency pre-emption to switch to 
some super high priority time critical task. And as noted, I did try a 
bunch of different settings for Wmax but nothing seemed to make much of 
a difference. E.g. 10/10 vs 10/1000 produced pretty much identical 
results. Hence it didn't seem worth including those in the table.


>>     Time        Original    10/10 50/10    100/10    110/10    
>> 130/10    140/10  RetryLoop
>>     10us:          2         2         2         2 2         
>> 2         2         2
>>     30us:                              1         1 1         1         1
>>     50us:                              1
>>     70us:                             14        63 56        
>> 64        63        61
>>     80us:                              8        41 52        
>> 44        46        41
>>     90us:                              6        24 10        
>> 28        12        17
>>    100us:                    2         4        20 16        
>> 17        17        22
>>    110us:                                       13 21        
>> 14        13        11
>>    120us:                              6       366 633       
>> 636       660       650
>>    130us:                    2         2        46 125        
>> 95        86        95
>>    140us:                    3         2        16 18        
>> 32        46        48
>>    150us:                  210         3        12 13        
>> 37        32        31
>>    160us:                  322         1        18 10        
>> 14        12        17
>>    170us:                  157         4         5 5         
>> 3         5         2
>>    180us:                   62        11         3 1         
>> 2         1         1
>>    190us:                   32       212 1                   1         2
>>    200us:                   27       266 1                   1
>>    210us:                   16 
>> 181                                                 1
>>    220us:                   16 51                                       1
>>    230us:                   10        43         4
>>    240us:                   12        22        62 1
>>    250us:                    4        12       112 3
>>    260us:                    3        13        73 8
>>    270us:                    5        12        12 8         2
>>    280us:                    4         7        12 5         1
>>    290us:                              9         4
>>    300us:                    1         3         9 1         1
>>    310us:                    2         3         5 1         1
>>    320us:                    1         4         2 3
>>    330us:                    1         5         1
>>    340us:                    1 2                   1
>>    350us:                              2         1
>>    360us:                              2         1
>>    370us:                    2                   2
>>    380us:                                        1
>>    390us:                    2         1         2 1
>>    410us:                    1
>>    420us:                    3
>>    430us:                    2         2         1
>>    440us:                    2         1
>>    450us:                              4
>>    460us:                    3         1
>>    470us:                              3         1
>>    480us:                    2                   2
>>    490us:                                        1
>>    500us:                   19        13        17
>>   1000us:        249        22        30        11
>>   1500us:        393         4         4         2 1
>>   2000us:        132         7         8         8 2         
>> 1                   1
>>   2500us:         63         4         4         6 1         1         1
>>   3000us:         59         9         7         6 1
>>   3500us:         34         2 1                             1
>>   4000us:         17         9         4         1
>>   4500us:          8         2         1         1
>>   5000us:          7         1         2
>>   5500us:          7         2                   1
>>   6000us:          4         2         1         1
>>   6500us:          3                             1
>>   7000us:          6         2                   1
>>   7500us:          4         1 1
>>   8000us:          5                             1
>>   8500us:                    1         1
>>   9000us:          2
>>   9500us:          2         1
>> >10000us:          3                             1
>>
>>
>> John.
>>
>>
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>


[-- Attachment #1.2: Type: text/html, Size: 15637 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-30  7:15           ` John Harrison
@ 2017-11-30  7:55             ` Sagar Arun Kamble
  2017-12-04 21:51               ` John Harrison
  0 siblings, 1 reply; 28+ messages in thread
From: Sagar Arun Kamble @ 2017-11-30  7:55 UTC (permalink / raw)
  To: John Harrison, Chris Wilson, Michał Winiarski; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 12809 bytes --]



On 11/30/2017 12:45 PM, John Harrison wrote:
> On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
>> On 11/30/2017 8:34 AM, John Harrison wrote:
>>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>>
>>>>> (SKL)
>>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>>
>>>>> drm-tip:
>>>>>       usecs               : count     distribution
>>>>>           0 -> 1          : 0        |                                        |
>>>>>           2 -> 3          : 0        |                                        |
>>>>>           4 -> 7          : 0        |                                        |
>>>>>           8 -> 15         : 44       |                                        |
>>>>>          16 -> 31         : 1088     |                                        |
>>>>>          32 -> 63         : 832      |                                        |
>>>>>          64 -> 127        : 0        |                                        |
>>>>>         128 -> 255        : 0        |                                        |
>>>>>         256 -> 511        : 12       |                                        |
>>>>>         512 -> 1023       : 0        |                                        |
>>>>>        1024 -> 2047       : 29899    |*********                               |
>>>>>        2048 -> 4095       : 131033   |****************************************|
>>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>>> could create a tracepoint/kprobe that would output a histogram for each
>>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>>> spin/sleep, in which case overall metrics are best
>>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>>> have the 10us sleep.
>>>>
>>>> We would also need a typical workload to run, it's profile-guided
>>>> optimisation after all. Hmm.
>>>> -Chris
>>>
>>> It took me a while to get back to this but I've now had chance to 
>>> run with this exponential backoff scheme on the original system that 
>>> showed the problem. It was a slightly messy back port due to the 
>>> customer tree being much older than current nightly. I'm pretty sure 
>>> I got it correct though. However, I'm not sure what the 
>>> recommendation is for the two timeout values. Using the default of 
>>> '10, 10' in the patch, I still get lots of very long delays. 
>> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us and 
>> Wmin=10, Wmax=1000 for wait_for.
>>
>> Exponential backoff is more helpful inside wait_for if wait_for_us 
>> prior to wait_for is smaller.
>> Setting Wmax less than Wmin is effectively changing the backoff 
>> strategy to just linear waits of Wmin.
>>> I have to up the Wmin value to at least 140 to get a stall free 
>>> result. Which is plausible given that the big spike in the results 
>>> of any fast version is at 110-150us. Also of note is that a Wmin 
>>> between 10 and 110 actually makes things worse. Changing Wmax has no 
>>> effect.
>>>
>>> In the following table, 'original' is the original driver before any 
>>> changes and 'retry loop' is the version using the first workaround 
>>> of just running the busy poll wait in a 10x loop. The other columns 
>>> are using the backoff patch with the given Wmin/Wmax values. Note 
>>> that the times are bucketed to 10us up to 500us and then in 500us 
>>> lumps thereafter. The value listed is the lower limit, i.e. there 
>>> were no times of <10us measured. Each case was run for 1000 samples.
>>>
>> Below setting like in current nightly will suit this workload and as 
>> you have found this will also likely complete most waits in <150us.
>> If many samples had been beyond 160us and less than 300us we might 
>> have been needed to change Wmin to may be 15 or 20 to ensure the
>> exponential rise caps around 300us.
>>
>> wait_for_us(10, 10)
>> wait_for()
>>
>> #define wait_for _wait_for(10, 1000)
>>
> But as shown in the table, a setting of 10/10 does not work well for 
> this workload. The best results possible are a large spike of waits in 
> the 120-130us bucket with a small tail out to 150us. Whereas, the 
> 10/10 setting produces a spike from 150-170us with the tail extending 
> to 240us and an appreciable number of samples stretching all the way 
> out to the 1-10ms range. A regular delay of multiple milliseconds is 
> not acceptable when this path is supposed to be a low latency 
> pre-emption to switch to some super high priority time critical task. 
> And as noted, I did try a bunch of different settings for Wmax but 
> nothing seemed to make much of a difference. E.g. 10/10 vs 10/1000 
> produced pretty much identical results. Hence it didn't seem worth 
> including those in the table.
>
Wmin = 10us leads us to total delay of 150us in 3 loops (this might be 
tight to catch most conditions)
Wmin = 25us can lead us to total delay of 175us in 3 loops

Since most conditions are likely to complete around 140us-160us, Looks 
like Wmin of 25 to 30 (25,1000 or 30, 1000) will suit this workload but
since this profile driver optimization I am wondering about the optimal 
Wmin point.

This wait need is very time critical. Exponential rise might not be good 
strategy during higher wait times.
usleep_range might also be adding extra latency.

May be we should do this exponential backoff for waits having US >= 1000 
and do periodic backoff for US<1000 with period of 50us?

>>>     Time        Original    10/10 50/10    100/10    110/10    
>>> 130/10    140/10  RetryLoop
>>>     10us:          2         2         2 2         2         
>>> 2         2         2
>>>     30us:                              1 1         1         1         1
>>>     50us:                              1
>>>     70us:                             14 63        56        
>>> 64        63        61
>>>     80us:                              8 41        52        
>>> 44        46        41
>>>     90us:                              6 24        10        
>>> 28        12        17
>>>    100us:                    2         4 20        16        
>>> 17        17        22
>>>    110us: 13        21        14        13        11
>>>    120us:                              6       366 633       
>>> 636       660       650
>>>    130us:                    2         2        46 125        
>>> 95        86        95
>>>    140us:                    3         2 16        18        
>>> 32        46        48
>>>    150us:                  210         3 12        13        
>>> 37        32        31
>>>    160us:                  322         1 18        10        
>>> 14        12        17
>>>    170us:                  157         4 5         5         
>>> 3         5         2
>>>    180us:                   62        11 3         1         
>>> 2         1         1
>>>    190us:                   32       212 1                   1         2
>>>    200us:                   27       266 1                   1
>>>    210us:                   16 
>>> 181                                                 1
>>>    220us:                   16 
>>> 51                                       1
>>>    230us:                   10        43         4
>>>    240us:                   12        22 62         1
>>>    250us:                    4        12 112         3
>>>    260us:                    3        13 73         8
>>>    270us:                    5        12 12         8         2
>>>    280us:                    4         7 12         5         1
>>>    290us:                              9         4
>>>    300us:                    1         3 9         1         1
>>>    310us:                    2         3 5         1         1
>>>    320us:                    1         4 2         3
>>>    330us:                    1         5         1
>>>    340us:                    1 2                   1
>>>    350us:                              2         1
>>>    360us:                              2         1
>>>    370us:                    2                   2
>>>    380us:                                        1
>>>    390us:                    2         1 2         1
>>>    410us:                    1
>>>    420us:                    3
>>>    430us:                    2         2         1
>>>    440us:                    2         1
>>>    450us:                              4
>>>    460us:                    3         1
>>>    470us:                              3         1
>>>    480us:                    2                   2
>>>    490us:                                        1
>>>    500us:                   19        13        17
>>>   1000us:        249        22        30        11
>>>   1500us:        393         4         4 2         1
>>>   2000us:        132         7         8 8         2         
>>> 1                   1
>>>   2500us:         63         4         4 6         1         1         1
>>>   3000us:         59         9         7 6         1
>>>   3500us:         34         2 1                             1
>>>   4000us:         17         9         4         1
>>>   4500us:          8         2         1         1
>>>   5000us:          7         1         2
>>>   5500us:          7         2                   1
>>>   6000us:          4         2         1         1
>>>   6500us:          3                             1
>>>   7000us:          6         2                   1
>>>   7500us:          4 1                             1
>>>   8000us:          5                             1
>>>   8500us:                    1         1
>>>   9000us:          2
>>>   9500us:          2         1
>>> >10000us:          3                             1
>>>
>>>
>>> John.
>>>
>>>
>>>
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
>


[-- Attachment #1.2: Type: text/html, Size: 17182 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] drm/i915: Use exponential backoff for wait_for()
  2017-11-30  7:55             ` Sagar Arun Kamble
@ 2017-12-04 21:51               ` John Harrison
  0 siblings, 0 replies; 28+ messages in thread
From: John Harrison @ 2017-12-04 21:51 UTC (permalink / raw)
  To: Sagar Arun Kamble, Chris Wilson, Michał Winiarski; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 14541 bytes --]

On 11/29/2017 11:55 PM, Sagar Arun Kamble wrote:
> On 11/30/2017 12:45 PM, John Harrison wrote:
>> On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
>>> On 11/30/2017 8:34 AM, John Harrison wrote:
>>>> On 11/24/2017 6:12 AM, Chris Wilson wrote:
>>>>> Quoting Michał Winiarski (2017-11-24 12:37:56)
>>>>>> Since we see the effects for GuC preeption, let's gather some evidence.
>>>>>>
>>>>>> (SKL)
>>>>>> intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'
>>>>>>
>>>>>> drm-tip:
>>>>>>       usecs               : count     distribution
>>>>>>           0 -> 1          : 0        |                                        |
>>>>>>           2 -> 3          : 0        |                                        |
>>>>>>           4 -> 7          : 0        |                                        |
>>>>>>           8 -> 15         : 44       |                                        |
>>>>>>          16 -> 31         : 1088     |                                        |
>>>>>>          32 -> 63         : 832      |                                        |
>>>>>>          64 -> 127        : 0        |                                        |
>>>>>>         128 -> 255        : 0        |                                        |
>>>>>>         256 -> 511        : 12       |                                        |
>>>>>>         512 -> 1023       : 0        |                                        |
>>>>>>        1024 -> 2047       : 29899    |*********                               |
>>>>>>        2048 -> 4095       : 131033   |****************************************|
>>>>> Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
>>>>> could create a tracepoint/kprobe that would output a histogram for each
>>>>> waiter (filterable ofc). Benefit? Just thinking of tuning the
>>>>> spin/sleep, in which case overall metrics are best
>>>>> (intel_eait_for_register needs to be optimised for the typical case). I
>>>>> am wondering if we could tune the spin period down to 5us, 2us? And then
>>>>> have the 10us sleep.
>>>>>
>>>>> We would also need a typical workload to run, it's profile-guided
>>>>> optimisation after all. Hmm.
>>>>> -Chris
>>>>
>>>> It took me a while to get back to this but I've now had chance to 
>>>> run with this exponential backoff scheme on the original system 
>>>> that showed the problem. It was a slightly messy back port due to 
>>>> the customer tree being much older than current nightly. I'm pretty 
>>>> sure I got it correct though. However, I'm not sure what the 
>>>> recommendation is for the two timeout values. Using the default of 
>>>> '10, 10' in the patch, I still get lots of very long delays. 
>>> Recommended setting currently is Wmin=10, Wmax=10 for wait_for_us 
>>> and Wmin=10, Wmax=1000 for wait_for.
>>>
>>> Exponential backoff is more helpful inside wait_for if wait_for_us 
>>> prior to wait_for is smaller.
>>> Setting Wmax less than Wmin is effectively changing the backoff 
>>> strategy to just linear waits of Wmin.
>>>> I have to up the Wmin value to at least 140 to get a stall free 
>>>> result. Which is plausible given that the big spike in the results 
>>>> of any fast version is at 110-150us. Also of note is that a Wmin 
>>>> between 10 and 110 actually makes things worse. Changing Wmax has 
>>>> no effect.
>>>>
>>>> In the following table, 'original' is the original driver before 
>>>> any changes and 'retry loop' is the version using the first 
>>>> workaround of just running the busy poll wait in a 10x loop. The 
>>>> other columns are using the backoff patch with the given Wmin/Wmax 
>>>> values. Note that the times are bucketed to 10us up to 500us and 
>>>> then in 500us lumps thereafter. The value listed is the lower 
>>>> limit, i.e. there were no times of <10us measured. Each case was 
>>>> run for 1000 samples.
>>>>
>>> Below setting like in current nightly will suit this workload and as 
>>> you have found this will also likely complete most waits in <150us.
>>> If many samples had been beyond 160us and less than 300us we might 
>>> have been needed to change Wmin to may be 15 or 20 to ensure the
>>> exponential rise caps around 300us.
>>>
>>> wait_for_us(10, 10)
>>> wait_for()
>>>
>>> #define wait_for _wait_for(10, 1000)
>>>
>> But as shown in the table, a setting of 10/10 does not work well for 
>> this workload. The best results possible are a large spike of waits 
>> in the 120-130us bucket with a small tail out to 150us. Whereas, the 
>> 10/10 setting produces a spike from 150-170us with the tail extending 
>> to 240us and an appreciable number of samples stretching all the way 
>> out to the 1-10ms range. A regular delay of multiple milliseconds is 
>> not acceptable when this path is supposed to be a low latency 
>> pre-emption to switch to some super high priority time critical task. 
>> And as noted, I did try a bunch of different settings for Wmax but 
>> nothing seemed to make much of a difference. E.g. 10/10 vs 10/1000 
>> produced pretty much identical results. Hence it didn't seem worth 
>> including those in the table.
>>
> Wmin = 10us leads us to total delay of 150us in 3 loops (this might be 
> tight to catch most conditions)
> Wmin = 25us can lead us to total delay of 175us in 3 loops
>
> Since most conditions are likely to complete around 140us-160us, Looks 
> like Wmin of 25 to 30 (25,1000 or 30, 1000) will suit this workload but
> since this profile driver optimization I am wondering about the 
> optimal Wmin point.
>
> This wait need is very time critical. Exponential rise might not be 
> good strategy during higher wait times.
> usleep_range might also be adding extra latency.
>
> May be we should do this exponential backoff for waits having US >= 
> 1000 and do periodic backoff for US<1000 with period of 50us?
>

The results I am seeing do not correspond. First of, it seems I get 
different results depending upon the context. That is in the context of 
the pre-emption GuC send action command I get the results previously 
posted. If I just run usleep_range(x, y) in loop 1000 times from the 
context of dumping a debugfs file, I get something very different. 
Basically, the minimum sleep time is 110-120us irrespective of the 
values of X and Y. Pushing X and Y beyond 120 seems to make it complete 
in Y+10-20us. E.g. u_r(100,200) completes in 210-230us for 80% of 
samples. On the other hand, I don't get anywhere near so many samples in 
the millisecond range as when called in the send action code path.

However, it sounds like the underlying issue might actually be a 
back-port merge problem. The customer tree in question is actually a 
combination of a 4.1 base kernel with a 4.11 DRM dropped on top. As 
noted in a separate thread, this tree also has a problem with the 
mutex_lock() call stalling even when the mutex is very definitely not 
acquired (using mutex_trylock() eliminates the stall completely). 
Apparently the back port process did hit a bunch of conflicts in the 
base kernel's scheduling code. So there is a strong possibility that all 
the issues we are seeing in that tree are an artifact of a merge issue.

So I think it is probably safe to ignore the results I am seeing in 
terms of what the best upstream solution should be.



>>>>     Time        Original    10/10 50/10    100/10    110/10    
>>>> 130/10    140/10 RetryLoop
>>>>     10us:          2         2         2 2         2         
>>>> 2         2         2
>>>>     30us:                              1 1         1         
>>>> 1         1
>>>>     50us:                              1
>>>>     70us:                             14 63        56        
>>>> 64        63        61
>>>>     80us:                              8 41        52        
>>>> 44        46        41
>>>>     90us:                              6 24        10        
>>>> 28        12        17
>>>>    100us:                    2         4 20        16        
>>>> 17        17        22
>>>>    110us: 13        21        14        13        11
>>>>    120us:                              6 366       633       
>>>> 636       660       650
>>>>    130us:                    2         2 46       125        
>>>> 95        86        95
>>>>    140us:                    3         2 16        18        
>>>> 32        46        48
>>>>    150us:                  210         3 12        13        
>>>> 37        32        31
>>>>    160us:                  322         1 18        10        
>>>> 14        12        17
>>>>    170us:                  157         4 5         5         
>>>> 3         5         2
>>>>    180us:                   62        11 3         1         
>>>> 2         1         1
>>>>    190us:                   32       212 1                   
>>>> 1         2
>>>>    200us:                   27       266 1                   1
>>>>    210us:                   16 
>>>> 181                                                 1
>>>>    220us:                   16 
>>>> 51                                       1
>>>>    230us:                   10        43         4
>>>>    240us:                   12        22 62         1
>>>>    250us:                    4        12 112         3
>>>>    260us:                    3        13 73         8
>>>>    270us:                    5        12 12         8         2
>>>>    280us:                    4         7 12         5         1
>>>>    290us:                              9         4
>>>>    300us:                    1         3 9         1         1
>>>>    310us:                    2         3 5         1         1
>>>>    320us:                    1         4 2         3
>>>>    330us:                    1         5         1
>>>>    340us:                    1 2                   1
>>>>    350us:                              2         1
>>>>    360us:                              2         1
>>>>    370us:                    2                   2
>>>>    380us:                                        1
>>>>    390us:                    2         1 2         1
>>>>    410us:                    1
>>>>    420us:                    3
>>>>    430us:                    2         2         1
>>>>    440us:                    2         1
>>>>    450us:                              4
>>>>    460us:                    3         1
>>>>    470us:                              3         1
>>>>    480us:                    2                   2
>>>>    490us:                                        1
>>>>    500us:                   19        13        17
>>>>   1000us:        249        22        30        11
>>>>   1500us:        393         4         4 2         1
>>>>   2000us:        132         7         8 8         2         
>>>> 1                   1
>>>>   2500us:         63         4         4 6         1         
>>>> 1         1
>>>>   3000us:         59         9         7 6         1
>>>>   3500us:         34         2 1                             1
>>>>   4000us:         17         9         4         1
>>>>   4500us:          8         2         1         1
>>>>   5000us:          7         1         2
>>>>   5500us:          7         2                   1
>>>>   6000us:          4         2         1         1
>>>>   6500us:          3                             1
>>>>   7000us:          6         2                   1
>>>>   7500us:          4 1                             1
>>>>   8000us:          5                             1
>>>>   8500us:                    1         1
>>>>   9000us:          2
>>>>   9500us:          2         1
>>>> >10000us:          3                             1
>>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Intel-gfx mailing list
>>>> Intel-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 19444 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-12-04 21:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 15:24 [PATCH] drm/i915: Use exponential backoff for wait_for() Chris Wilson
2017-11-21 16:29 ` Sagar Arun Kamble
2017-11-21 16:33   ` Chris Wilson
2017-11-21 16:49     ` Sagar Arun Kamble
2017-11-21 20:58       ` Chris Wilson
2017-11-21 16:50 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-11-21 17:00 ` [PATCH] " Tvrtko Ursulin
2017-11-21 17:11   ` Chris Wilson
2017-11-21 17:29     ` Ville Syrjälä
2017-11-21 20:40       ` Chris Wilson
2017-11-21 17:36 ` [PATCH v2] " Chris Wilson
2017-11-21 17:41 ` ✓ Fi.CI.IGT: success for " Patchwork
2017-11-21 18:03 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev2) Patchwork
2017-11-21 19:07 ` ✓ Fi.CI.IGT: " Patchwork
2017-11-21 20:59 ` [PATCH v3] drm/i915: Use exponential backoff for wait_for() Chris Wilson
2017-11-22  7:41   ` Sagar Arun Kamble
2017-11-22  9:36     ` Chris Wilson
2017-11-22  9:47       ` Michal Wajdeczko
2017-11-22 10:03       ` Sagar Arun Kamble
2017-11-24 12:37   ` Michał Winiarski
2017-11-24 14:12     ` Chris Wilson
2017-11-30  3:04       ` John Harrison
2017-11-30  6:19         ` Sagar Arun Kamble
2017-11-30  7:15           ` John Harrison
2017-11-30  7:55             ` Sagar Arun Kamble
2017-12-04 21:51               ` John Harrison
2017-11-21 21:32 ` ✓ Fi.CI.BAT: success for drm/i915: Use exponential backoff for wait_for() (rev3) Patchwork
2017-11-21 22:41 ` ✗ Fi.CI.IGT: warning " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.