All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] arm: spin one more cycle in timer-based delays
@ 2016-11-15 13:36 Mason
  2016-11-18 12:06 ` Will Deacon
  0 siblings, 1 reply; 17+ messages in thread
From: Mason @ 2016-11-15 13:36 UTC (permalink / raw)
  To: linux-arm-kernel

When polling a tick counter in a busy loop, one might sample the
counter just *before* it is updated, and then again just *after*
it is updated. In that case, while mathematically v2-v1 equals 1,
only epsilon has really passed.

So, if a caller requests an N-cycle delay, we spin until v2-v1
is strictly greater than N to avoid these random corner cases.

Signed-off-by: Mason <slash.tmp@free.fr>
---
 arch/arm/lib/delay.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
index 69aad80a3af4..3f1cd15ca102 100644
--- a/arch/arm/lib/delay.c
+++ b/arch/arm/lib/delay.c
@@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
 {
 	cycles_t start = get_cycles();
 
-	while ((get_cycles() - start) < cycles)
+	while ((get_cycles() - start) <= cycles)
 		cpu_relax();
 }
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-15 13:36 [PATCH] arm: spin one more cycle in timer-based delays Mason
@ 2016-11-18 12:06 ` Will Deacon
  2016-11-18 12:24   ` Mason
  2016-11-18 12:54   ` Russell King - ARM Linux
  0 siblings, 2 replies; 17+ messages in thread
From: Will Deacon @ 2016-11-18 12:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
> When polling a tick counter in a busy loop, one might sample the
> counter just *before* it is updated, and then again just *after*
> it is updated. In that case, while mathematically v2-v1 equals 1,
> only epsilon has really passed.
> 
> So, if a caller requests an N-cycle delay, we spin until v2-v1
> is strictly greater than N to avoid these random corner cases.
> 
> Signed-off-by: Mason <slash.tmp@free.fr>
> ---
>  arch/arm/lib/delay.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
> index 69aad80a3af4..3f1cd15ca102 100644
> --- a/arch/arm/lib/delay.c
> +++ b/arch/arm/lib/delay.c
> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>  {
>  	cycles_t start = get_cycles();
>  
> -	while ((get_cycles() - start) < cycles)
> +	while ((get_cycles() - start) <= cycles)
>  		cpu_relax();
>  }

I thought a bit about this last night. It is well known that the delay
routines are not guaranteed to be accurate, and I don't *think* that's
limited to over-spinning, so arguably this isn't a bug. However, taking
the approach that "drivers should figure it out" is also unhelpful,
because the frequency of the underlying counter isn't generally known
and so drivers can't simply adjust the delay parameter.

If you want to go ahead with this change, I do think that you should fix
the other architectures for consistency (particularly arm64, which is
likely to share drivers with arch/arm/). However, given that I don't
think you've seen a failure in practice, it might be a hard sell to get
the patches picked up, especially given the deliberately vague guarantees
offered by the delay loop.

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 12:06 ` Will Deacon
@ 2016-11-18 12:24   ` Mason
  2016-11-18 12:54   ` Russell King - ARM Linux
  1 sibling, 0 replies; 17+ messages in thread
From: Mason @ 2016-11-18 12:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 18/11/2016 13:06, Will Deacon wrote:

> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>
>> When polling a tick counter in a busy loop, one might sample the
>> counter just *before* it is updated, and then again just *after*
>> it is updated. In that case, while mathematically v2-v1 equals 1,
>> only epsilon has really passed.
>>
>> So, if a caller requests an N-cycle delay, we spin until v2-v1
>> is strictly greater than N to avoid these random corner cases.
>>
>> Signed-off-by: Mason <slash.tmp@free.fr>
>> ---
>>  arch/arm/lib/delay.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>> index 69aad80a3af4..3f1cd15ca102 100644
>> --- a/arch/arm/lib/delay.c
>> +++ b/arch/arm/lib/delay.c
>> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>>  {
>>  	cycles_t start = get_cycles();
>>  
>> -	while ((get_cycles() - start) < cycles)
>> +	while ((get_cycles() - start) <= cycles)
>>  		cpu_relax();
>>  }
> 
> I thought a bit about this last night. It is well known that the delay
> routines are not guaranteed to be accurate, and I don't *think* that's
> limited to over-spinning, so arguably this isn't a bug. However, taking
> the approach that "drivers should figure it out" is also unhelpful,
> because the frequency of the underlying counter isn't generally known
> and so drivers can't simply adjust the delay parameter.
> 
> If you want to go ahead with this change, I do think that you should fix
> the other architectures for consistency (particularly arm64, which is
> likely to share drivers with arch/arm/). However, given that I don't
> think you've seen a failure in practice, it might be a hard sell to get
> the patches picked up, especially given the deliberately vague guarantees
> offered by the delay loop.

Hello Will,

Thanks for the in-depth analysis.

This is, conceptually, the first patch in a 2-patch series, and perhaps
the intent would have been clearer had I posted the series, along with
full rationale in the cover letter. I'll do that next week, so everyone
can judge with all the information in hand.

Regards.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 12:06 ` Will Deacon
  2016-11-18 12:24   ` Mason
@ 2016-11-18 12:54   ` Russell King - ARM Linux
  2016-11-18 14:18     ` Mason
  2016-11-18 17:13     ` Doug Anderson
  1 sibling, 2 replies; 17+ messages in thread
From: Russell King - ARM Linux @ 2016-11-18 12:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
> > When polling a tick counter in a busy loop, one might sample the
> > counter just *before* it is updated, and then again just *after*
> > it is updated. In that case, while mathematically v2-v1 equals 1,
> > only epsilon has really passed.
> > 
> > So, if a caller requests an N-cycle delay, we spin until v2-v1
> > is strictly greater than N to avoid these random corner cases.
> > 
> > Signed-off-by: Mason <slash.tmp@free.fr>
> > ---
> >  arch/arm/lib/delay.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
> > index 69aad80a3af4..3f1cd15ca102 100644
> > --- a/arch/arm/lib/delay.c
> > +++ b/arch/arm/lib/delay.c
> > @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
> >  {
> >  	cycles_t start = get_cycles();
> >  
> > -	while ((get_cycles() - start) < cycles)
> > +	while ((get_cycles() - start) <= cycles)
> >  		cpu_relax();
> >  }
> 
> I thought a bit about this last night. It is well known that the delay
> routines are not guaranteed to be accurate, and I don't *think* that's
> limited to over-spinning, so arguably this isn't a bug. However, taking
> the approach that "drivers should figure it out" is also unhelpful,
> because the frequency of the underlying counter isn't generally known
> and so drivers can't simply adjust the delay parameter.

I don't think this change makes any sense whatsoever.  udelay() is
inaccurate, period.  It _will_ give delays shorter than requested
for many reasons, many of which can't be fixed.

Having a super-accurate version just encourages people to write broken
drivers which assume (eg) that udelay(10) will give _at least_ a 10us
delay.  However, there is no such guarantee.

So, having udelay() for timers return slightly short is actually a good
thing - it causes people not to make the mistake to be soo accurate
with their delay specifications.

So, NAK on this change.  udelay is not super-accurate.

Reference (and this is the most important one):

  http://lists.openwall.net/linux-kernel/2011/01/12/372

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 12:54   ` Russell King - ARM Linux
@ 2016-11-18 14:18     ` Mason
  2016-11-18 14:42       ` Peter Zijlstra
                         ` (2 more replies)
  2016-11-18 17:13     ` Doug Anderson
  1 sibling, 3 replies; 17+ messages in thread
From: Mason @ 2016-11-18 14:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 18/11/2016 13:54, Russell King - ARM Linux wrote:
> On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
>> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>>> When polling a tick counter in a busy loop, one might sample the
>>> counter just *before* it is updated, and then again just *after*
>>> it is updated. In that case, while mathematically v2-v1 equals 1,
>>> only epsilon has really passed.
>>>
>>> So, if a caller requests an N-cycle delay, we spin until v2-v1
>>> is strictly greater than N to avoid these random corner cases.
>>>
>>> Signed-off-by: Mason <slash.tmp@free.fr>
>>> ---
>>>  arch/arm/lib/delay.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>>> index 69aad80a3af4..3f1cd15ca102 100644
>>> --- a/arch/arm/lib/delay.c
>>> +++ b/arch/arm/lib/delay.c
>>> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>>>  {
>>>  	cycles_t start = get_cycles();
>>>  
>>> -	while ((get_cycles() - start) < cycles)
>>> +	while ((get_cycles() - start) <= cycles)
>>>  		cpu_relax();
>>>  }
>>
>> I thought a bit about this last night. It is well known that the delay
>> routines are not guaranteed to be accurate, and I don't *think* that's
>> limited to over-spinning, so arguably this isn't a bug. However, taking
>> the approach that "drivers should figure it out" is also unhelpful,
>> because the frequency of the underlying counter isn't generally known
>> and so drivers can't simply adjust the delay parameter.
> 
> I don't think this change makes any sense whatsoever.  udelay() is
> inaccurate, period.  It _will_ give delays shorter than requested
> for many reasons, many of which can't be fixed.

If I understand correctly, udelay is, in fact, a wrapper around
several possible implementations. (Run-time decision on arch/arm)

1) The "historical" software loop-based method, which is calibrated
during init.

2) A hardware solution, based on an actual timer, ticking away
at a constant frequency.

3) others?

Solution 1 (which probably dates back to Linux 0.1) suffers from
the inaccuracies you point out, because of interrupt latency
during calibration, DVFS, etc.

On the other hand, it is simple enough to implement solution 2
in a way that guarantees that spin_for_n_cycles(n) spins
/at least/ for n cycles.

On platforms using timer-based delays, there can indeed exist
a function guaranteeing a minimal delay.


> Having a super-accurate version just encourages people to write broken
> drivers which assume (eg) that udelay(10) will give _at least_ a 10us
> delay.  However, there is no such guarantee.

You say that calling udelay(10) expecting at least 10 ?s is broken.
This fails the principle of least astonishment in a pretty big way.
(If it returns after 9.9 ?s, I suppose it could be OK. But for
shorter delays, the relative error grows.)

A lot of drivers implement a spec that says
"do this, wait 1 ?s, do that".

Driver writers would typically write
do_this(); udelay(1); do_that();

Do you suggest they should write udelay(2); ?
But then, it's not so easy to follow the spec because everything
has been translated to a different number.
Hide everything behind a macro?

Note that people have been using ndelay too.
(I count 182 in drivers, 288 overall.)

drivers/cpufreq/s3c2440-cpufreq.c:      ndelay(20);

I don't think the author expects this to return immediately.


> So, having udelay() for timers return slightly short is actually a good
> thing - it causes people not to make the mistake to be soo accurate
> with their delay specifications.

I disagree that having an implementation which introduces
hard-to-track-down random heisenbugs is a good thing.


> So, NAK on this change.  udelay is not super-accurate.

usleep_range() fixed this issue recently.
6c5e9059692567740a4ee51530dffe51a4b9584d
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d

Do you suggest driver writers should use usleep_range()
instead of udelay?
(When does usleep_range start making sense? Around 50/100 ?s
and above?)

> Reference (and this is the most important one):
> 
>   http://lists.openwall.net/linux-kernel/2011/01/12/372


Regards.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 14:18     ` Mason
@ 2016-11-18 14:42       ` Peter Zijlstra
  2016-11-18 17:51       ` Mason
  2016-11-19  7:17       ` Afzal Mohammed
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2016-11-18 14:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 18, 2016 at 03:18:58PM +0100, Mason wrote:
> If I understand correctly, udelay is, in fact, a wrapper around
> several possible implementations. (Run-time decision on arch/arm)
> 
> 1) The "historical" software loop-based method, which is calibrated
> during init.
> 
> 2) A hardware solution, based on an actual timer, ticking away
> at a constant frequency.

I'd say clock, a timer would imply interrupts or somesuch, which is of
course not useful for udelay.

x86 with TSC uses this, although recent chips have grown an timed-mwait
instruction and that too can be used.

The TSC based ones (including the mwait) try very hard to not return
early.

> 3) others?

On x86 there's a variant that times itself based on io port accesses
which have a 'known' slowness.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 12:54   ` Russell King - ARM Linux
  2016-11-18 14:18     ` Mason
@ 2016-11-18 17:13     ` Doug Anderson
  2016-11-18 17:32       ` Russell King - ARM Linux
  1 sibling, 1 reply; 17+ messages in thread
From: Doug Anderson @ 2016-11-18 17:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Nov 18, 2016 at 4:54 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
>> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>> > When polling a tick counter in a busy loop, one might sample the
>> > counter just *before* it is updated, and then again just *after*
>> > it is updated. In that case, while mathematically v2-v1 equals 1,
>> > only epsilon has really passed.
>> >
>> > So, if a caller requests an N-cycle delay, we spin until v2-v1
>> > is strictly greater than N to avoid these random corner cases.
>> >
>> > Signed-off-by: Mason <slash.tmp@free.fr>
>> > ---
>> >  arch/arm/lib/delay.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>> > index 69aad80a3af4..3f1cd15ca102 100644
>> > --- a/arch/arm/lib/delay.c
>> > +++ b/arch/arm/lib/delay.c
>> > @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>> >  {
>> >     cycles_t start = get_cycles();
>> >
>> > -   while ((get_cycles() - start) < cycles)
>> > +   while ((get_cycles() - start) <= cycles)
>> >             cpu_relax();
>> >  }
>>
>> I thought a bit about this last night. It is well known that the delay
>> routines are not guaranteed to be accurate, and I don't *think* that's
>> limited to over-spinning, so arguably this isn't a bug. However, taking
>> the approach that "drivers should figure it out" is also unhelpful,
>> because the frequency of the underlying counter isn't generally known
>> and so drivers can't simply adjust the delay parameter.
>
> I don't think this change makes any sense whatsoever.  udelay() is
> inaccurate, period.  It _will_ give delays shorter than requested
> for many reasons, many of which can't be fixed.
>
> Having a super-accurate version just encourages people to write broken
> drivers which assume (eg) that udelay(10) will give _at least_ a 10us
> delay.  However, there is no such guarantee.
>
> So, having udelay() for timers return slightly short is actually a good
> thing - it causes people not to make the mistake to be soo accurate
> with their delay specifications.
>
> So, NAK on this change.  udelay is not super-accurate.
>
> Reference (and this is the most important one):
>
>   http://lists.openwall.net/linux-kernel/2011/01/12/372

Mason's change adds no complexity to the code and seems like a
generally good idea.

As per John Stultz [1] there is agreement that udelay() really
shouldn't delay less than the requested amount.  In fact, we even have
a test committed to the tree: "kernel/time/test_udelay.c".  As your
reference says, maybe 1% isn't enough to throw a hissyfit about, but
when the change is so simple and adds no complexity then it seems like
a worthwhile thing to do.  Why make things inaccurate for no reason?

[1] http://www.gossamer-threads.com/lists/linux/kernel/1918249


For what it's worth, Mason's change is:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

-Doug

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 17:13     ` Doug Anderson
@ 2016-11-18 17:32       ` Russell King - ARM Linux
  0 siblings, 0 replies; 17+ messages in thread
From: Russell King - ARM Linux @ 2016-11-18 17:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 18, 2016 at 09:13:18AM -0800, Doug Anderson wrote:
> Mason's change adds no complexity to the code and seems like a
> generally good idea.
> 
> As per John Stultz [1] there is agreement that udelay() really
> shouldn't delay less than the requested amount.

There is NO such agreement, read the reference that I gave, which is
a statement from Linus who clearly says that if you're stupid enough
to want udelay() to be accurate to within 5% then you're doing
something wrong.

FACT: software-based udelay()s are _always_ short by the very nature
of the way the delay loop is calibrated.

It's not something that's going to get fixed either - read the full
thread from my reference, and you'll see that Linus has said that
udelay() being short is not something anyone should care about.
That's at odds from John, and Linus' opinion rather trumps everyone
elses - it's Linus' kernel, his is the ultimate last say on anything.

>  In fact, we even have
> a test committed to the tree: "kernel/time/test_udelay.c".  As your
> reference says, maybe 1% isn't enough to throw a hissyfit about, but
> when the change is so simple and adds no complexity then it seems like
> a worthwhile thing to do.

Maybe, but my point is that it's just encouraging this fiction that
udelay() never delays shorter than the requested delay.

Also, given that this is architecture code, it's my decision to make,
not yours.  So while you can supply any attributation you like, I'm
saying no at the moment - unless someone can show that it causes
_significant_ errors.

In other words, if it causes significantly short or long delays that
the reasonable inflation of delay value that drivers are expected to
make becomes a problem, then yes, it should be fixed.  If it's merely
"lets stop udelay() returning slightly early" then no, I'm not
interested because it's encouraging the fiction.

And... if a data sheet says "needs a delay of 2us" and you put in the
driver udelay(2) then you're doing it wrong, and you need to read
Linus' mails on this subject, such as the one I've provided a link
to... that udelay() must be _at least_ udelay(3), if not 4.

The best thing when using udelay() is to remember the most important
point - udelay() is very _approximate_.

Adding Linus in case he wishes to add to the discussion.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 14:18     ` Mason
  2016-11-18 14:42       ` Peter Zijlstra
@ 2016-11-18 17:51       ` Mason
  2016-11-19  7:17       ` Afzal Mohammed
  2 siblings, 0 replies; 17+ messages in thread
From: Mason @ 2016-11-18 17:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 18/11/2016 15:18, Mason wrote:
> On 18/11/2016 13:54, Russell King - ARM Linux wrote:
>> On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
>>> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>>>> When polling a tick counter in a busy loop, one might sample the
>>>> counter just *before* it is updated, and then again just *after*
>>>> it is updated. In that case, while mathematically v2-v1 equals 1,
>>>> only epsilon has really passed.
>>>>
>>>> So, if a caller requests an N-cycle delay, we spin until v2-v1
>>>> is strictly greater than N to avoid these random corner cases.
>>>>
>>>> Signed-off-by: Mason <slash.tmp@free.fr>
>>>> ---
>>>>  arch/arm/lib/delay.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>>>> index 69aad80a3af4..3f1cd15ca102 100644
>>>> --- a/arch/arm/lib/delay.c
>>>> +++ b/arch/arm/lib/delay.c
>>>> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>>>>  {
>>>>  	cycles_t start = get_cycles();
>>>>  
>>>> -	while ((get_cycles() - start) < cycles)
>>>> +	while ((get_cycles() - start) <= cycles)
>>>>  		cpu_relax();
>>>>  }
>>>
>>> I thought a bit about this last night. It is well known that the delay
>>> routines are not guaranteed to be accurate, and I don't *think* that's
>>> limited to over-spinning, so arguably this isn't a bug. However, taking
>>> the approach that "drivers should figure it out" is also unhelpful,
>>> because the frequency of the underlying counter isn't generally known
>>> and so drivers can't simply adjust the delay parameter.
>>
>> I don't think this change makes any sense whatsoever.  udelay() is
>> inaccurate, period.  It _will_ give delays shorter than requested
>> for many reasons, many of which can't be fixed.
> 
> If I understand correctly, udelay is, in fact, a wrapper around
> several possible implementations. (Run-time decision on arch/arm)
> 
> 1) The "historical" software loop-based method, which is calibrated
> during init.
> 
> 2) A hardware solution, based on an actual timer, ticking away
> at a constant frequency.
> 
> 3) others?
> 
> Solution 1 (which probably dates back to Linux 0.1) suffers from
> the inaccuracies you point out, because of interrupt latency
> during calibration, DVFS, etc.
> 
> On the other hand, it is simple enough to implement solution 2
> in a way that guarantees that spin_for_n_cycles(n) spins
> /at least/ for n cycles.
> 
> On platforms using timer-based delays, there can indeed exist
> a function guaranteeing a minimal delay.
> 
> 
>> Having a super-accurate version just encourages people to write broken
>> drivers which assume (eg) that udelay(10) will give _at least_ a 10us
>> delay.  However, there is no such guarantee.
> 
> You say that calling udelay(10) expecting at least 10 ?s is broken.
> This fails the principle of least astonishment in a pretty big way.
> (If it returns after 9.9 ?s, I suppose it could be OK. But for
> shorter delays, the relative error grows.)
> 
> A lot of drivers implement a spec that says
> "do this, wait 1 ?s, do that".
> 
> Driver writers would typically write
> do_this(); udelay(1); do_that();
> 
> Do you suggest they should write udelay(2); ?
> But then, it's not so easy to follow the spec because everything
> has been translated to a different number.
> Hide everything behind a macro?
> 
> Note that people have been using ndelay too.
> (I count 182 in drivers, 288 overall.)
> 
> drivers/cpufreq/s3c2440-cpufreq.c:      ndelay(20);
> 
> I don't think the author expects this to return immediately.
> 
> 
>> So, having udelay() for timers return slightly short is actually a good
>> thing - it causes people not to make the mistake to be soo accurate
>> with their delay specifications.
> 
> I disagree that having an implementation which introduces
> hard-to-track-down random heisenbugs is a good thing.
> 
> 
>> So, NAK on this change.  udelay is not super-accurate.
> 
> usleep_range() fixed this issue recently.
> 6c5e9059692567740a4ee51530dffe51a4b9584d
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d
> 
> Do you suggest driver writers should use usleep_range()
> instead of udelay?
> (When does usleep_range start making sense? Around 50/100 ?s
> and above?)
> 
>> Reference (and this is the most important one):
>>
>>   http://lists.openwall.net/linux-kernel/2011/01/12/372

I forgot to explain how I came to work in this area of the code.

When my NAND controller's driver probes, it scans the chips
for bad blocks, which generates 65000 calls to ndelay(100);
(for 1 GB of NAND). For larger sizes of NAND, the number can
climb to 500k calls.

Currently, on arch/arm, ndelay is rounded *up* to the
nearest microsecond. So this might generate a total
delay of up to 500 ms, when 50 ms would be sufficient.

So I locally defined ndelay as
#define NDELAY_MULT	((UL(2199)    * HZ) >> 11)
#define ndelay(n)	__const_udelay((n) * NDELAY_MULT)

I use a 27 MHz tick counter (37 ns period).
ndelay() was resolving to __timer_delay(2)
which might delay only a single tick, so 37 ns.

So the NAND framework occasionally (falsely) flags a block
as bad (random).

There are two sources of error in the timer-based code:
1) rounding down in __timer_const_udelay => loses 1 cycle
2) spinning one cycle too few => loses 1 cycle

With these two errors corrected, I am able to init the
NAND driver faster, with no blocks incorrectly marked
as bad.

Regards.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-18 14:18     ` Mason
  2016-11-18 14:42       ` Peter Zijlstra
  2016-11-18 17:51       ` Mason
@ 2016-11-19  7:17       ` Afzal Mohammed
  2016-11-19 11:03         ` Russell King - ARM Linux
  2 siblings, 1 reply; 17+ messages in thread
From: Afzal Mohammed @ 2016-11-19  7:17 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Mason,

On Fri, Nov 18, 2016 at 03:18:58PM +0100, Mason wrote:
> On 18/11/2016 13:54, Russell King - ARM Linux wrote:

> > So, NAK on this change.  udelay is not super-accurate.
> 
> usleep_range() fixed this issue recently.
> 6c5e9059692567740a4ee51530dffe51a4b9584d
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d

But the above "timers: Fix usleep_range() in the context of
wake_up_process()" is to avoid wakeup causing premature return than
about being precise, no ?

With conflicting opinion on delay/sleep fn's from the players, the one
in gallery would get confused.

But Linus has mentioned udelay as not meant to be precise, okay ?

Regards
afzal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-19  7:17       ` Afzal Mohammed
@ 2016-11-19 11:03         ` Russell King - ARM Linux
  2016-11-19 18:29           ` Mason
                             ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Russell King - ARM Linux @ 2016-11-19 11:03 UTC (permalink / raw)
  To: linux-arm-kernel

Linus, please see the comment and patch at the bottom of this mail.
Thanks.

On Sat, Nov 19, 2016 at 12:47:02PM +0530, Afzal Mohammed wrote:
> Hi Mason,
> 
> On Fri, Nov 18, 2016 at 03:18:58PM +0100, Mason wrote:
> > On 18/11/2016 13:54, Russell King - ARM Linux wrote:
> 
> > > So, NAK on this change.  udelay is not super-accurate.
> > 
> > usleep_range() fixed this issue recently.
> > 6c5e9059692567740a4ee51530dffe51a4b9584d
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d
> 
> But the above "timers: Fix usleep_range() in the context of
> wake_up_process()" is to avoid wakeup causing premature return than
> about being precise, no ?

usleep*() is different from udelay().  usleep*() is not based on looping
a certain number of times, and doesn't involve calibration of such a
loop.  usleep*() is based on the scheduler, which has tighter
requirements laid down in POSIX amongst other standards, such as "not
timing out before this specified time" (due to things like select(),
poll(), etc.)  udelay() is purely a kernel thing, unspecified by any
standard.

> With conflicting opinion on delay/sleep fn's from the players, the one
> in gallery would get confused.
> 
> But Linus has mentioned udelay as not meant to be precise, okay ?

Exactly - and the reason for that (as I've explained several times in
the past) the "standard" software delay loop calibrated against the
timer interrupt is _always_ going to be short.

I explain why this is in the message to which Linus replied:

  http://lists.openwall.net/linux-kernel/2011/01/09/56

A consequence of the formula that I give in (2) in that mail is that
the higher the HZ value, the more error in the resulting value of
loops_per_jiffy, and the shorter udelay(1) than 1us will be, since
"timer_interrupt_usec" is a constant but "usec_per_jiffy" reduces.

So folk need to put the idea that "udelay(1) will always be at least
1us" out of their minds - that's simply not guaranteed by the kernel.
Linus' reply also indicates that we don't care if it's out by 5%,
and probably more than that too.

If someone can show that our timer-based udelay() produces an error
more than 5%, then I'll apply the patch.  What I don't want to do is
to apply the patch because someone thinks that udelay() should not
return early.  Applying it in that case has the effect of re-inforcing
what is an incorrect assumption, leading to people writing buggy drivers
that have delays which are too finely "tuned" - which may work with a
timer-based udelay() but cause failures with a loop-based udelay().

This is all about ensuring that driver authors do the right thing.

Linus, how about we add something like this to linux/delay.h to document
this fact?

 include/linux/delay.h | 12 +++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/delay.h b/include/linux/delay.h
index a6ecb34cf547..2ecb3c46b20a 100644
--- a/include/linux/delay.h
+++ b/include/linux/delay.h
@@ -5,6 +5,18 @@
  * Copyright (C) 1993 Linus Torvalds
  *
  * Delay routines, using a pre-computed "loops_per_jiffy" value.
+ *
+ * Please note that ndelay(), udelay() and mdelay() may return early for
+ * several reasons:
+ *  1. computed loops_per_jiffy too low (due to the time taken to
+ *     execute the timer interrupt.)
+ *  2. cache behaviour affecting the time it takes to execute the
+ *     loop function.
+ *  3. CPU clock rate changes.
+ * As a result, delays should always be over-stated.
+ *
+ * Please see this thread:
+ *   http://lists.openwall.net/linux-kernel/2011/01/09/56
  */
 
 #include <linux/kernel.h>


-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-19 11:03         ` Russell King - ARM Linux
@ 2016-11-19 18:29           ` Mason
  2016-11-20 19:18             ` Doug Anderson
  2016-11-20  6:15           ` Afzal Mohammed
  2016-11-20 19:15           ` Doug Anderson
  2 siblings, 1 reply; 17+ messages in thread
From: Mason @ 2016-11-19 18:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 19/11/2016 12:03, Russell King - ARM Linux wrote:

> Linus, please see the comment and patch at the bottom of this mail.
> Thanks.
> 
> On Sat, Nov 19, 2016 at 12:47:02PM +0530, Afzal Mohammed wrote:
>> Hi Mason,
>>
>> On Fri, Nov 18, 2016 at 03:18:58PM +0100, Mason wrote:
>>> On 18/11/2016 13:54, Russell King - ARM Linux wrote:
>>
>>>> So, NAK on this change.  udelay is not super-accurate.
>>>
>>> usleep_range() fixed this issue recently.
>>> 6c5e9059692567740a4ee51530dffe51a4b9584d
>>> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d
>>
>> But the above "timers: Fix usleep_range() in the context of
>> wake_up_process()" is to avoid wakeup causing premature return than
>> about being precise, no ?
> 
> usleep*() is different from udelay().  usleep*() is not based on looping
> a certain number of times, and doesn't involve calibration of such a
> loop.

You keep saying that udelay is loop-based. That's merely the fall-back,
when nothing better is available. As you know, there are several better
arch-specific options available (peterz described some on x86).

On my platform, udelay polls a platform tick counter. (In fact, you
were the one to recommend that solution to me years ago). This tick
counter is tied to a high-precision crystal, the error of which is
measured in parts per million. The memory bus to this device offers
some guarantees for the access latency.

IIUC, arm and arm64 even have an architected counter that is
guaranteed to tick at constant frequency, and is accessible
without leaving the CPU core.


> usleep*() is based on the scheduler, which has tighter
> requirements laid down in POSIX amongst other standards, such as "not
> timing out before this specified time" (due to things like select(),
> poll(), etc.)  udelay() is purely a kernel thing, unspecified by any
> standard.

The "problem" with udelay is that the same API exists on every single
kernel ever written, and users have implicit expectations about the
implementation. (Principle of least astonishment)


>> With conflicting opinion on delay/sleep fn's from the players, the one
>> in gallery would get confused.
>>
>> But Linus has mentioned udelay as not meant to be precise, okay ?
> 
> Exactly - and the reason for that (as I've explained several times in
> the past) the "standard" software delay loop calibrated against the
> timer interrupt is _always_ going to be short.

OK, so loop-based delays are known to be short. Would you or Linus
accept a patch that adds a X% cushion *in the implementation* ?

You are saying "people shouldn't expect udelay(10) to delay at least
10 ?s, thus they should write udelay(10+N)".

Why not hide that implementation detail inside the implementation,
so as not to force the pessimization on every other implementation
behind the udelay/ndelay wrapper?

void loop_based_udelay(long us) {
  spin_for_some_us(us + us/8);
}


> I explain why this is in the message to which Linus replied:
> 
>   http://lists.openwall.net/linux-kernel/2011/01/09/56
> 
> A consequence of the formula that I give in (2) in that mail is that
> the higher the HZ value, the more error in the resulting value of
> loops_per_jiffy, and the shorter udelay(1) than 1us will be, since
> "timer_interrupt_usec" is a constant but "usec_per_jiffy" reduces.
> 
> So folk need to put the idea that "udelay(1) will always be at least
> 1us" out of their minds - that's simply not guaranteed by the kernel.
> Linus' reply also indicates that we don't care if it's out by 5%,
> and probably more than that too.
> 
> If someone can show that our timer-based udelay() produces an error
> more than 5%, then I'll apply the patch.

I gave an example where ndelay had a 60% error (37 instead of 100 ns).

If one is using a 1 MHz clock for timer-based delays, udelay(1)
will randomly return immediately (100% error).


> What I don't want to do is
> to apply the patch because someone thinks that udelay() should not
> return early.  Applying it in that case has the effect of re-inforcing
> what is an incorrect assumption, leading to people writing buggy drivers
> that have delays which are too finely "tuned" - which may work with a
> timer-based udelay() but cause failures with a loop-based udelay().

Again, I think it would be much smarter to hide this quirk within
the loop-based delay implementation.

Why is it unacceptable to "fix" the API, instead of fixing the
expectations of driver writers?


> This is all about ensuring that driver authors do the right thing.

What is the rationale behind the devm managed resources?
Driver writers were getting things wrong, and the kernel
provided a framework to help them with the hard part.

Likewise, instead of setting them up to fail with a quirky
delay routine, why not help them by writing an implementation
to match their expectation?


> Linus, how about we add something like this to linux/delay.h to document
> this fact?
> 
>  include/linux/delay.h | 12 +++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/delay.h b/include/linux/delay.h
> index a6ecb34cf547..2ecb3c46b20a 100644
> --- a/include/linux/delay.h
> +++ b/include/linux/delay.h
> @@ -5,6 +5,18 @@
>   * Copyright (C) 1993 Linus Torvalds
>   *
>   * Delay routines, using a pre-computed "loops_per_jiffy" value.
> + *
> + * Please note that ndelay(), udelay() and mdelay() may return early for
> + * several reasons:
> + *  1. computed loops_per_jiffy too low (due to the time taken to
> + *     execute the timer interrupt.)
> + *  2. cache behaviour affecting the time it takes to execute the
> + *     loop function.
> + *  3. CPU clock rate changes.
> + * As a result, delays should always be over-stated.
> + *
> + * Please see this thread:
> + *   http://lists.openwall.net/linux-kernel/2011/01/09/56

(None of the reasons you stated affect a tick-counter-based delay.)

If one wants a 100 ns delay, what should one write?
If one wants a 1 ?s delay, what should one write?
If one wants a 100 ?s delay, what should one write?
Is the relative error constant?
Is there a constant component in the error, independent of requested delay?

It seems to me you're saying "on platform A, udelay has 50%
error, so driver writers should write udelay(N+N/2);"

The code is now unnecessarily pessimized on hundreds of platform
that don't have that behavior.

Why not fix the implementation of that platform's udelay to
add the padding itself? That way, other platforms are not
hampered by that platform's ineptitude.

Regards.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-19 11:03         ` Russell King - ARM Linux
  2016-11-19 18:29           ` Mason
@ 2016-11-20  6:15           ` Afzal Mohammed
  2016-11-20 19:15           ` Doug Anderson
  2 siblings, 0 replies; 17+ messages in thread
From: Afzal Mohammed @ 2016-11-20  6:15 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Sat, Nov 19, 2016 at 11:03:01AM +0000, Russell King - ARM Linux wrote:

> > > 6c5e9059692567740a4ee51530dffe51a4b9584d
> > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d
> > 
> > But the above "timers: Fix usleep_range() in the context of
> > wake_up_process()" is to avoid wakeup causing premature return than
> > about being precise, no ?

> usleep*() is based on the scheduler, which has tighter requirements
> laid down in POSIX amongst other standards, such as "not timing out
> before this specified time"

Thanks for educating on the POSIX linkage.

When the above mentioned patch was flying, tglx initially mentioned
that there is no gurantee that usleep_range() will never return in less
time than the specified minimum, but later he committed the change to
tip-bot keeping the message saying otherwise. That was confusing, and
me being a wayfarer, didn't have the courage to ask.

Regards
afzal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-19 11:03         ` Russell King - ARM Linux
  2016-11-19 18:29           ` Mason
  2016-11-20  6:15           ` Afzal Mohammed
@ 2016-11-20 19:15           ` Doug Anderson
  2 siblings, 0 replies; 17+ messages in thread
From: Doug Anderson @ 2016-11-20 19:15 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Sat, Nov 19, 2016 at 3:03 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> Linus, how about we add something like this to linux/delay.h to document
> this fact?
>
>  include/linux/delay.h | 12 +++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/include/linux/delay.h b/include/linux/delay.h
> index a6ecb34cf547..2ecb3c46b20a 100644
> --- a/include/linux/delay.h
> +++ b/include/linux/delay.h
> @@ -5,6 +5,18 @@
>   * Copyright (C) 1993 Linus Torvalds
>   *
>   * Delay routines, using a pre-computed "loops_per_jiffy" value.
> + *
> + * Please note that ndelay(), udelay() and mdelay() may return early for
> + * several reasons:
> + *  1. computed loops_per_jiffy too low (due to the time taken to
> + *     execute the timer interrupt.)
> + *  2. cache behaviour affecting the time it takes to execute the
> + *     loop function.
> + *  3. CPU clock rate changes.
> + * As a result, delays should always be over-stated.
> + *
> + * Please see this thread:
> + *   http://lists.openwall.net/linux-kernel/2011/01/09/56
>   */

Any reason not to land Mason's patch _and_ land your comment change.
They you eliminate the arbitrary/pointless inaccuracy but still
reinforce that delays are not accurate.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-19 18:29           ` Mason
@ 2016-11-20 19:18             ` Doug Anderson
  2016-11-20 19:44               ` Russell King - ARM Linux
  0 siblings, 1 reply; 17+ messages in thread
From: Doug Anderson @ 2016-11-20 19:18 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Sat, Nov 19, 2016 at 10:29 AM, Mason <slash.tmp@free.fr> wrote:
>> Exactly - and the reason for that (as I've explained several times in
>> the past) the "standard" software delay loop calibrated against the
>> timer interrupt is _always_ going to be short.
>
> OK, so loop-based delays are known to be short. Would you or Linus
> accept a patch that adds a X% cushion *in the implementation* ?
>
> You are saying "people shouldn't expect udelay(10) to delay at least
> 10 ?s, thus they should write udelay(10+N)".
>
> Why not hide that implementation detail inside the implementation,
> so as not to force the pessimization on every other implementation
> behind the udelay/ndelay wrapper?
>
> void loop_based_udelay(long us) {
>   spin_for_some_us(us + us/8);
> }

IMHO this would be an extremely sane thing to do.

Then, if you happened to be on a platform where udelay _was_ accurate
then you could eliminate the pointless overhead.  ...and also if you
were writing code to a datasheet and it said "delay for 50 us between
A and B" you could actually write udelay(50) instead of everyone
needing to know that they should write "udelay(53)".

-Doug

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-20 19:18             ` Doug Anderson
@ 2016-11-20 19:44               ` Russell King - ARM Linux
  2016-11-20 20:00                 ` Mason
  0 siblings, 1 reply; 17+ messages in thread
From: Russell King - ARM Linux @ 2016-11-20 19:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Nov 20, 2016 at 11:18:48AM -0800, Doug Anderson wrote:
> Hi,
> 
> On Sat, Nov 19, 2016 at 10:29 AM, Mason <slash.tmp@free.fr> wrote:
> >> Exactly - and the reason for that (as I've explained several times in
> >> the past) the "standard" software delay loop calibrated against the
> >> timer interrupt is _always_ going to be short.
> >
> > OK, so loop-based delays are known to be short. Would you or Linus
> > accept a patch that adds a X% cushion *in the implementation* ?
> >
> > You are saying "people shouldn't expect udelay(10) to delay at least
> > 10 ?s, thus they should write udelay(10+N)".
> >
> > Why not hide that implementation detail inside the implementation,
> > so as not to force the pessimization on every other implementation
> > behind the udelay/ndelay wrapper?

Try sending Linus a patch for it.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] arm: spin one more cycle in timer-based delays
  2016-11-20 19:44               ` Russell King - ARM Linux
@ 2016-11-20 20:00                 ` Mason
  0 siblings, 0 replies; 17+ messages in thread
From: Mason @ 2016-11-20 20:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 20/11/2016 20:44, Russell King - ARM Linux wrote:

> On Sun, Nov 20, 2016 at 11:18:48AM -0800, Doug Anderson wrote:
>
>> On Sat, Nov 19, 2016 at 10:29 AM, Mason <slash.tmp@free.fr> wrote:
>>
>>> OK, so loop-based delays are known to be short. Would you or Linus
>>> accept a patch that adds a X% cushion *in the implementation* ?
>>>
>>> You are saying "people shouldn't expect udelay(10) to delay at least
>>> 10 ?s, thus they should write udelay(10+N)".
>>>
>>> Why not hide that implementation detail inside the implementation,
>>> so as not to force the pessimization on every other implementation
>>> behind the udelay/ndelay wrapper?
> 
> Try sending Linus a patch for it.

OK, I will.

BTW, any reason why you replied to me via Doug's reply?
Are my messages not reaching your server?

Regards.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-11-20 20:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15 13:36 [PATCH] arm: spin one more cycle in timer-based delays Mason
2016-11-18 12:06 ` Will Deacon
2016-11-18 12:24   ` Mason
2016-11-18 12:54   ` Russell King - ARM Linux
2016-11-18 14:18     ` Mason
2016-11-18 14:42       ` Peter Zijlstra
2016-11-18 17:51       ` Mason
2016-11-19  7:17       ` Afzal Mohammed
2016-11-19 11:03         ` Russell King - ARM Linux
2016-11-19 18:29           ` Mason
2016-11-20 19:18             ` Doug Anderson
2016-11-20 19:44               ` Russell King - ARM Linux
2016-11-20 20:00                 ` Mason
2016-11-20  6:15           ` Afzal Mohammed
2016-11-20 19:15           ` Doug Anderson
2016-11-18 17:13     ` Doug Anderson
2016-11-18 17:32       ` Russell King - ARM Linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.