[PATCH] arm: spin one more cycle in timer-based delays

From: slash.tmp@free.fr (Mason)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] arm: spin one more cycle in timer-based delays
Date: Fri, 18 Nov 2016 18:51:19 +0100	[thread overview]
Message-ID: <582F3F97.2010701@free.fr> (raw)
In-Reply-To: <582F0DD2.3030805@free.fr>

On 18/11/2016 15:18, Mason wrote:
> On 18/11/2016 13:54, Russell King - ARM Linux wrote:
>> On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
>>> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>>>> When polling a tick counter in a busy loop, one might sample the
>>>> counter just *before* it is updated, and then again just *after*
>>>> it is updated. In that case, while mathematically v2-v1 equals 1,
>>>> only epsilon has really passed.
>>>>
>>>> So, if a caller requests an N-cycle delay, we spin until v2-v1
>>>> is strictly greater than N to avoid these random corner cases.
>>>>
>>>> Signed-off-by: Mason <slash.tmp@free.fr>
>>>> ---
>>>>  arch/arm/lib/delay.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>>>> index 69aad80a3af4..3f1cd15ca102 100644
>>>> --- a/arch/arm/lib/delay.c
>>>> +++ b/arch/arm/lib/delay.c
>>>> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>>>>  {
>>>>  	cycles_t start = get_cycles();
>>>>  
>>>> -	while ((get_cycles() - start) < cycles)
>>>> +	while ((get_cycles() - start) <= cycles)
>>>>  		cpu_relax();
>>>>  }
>>>
>>> I thought a bit about this last night. It is well known that the delay
>>> routines are not guaranteed to be accurate, and I don't *think* that's
>>> limited to over-spinning, so arguably this isn't a bug. However, taking
>>> the approach that "drivers should figure it out" is also unhelpful,
>>> because the frequency of the underlying counter isn't generally known
>>> and so drivers can't simply adjust the delay parameter.
>>
>> I don't think this change makes any sense whatsoever.  udelay() is
>> inaccurate, period.  It _will_ give delays shorter than requested
>> for many reasons, many of which can't be fixed.
> 
> If I understand correctly, udelay is, in fact, a wrapper around
> several possible implementations. (Run-time decision on arch/arm)
> 
> 1) The "historical" software loop-based method, which is calibrated
> during init.
> 
> 2) A hardware solution, based on an actual timer, ticking away
> at a constant frequency.
> 
> 3) others?
> 
> Solution 1 (which probably dates back to Linux 0.1) suffers from
> the inaccuracies you point out, because of interrupt latency
> during calibration, DVFS, etc.
> 
> On the other hand, it is simple enough to implement solution 2
> in a way that guarantees that spin_for_n_cycles(n) spins
> /at least/ for n cycles.
> 
> On platforms using timer-based delays, there can indeed exist
> a function guaranteeing a minimal delay.
> 
> 
>> Having a super-accurate version just encourages people to write broken
>> drivers which assume (eg) that udelay(10) will give _at least_ a 10us
>> delay.  However, there is no such guarantee.
> 
> You say that calling udelay(10) expecting at least 10 ?s is broken.
> This fails the principle of least astonishment in a pretty big way.
> (If it returns after 9.9 ?s, I suppose it could be OK. But for
> shorter delays, the relative error grows.)
> 
> A lot of drivers implement a spec that says
> "do this, wait 1 ?s, do that".
> 
> Driver writers would typically write
> do_this(); udelay(1); do_that();
> 
> Do you suggest they should write udelay(2); ?
> But then, it's not so easy to follow the spec because everything
> has been translated to a different number.
> Hide everything behind a macro?
> 
> Note that people have been using ndelay too.
> (I count 182 in drivers, 288 overall.)
> 
> drivers/cpufreq/s3c2440-cpufreq.c:      ndelay(20);
> 
> I don't think the author expects this to return immediately.
> 
> 
>> So, having udelay() for timers return slightly short is actually a good
>> thing - it causes people not to make the mistake to be soo accurate
>> with their delay specifications.
> 
> I disagree that having an implementation which introduces
> hard-to-track-down random heisenbugs is a good thing.
> 
> 
>> So, NAK on this change.  udelay is not super-accurate.
> 
> usleep_range() fixed this issue recently.
> 6c5e9059692567740a4ee51530dffe51a4b9584d
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d
> 
> Do you suggest driver writers should use usleep_range()
> instead of udelay?
> (When does usleep_range start making sense? Around 50/100 ?s
> and above?)
> 
>> Reference (and this is the most important one):
>>
>>   http://lists.openwall.net/linux-kernel/2011/01/12/372

I forgot to explain how I came to work in this area of the code.

When my NAND controller's driver probes, it scans the chips
for bad blocks, which generates 65000 calls to ndelay(100);
(for 1 GB of NAND). For larger sizes of NAND, the number can
climb to 500k calls.

Currently, on arch/arm, ndelay is rounded *up* to the
nearest microsecond. So this might generate a total
delay of up to 500 ms, when 50 ms would be sufficient.

So I locally defined ndelay as
#define NDELAY_MULT	((UL(2199)    * HZ) >> 11)
#define ndelay(n)	__const_udelay((n) * NDELAY_MULT)

I use a 27 MHz tick counter (37 ns period).
ndelay() was resolving to __timer_delay(2)
which might delay only a single tick, so 37 ns.

So the NAND framework occasionally (falsely) flags a block
as bad (random).

There are two sources of error in the timer-based code:
1) rounding down in __timer_const_udelay => loses 1 cycle
2) spinning one cycle too few => loses 1 cycle

With these two errors corrected, I am able to init the
NAND driver faster, with no blocks incorrectly marked
as bad.

Regards.