linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Improving udelay/ndelay on platforms where that is possible
@ 2017-10-31 16:15 Marc Gonzalez
  2017-10-31 16:44 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Marc Gonzalez @ 2017-10-31 16:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre,
	Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann,
	Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd,
	Mason

Hello BDFL,

I am writing to you directly because Russell has bluntly stated (paraphrased)
"Send your patches to Linus; it's his kernel, and he has the final say." which
is his diplomatic way of telling me to fsck off.

Basically, I want to improve the accuracy of clock-based delays, in order to
improve the boot-time delay in the NAND framework. I will send a patch, but
I wanted to have a discussion about the design first.

The key points under discussion below are

a) guarantee of no under-delays
b) availability of ndelay for arm32


Below is the long-winded rationale.

1) Delays and sleeps (clocks, timers, alarms, etc)

Every operating system kernel (that I know of) offers some way for a kernel thread to
wait for a specified amount of time (expressed in seconds, milliseconds, microseconds,
or even nanoseconds).

For relatively small amounts of time, such a primitive is typically implemented as
a busy-wait spin-loop. Let us call it delay().

Such a primitive cannot guarantee that the calling thread will be delayed *exactly*
the amount of time specified, because there are many sources of inaccuracy:

* the timer's period may not divide the requested amount
* sampling the timer value may have variable latency
* the thread might be preempted for whatever reason
* the timer itself may have varying frequencies
* etc

Therefore, users are accustomed to having delays be longer (within a reasonable margin).
However, very few users would expect delays to be *shorter* than requested.

As you have stated yourself, the vast majority of code in Linux is driver code.

Typical driver writers (of which I am one) are not experts on Linux internals, and may
sometimes need some guidance, either from another programmer or a maintainer (whose time
is a rare resource), or from clear APIs (either well-documented, or just "natural" and
"obvious" interfaces).

A typical driver writer has some HW spec in front of them, which e.g. states:

* poke register A
* wait 1 microsecond for the dust to settle
* poke register B

which most programmers would translate to:

	poke(reg_A, val_A);
	udelay(1);
	poke(reg_B, val_B);


Given a similar example, Russell has stated:

> And if a data sheet says "needs a delay of 2us" and you put in the
> driver udelay(2) then you're doing it wrong, and you need to read
> Linus' mails on this subject, such as the one I've provided a link
> to... that udelay() must be _at least_ udelay(3), if not 4.

I see two key points in this reply.

i) There seems to be an implicit agreement that it is BAD for the *actual*
delay to be less than what the data sheet specifies, leading to...

ii) It is the responsibility of the *driver writer* to figure out how much
of a cushion they need, in order to guarantee a minimal delay.


Obviously, I agree with the first point.

The second point is troubling. It means driver writers are required to be
aware of the quirkiness of Linux internals.

And because drivers are supposed to work on all platforms, are we saying
that driver writers should be aware of the quirkiness for ALL platforms?

For example, assume that for short delays, such as 1 µs:

* amd64 has  5% relative error, 10 ns absolute error
* arm32 has 10% relative error,  1 µs absolute error
* alpha has  3% relative error,  3 µs absolute error

The driver writer would need to write udelay(4); ?

In my opinion, it makes more sense to hide the complexity and quirkiness
of udelay inside each platform's implementation. Which brings me to...



2) Different implementations of udelay and ndelay

On arm32, it is possible to set up udelay() to be clock-based.
In that case, udelay simply polls a constant-frequency tick-counter.
For example, on my puny little platform, there is a 27 MHz crystal oscillator
(37 ns period) which is wired to a tick counter mapped on the system bus.
The latency to sample the register is within [10-200] ns.

This implies two things:

i) it is possible to guarantee a minimum delay (in quanta of 37 ns)

ii) the sampling error is limited to ~250 ns


[NB: some platforms are even "better", and move the tick counter inside the
CPU block (e.g. ARM architected timer) to achieve a lower sampling error.]


There is one minor trap when handling tick counter sampling:
Assume we are waiting for one period to elapse. Consider the timeline below,
where x marks the cycle when the tick counter register is incremented.


-------x----------x----------x----------x----------x----->  time
                 ^ ^           ^
                 A B           C

If the execution flow leads to sampling at times A and B, then B-A equals 1,
yet only a tiny amount of time has elapsed. To guarantee that at least one
period has elapsed, we must wait until the arithmetic difference is 2
(i.e. one more than required). In the example, until time C.

In other words, if we need to delay for N cycles, the correct code should be:

	t0 = readl(tick_address);
	while (readl(tick_address) - t0 <= N)
		/* spin */ ;



There is another source of "error" (in the sense that udelay might spin too little)
caused by the conversion from µs to cycles -- which rounds down.

Consider arch/arm/include/asm/delay.h

loops = (loops_per_jiffy * delay_us * UDELAY_MULT) >> 31
where UDELAY_MULT = 2147 * HZ + 483648 * HZ / 1000000


Consider a platform with a 27 MHz clock, and HZ=300
The proper conversion is trivially

	loops = delay_us * 27

Thus, for a 1 microsecond delay, loops equals 27.

But when using UDELAY_MULT = 644245 (rounded down from 644245,0944)
loops equals 26.

This "off-by-one" error is systematic over the entire range of allowed
delay_us input (1 to 2000), so it is easy to fix, by adding 1 to the result.



3) Why does all this even matter?

At boot, the NAND framework scans the NAND chips for bad blocks;
this operation generates approximately 10^5 calls to ndelay(100);
which cause a 100 ms delay, because ndelay is implemented as a
call to the nearest udelay (rounded up).

My current NAND chips are tiny (2 x 512 MB) but with larger chips,
the number of calls to ndelay would climb to 10^6 and the delay
increase to 1 second, with is starting to be a problem.

One solution is to implement ndelay, but ndelay is more prone to
under-delays, and thus a prerequisite is fixing under-delays.


Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez
@ 2017-10-31 16:44 ` Linus Torvalds
  2017-10-31 16:56   ` Russell King - ARM Linux
  2017-10-31 16:46 ` Russell King - ARM Linux
  2017-11-01 17:53 ` Alan Cox
  2 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2017-10-31 16:44 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre,
	Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann,
	Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd,
	Mason

On Tue, Oct 31, 2017 at 9:15 AM, Marc Gonzalez
<marc_gonzalez@sigmadesigns.com> wrote:
>
> On arm32, it is possible to set up udelay() to be clock-based.

I'm not sure why this is discussed as some kind of generic problem.

Just do what x86-64 already does. We use the TSC, and ndelay() does
the right thing too.

I've not checked precision, but it should be close to the TSC update
frequency. So we're talking tens of megahertz.

Does it work on all PC's? No. It's only half-way reliable on the ones
where the TSC frequency doesn't change. So there are no *guarantees*.

               Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez
  2017-10-31 16:44 ` Linus Torvalds
@ 2017-10-31 16:46 ` Russell King - ARM Linux
  2017-11-01 17:53 ` Alan Cox
  2 siblings, 0 replies; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-10-31 16:46 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd,
	Mason

On Tue, Oct 31, 2017 at 05:15:34PM +0100, Marc Gonzalez wrote:
> I am writing to you directly because Russell has bluntly stated (paraphrased)
> "Send your patches to Linus; it's his kernel, and he has the final say." which
> is his diplomatic way of telling me to fsck off.

... after I've already pointed you at archived emails from Linus
himself giving you all the information you need, and me stating the
reasons for it being the way it is.

Why can you not accept what has already been said?  Why do you need
to continually re-open old discussions only to be told the same thing
as has already been said?  Why this "if I don't get the answer I want,
I'm going to keep on trying until I do" attitude?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-10-31 16:44 ` Linus Torvalds
@ 2017-10-31 16:56   ` Russell King - ARM Linux
  2017-10-31 17:45     ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-10-31 16:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Gonzalez, Mark Rutland, Mason, Jonathan Austin,
	Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd,
	Michael Turquette, Kevin Hilman, Will Deacon, LKML,
	Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner,
	Ingo Molnar, Linux ARM

On Tue, Oct 31, 2017 at 09:44:20AM -0700, Linus Torvalds wrote:
> On Tue, Oct 31, 2017 at 9:15 AM, Marc Gonzalez
> <marc_gonzalez@sigmadesigns.com> wrote:
> >
> > On arm32, it is possible to set up udelay() to be clock-based.
> 
> I'm not sure why this is discussed as some kind of generic problem.

Hi Linus,

Marc is stating something that's incorrect there.  On ARM32, we don't
have a TSC, and we aren't guaranteed to have a timer usable for delays.
Where there is a suitable timer, it can be used for delays.

However, where there isn't a timer, we fall back to using the software
loop, and that's where the problem lies.  For example, some platforms
have a relatively slow timer (32kHz).

This centres around is the discussion we had previously:

  http://lists.openwall.net/linux-kernel/2011/01/12/372

where errors of 5% are very much in the "don't care" category for
udelay() - so if udelay() returns early by 5%, we don't care.

Marc, however, does want to care about udelay() etc returning early,
because he wants his NAND driver to be performant, despite using
udelay()/ndelay() in critical paths.  So his argument is that udelay()
and ndelay() should be accurate, and certainly not return early.

Marc's motivation here is to try and force me to "fix" the ARM and
generic loops_per_jiffy code so that these functions provide accurate
delays.  I've said no, based on what we discussed back in 2011.

There's the issue too of the timer-based delay code possibly returning
one timer tick early, which at reasonable timer resolutions is pretty
small.  I don't want to do that, as it encourages people to use data-
sheet values in udelay() and ndelay() without any cushioning, and so
the drivers end up breaking if we fall back to using the software loop
delays - which we know for a fact will return early.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-10-31 16:56   ` Russell King - ARM Linux
@ 2017-10-31 17:45     ` Linus Torvalds
  2017-10-31 17:58       ` Linus Torvalds
  2017-11-01  0:23       ` Doug Anderson
  0 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2017-10-31 17:45 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Marc Gonzalez, Mark Rutland, Mason, Jonathan Austin,
	Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd,
	Michael Turquette, Kevin Hilman, Will Deacon, LKML,
	Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner,
	Ingo Molnar, Linux ARM

On Tue, Oct 31, 2017 at 9:56 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
>
> Marc is stating something that's incorrect there.  On ARM32, we don't
> have a TSC, and we aren't guaranteed to have a timer usable for delays.
> Where there is a suitable timer, it can be used for delays.
>
> However, where there isn't a timer, we fall back to using the software
> loop, and that's where the problem lies.  For example, some platforms
> have a relatively slow timer (32kHz).

Right.

So that is actually the basic issue: there is no way for us to really
_ever_ give any kind of guarantees about the behavior of
udelay/ndelay() in every general case.

We can't even guarantee some kind of "at least" behavior, because on
some platforms there is no reasonable stable clock at all.

We can give good results in certain _particular_ cases, but not in
some kind of blanket "we will always do well" way.

Traditionally, we used to obviously do the bogo-loop, but it depends
on processor frequency, which can (and does) change even outside SW
control, never mind things like interrupts etc.

On lots of platforms, we can generally do platform-specific clocks. On
modern x86, as mentioned, the TSC is stable and fairly high frequency
(it isn't really the gigahertz frequency that it reports - reading it
takes time, and even ignoring that, the implementation is actually not
a true adder at the reported frequency, but it is generally tens of
hundreds of megahertz, so you should get something that is close to
the "tens of nanoseconds" resolution).

But on others we can't even get *close* to that kind of behavior, and
if the clock is something like a 32kHz timer that you mention, you
obviously aren't going to get even microsecond resotulion, much less
nanoseconds.

You can (and on x86 we do) calibrate a faster non-architected clock
against a slow clock, but all the faster clocks tend to have that
frequency shifting issue.

So then you tend to be forced to simply rely on platform-specific
hacks if you really need something more precise. Most people don't,
which is why most people just use udelay() and friends.

In particular, several drivers end up depending not on an explicit
clock at all, but on the IO fabric itself. For a driver for a
particular piece of hardware, that is often the sanest way to do
really short timing: if you know you are on a PCI bus and you know
your own hardware, you can often do things like "reading the status
register takes 6 bus cycles, which is 200 nsec". Things like that are
very hacky, but for a driver that is looking at times in the usec
range, it's often the best you can do.

Don't get me wrong. I think

 (a) platform code could try to make their udelay/ndelay() be as good
as it can be on a particular platform

 (b) we could maybe export some interface to give estimated errors so
that drivers could then try to corrtect for them depending on just how
much they care.

so I'm certainly not _opposed_ to trying to improve on
udelay/ndelay(). It's just that for the generic case, we know we're
never going to be very good, and the error (both absolute and
relative) can be pretty damn big.

One of the issues has historically been that because so few people
care, and because there are probably more platforms than there are
cases that care deeply, even that (a) thing is actually fairly hard to
do. On the x86 side, for example, I doubt that most core kernel
developers even have access to platforms that have unstable TSC's any
more. I certainly don't. I complained to Intel for many many _years_,
but they finally did fix it, and now it's been a long time since I
cared.

That's why I actually would encourage driver writers that really care
deeply about delays to look at ways to get those delays from their own
hardware (ie exactly that "read the status register three times" kind
of model). It sounds hacky, but it couples the timing constraint with
the piece of hardware that actually depends on it, which means that
you don't get the nasty kinds of "worry about each platform"
complications.

I realize that this is not what people want to hear. In a perfect
world, we'd just make "ndelay()" work and give the right behavior, and
have some strictly bounded error.

It's just that it's really fundamentally hard in the general case,
even if it sounds like it should be pretty trivial in most
_particular_ cases.

So I'm very much open to udelay improvements, and if somebody sends
patches for particular platforms to do particularly well on that
platform, I think we should merge them. But ...

                Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-10-31 17:45     ` Linus Torvalds
@ 2017-10-31 17:58       ` Linus Torvalds
  2017-11-01  0:23       ` Doug Anderson
  1 sibling, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2017-10-31 17:58 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Marc Gonzalez, Mark Rutland, Mason, Jonathan Austin,
	Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd,
	Michael Turquette, Kevin Hilman, Will Deacon, LKML,
	Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner,
	Ingo Molnar, Linux ARM

On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'm very much open to udelay improvements, and if somebody sends
> patches for particular platforms to do particularly well on that
> platform, I think we should merge them. But ...

Side note: this is obviously a negative feedback circle, where because
we know we can't do all that great in general, few platforms really
bother to do as well as they perhaps could, since nobody should
generally care that deeply due to the generic problem..

Again, if a driver that has tight timing requirements can just
generate the timings from its own hardware, that tends to
JustWorks(tm).

It was what people used to do traditionally at least on the PC/AT
platform. None of this "udelay()" crud, you knew that doing a single
PIO read took one usec, so you'd just delay that way. Then the
hardware moved from ISA to EISA, and everything sped up, but that was
ok, because now it needed a shorter delay anyway...

Doing things like a status register read is often a good way
(sometimes the _only_ way) to make sure preceding writes have been
flushed all the way to the device anyway, so it can have those kinds
of incidental advantages.

Of course, all of this is relevant only for delays on the scale of
individual device accesses. But that's also when udelay/ndelay() tends
to have the most problems.

               Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-10-31 17:45     ` Linus Torvalds
  2017-10-31 17:58       ` Linus Torvalds
@ 2017-11-01  0:23       ` Doug Anderson
  2017-11-01  9:26         ` Russell King - ARM Linux
  1 sibling, 1 reply; 41+ messages in thread
From: Doug Anderson @ 2017-11-01  0:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Russell King - ARM Linux, Marc Gonzalez, Mark Rutland, Mason,
	Jonathan Austin, Arnd Bergmann, Nicolas Pitre, Peter Zijlstra,
	Stephen Boyd, Michael Turquette, Kevin Hilman, Will Deacon, LKML,
	Steven Rostedt, John Stultz, Thomas Gleixner, Ingo Molnar,
	Linux ARM

Hi,

On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So I'm very much open to udelay improvements, and if somebody sends
> patches for particular platforms to do particularly well on that
> platform, I think we should merge them. But ...

If I'm reading this all correctly, this sounds like you'd be willing
to merge <https://patchwork.kernel.org/patch/9429841/>.  This makes
udelay() guaranteed not to underrun on arm32 platforms.

-Doug

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01  0:23       ` Doug Anderson
@ 2017-11-01  9:26         ` Russell King - ARM Linux
  2017-11-01 15:53           ` Doug Anderson
  2017-11-01 19:28           ` Marc Gonzalez
  0 siblings, 2 replies; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-01  9:26 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Linus Torvalds, Mark Rutland, Jonathan Austin, Arnd Bergmann,
	Mason, Peter Zijlstra, Will Deacon, Michael Turquette,
	Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman,
	John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM,
	Marc Gonzalez

On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > So I'm very much open to udelay improvements, and if somebody sends
> > patches for particular platforms to do particularly well on that
> > platform, I think we should merge them. But ...
> 
> If I'm reading this all correctly, this sounds like you'd be willing
> to merge <https://patchwork.kernel.org/patch/9429841/>.  This makes
> udelay() guaranteed not to underrun on arm32 platforms.

That's a mis-representation again.  It stops a timer-based udelay()
possibly underrunning by one tick if we are close to the start of
a count increment.  However, it does nothing for the loops_per_jiffy
udelay(), which can still underrun.

My argument against merging that patch is that with it merged, we get
(as you say) a udelay() that doesn't underrun _when using a timer_
but when we end up using the loops_per_jiffy udelay(), we're back to
the old problem.

My opinion is that's bad, because it encourages people to write drivers
that rely on udelay() having "good" behaviour, which it is not guaranteed
to have.  So, they'll specify a delay period of exactly what they want,
and their drivers will then fail when running on systems that aren't
using a timer-based udelay().

If we want udelay() to have this behaviour, it needs to _always_ have
this behaviour irrespective of the implementation.  So that means
the loops_per_jiffy version also needs to be fixed in the same way,
which IMHO is impossible.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01  9:26         ` Russell King - ARM Linux
@ 2017-11-01 15:53           ` Doug Anderson
  2017-12-07 12:31             ` Pavel Machek
  2017-11-01 19:28           ` Marc Gonzalez
  1 sibling, 1 reply; 41+ messages in thread
From: Doug Anderson @ 2017-11-01 15:53 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Linus Torvalds, Mark Rutland, Jonathan Austin, Arnd Bergmann,
	Mason, Peter Zijlstra, Will Deacon, Michael Turquette,
	Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman,
	John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM,
	Marc Gonzalez

Hi,

On Wed, Nov 1, 2017 at 2:26 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote:
>> Hi,
>>
>> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> > So I'm very much open to udelay improvements, and if somebody sends
>> > patches for particular platforms to do particularly well on that
>> > platform, I think we should merge them. But ...
>>
>> If I'm reading this all correctly, this sounds like you'd be willing
>> to merge <https://patchwork.kernel.org/patch/9429841/>.  This makes
>> udelay() guaranteed not to underrun on arm32 platforms.
>
> That's a mis-representation again.  It stops a timer-based udelay()
> possibly underrunning by one tick if we are close to the start of
> a count increment.  However, it does nothing for the loops_per_jiffy
> udelay(), which can still underrun.
>
> My argument against merging that patch is that with it merged, we get
> (as you say) a udelay() that doesn't underrun _when using a timer_
> but when we end up using the loops_per_jiffy udelay(), we're back to
> the old problem.
>
> My opinion is that's bad, because it encourages people to write drivers
> that rely on udelay() having "good" behaviour, which it is not guaranteed
> to have.  So, they'll specify a delay period of exactly what they want,
> and their drivers will then fail when running on systems that aren't
> using a timer-based udelay().

IMHO the current udelay is broken in an off-by-one way and it's easy
to fix.  Intentionally leaving a bug in the code seems silly.  This
seems to by what Linus is saying with his statement that "(a) platform
code could try to make their udelay/ndelay() be as good as it can be
on a particular platform".

So no matter the rest of the discussions, we should land that.  If you
disagree then I'm happy to re-post that patch straight to Linus later
this week since it sounds as if he'd take it.


> If we want udelay() to have this behaviour, it needs to _always_ have
> this behaviour irrespective of the implementation.  So that means
> the loops_per_jiffy version also needs to be fixed in the same way,
> which IMHO is impossible.

As Linus indicates, if there is a way to code things up that doesn't
rely on udelay then that should be preferred.  However, there may be
cases where this is exceedingly difficult.  If you're writing a driver
at a high enough level that will work on a lot of underlying platforms
(AKA it's platform-agnostic) then you can't necessarily rely on timing
an individual hardware read.  Since you're writing high-level
platform-agnostic code, presumably implementing a 1 us delay in a
generic way is equally difficult to making the platform-agnostic
udelay() reliable.


IMHO it would be OK to put in a requirement in a driver saying that it
will only function properly on hardware that has a udelay() that is
guaranteed to never return early.  As Linus says: "most core kernel
developers even have access to platforms that have unstable TSC's any
more."  Presumably all those old platforms aren't suddenly going to be
attached to new devices unless those new devices are connected by a
PCI, ISA, USB, etc. bus.  Drivers for components connected by
non-external busses seem like they don't need to take into account the
quirks of really ancient hardware.

Yes, I know there are still some arm32 chips that aren't that old and
that don't have a CP15-based timer.  We should make sure we don't
change existing drivers and frameworks in a way that will break those
boards.  If that means we need to figure out how to add an API, as
Linus says, to indicate how accurate udelay is then that might be a
solution.  Another would be to come up with some clever solution on
affected boards.  Most arm32 boards I'm aware of have other
(non-CP15-based) timers.  ...if they don't and these are real boards
that are actually using a driver relying on udelay() then perhaps they
could add a new board-specific udelay() implementation that delayed by
reading a specific hardware register with known timing.


Said another way: if we're writing a high level NAND driver and we
can't find a better way than udelay() to ensure timing requirements,
then the driver should use udelay() and document the fact that it must
not underrun (ideally it could even test for it at runtime).  If that
NAND driver will never be used on platforms with an unreliable
udelay() then we don't need to worry about it.  If we find a platform
where we need this NAND driver, we should find a way to implement a
udelay() that will, at least, never underestimate.


-Doug

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez
  2017-10-31 16:44 ` Linus Torvalds
  2017-10-31 16:46 ` Russell King - ARM Linux
@ 2017-11-01 17:53 ` Alan Cox
  2017-11-01 19:03   ` Marc Gonzalez
  2 siblings, 1 reply; 41+ messages in thread
From: Alan Cox @ 2017-11-01 17:53 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Mason

On Tue, 31 Oct 2017 17:15:34 +0100
> Therefore, users are accustomed to having delays be longer (within a reasonable margin).
> However, very few users would expect delays to be *shorter* than requested.

If your udelay can be under by 10% then just bump the number by 10%.
However at that level most hardware isn't that predictable anyway because
the fabric between the CPU core and the device isn't some clunky
serialized link. Writes get delayed, they can bunch together, busses do
posting and queueing.

Then there is virtualisation 8)

> A typical driver writer has some HW spec in front of them, which e.g. states:
> 
> * poke register A
> * wait 1 microsecond for the dust to settle
> * poke register B

Rarely because of posting. It's usually

	write
	while(read() != READY);
	write

and even when you've got a legacy device with timeouts its

	write
	read
	delay
	write

and for sub 1ms delays I suspect the read and bus latency actually add a
randomization sufficient that it's not much of an optimization to worry
about an accurate ndelay().

> This "off-by-one" error is systematic over the entire range of allowed
> delay_us input (1 to 2000), so it is easy to fix, by adding 1 to the result.

And that + 1 might be worth adding but really there isn't a lot of
modern hardware that haas a bus that behaves like software folks imagine
and everything has percentage errors factored into published numbers.

> 3) Why does all this even matter?
> 
> At boot, the NAND framework scans the NAND chips for bad blocks;
> this operation generates approximately 10^5 calls to ndelay(100);
> which cause a 100 ms delay, because ndelay is implemented as a
> call to the nearest udelay (rounded up).

So why aren't you doing that on both NANDs in parallel and asynchronous
to other parts of boot ? If you start scanning at early boot time do you
need the bad block list before mounting / - or are you stuck with a
single threaded CPU and PIO ?

For that matter given the bad blocks don't randomly change why not cache
them ?
 
> My current NAND chips are tiny (2 x 512 MB) but with larger chips,
> the number of calls to ndelay would climb to 10^6 and the delay
> increase to 1 second, with is starting to be a problem.
> 
> One solution is to implement ndelay, but ndelay is more prone to
> under-delays, and thus a prerequisite is fixing under-delays.

For ndelay you probably have to make it platform specific or just use
udelay if not. We do have a few cases we wanted 400ns delays in the PC
world (ATA) but not many.

Akab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 17:53 ` Alan Cox
@ 2017-11-01 19:03   ` Marc Gonzalez
  2017-11-01 19:09     ` Linus Torvalds
                       ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-01 19:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Mason

On 01/11/2017 18:53, Alan Cox wrote:

> On Tue, 31 Oct 2017 17:15:34 +0100
>
>> Therefore, users are accustomed to having delays be longer (within a reasonable margin).
>> However, very few users would expect delays to be *shorter* than requested.
> 
> If your udelay can be under by 10% then just bump the number by 10%.

Except it's not *quite* that simple.
Error has both an absolute and a relative component.
So the actual value matters, and it's not always a constant.

For example:
http://elixir.free-electrons.com/linux/latest/source/drivers/mtd/nand/nand_base.c#L814

> However at that level most hardware isn't that predictable anyway because
> the fabric between the CPU core and the device isn't some clunky
> serialized link. Writes get delayed, they can bunch together, busses do
> posting and queueing.

Are you talking about the actual delay operation, or the pokes around it?

> Then there is virtualisation 8)
> 
>> A typical driver writer has some HW spec in front of them, which e.g. states:
>>
>> * poke register A
>> * wait 1 microsecond for the dust to settle
>> * poke register B
> 
> Rarely because of posting. It's usually
> 
> 	write
> 	while(read() != READY);
> 	write
> 
> and even when you've got a legacy device with timeouts its
> 
> 	write
> 	read
> 	delay
> 	write
> 
> and for sub 1ms delays I suspect the read and bus latency actually add a
> randomization sufficient that it's not much of an optimization to worry
> about an accurate ndelay().

I don't think "accurate" is the proper term.
Over-delays are fine, under-delays are problematic.

>> This "off-by-one" error is systematic over the entire range of allowed
>> delay_us input (1 to 2000), so it is easy to fix, by adding 1 to the result.
> 
> And that + 1 might be worth adding but really there isn't a lot of
> modern hardware that has a bus that behaves like software folks imagine
> and everything has percentage errors factored into published numbers.

I guess I'm a software folk, but the designer of the system bus sits
across my desk, and we do talk often.

>> 3) Why does all this even matter?
>>
>> At boot, the NAND framework scans the NAND chips for bad blocks;
>> this operation generates approximately 10^5 calls to ndelay(100);
>> which cause a 100 ms delay, because ndelay is implemented as a
>> call to the nearest udelay (rounded up).
> 
> So why aren't you doing that on both NANDs in parallel and asynchronous
> to other parts of boot ? If you start scanning at early boot time do you
> need the bad block list before mounting / - or are you stuck with a
> single threaded CPU and PIO ?

There might be some low(ish) hanging fruit to improve the performance
of the NAND framework, such as multi-page reads/writes. But the NAND
controller on my SoC muxes access to the two NAND chips, so no parallel
access, and this requires PIO.

> For that matter given the bad blocks don't randomly change why not cache
> them ?

That's a good question, I'll ask the NAND framework maintainer.
Store them where, by the way? On the NAND chip itself?

>> My current NAND chips are tiny (2 x 512 MB) but with larger chips,
>> the number of calls to ndelay would climb to 10^6 and the delay
>> increase to 1 second, with is starting to be a problem.
>>
>> One solution is to implement ndelay, but ndelay is more prone to
>> under-delays, and thus a prerequisite is fixing under-delays.
> 
> For ndelay you probably have to make it platform specific or just use
> udelay if not. We do have a few cases we wanted 400ns delays in the PC
> world (ATA) but not many.

By default, ndelay is implemented in terms of udelay.

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:03   ` Marc Gonzalez
@ 2017-11-01 19:09     ` Linus Torvalds
  2017-11-01 19:17       ` Linus Torvalds
  2017-11-01 19:38       ` Marc Gonzalez
  2017-11-01 19:36     ` Alan Cox
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2017-11-01 19:09 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Mason

On Wed, Nov 1, 2017 at 12:03 PM, Marc Gonzalez
<marc_gonzalez@sigmadesigns.com> wrote:
>
> By default, ndelay is implemented in terms of udelay.

That's very much *NOT* the case.

Yes, there is a *fallback* for when somebody doesn't do ndelay() at
all, but that doesn't make it the default.

It's just a "the architecture didn't implement ndelay at all, we'll
work around it".

So stop this idiocy already. About half of what I've seen in this
thread has been pure garbage.

                       Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:09     ` Linus Torvalds
@ 2017-11-01 19:17       ` Linus Torvalds
  2017-11-01 19:38       ` Marc Gonzalez
  1 sibling, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2017-11-01 19:17 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Mason

On Wed, Nov 1, 2017 at 12:09 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes, there is a *fallback* for when somebody doesn't do ndelay() at
> all, but that doesn't make it the default.
>
> It's just a "the architecture didn't implement ndelay at all, we'll
> work around it".

Side  note: I will continue to claim that anybody who thinks they need
ndelay() needs to rethink their position anyway, due to all the issues
I did already bring up.

So I'd say that pretty much 100% of all ndelay() users are just wrong.
If you are doing a driver that needs that kind of delay granularity,
you had better already know exactly how long it takes to read the
status register, and just do that instead.

And if you don't know how long it takes to read a status register, you
have no business thinking that you need ndelay(). You probably ended
up reading one piece of hardware doc ("the sample-and-hold register
takes 200ns to settle") without then reading the other low-level
hardware documentation that talks about the timings of the chip
accesses from the host side.

                 Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01  9:26         ` Russell King - ARM Linux
  2017-11-01 15:53           ` Doug Anderson
@ 2017-11-01 19:28           ` Marc Gonzalez
  2017-11-01 20:30             ` Russell King - ARM Linux
  1 sibling, 1 reply; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-01 19:28 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Doug Anderson, Linus Torvalds, Mark Rutland, Jonathan Austin,
	Arnd Bergmann, Peter Zijlstra, Will Deacon, Michael Turquette,
	Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman,
	John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM, Mason

On 01/11/2017 10:26, Russell King - ARM Linux wrote:

> On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote:
>
>> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds wrote:
>>
>>> So I'm very much open to udelay improvements, and if somebody sends
>>> patches for particular platforms to do particularly well on that
>>> platform, I think we should merge them. But ...
>>
>> If I'm reading this all correctly, this sounds like you'd be willing
>> to merge <https://patchwork.kernel.org/patch/9429841/>.  This makes
>> udelay() guaranteed not to underrun on arm32 platforms.
> 
> That's a mis-representation again.  It stops a timer-based udelay()
> possibly underrunning by one tick if we are close to the start of
> a count increment.  However, it does nothing for the loops_per_jiffy
> udelay(), which can still underrun.

It is correct that improving the clock-based implementation does strictly
nothing for the loop-based implementation.

Is it possible to derive a higher bound on the amount of under-run when
using the loop-based delay on arm32?

> My argument against merging that patch is that with it merged, we get
> (as you say) a udelay() that doesn't underrun _when using a timer_
> but when we end up using the loops_per_jiffy udelay(), we're back to
> the old problem.
> 
> My opinion is that's bad, because it encourages people to write drivers
> that rely on udelay() having "good" behaviour, which it is not guaranteed
> to have.  So, they'll specify a delay period of exactly what they want,
> and their drivers will then fail when running on systems that aren't
> using a timer-based udelay().
> 
> If we want udelay() to have this behaviour, it needs to _always_ have
> this behaviour irrespective of the implementation.  So that means
> the loops_per_jiffy version also needs to be fixed in the same way,
> which IMHO is impossible.

Let's say some piece of HW absolutely, positively, unequivocally,
uncompromisingly, requires a strict minimum of 10 microseconds
elapsing between operations A and B.

You say a driver writer must not write udelay(10);
They have to take into account the possibility of under-delay.
How much additional delay should they add?
10%? 20%? 50%? A percentage + a fixed quantity?

If there is an actual rule, then it could be incorporated in the
loop-based implementation?

If it is impossible to say (as Linus hinted for some platforms)
then this means there is no way to guarantee a minimal delay?

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:03   ` Marc Gonzalez
  2017-11-01 19:09     ` Linus Torvalds
@ 2017-11-01 19:36     ` Alan Cox
  2017-11-01 19:39     ` Thomas Gleixner
  2017-11-01 19:48     ` Baruch Siach
  3 siblings, 0 replies; 41+ messages in thread
From: Alan Cox @ 2017-11-01 19:36 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Mason

> > serialized link. Writes get delayed, they can bunch together, busses do
> > posting and queueing.  
> 
> Are you talking about the actual delay operation, or the pokes around it?

All of it. A write from the CPU except on the lowest end ones isn't
neccessarily going to occur in strict sequential order as expressed by
the program. The write even with cache bypassing goes to whatever bus
interface unit is involved and then the signal goes on (eventually) to
the device. The chances are the state machine for the external bus isn't
the same as the internal one and already does things like posting so the
CPU isn't stalled for the write to complete on the slow external bus.

All your bus clocks have errors, even more so if spread spectrum is in
use to reduce the noise. For some bus encoding transferring the data
takes a subtly different amount of time accordig to the value.

The reality for most machines is that writing to a device rather more
resembles one of those games where your ball descends at an unpredictable
rate bouncing off pillars to score points, than is implied by the
diagrams.

> I don't think "accurate" is the proper term.
> Over-delays are fine, under-delays are problematic.

Not often. The people who characterized the silicon will have added a
safety margin to. They are always quoting +/- something if you dig in the
small print or ask.

> > For that matter given the bad blocks don't randomly change why not cache
> > them ?  
> 
> That's a good question, I'll ask the NAND framework maintainer.
> Store them where, by the way? On the NAND chip itself?

I guess the ideal case would be to store it with magic numbers on one of
a certain number of blocks (in case the default one to hold it is itself
a bad block) ?
> 
> >> My current NAND chips are tiny (2 x 512 MB) but with larger chips,
> >> the number of calls to ndelay would climb to 10^6 and the delay
> >> increase to 1 second, with is starting to be a problem.

And this is another reason I think that worrying about ndelay is not the
answer. If you've got multiple threads you can bring it up
asynchronously, if you've got multiple buses (ok you don't) then you can
halve or quarter the time needed. If you've got them cached you can
nearly eliminate it.

Compared with those scraping a few percent by fine tuning ndelay doesn't
look such a good return ?

Alan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:09     ` Linus Torvalds
  2017-11-01 19:17       ` Linus Torvalds
@ 2017-11-01 19:38       ` Marc Gonzalez
  2017-11-15 12:51         ` Marc Gonzalez
  1 sibling, 1 reply; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-01 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Mason

On 01/11/2017 20:09, Linus Torvalds wrote:

> On Wed, Nov 1, 2017 at 12:03 PM, Marc Gonzalez wrote:
>
>> By default, ndelay is implemented in terms of udelay.
> 
> That's very much *NOT* the case.
> 
> Yes, there is a *fallback* for when somebody doesn't do ndelay() at
> all, but that doesn't make it the default.
> 
> It's just a "the architecture didn't implement ndelay at all, we'll
> work around it".

Yes, sorry, I wrote "default" when I meant "fallback".
(arm32 currently does not define a specific ndelay implementation.)

> So stop this idiocy already. About half of what I've seen in this
> thread has been pure garbage.

OK, I'll just send my patch, and then crawl back under my rock.

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:03   ` Marc Gonzalez
  2017-11-01 19:09     ` Linus Torvalds
  2017-11-01 19:36     ` Alan Cox
@ 2017-11-01 19:39     ` Thomas Gleixner
  2017-11-01 19:48     ` Baruch Siach
  3 siblings, 0 replies; 41+ messages in thread
From: Thomas Gleixner @ 2017-11-01 19:39 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Alan Cox, Linus Torvalds, LKML, Linux ARM, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Mason

On Wed, 1 Nov 2017, Marc Gonzalez wrote:

> On 01/11/2017 18:53, Alan Cox wrote:
> 
> > On Tue, 31 Oct 2017 17:15:34 +0100
> >
> >> Therefore, users are accustomed to having delays be longer (within a reasonable margin).
> >> However, very few users would expect delays to be *shorter* than requested.
> > 
> > If your udelay can be under by 10% then just bump the number by 10%.
> 
> Except it's not *quite* that simple.
> Error has both an absolute and a relative component.
> So the actual value matters, and it's not always a constant.
> 
> For example:
> http://elixir.free-electrons.com/linux/latest/source/drivers/mtd/nand/nand_base.c#L814

That example is really pointless, because the only effect it does is to
delay the read of the ready/busy pin long enough to bridge the silly gap
between
	     |-----------------------	
CS	_____|

and
	-------|
R/B	       |_____________________

IIRC, the gap is in the low single digit nsec range, so it really does not
matter how long you delay here. In fact on most older CPUs excuting the 3
instructions even if it returned immediately were sufficient.

Any real NAND controller, not the stupid PIO based ones we had 15 years
ago, should not even make it necessary to handle that on the software
side. Simply because the controller should already take care of that and
not blindly reflect the status of the R/B pin in the status register. If
your hardware does that, then poke that HW dude on the desk next to yours
with a big cluestick.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:03   ` Marc Gonzalez
                       ` (2 preceding siblings ...)
  2017-11-01 19:39     ` Thomas Gleixner
@ 2017-11-01 19:48     ` Baruch Siach
  2017-11-02 16:12       ` Boris Brezillon
  3 siblings, 1 reply; 41+ messages in thread
From: Baruch Siach @ 2017-11-01 19:48 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Alan Cox, Mark Rutland, Mason, Russell King, Jonathan Austin,
	Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd,
	Michael Turquette, Kevin Hilman, Will Deacon, LKML,
	Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner,
	Linus Torvalds, Ingo Molnar, Linux ARM, linux-mtd

Hi Marc,

On Wed, Nov 01, 2017 at 08:03:20PM +0100, Marc Gonzalez wrote:
> On 01/11/2017 18:53, Alan Cox wrote: 
> > For that matter given the bad blocks don't randomly change why not cache
> > them ?
> 
> That's a good question, I'll ask the NAND framework maintainer.
> Store them where, by the way? On the NAND chip itself?

Yes. In the bad block table (bbt). See drivers/mtd/nand/nand_bbt.c.

baruch

-- 
     http://baruch.siach.name/blog/                  ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il -

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:28           ` Marc Gonzalez
@ 2017-11-01 20:30             ` Russell King - ARM Linux
  0 siblings, 0 replies; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-01 20:30 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Doug Anderson, Linus Torvalds, Mark Rutland, Jonathan Austin,
	Arnd Bergmann, Peter Zijlstra, Will Deacon, Michael Turquette,
	Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman,
	John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM, Mason

On Wed, Nov 01, 2017 at 08:28:18PM +0100, Marc Gonzalez wrote:
> On 01/11/2017 10:26, Russell King - ARM Linux wrote:
> 
> > On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote:
> >
> >> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds wrote:
> >>
> >>> So I'm very much open to udelay improvements, and if somebody sends
> >>> patches for particular platforms to do particularly well on that
> >>> platform, I think we should merge them. But ...
> >>
> >> If I'm reading this all correctly, this sounds like you'd be willing
> >> to merge <https://patchwork.kernel.org/patch/9429841/>.  This makes
> >> udelay() guaranteed not to underrun on arm32 platforms.
> > 
> > That's a mis-representation again.  It stops a timer-based udelay()
> > possibly underrunning by one tick if we are close to the start of
> > a count increment.  However, it does nothing for the loops_per_jiffy
> > udelay(), which can still underrun.
> 
> It is correct that improving the clock-based implementation does strictly
> nothing for the loop-based implementation.
> 
> Is it possible to derive a higher bound on the amount of under-run when
> using the loop-based delay on arm32?

Not really.  If you read the archived thread via URL that I gave you
when this was initially brought up, specifically the first email in
the thread, I give a full analysis of the loop-based delay, why it
gives short delays.

What the analysis says, specifically point (2) in my initial email,
is that the inaccuracy is dependent on two things:

1. the CPU speed
2. the time that the CPU has to spend executing the timer interrupt
   handler.

For any particular kernel, we could assume that there are a fixed
number of cycles to execute the timer interrupt handler.  Let's call
this t_timer.  The timer fires every t_period.

We measure how many loops we can do between two t_period interrupts,
and that includes t_timer each time.  So, the time that we're able
to measure via the delay loop is t_period - t_timer (which is the
equation I give in point (2) in that email).

If the CPU is clocked slowly, then t_timer gets larger, but t_period
remains the same.  So, the error gets bigger.  Conversely, the faster
the CPU, the smaller the error.

Further in that initial email, I give actual measured delay values
for one of my systems (I forget which) for delays from 1us up to 5ms
in 1, 2, 5 decade multiples.  Small delays are longer than desired
because of the overhead of computing the number of loops.  Long
delays are shorter because of the inaccuracies.

Note also that (1) in the original email indicates that the
loops_per_jiffy value is biased towards a smaller value rather than
a larger value - and that also adds to the "it's shorter than
requested" problem.

At the end of the email, I proposed an improvement to the ARM
implementation, which reduces the amount of underrun by correcting
the calculations to round up.  However, this adds four additional
instructions to the computation, which has the effect of making
small delays ever so slightly longer than they are already.
That said, in the last six years, ARM CPUs have become a lot
faster, so the effect of those four instructions is now reduced.

Now, if you read Linus' reply to my initial email, you'll see that
Linus stated very clearly that the error caused by that is in the
"don't care too much" (quoting Linus) category, which is what I've
been trying to tell you all this time.

The slight underrun that we get with the software loop is not
something we care about, we know that it happens and there's very
little motivation to fix it.

What does that mean?  Well, if we have a super-accurate udelay()
implementation based on timers, that's really great, it means we
can have accurate delays that won't expire before the time they're
supposed to.  I agree that's a good thing.  There's also a major
got-cha, which is if we have to fall back to using the software
based udelay(), then we immediately start loosing.  As can be
seen from the figures I've given, if you ask for a 1ms delay, it's
about 1% short, so 990us.

Do those 10us matter?  That depends on the driver, buses, and what
the delay is being used for - but the point is, with the software
loop, we _can't_ guarantee that if we ask for a 1ms delay, we'll
get at least a 1ms delay.

Now, what does this 1% equate to for a timer based delay?  If the
timer ticks at 10MHz, so giving a 100ns resolution, and we miss
one by one tick, that's a 100ns error.  If you now look at the
figures a measured from the software udelay(), that's peanuts
compared to the error with the software loop.

If your timer ticks at 1MHz, then maybe it's a bigger problem as
the error would be 1us, but then do you really want to use a
1MHz counter to implement udelay() which suffers from not knowing
where in its count cycle you start - if you request 1us, and you
wait for two ticks, you could be waiting close to 2us.  Probably
not the best choice if you're trying to bitbang a serial bus since
it'll make everything twice as slow as using the software delay
loop!

It's really all about balances and tradeoffs.

> > If we want udelay() to have this behaviour, it needs to _always_ have
> > this behaviour irrespective of the implementation.  So that means
> > the loops_per_jiffy version also needs to be fixed in the same way,
> > which IMHO is impossible.
> 
> Let's say some piece of HW absolutely, positively, unequivocally,
> uncompromisingly, requires a strict minimum of 10 microseconds
> elapsing between operations A and B.
> 
> You say a driver writer must not write udelay(10);

Correct, because udelay() may return early.

> They have to take into account the possibility of under-delay.
> How much additional delay should they add?
> 10%? 20%? 50%? A percentage + a fixed quantity?
> 
> If there is an actual rule, then it could be incorporated in the
> loop-based implementation?

Well, there are two ways that udelay() gets used:

1. As a busy-wait for short intervals while waiting for hardware to
   produce an event, such as:

                /* wait */
                timeout = 1000;
                do {
                        tc_read(DP0_LTSTAT, &value);
                        udelay(1);
                } while ((!(value & LT_LOOPDONE)) && (--timeout));
                if (timeout == 0) {
                        dev_err(tc->dev, "Link training timeout!\n");

   Here, the "timeout" will be way over-estimated, so that a short
   udelay() has little effect, and to give the hardware more than
   enough time to respond.  The same is done in other drivers, eg:

        hdmi_phy_wait_i2c_done(hdmi, 1000);

static bool hdmi_phy_wait_i2c_done(struct dw_hdmi *hdmi, int msec)
{
        u32 val;

        while ((val = hdmi_readb(hdmi, HDMI_IH_I2CMPHY_STAT0) & 0x3) == 0) {
                if (msec-- == 0)
                        return false;
                udelay(1000);
        }
        hdmi_writeb(hdmi, val, HDMI_IH_I2CMPHY_STAT0);

   It probably doesn't take anywhere near 1 _second_ for the PHY to
   complete the write operation, but the point is to allow progress
   to be made if it did rather than the kernel just coming to a dead
   stop.

2. Drivers that use udelay() to produce bus timings, eg, i2c-algo-bit.c.
   The value of adap->udelay is the period of the clock.  There are
   a fairly small set of standard I2C bus maximum frequencies (100kHz,
   400kHz, we'll ignore the higher ones because they're not relevant
   here):

drivers/i2c/busses/i2c-hydra.c:     .udelay            = 5,
drivers/i2c/busses/i2c-versatile.c: .udelay    = 30,
drivers/i2c/busses/i2c-simtec.c:    pd->bit.udelay = 20;
drivers/i2c/busses/i2c-parport.c:   .udelay            = 10, /* ~50 kbps */
drivers/i2c/busses/i2c-parport.c:           adapter->algo_data.udelay = 50; /* ~10 kbps */
drivers/i2c/busses/i2c-acorn.c:     .udelay            = 80,
drivers/i2c/busses/i2c-via.c:       .udelay            = 5,
drivers/i2c/busses/i2c-parport-light.c:     .udelay            = 50,

   So we have 200kHz, 33kHz, 50kHz, 100kHz, 20kHz, 12kHz, etc.
   However, note that some of those (eg, parport) are writing to an
   ISA bus, and ISA buses are comparitively slow, and will reduce
   the cycle time below that of the specified delay.

   These figures are probably arrived at by repeated test and
   measurement of the bus behaviour over a range of conditions, rather
   than any particular fixed "we need to inflate by a particular %age
   and add a fixed value".

   I've been there - I've had a parallel port bit-banging a serial
   protocol using a driver with udelay(), and if you want it to
   perform with the minimum of overhead, it's very much a case of
   "connect the 'scope, measure the behaviour, adjust the software
   to cater for the overheads while leaving a margin."

All in all, there is not any nice answer to "I want udelay() to be
accurate" or even "I want udelay() to return only after the minimum
specified time".  There are ways we can do it, but the point is that
the kernel _as a whole_ can not make either of those guarantees to
driver authors, and driver authors must not rely on these functions
to produce accurate delays.

I'm sorry that I can't give you exact figures to your question, but
I'm trying to give you the full understanding about why this is the
case, and why you should not make the assumptions you want to about
these functions.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:48     ` Baruch Siach
@ 2017-11-02 16:12       ` Boris Brezillon
  0 siblings, 0 replies; 41+ messages in thread
From: Boris Brezillon @ 2017-11-02 16:12 UTC (permalink / raw)
  To: Baruch Siach
  Cc: Marc Gonzalez, Mark Rutland, Alan Cox, Russell King,
	Jonathan Austin, Arnd Bergmann, Mason, Peter Zijlstra,
	Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd,
	LKML, Steven Rostedt, Douglas Anderson, Kevin Hilman,
	John Stultz, linux-mtd, Thomas Gleixner, Linus Torvalds,
	Ingo Molnar, Linux ARM

On Wed, 1 Nov 2017 21:48:22 +0200
Baruch Siach <baruch@tkos.co.il> wrote:

> Hi Marc,
> 
> On Wed, Nov 01, 2017 at 08:03:20PM +0100, Marc Gonzalez wrote:
> > On 01/11/2017 18:53, Alan Cox wrote:   
> > > For that matter given the bad blocks don't randomly change why not cache
> > > them ?  
> > 
> > That's a good question, I'll ask the NAND framework maintainer.
> > Store them where, by the way? On the NAND chip itself?  
> 
> Yes. In the bad block table (bbt). See drivers/mtd/nand/nand_bbt.c.

Yes, you can cache this information in a bad block table stored on the
flash. But still, the ndelay()/udelay() problem remains: reading the
out-of-band area of each eraseblock to re-create the bad block table is
just one use case. This ndelay() is something you can have on all kind
of read/write operations. As Thomas Gleixner stated, this is not really
needed on modern NAND controllers which take care of various
timing constraints internally, but still, we have some controllers
that are not that smart, and we have to support them.

I'm not concerned about performances here, and if I'm told that we
should turn all ndelay() calls into usleep_range() ones, then I'm
perfectly fine with that, but I need a guarantee that when I say "I
want to wait at least X ns/us", the function does not return before
that time has expired.

Not sure if that would work, but maybe we could create a wrapper like

void nand_ndelay(unsigned long nanoseconds)
{
	ktime_t end = ktime_add_ns(ktime_get(), nanoseconds);

	do {
		ndelay(nanoseconds);
	} while (ktime_before(ktime_get(), end));
}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 19:38       ` Marc Gonzalez
@ 2017-11-15 12:51         ` Marc Gonzalez
  2017-11-15 13:13           ` Russell King - ARM Linux
  2017-11-15 18:45           ` Doug Anderson
  0 siblings, 2 replies; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-15 12:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On 01/11/2017 20:38, Marc Gonzalez wrote:

> OK, I'll just send my patch, and then crawl back under my rock.

Linus,

As promised, the patch is provided below. And as promised, I will
no longer bring this up on LKML.

FWIW, I have checked that the computed value matches the expected
value for all HZ and delay_us, and for a few clock frequencies,
using the following program:

$ cat delays.c
#include <stdio.h>
#define MEGA 1000000u
typedef unsigned int uint;
typedef unsigned long long u64;
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))

static const uint HZ_tab[] = { 100, 250, 300, 1000 };

static void check_cycle_count(uint freq, uint HZ, uint delay_us)
{
	uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
	uint lpj = DIV_ROUND_UP(freq, HZ);
	uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
	uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);

	if (computed != expected)
		printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
}

int main(void)
{
	uint idx, delay_us, freq;

	for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
		for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
			for (delay_us = 1; delay_us <= 2000; ++delay_us)
				check_cycle_count(freq, HZ_tab[idx], delay_us);

	return 0;
}



-- >8 --
Subject: [PATCH] ARM: Tweak clock-based udelay implementation

In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
Russell pointed out that loop-based delays may return early.

On the arm platform, delays may be either loop-based or clock-based.

This patch tweaks the clock-based implementation so that udelay(N)
is guaranteed to spin at least N microseconds.

Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
---
 arch/arm/lib/delay.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
index 2cef11884857..0a25712077ec 100644
--- a/arch/arm/lib/delay.c
+++ b/arch/arm/lib/delay.c
@@ -58,15 +58,15 @@ static void __timer_delay(unsigned long cycles)
 {
 	cycles_t start = get_cycles();
 
-	while ((get_cycles() - start) < cycles)
+	while ((get_cycles() - start) <= cycles)
 		cpu_relax();
 }
 
 static void __timer_const_udelay(unsigned long xloops)
 {
-	unsigned long long loops = xloops;
-	loops *= arm_delay_ops.ticks_per_jiffy;
-	__timer_delay(loops >> UDELAY_SHIFT);
+	u64 tmp = (u64)xloops * arm_delay_ops.ticks_per_jiffy;
+	unsigned long cycles = tmp >> UDELAY_SHIFT;
+	__timer_delay(cycles + 1); /* Round up in 1 instruction */
 }
 
 static void __timer_udelay(unsigned long usecs)
@@ -92,7 +92,7 @@ void __init register_current_timer_delay(const struct delay_timer *timer)
 	if (!delay_calibrated && (!delay_res || (res < delay_res))) {
 		pr_info("Switching to timer-based delay loop, resolution %lluns\n", res);
 		delay_timer			= timer;
-		lpj_fine			= timer->freq / HZ;
+		lpj_fine			= DIV_ROUND_UP(timer->freq, HZ);
 		delay_res			= res;
 
 		/* cpufreq may scale loops_per_jiffy, so keep a private copy */
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-15 12:51         ` Marc Gonzalez
@ 2017-11-15 13:13           ` Russell King - ARM Linux
  2017-11-16 15:26             ` Marc Gonzalez
  2017-12-07 12:43             ` Pavel Machek
  2017-11-15 18:45           ` Doug Anderson
  1 sibling, 2 replies; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-15 13:13 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz,
	Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On Wed, Nov 15, 2017 at 01:51:54PM +0100, Marc Gonzalez wrote:
> On 01/11/2017 20:38, Marc Gonzalez wrote:
> 
> > OK, I'll just send my patch, and then crawl back under my rock.
> 
> Linus,
> 
> As promised, the patch is provided below. And as promised, I will
> no longer bring this up on LKML.
> 
> FWIW, I have checked that the computed value matches the expected
> value for all HZ and delay_us, and for a few clock frequencies,
> using the following program:
> 
> $ cat delays.c
> #include <stdio.h>
> #define MEGA 1000000u
> typedef unsigned int uint;
> typedef unsigned long long u64;
> #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> 
> static const uint HZ_tab[] = { 100, 250, 300, 1000 };
> 
> static void check_cycle_count(uint freq, uint HZ, uint delay_us)
> {
> 	uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
> 	uint lpj = DIV_ROUND_UP(freq, HZ);
> 	uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
> 	uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);
> 
> 	if (computed != expected)
> 		printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
> }
> 
> int main(void)
> {
> 	uint idx, delay_us, freq;
> 
> 	for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
> 		for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
> 			for (delay_us = 1; delay_us <= 2000; ++delay_us)
> 				check_cycle_count(freq, HZ_tab[idx], delay_us);
> 
> 	return 0;
> }
> 
> 
> 
> -- >8 --
> Subject: [PATCH] ARM: Tweak clock-based udelay implementation
> 
> In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
> Russell pointed out that loop-based delays may return early.
> 
> On the arm platform, delays may be either loop-based or clock-based.
> 
> This patch tweaks the clock-based implementation so that udelay(N)
> is guaranteed to spin at least N microseconds.

As I've already said, I don't want this, because it encourages people
to use too-small delays in driver code, and if we merge it then you
will look at your data sheet, decide it says "you need to wait 10us"
and write in your driver "udelay(10)" which will break on the loops
based delay.

udelay() needs to offer a consistent interface so that drivers know
what to expect no matter what the implementation is.  Making one
implementation conform to your ideas while leaving the other
implementations with other expectations is a recipe for bugs.

If you really want to do this, fix the loops_per_jiffy implementation
as well so that the consistency is maintained.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-15 12:51         ` Marc Gonzalez
  2017-11-15 13:13           ` Russell King - ARM Linux
@ 2017-11-15 18:45           ` Doug Anderson
  1 sibling, 0 replies; 41+ messages in thread
From: Doug Anderson @ 2017-11-15 18:45 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz,
	Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

Hi,

On Wed, Nov 15, 2017 at 4:51 AM, Marc Gonzalez
<marc_gonzalez@sigmadesigns.com> wrote:
> On 01/11/2017 20:38, Marc Gonzalez wrote:
>
>> OK, I'll just send my patch, and then crawl back under my rock.
>
> Linus,
>
> As promised, the patch is provided below. And as promised, I will
> no longer bring this up on LKML.
>
> FWIW, I have checked that the computed value matches the expected
> value for all HZ and delay_us, and for a few clock frequencies,
> using the following program:
>
> $ cat delays.c
> #include <stdio.h>
> #define MEGA 1000000u
> typedef unsigned int uint;
> typedef unsigned long long u64;
> #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
>
> static const uint HZ_tab[] = { 100, 250, 300, 1000 };
>
> static void check_cycle_count(uint freq, uint HZ, uint delay_us)
> {
>         uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
>         uint lpj = DIV_ROUND_UP(freq, HZ);
>         uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
>         uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);
>
>         if (computed != expected)
>                 printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
> }
>
> int main(void)
> {
>         uint idx, delay_us, freq;
>
>         for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
>                 for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
>                         for (delay_us = 1; delay_us <= 2000; ++delay_us)
>                                 check_cycle_count(freq, HZ_tab[idx], delay_us);
>
>         return 0;
> }
>
>
>
> -- >8 --
> Subject: [PATCH] ARM: Tweak clock-based udelay implementation
>
> In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
> Russell pointed out that loop-based delays may return early.
>
> On the arm platform, delays may be either loop-based or clock-based.
>
> This patch tweaks the clock-based implementation so that udelay(N)
> is guaranteed to spin at least N microseconds.
>
> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
> ---
>  arch/arm/lib/delay.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

As I have indicated in the past, I'm not a believer in the "don't fix
bug A because bug B is still there" argument.  From the statements
"platform code could try to make their udelay/ndelay() be as good as
it can be on a particular platform" and "I'm very much open to udelay
improvements, and if somebody sends patches for particular platforms
to do particularly well on that platform" it's my understanding that
this is consistent with Linus's opinion.  Since Marc's bugfix seems
good and valid:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

Marc's bugfix would immediately be useful if you happened to know your
driver was only running on a system that was using a timer-based
udelay on ARM.

Marc's bugfix could also form the basis of future patches that
extended the udelay() API to somehow express the error, as Linus
suggested by saying "we could maybe export some interface to give
estimated errors so that drivers could then try to correct for them
depending on just how much they care".


-Doug

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-15 13:13           ` Russell King - ARM Linux
@ 2017-11-16 15:26             ` Marc Gonzalez
  2017-11-16 15:36               ` Russell King - ARM Linux
  2017-12-07 12:43             ` Pavel Machek
  1 sibling, 1 reply; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-16 15:26 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz,
	Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On 15/11/2017 14:13, Russell King - ARM Linux wrote:

> udelay() needs to offer a consistent interface so that drivers know
> what to expect no matter what the implementation is.  Making one
> implementation conform to your ideas while leaving the other
> implementations with other expectations is a recipe for bugs.
> 
> If you really want to do this, fix the loops_per_jiffy implementation
> as well so that the consistency is maintained.

Hello Russell,

It seems to me that, when using DFS, there's a serious issue with loop-based
delays. (IIRC, it was you who pointed this out a few years ago.)

If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
when the frequency changes.

But arch/arm/lib/delay-loop.S starts by loading the current value of
loops_per_jiffy, computes the number of times to loop, and then loops.
If the frequency increases when the core is in __loop_delay, the
delay will be much shorter than requested.

Is this a correct assessment of the situation?

(BTW, does arch/arm/lib/delay-loop.S load the per_cpu loops_per_jiffy
or the system-wide variable?)

Should loop-based delays be disabled when CPUFREQ is enabled?

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 15:26             ` Marc Gonzalez
@ 2017-11-16 15:36               ` Russell King - ARM Linux
  2017-11-16 15:47                 ` Marc Gonzalez
  0 siblings, 1 reply; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-16 15:36 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz,
	Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> 
> > udelay() needs to offer a consistent interface so that drivers know
> > what to expect no matter what the implementation is.  Making one
> > implementation conform to your ideas while leaving the other
> > implementations with other expectations is a recipe for bugs.
> > 
> > If you really want to do this, fix the loops_per_jiffy implementation
> > as well so that the consistency is maintained.
> 
> Hello Russell,
> 
> It seems to me that, when using DFS, there's a serious issue with loop-based
> delays. (IIRC, it was you who pointed this out a few years ago.)
> 
> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> when the frequency changes.
> 
> But arch/arm/lib/delay-loop.S starts by loading the current value of
> loops_per_jiffy, computes the number of times to loop, and then loops.
> If the frequency increases when the core is in __loop_delay, the
> delay will be much shorter than requested.
> 
> Is this a correct assessment of the situation?

Absolutely correct, and it's something that people are aware of, and
have already catered for while writing their drivers.

> (BTW, does arch/arm/lib/delay-loop.S load the per_cpu loops_per_jiffy
> or the system-wide variable?)
> 
> Should loop-based delays be disabled when CPUFREQ is enabled?

What about platforms (and there are those in the kernel today) which
have CPUFREQ enabled and also have no timer based delay registered?
These rely on using the delay loop mechanism today.

What this means is you can't just "turn off" loop-based delays just
because CPUFREQ is enabled, because that's going to cause regressions.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 15:36               ` Russell King - ARM Linux
@ 2017-11-16 15:47                 ` Marc Gonzalez
  2017-11-16 16:08                   ` Nicolas Pitre
  0 siblings, 1 reply; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-16 15:47 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz,
	Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
>>
>>> udelay() needs to offer a consistent interface so that drivers know
>>> what to expect no matter what the implementation is.  Making one
>>> implementation conform to your ideas while leaving the other
>>> implementations with other expectations is a recipe for bugs.
>>>
>>> If you really want to do this, fix the loops_per_jiffy implementation
>>> as well so that the consistency is maintained.
>>
>> Hello Russell,
>>
>> It seems to me that, when using DFS, there's a serious issue with loop-based
>> delays. (IIRC, it was you who pointed this out a few years ago.)
>>
>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
>> when the frequency changes.
>>
>> But arch/arm/lib/delay-loop.S starts by loading the current value of
>> loops_per_jiffy, computes the number of times to loop, and then loops.
>> If the frequency increases when the core is in __loop_delay, the
>> delay will be much shorter than requested.
>>
>> Is this a correct assessment of the situation?
> 
> Absolutely correct, and it's something that people are aware of, and
> have already catered for while writing their drivers.

In their cpufreq driver?
In "real" device drivers that happen to use delays?

On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
If the frequency increases at the beginning of __loop_delay, udelay(100)
would spin only 10 microseconds. This is likely to cause issues in
any driver using udelay.

How does one cater for that?

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 15:47                 ` Marc Gonzalez
@ 2017-11-16 16:08                   ` Nicolas Pitre
  2017-11-16 16:26                     ` Marc Gonzalez
  0 siblings, 1 reply; 41+ messages in thread
From: Nicolas Pitre @ 2017-11-16 16:08 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML,
	Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland,
	Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman,
	Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic,
	Mason

On Thu, 16 Nov 2017, Marc Gonzalez wrote:

> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> > On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>
> >>> udelay() needs to offer a consistent interface so that drivers know
> >>> what to expect no matter what the implementation is.  Making one
> >>> implementation conform to your ideas while leaving the other
> >>> implementations with other expectations is a recipe for bugs.
> >>>
> >>> If you really want to do this, fix the loops_per_jiffy implementation
> >>> as well so that the consistency is maintained.
> >>
> >> Hello Russell,
> >>
> >> It seems to me that, when using DFS, there's a serious issue with loop-based
> >> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>
> >> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >> when the frequency changes.
> >>
> >> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >> loops_per_jiffy, computes the number of times to loop, and then loops.
> >> If the frequency increases when the core is in __loop_delay, the
> >> delay will be much shorter than requested.
> >>
> >> Is this a correct assessment of the situation?
> > 
> > Absolutely correct, and it's something that people are aware of, and
> > have already catered for while writing their drivers.
> 
> In their cpufreq driver?
> In "real" device drivers that happen to use delays?
> 
> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> If the frequency increases at the beginning of __loop_delay, udelay(100)
> would spin only 10 microseconds. This is likely to cause issues in
> any driver using udelay.
> 
> How does one cater for that?

You make sure your delays are based on a stable hardware timer.
Most platforms nowdays should have a suitable timer source.


Nicolas

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 16:08                   ` Nicolas Pitre
@ 2017-11-16 16:26                     ` Marc Gonzalez
  2017-11-16 16:32                       ` Russell King - ARM Linux
  2017-11-16 16:47                       ` Nicolas Pitre
  0 siblings, 2 replies; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-16 16:26 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML,
	Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland,
	Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman,
	Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic,
	Mason

On 16/11/2017 17:08, Nicolas Pitre wrote:

> On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> 
>> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
>>>>
>>>>> udelay() needs to offer a consistent interface so that drivers know
>>>>> what to expect no matter what the implementation is.  Making one
>>>>> implementation conform to your ideas while leaving the other
>>>>> implementations with other expectations is a recipe for bugs.
>>>>>
>>>>> If you really want to do this, fix the loops_per_jiffy implementation
>>>>> as well so that the consistency is maintained.
>>>>
>>>> Hello Russell,
>>>>
>>>> It seems to me that, when using DFS, there's a serious issue with loop-based
>>>> delays. (IIRC, it was you who pointed this out a few years ago.)
>>>>
>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
>>>> when the frequency changes.
>>>>
>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
>>>> loops_per_jiffy, computes the number of times to loop, and then loops.
>>>> If the frequency increases when the core is in __loop_delay, the
>>>> delay will be much shorter than requested.
>>>>
>>>> Is this a correct assessment of the situation?
>>>
>>> Absolutely correct, and it's something that people are aware of, and
>>> have already catered for while writing their drivers.
>>
>> In their cpufreq driver?
>> In "real" device drivers that happen to use delays?
>>
>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
>> If the frequency increases at the beginning of __loop_delay, udelay(100)
>> would spin only 10 microseconds. This is likely to cause issues in
>> any driver using udelay.
>>
>> How does one cater for that?
> 
> You make sure your delays are based on a stable hardware timer.
> Most platforms nowadays should have a suitable timer source.

So you propose fixing loop-based delays by using clock-based delays,
is that correct? (That is indeed what I did on my platform.)

Russell stated that there are platforms using loop-based delays with
cpufreq enabled. I'm asking how they manage the brokenness.

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 16:26                     ` Marc Gonzalez
@ 2017-11-16 16:32                       ` Russell King - ARM Linux
  2017-11-16 16:42                         ` Marc Gonzalez
  2017-11-16 16:47                       ` Nicolas Pitre
  1 sibling, 1 reply; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-16 16:32 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Douglas Anderson, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote:
> On 16/11/2017 17:08, Nicolas Pitre wrote:
> 
> > On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> > 
> >> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> >>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>>>
> >>>>> udelay() needs to offer a consistent interface so that drivers know
> >>>>> what to expect no matter what the implementation is.  Making one
> >>>>> implementation conform to your ideas while leaving the other
> >>>>> implementations with other expectations is a recipe for bugs.
> >>>>>
> >>>>> If you really want to do this, fix the loops_per_jiffy implementation
> >>>>> as well so that the consistency is maintained.
> >>>>
> >>>> Hello Russell,
> >>>>
> >>>> It seems to me that, when using DFS, there's a serious issue with loop-based
> >>>> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>>>
> >>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >>>> when the frequency changes.
> >>>>
> >>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >>>> loops_per_jiffy, computes the number of times to loop, and then loops.
> >>>> If the frequency increases when the core is in __loop_delay, the
> >>>> delay will be much shorter than requested.
> >>>>
> >>>> Is this a correct assessment of the situation?
> >>>
> >>> Absolutely correct, and it's something that people are aware of, and
> >>> have already catered for while writing their drivers.
> >>
> >> In their cpufreq driver?
> >> In "real" device drivers that happen to use delays?
> >>
> >> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> >> If the frequency increases at the beginning of __loop_delay, udelay(100)
> >> would spin only 10 microseconds. This is likely to cause issues in
> >> any driver using udelay.
> >>
> >> How does one cater for that?
> > 
> > You make sure your delays are based on a stable hardware timer.
> > Most platforms nowadays should have a suitable timer source.
> 
> So you propose fixing loop-based delays by using clock-based delays,
> is that correct? (That is indeed what I did on my platform.)
> 
> Russell stated that there are platforms using loop-based delays with
> cpufreq enabled. I'm asking how they manage the brokenness.

Quite simply, they don't have such a wide range of frequencies that can
be selected.  They're on the order of 4x.  For example, the original
platform that cpufreq was developed on, a StrongARM-1110 board, can
practically range from 221MHz down to 59MHz.

BTW, your example above is incorrect.  If I'm not mistaken "120 MHz to
1.2 GHz." is only a 10x range, not a 100x range.  That would mean if
udelay(100) is initially executed while at 1.2GHz, and the frequency
drops to 120MHz during that delay, the delay could be anything between
just short of 100us to just short of 1ms.

For 10ms to come into it, you'd need to range from 1.2GHz down to
0.012GHz, iow, 12MHz.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 16:32                       ` Russell King - ARM Linux
@ 2017-11-16 16:42                         ` Marc Gonzalez
  2017-11-16 17:05                           ` Russell King - ARM Linux
  0 siblings, 1 reply; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-16 16:42 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Douglas Anderson, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On 16/11/2017 17:32, Russell King - ARM Linux wrote:
> On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote:
>> On 16/11/2017 17:08, Nicolas Pitre wrote:
>>
>>> On Thu, 16 Nov 2017, Marc Gonzalez wrote:
>>>
>>>> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
>>>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
>>>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
>>>>>>
>>>>>>> udelay() needs to offer a consistent interface so that drivers know
>>>>>>> what to expect no matter what the implementation is.  Making one
>>>>>>> implementation conform to your ideas while leaving the other
>>>>>>> implementations with other expectations is a recipe for bugs.
>>>>>>>
>>>>>>> If you really want to do this, fix the loops_per_jiffy implementation
>>>>>>> as well so that the consistency is maintained.
>>>>>>
>>>>>> Hello Russell,
>>>>>>
>>>>>> It seems to me that, when using DFS, there's a serious issue with loop-based
>>>>>> delays. (IIRC, it was you who pointed this out a few years ago.)
>>>>>>
>>>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
>>>>>> when the frequency changes.
>>>>>>
>>>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
>>>>>> loops_per_jiffy, computes the number of times to loop, and then loops.
>>>>>> If the frequency increases when the core is in __loop_delay, the
>>>>>> delay will be much shorter than requested.
>>>>>>
>>>>>> Is this a correct assessment of the situation?
>>>>>
>>>>> Absolutely correct, and it's something that people are aware of, and
>>>>> have already catered for while writing their drivers.
>>>>
>>>> In their cpufreq driver?
>>>> In "real" device drivers that happen to use delays?
>>>>
>>>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
>>>> If the frequency increases at the beginning of __loop_delay, udelay(100)
>>>> would spin only 10 microseconds. This is likely to cause issues in
>>>> any driver using udelay.
>>>>
>>>> How does one cater for that?
>>>
>>> You make sure your delays are based on a stable hardware timer.
>>> Most platforms nowadays should have a suitable timer source.
>>
>> So you propose fixing loop-based delays by using clock-based delays,
>> is that correct? (That is indeed what I did on my platform.)
>>
>> Russell stated that there are platforms using loop-based delays with
>> cpufreq enabled. I'm asking how they manage the brokenness.
> 
> Quite simply, they don't have such a wide range of frequencies that can
> be selected.  They're on the order of 4x.  For example, the original
> platform that cpufreq was developed on, a StrongARM-1110 board, can
> practically range from 221MHz down to 59MHz.

Requesting 100 µs and spinning only 25 µs is still a problem,
don't you agree?

> BTW, your example above is incorrect.

A 10x increase in frequency causes a request of 100 µs to spin
only 10 µs, as written above.

The problem is not when the frequency drops -- this makes the
delay longer. The problem is when the frequency increases,
which makes the delay shorter.

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 16:26                     ` Marc Gonzalez
  2017-11-16 16:32                       ` Russell King - ARM Linux
@ 2017-11-16 16:47                       ` Nicolas Pitre
  2017-11-16 16:51                         ` Marc Gonzalez
  1 sibling, 1 reply; 41+ messages in thread
From: Nicolas Pitre @ 2017-11-16 16:47 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML,
	Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland,
	Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman,
	Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic,
	Mason

On Thu, 16 Nov 2017, Marc Gonzalez wrote:

> On 16/11/2017 17:08, Nicolas Pitre wrote:
> 
> > On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> > 
> >> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> >>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>>>
> >>>>> udelay() needs to offer a consistent interface so that drivers know
> >>>>> what to expect no matter what the implementation is.  Making one
> >>>>> implementation conform to your ideas while leaving the other
> >>>>> implementations with other expectations is a recipe for bugs.
> >>>>>
> >>>>> If you really want to do this, fix the loops_per_jiffy implementation
> >>>>> as well so that the consistency is maintained.
> >>>>
> >>>> Hello Russell,
> >>>>
> >>>> It seems to me that, when using DFS, there's a serious issue with loop-based
> >>>> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>>>
> >>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >>>> when the frequency changes.
> >>>>
> >>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >>>> loops_per_jiffy, computes the number of times to loop, and then loops.
> >>>> If the frequency increases when the core is in __loop_delay, the
> >>>> delay will be much shorter than requested.
> >>>>
> >>>> Is this a correct assessment of the situation?
> >>>
> >>> Absolutely correct, and it's something that people are aware of, and
> >>> have already catered for while writing their drivers.
> >>
> >> In their cpufreq driver?
> >> In "real" device drivers that happen to use delays?
> >>
> >> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> >> If the frequency increases at the beginning of __loop_delay, udelay(100)
> >> would spin only 10 microseconds. This is likely to cause issues in
> >> any driver using udelay.
> >>
> >> How does one cater for that?
> > 
> > You make sure your delays are based on a stable hardware timer.
> > Most platforms nowadays should have a suitable timer source.
> 
> So you propose fixing loop-based delays by using clock-based delays,
> is that correct? (That is indeed what I did on my platform.)
> 
> Russell stated that there are platforms using loop-based delays with
> cpufreq enabled. I'm asking how they manage the brokenness.

Look at cpufreq_callback() in arch/arm/kernel/smp.c.


Nicolas

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 16:47                       ` Nicolas Pitre
@ 2017-11-16 16:51                         ` Marc Gonzalez
  2017-11-16 17:00                           ` Nicolas Pitre
  0 siblings, 1 reply; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-16 16:51 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML,
	Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland,
	Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman,
	Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic,
	Mason

On 16/11/2017 17:47, Nicolas Pitre wrote:

> Look at cpufreq_callback() in arch/arm/kernel/smp.c.

Are you pointing at the scaling of loops_per_jiffy done in that function?

As I wrote earlier:

If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
when the frequency changes.

But arch/arm/lib/delay-loop.S starts by loading the current value of
loops_per_jiffy, computes the number of times to loop, and then loops.
If the frequency increases when the core is in __loop_delay, the
delay will be much shorter than requested.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 16:51                         ` Marc Gonzalez
@ 2017-11-16 17:00                           ` Nicolas Pitre
  0 siblings, 0 replies; 41+ messages in thread
From: Nicolas Pitre @ 2017-11-16 17:00 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML,
	Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland,
	Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman,
	Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic,
	Mason

On Thu, 16 Nov 2017, Marc Gonzalez wrote:

> On 16/11/2017 17:47, Nicolas Pitre wrote:
> 
> > Look at cpufreq_callback() in arch/arm/kernel/smp.c.
> 
> Are you pointing at the scaling of loops_per_jiffy done in that function?
> 
> As I wrote earlier:
> 
> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> when the frequency changes.
> 
> But arch/arm/lib/delay-loop.S starts by loading the current value of
> loops_per_jiffy, computes the number of times to loop, and then loops.
> If the frequency increases when the core is in __loop_delay, the
> delay will be much shorter than requested.

The callback is invoked with CPUFREQ_PRECHANGE before the actual 
frequency increase. If your CPU clock is per core, then you won't be in 
the middle of the delay loop when this happens, unless you change your 
core clock from an interrupt handler.

If your CPU clock is common to all cores then you are screwed. In that 
case the only way out is a hardware timer based delay.


Nicolas

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 16:42                         ` Marc Gonzalez
@ 2017-11-16 17:05                           ` Russell King - ARM Linux
  2017-11-16 21:05                             ` Marc Gonzalez
  0 siblings, 1 reply; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-16 17:05 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Douglas Anderson, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
> On 16/11/2017 17:32, Russell King - ARM Linux wrote:
> > On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote:
> >> On 16/11/2017 17:08, Nicolas Pitre wrote:
> >>
> >>> On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> >>>
> >>>> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> >>>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >>>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>>>>>
> >>>>>>> udelay() needs to offer a consistent interface so that drivers know
> >>>>>>> what to expect no matter what the implementation is.  Making one
> >>>>>>> implementation conform to your ideas while leaving the other
> >>>>>>> implementations with other expectations is a recipe for bugs.
> >>>>>>>
> >>>>>>> If you really want to do this, fix the loops_per_jiffy implementation
> >>>>>>> as well so that the consistency is maintained.
> >>>>>>
> >>>>>> Hello Russell,
> >>>>>>
> >>>>>> It seems to me that, when using DFS, there's a serious issue with loop-based
> >>>>>> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>>>>>
> >>>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >>>>>> when the frequency changes.
> >>>>>>
> >>>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >>>>>> loops_per_jiffy, computes the number of times to loop, and then loops.
> >>>>>> If the frequency increases when the core is in __loop_delay, the
> >>>>>> delay will be much shorter than requested.
> >>>>>>
> >>>>>> Is this a correct assessment of the situation?
> >>>>>
> >>>>> Absolutely correct, and it's something that people are aware of, and
> >>>>> have already catered for while writing their drivers.
> >>>>
> >>>> In their cpufreq driver?
> >>>> In "real" device drivers that happen to use delays?
> >>>>
> >>>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> >>>> If the frequency increases at the beginning of __loop_delay, udelay(100)
> >>>> would spin only 10 microseconds. This is likely to cause issues in
> >>>> any driver using udelay.
> >>>>
> >>>> How does one cater for that?
> >>>
> >>> You make sure your delays are based on a stable hardware timer.
> >>> Most platforms nowadays should have a suitable timer source.
> >>
> >> So you propose fixing loop-based delays by using clock-based delays,
> >> is that correct? (That is indeed what I did on my platform.)
> >>
> >> Russell stated that there are platforms using loop-based delays with
> >> cpufreq enabled. I'm asking how they manage the brokenness.
> > 
> > Quite simply, they don't have such a wide range of frequencies that can
> > be selected.  They're on the order of 4x.  For example, the original
> > platform that cpufreq was developed on, a StrongARM-1110 board, can
> > practically range from 221MHz down to 59MHz.
> 
> Requesting 100 µs and spinning only 25 µs is still a problem,
> don't you agree?

Which is why, as I've said *many* times already, that drivers are written
with leaway on the delays.

I get the impression that we're just going around in circles, and what
you're trying to do is to get me to agree with your point of view.
That's not going to happen, because I know the history over about the
last /24/ years of kernel development (which is how long I've been
involved with the kernel.)  That's almost a quarter of a century!

I know how things were done years ago (which is relevant because we
still have support in the kernel for these systems), and I also know the
history of facilities like cpufreq - I was the one who took the work
that Erik Mouw and others involved with the LART project, and turned it
into something a little more generic.  The idea of dynamically scaling
the CPU frequency on ARM SoCs was something that the SoC manufacturer
had not even considered - it was innovative.

I know that udelay() can return short delays when used in a kernel with
cpufreq enabled, and I also know that's almost an impossible problem to
solve without going to a timer-based delay.

So, when you think that sending an email about a udelay() that can be
10x shorter might be somehow new information, and might convince people
that there's a problem, I'm afraid that it isn't really new information.
The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
has the ability to make udelay()s 4x shorter or 4x longer depending on
the direction of change.

We've discussed solutions in the past (probably 10 years ago) about
this, and what can be done, and the conclusion to that was, as Nicolas
has said, to switch to using a timer-based delay mechanism where
possible.  Where this is not possible, the platform is stuck with the
loops based delays, and their inherent variability and inaccuracy.

These platforms have been tested with such a setup over many years.
They work even with udelay() having this behaviour, because it's a
known issue and drivers cater for it in ways that I've already covered
in my many previous emails to you.

These issues are known.  They've been known for the last 15 odd years.

> > BTW, your example above is incorrect.
> 
> A 10x increase in frequency causes a request of 100 µs to spin
> only 10 µs, as written above.

Right, sorry, misread.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 17:05                           ` Russell King - ARM Linux
@ 2017-11-16 21:05                             ` Marc Gonzalez
  2017-11-16 22:15                               ` Doug Anderson
  0 siblings, 1 reply; 41+ messages in thread
From: Marc Gonzalez @ 2017-11-16 21:05 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Douglas Anderson, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On 16/11/2017 18:05, Russell King - ARM Linux wrote:

> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
>
>> Requesting 100 µs and spinning only 25 µs is still a problem,
>> don't you agree?
> 
> Which is why, as I've said *many* times already, that drivers are written
> with leaway on the delays.

A delay 75% too short is possible. Roger that.

> I get the impression that we're just going around in circles, and what
> you're trying to do is to get me to agree with your point of view.
> That's not going to happen, because I know the history over about the
> last /24/ years of kernel development (which is how long I've been
> involved with the kernel.)  That's almost a quarter of a century!
> 
> I know how things were done years ago (which is relevant because we
> still have support in the kernel for these systems), and I also know the
> history of facilities like cpufreq - I was the one who took the work
> that Erik Mouw and others involved with the LART project, and turned it
> into something a little more generic.  The idea of dynamically scaling
> the CPU frequency on ARM SoCs was something that the SoC manufacturer
> had not even considered - it was innovative.
> 
> I know that udelay() can return short delays when used in a kernel with
> cpufreq enabled, and I also know that's almost an impossible problem to
> solve without going to a timer-based delay.
> 
> So, when you think that sending an email about a udelay() that can be
> 10x shorter might be somehow new information, and might convince people
> that there's a problem, I'm afraid that it isn't really new information.
> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
> has the ability to make udelay()s 4x shorter or 4x longer depending on
> the direction of change.
> 
> We've discussed solutions in the past (probably 10 years ago) about
> this, and what can be done, and the conclusion to that was, as Nicolas
> has said, to switch to using a timer-based delay mechanism where
> possible.  Where this is not possible, the platform is stuck with the
> loops based delays, and their inherent variability and inaccuracy.
> 
> These platforms have been tested with such a setup over many years.
> They work even with udelay() having this behaviour, because it's a
> known issue and drivers cater for it in ways that I've already covered
> in my many previous emails to you.
> 
> These issues are known.  They've been known for the last 15 odd years.

So you've known for umpteen years that fixing loop-based delays is
intractable, yet you wrote:

> udelay() needs to offer a consistent interface so that drivers know
> what to expect no matter what the implementation is.  Making one
> implementation conform to your ideas while leaving the other
> implementations with other expectations is a recipe for bugs.
> 
> If you really want to do this, fix the loops_per_jiffy implementation
> as well so that the consistency is maintained.

In other words, "I'll consider your patch as soon as Hell freezes over".

Roger that. I'll drop the subject then.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 21:05                             ` Marc Gonzalez
@ 2017-11-16 22:15                               ` Doug Anderson
  2017-11-16 23:22                                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 41+ messages in thread
From: Doug Anderson @ 2017-11-16 22:15 UTC (permalink / raw)
  To: Marc Gonzalez
  Cc: Russell King - ARM Linux, Nicolas Pitre, Linus Torvalds,
	Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, John Stultz, Mark Rutland,
	Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman,
	Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic,
	Mason

Hi,

On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
<marc_gonzalez@sigmadesigns.com> wrote:
> On 16/11/2017 18:05, Russell King - ARM Linux wrote:
>
>> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
>>
>>> Requesting 100 µs and spinning only 25 µs is still a problem,
>>> don't you agree?
>>
>> Which is why, as I've said *many* times already, that drivers are written
>> with leaway on the delays.
>
> A delay 75% too short is possible. Roger that.
>
>> I get the impression that we're just going around in circles, and what
>> you're trying to do is to get me to agree with your point of view.
>> That's not going to happen, because I know the history over about the
>> last /24/ years of kernel development (which is how long I've been
>> involved with the kernel.)  That's almost a quarter of a century!
>>
>> I know how things were done years ago (which is relevant because we
>> still have support in the kernel for these systems), and I also know the
>> history of facilities like cpufreq - I was the one who took the work
>> that Erik Mouw and others involved with the LART project, and turned it
>> into something a little more generic.  The idea of dynamically scaling
>> the CPU frequency on ARM SoCs was something that the SoC manufacturer
>> had not even considered - it was innovative.
>>
>> I know that udelay() can return short delays when used in a kernel with
>> cpufreq enabled, and I also know that's almost an impossible problem to
>> solve without going to a timer-based delay.
>>
>> So, when you think that sending an email about a udelay() that can be
>> 10x shorter might be somehow new information, and might convince people
>> that there's a problem, I'm afraid that it isn't really new information.
>> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
>> has the ability to make udelay()s 4x shorter or 4x longer depending on
>> the direction of change.
>>
>> We've discussed solutions in the past (probably 10 years ago) about
>> this, and what can be done, and the conclusion to that was, as Nicolas
>> has said, to switch to using a timer-based delay mechanism where
>> possible.  Where this is not possible, the platform is stuck with the
>> loops based delays, and their inherent variability and inaccuracy.
>>
>> These platforms have been tested with such a setup over many years.
>> They work even with udelay() having this behaviour, because it's a
>> known issue and drivers cater for it in ways that I've already covered
>> in my many previous emails to you.
>>
>> These issues are known.  They've been known for the last 15 odd years.
>
> So you've known for umpteen years that fixing loop-based delays is
> intractable, yet you wrote:
>
>> udelay() needs to offer a consistent interface so that drivers know
>> what to expect no matter what the implementation is.  Making one
>> implementation conform to your ideas while leaving the other
>> implementations with other expectations is a recipe for bugs.
>>
>> If you really want to do this, fix the loops_per_jiffy implementation
>> as well so that the consistency is maintained.
>
> In other words, "I'll consider your patch as soon as Hell freezes over".
>
> Roger that. I'll drop the subject then.

Presumably, though, you could introduce a new API like:

  udelay_atleast()

That was guaranteed to delay at least the given number of
microseconds.  Unlike the current udelay(), the new udelay_atleast()
wouldn't really try that hard to get a delay that's approximately the
one requested, it would just guarantee not to ever delay _less_ than
the amount requested.  Thus there would be some reasons to use one API
or the other and it would be up to the driver to pick one.  You
wouldn't regress any performance on old code since the old API would
work the same as it always did.


You could presumably implement udelay_atleast() something like I did
in an ancient patch I was involved in at http://crosreview.com/189885.
In this case you wouldn't modify the normal udelay() but you'd
actually add a new API.  (for the curious, we later picked back a
proper timer-based delay in http://crosreview.com/218961 so it's not
like we were stuck with the crappy delay for long)


Once you added udelay_atleast() you'd have a good reason to fixup the
timer-based udelay() since udelay_atleast() could just be a wrapper of
that function and you'd have the requested consistency.


-Doug

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 22:15                               ` Doug Anderson
@ 2017-11-16 23:22                                 ` Russell King - ARM Linux
  2017-11-20 17:38                                   ` Doug Anderson
  0 siblings, 1 reply; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-16 23:22 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Marc Gonzalez, Nicolas Pitre, Linus Torvalds, Alan Cox, LKML,
	Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote:
> Hi,
> 
> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
> <marc_gonzalez@sigmadesigns.com> wrote:
> > On 16/11/2017 18:05, Russell King - ARM Linux wrote:
> >
> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
> >>
> >>> Requesting 100 µs and spinning only 25 µs is still a problem,
> >>> don't you agree?
> >>
> >> Which is why, as I've said *many* times already, that drivers are written
> >> with leaway on the delays.
> >
> > A delay 75% too short is possible. Roger that.
> >
> >> I get the impression that we're just going around in circles, and what
> >> you're trying to do is to get me to agree with your point of view.
> >> That's not going to happen, because I know the history over about the
> >> last /24/ years of kernel development (which is how long I've been
> >> involved with the kernel.)  That's almost a quarter of a century!
> >>
> >> I know how things were done years ago (which is relevant because we
> >> still have support in the kernel for these systems), and I also know the
> >> history of facilities like cpufreq - I was the one who took the work
> >> that Erik Mouw and others involved with the LART project, and turned it
> >> into something a little more generic.  The idea of dynamically scaling
> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer
> >> had not even considered - it was innovative.
> >>
> >> I know that udelay() can return short delays when used in a kernel with
> >> cpufreq enabled, and I also know that's almost an impossible problem to
> >> solve without going to a timer-based delay.
> >>
> >> So, when you think that sending an email about a udelay() that can be
> >> 10x shorter might be somehow new information, and might convince people
> >> that there's a problem, I'm afraid that it isn't really new information.
> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
> >> has the ability to make udelay()s 4x shorter or 4x longer depending on
> >> the direction of change.
> >>
> >> We've discussed solutions in the past (probably 10 years ago) about
> >> this, and what can be done, and the conclusion to that was, as Nicolas
> >> has said, to switch to using a timer-based delay mechanism where
> >> possible.  Where this is not possible, the platform is stuck with the
> >> loops based delays, and their inherent variability and inaccuracy.
> >>
> >> These platforms have been tested with such a setup over many years.
> >> They work even with udelay() having this behaviour, because it's a
> >> known issue and drivers cater for it in ways that I've already covered
> >> in my many previous emails to you.
> >>
> >> These issues are known.  They've been known for the last 15 odd years.
> >
> > So you've known for umpteen years that fixing loop-based delays is
> > intractable, yet you wrote:
> >
> >> udelay() needs to offer a consistent interface so that drivers know
> >> what to expect no matter what the implementation is.  Making one
> >> implementation conform to your ideas while leaving the other
> >> implementations with other expectations is a recipe for bugs.
> >>
> >> If you really want to do this, fix the loops_per_jiffy implementation
> >> as well so that the consistency is maintained.
> >
> > In other words, "I'll consider your patch as soon as Hell freezes over".
> >
> > Roger that. I'll drop the subject then.
> 
> Presumably, though, you could introduce a new API like:
> 
>   udelay_atleast()
> 
> That was guaranteed to delay at least the given number of
> microseconds.  Unlike the current udelay(), the new udelay_atleast()
> wouldn't really try that hard to get a delay that's approximately the
> one requested, it would just guarantee not to ever delay _less_ than
> the amount requested.

I look forward to reviewing your implementation.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-16 23:22                                 ` Russell King - ARM Linux
@ 2017-11-20 17:38                                   ` Doug Anderson
  2017-11-20 18:31                                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 41+ messages in thread
From: Doug Anderson @ 2017-11-20 17:38 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Marc Gonzalez, Nicolas Pitre, Linus Torvalds, Alan Cox, LKML,
	Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, John Stultz, Mark Rutland, Will Deacon,
	Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette,
	Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason

Hi,

On Thu, Nov 16, 2017 at 3:22 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote:
>> Hi,
>>
>> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
>> <marc_gonzalez@sigmadesigns.com> wrote:
>> > On 16/11/2017 18:05, Russell King - ARM Linux wrote:
>> >
>> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
>> >>
>> >>> Requesting 100 盜 and spinning only 25 盜 is still a problem,
>> >>> don't you agree?
>> >>
>> >> Which is why, as I've said *many* times already, that drivers are written
>> >> with leaway on the delays.
>> >
>> > A delay 75% too short is possible. Roger that.
>> >
>> >> I get the impression that we're just going around in circles, and what
>> >> you're trying to do is to get me to agree with your point of view.
>> >> That's not going to happen, because I know the history over about the
>> >> last /24/ years of kernel development (which is how long I've been
>> >> involved with the kernel.)  That's almost a quarter of a century!
>> >>
>> >> I know how things were done years ago (which is relevant because we
>> >> still have support in the kernel for these systems), and I also know the
>> >> history of facilities like cpufreq - I was the one who took the work
>> >> that Erik Mouw and others involved with the LART project, and turned it
>> >> into something a little more generic.  The idea of dynamically scaling
>> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer
>> >> had not even considered - it was innovative.
>> >>
>> >> I know that udelay() can return short delays when used in a kernel with
>> >> cpufreq enabled, and I also know that's almost an impossible problem to
>> >> solve without going to a timer-based delay.
>> >>
>> >> So, when you think that sending an email about a udelay() that can be
>> >> 10x shorter might be somehow new information, and might convince people
>> >> that there's a problem, I'm afraid that it isn't really new information.
>> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
>> >> has the ability to make udelay()s 4x shorter or 4x longer depending on
>> >> the direction of change.
>> >>
>> >> We've discussed solutions in the past (probably 10 years ago) about
>> >> this, and what can be done, and the conclusion to that was, as Nicolas
>> >> has said, to switch to using a timer-based delay mechanism where
>> >> possible.  Where this is not possible, the platform is stuck with the
>> >> loops based delays, and their inherent variability and inaccuracy.
>> >>
>> >> These platforms have been tested with such a setup over many years.
>> >> They work even with udelay() having this behaviour, because it's a
>> >> known issue and drivers cater for it in ways that I've already covered
>> >> in my many previous emails to you.
>> >>
>> >> These issues are known.  They've been known for the last 15 odd years.
>> >
>> > So you've known for umpteen years that fixing loop-based delays is
>> > intractable, yet you wrote:
>> >
>> >> udelay() needs to offer a consistent interface so that drivers know
>> >> what to expect no matter what the implementation is.  Making one
>> >> implementation conform to your ideas while leaving the other
>> >> implementations with other expectations is a recipe for bugs.
>> >>
>> >> If you really want to do this, fix the loops_per_jiffy implementation
>> >> as well so that the consistency is maintained.
>> >
>> > In other words, "I'll consider your patch as soon as Hell freezes over".
>> >
>> > Roger that. I'll drop the subject then.
>>
>> Presumably, though, you could introduce a new API like:
>>
>>   udelay_atleast()
>>
>> That was guaranteed to delay at least the given number of
>> microseconds.  Unlike the current udelay(), the new udelay_atleast()
>> wouldn't really try that hard to get a delay that's approximately the
>> one requested, it would just guarantee not to ever delay _less_ than
>> the amount requested.
>
> I look forward to reviewing your implementation.

It's unlikely I'll post a patch in the near term since this isn't
presenting me with a big problem right now.  Mostly I saw Marc's patch
and thought it would be a good patch to land and I knew this type of
thing had bitten me in the past.

One happy result of this whole discussion, though, is that you now
sound as if you'll be happy the next time someone brings this up since
you're looking forward to reviewing an implementation.  That's a nice
change from the original statement questioning why someone was asking
about this again.  :)

As I've expressed in the past, IMHO fixing the timer based delays is
still a sane thing to do even without also fixing all the loop-based
implementations.  As someone said earlier in this thread, all the old
platforms using loop-based delays are working great for the platforms
they support and there's no reason to mess with them, but that doesn't
mean we shouldn't fix the problems that are easy to fix.


-Doug

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-20 17:38                                   ` Doug Anderson
@ 2017-11-20 18:31                                     ` Russell King - ARM Linux
  0 siblings, 0 replies; 41+ messages in thread
From: Russell King - ARM Linux @ 2017-11-20 18:31 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Nicolas Pitre, Alan Cox, Mason, Thibaud Cornic, Jonathan Austin,
	Arnd Bergmann, Kevin Hilman, Peter Zijlstra, Stephen Boyd,
	Michael Turquette, Mark Rutland, Will Deacon, LKML,
	Steven Rostedt, John Stultz, Boris Brezillon, Thomas Gleixner,
	Linus Torvalds, Ingo Molnar, Linux ARM, Marc Gonzalez

On Mon, Nov 20, 2017 at 09:38:46AM -0800, Doug Anderson wrote:
> Hi,
> 
> On Thu, Nov 16, 2017 at 3:22 PM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
> > On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
> >> <marc_gonzalez@sigmadesigns.com> wrote:
> >> > On 16/11/2017 18:05, Russell King - ARM Linux wrote:
> >> >
> >> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
> >> >>
> >> >>> Requesting 100 盜 and spinning only 25 盜 is still a problem,
> >> >>> don't you agree?
> >> >>
> >> >> Which is why, as I've said *many* times already, that drivers are written
> >> >> with leaway on the delays.
> >> >
> >> > A delay 75% too short is possible. Roger that.
> >> >
> >> >> I get the impression that we're just going around in circles, and what
> >> >> you're trying to do is to get me to agree with your point of view.
> >> >> That's not going to happen, because I know the history over about the
> >> >> last /24/ years of kernel development (which is how long I've been
> >> >> involved with the kernel.)  That's almost a quarter of a century!
> >> >>
> >> >> I know how things were done years ago (which is relevant because we
> >> >> still have support in the kernel for these systems), and I also know the
> >> >> history of facilities like cpufreq - I was the one who took the work
> >> >> that Erik Mouw and others involved with the LART project, and turned it
> >> >> into something a little more generic.  The idea of dynamically scaling
> >> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer
> >> >> had not even considered - it was innovative.
> >> >>
> >> >> I know that udelay() can return short delays when used in a kernel with
> >> >> cpufreq enabled, and I also know that's almost an impossible problem to
> >> >> solve without going to a timer-based delay.
> >> >>
> >> >> So, when you think that sending an email about a udelay() that can be
> >> >> 10x shorter might be somehow new information, and might convince people
> >> >> that there's a problem, I'm afraid that it isn't really new information.
> >> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
> >> >> has the ability to make udelay()s 4x shorter or 4x longer depending on
> >> >> the direction of change.
> >> >>
> >> >> We've discussed solutions in the past (probably 10 years ago) about
> >> >> this, and what can be done, and the conclusion to that was, as Nicolas
> >> >> has said, to switch to using a timer-based delay mechanism where
> >> >> possible.  Where this is not possible, the platform is stuck with the
> >> >> loops based delays, and their inherent variability and inaccuracy.
> >> >>
> >> >> These platforms have been tested with such a setup over many years.
> >> >> They work even with udelay() having this behaviour, because it's a
> >> >> known issue and drivers cater for it in ways that I've already covered
> >> >> in my many previous emails to you.
> >> >>
> >> >> These issues are known.  They've been known for the last 15 odd years.
> >> >
> >> > So you've known for umpteen years that fixing loop-based delays is
> >> > intractable, yet you wrote:
> >> >
> >> >> udelay() needs to offer a consistent interface so that drivers know
> >> >> what to expect no matter what the implementation is.  Making one
> >> >> implementation conform to your ideas while leaving the other
> >> >> implementations with other expectations is a recipe for bugs.
> >> >>
> >> >> If you really want to do this, fix the loops_per_jiffy implementation
> >> >> as well so that the consistency is maintained.
> >> >
> >> > In other words, "I'll consider your patch as soon as Hell freezes over".
> >> >
> >> > Roger that. I'll drop the subject then.
> >>
> >> Presumably, though, you could introduce a new API like:
> >>
> >>   udelay_atleast()
> >>
> >> That was guaranteed to delay at least the given number of
> >> microseconds.  Unlike the current udelay(), the new udelay_atleast()
> >> wouldn't really try that hard to get a delay that's approximately the
> >> one requested, it would just guarantee not to ever delay _less_ than
> >> the amount requested.
> >
> > I look forward to reviewing your implementation.
> 
> It's unlikely I'll post a patch in the near term since this isn't
> presenting me with a big problem right now.  Mostly I saw Marc's patch
> and thought it would be a good patch to land and I knew this type of
> thing had bitten me in the past.
> 
> One happy result of this whole discussion, though, is that you now
> sound as if you'll be happy the next time someone brings this up since
> you're looking forward to reviewing an implementation.  That's a nice
> change from the original statement questioning why someone was asking
> about this again.  :)

What I'd be happy with, and what I've always been happy with is what I've
stated: either we fix _all_ implementations or none of them.  We can't
have the situation where some implementations give one expectation and
others give something completely different.

That's always been my argument against _just_ fixing the timer-based
delays and ignoring the rest of the problem.

Nothing has changed about my position.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-01 15:53           ` Doug Anderson
@ 2017-12-07 12:31             ` Pavel Machek
  0 siblings, 0 replies; 41+ messages in thread
From: Pavel Machek @ 2017-12-07 12:31 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Russell King - ARM Linux, Linus Torvalds, Mark Rutland,
	Jonathan Austin, Arnd Bergmann, Mason, Peter Zijlstra,
	Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd,
	Steven Rostedt, LKML, Kevin Hilman, John Stultz, Thomas Gleixner,
	Ingo Molnar, Linux ARM, Marc Gonzalez

Hi!

> On Wed, Nov 1, 2017 at 2:26 AM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
> > On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds
> >> <torvalds@linux-foundation.org> wrote:
> >> > So I'm very much open to udelay improvements, and if somebody sends
> >> > patches for particular platforms to do particularly well on that
> >> > platform, I think we should merge them. But ...
> >>
> >> If I'm reading this all correctly, this sounds like you'd be willing
> >> to merge <https://patchwork.kernel.org/patch/9429841/>.  This makes
> >> udelay() guaranteed not to underrun on arm32 platforms.
> >
> > That's a mis-representation again.  It stops a timer-based udelay()
> > possibly underrunning by one tick if we are close to the start of
> > a count increment.  However, it does nothing for the loops_per_jiffy
> > udelay(), which can still underrun.
> >
> > My argument against merging that patch is that with it merged, we get
> > (as you say) a udelay() that doesn't underrun _when using a timer_
> > but when we end up using the loops_per_jiffy udelay(), we're back to
> > the old problem.
> >
> > My opinion is that's bad, because it encourages people to write drivers
> > that rely on udelay() having "good" behaviour, which it is not guaranteed
> > to have.  So, they'll specify a delay period of exactly what they want,
> > and their drivers will then fail when running on systems that aren't
> > using a timer-based udelay().
> 
> IMHO the current udelay is broken in an off-by-one way and it's easy
> to fix.  Intentionally leaving a bug in the code seems silly.  This
> seems to by what Linus is saying with his statement that "(a) platform
> code could try to make their udelay/ndelay() be as good as it can be
> on a particular platform".
> 
> So no matter the rest of the discussions, we should land that.  If you
> disagree then I'm happy to re-post that patch straight to Linus later
> this week since it sounds as if he'd take it.

Did this get fixed in any way? Russell having crazy arguments for
keeping kernel buggy should not be good enough reason to keep the
bugs...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] Improving udelay/ndelay on platforms where that is possible
  2017-11-15 13:13           ` Russell King - ARM Linux
  2017-11-16 15:26             ` Marc Gonzalez
@ 2017-12-07 12:43             ` Pavel Machek
  1 sibling, 0 replies; 41+ messages in thread
From: Pavel Machek @ 2017-12-07 12:43 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Marc Gonzalez, Linus Torvalds, Alan Cox, LKML, Linux ARM,
	Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland,
	Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman,
	Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic,
	Mason

> On Wed, Nov 15, 2017 at 01:51:54PM +0100, Marc Gonzalez wrote:
> > On 01/11/2017 20:38, Marc Gonzalez wrote:
> > 
> > > OK, I'll just send my patch, and then crawl back under my rock.
> > 
> > Linus,
> > 
> > As promised, the patch is provided below. And as promised, I will
> > no longer bring this up on LKML.
> > 
> > FWIW, I have checked that the computed value matches the expected
> > value for all HZ and delay_us, and for a few clock frequencies,
> > using the following program:
> > 
> > $ cat delays.c
> > #include <stdio.h>
> > #define MEGA 1000000u
> > typedef unsigned int uint;
> > typedef unsigned long long u64;
> > #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> > 
> > static const uint HZ_tab[] = { 100, 250, 300, 1000 };
> > 
> > static void check_cycle_count(uint freq, uint HZ, uint delay_us)
> > {
> > 	uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
> > 	uint lpj = DIV_ROUND_UP(freq, HZ);
> > 	uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
> > 	uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);
> > 
> > 	if (computed != expected)
> > 		printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
> > }
> > 
> > int main(void)
> > {
> > 	uint idx, delay_us, freq;
> > 
> > 	for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
> > 		for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
> > 			for (delay_us = 1; delay_us <= 2000; ++delay_us)
> > 				check_cycle_count(freq, HZ_tab[idx], delay_us);
> > 
> > 	return 0;
> > }
> > 
> > 
> > 
> > -- >8 --
> > Subject: [PATCH] ARM: Tweak clock-based udelay implementation
> > 
> > In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
> > Russell pointed out that loop-based delays may return early.
> > 
> > On the arm platform, delays may be either loop-based or clock-based.
> > 
> > This patch tweaks the clock-based implementation so that udelay(N)
> > is guaranteed to spin at least N microseconds.
> 
> As I've already said, I don't want this, because it encourages people
> to use too-small delays in driver code, and if we merge it then you
> will look at your data sheet, decide it says "you need to wait 10us"
> and write in your driver "udelay(10)" which will break on the loops
> based delay.
> 
> udelay() needs to offer a consistent interface so that drivers know
> what to expect no matter what the implementation is.  Making one
> implementation conform to your ideas while leaving the other
> implementations with other expectations is a recipe for bugs.

udelay() needs to be consistent across platforms, and yes, udelay(10)
is expected to delay at least 10usec.

If that is not true on your platform, _fix your platform_. But it is
not valid to reject patches fixing other platforms, just because your
platform is broken.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-12-07 12:43 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez
2017-10-31 16:44 ` Linus Torvalds
2017-10-31 16:56   ` Russell King - ARM Linux
2017-10-31 17:45     ` Linus Torvalds
2017-10-31 17:58       ` Linus Torvalds
2017-11-01  0:23       ` Doug Anderson
2017-11-01  9:26         ` Russell King - ARM Linux
2017-11-01 15:53           ` Doug Anderson
2017-12-07 12:31             ` Pavel Machek
2017-11-01 19:28           ` Marc Gonzalez
2017-11-01 20:30             ` Russell King - ARM Linux
2017-10-31 16:46 ` Russell King - ARM Linux
2017-11-01 17:53 ` Alan Cox
2017-11-01 19:03   ` Marc Gonzalez
2017-11-01 19:09     ` Linus Torvalds
2017-11-01 19:17       ` Linus Torvalds
2017-11-01 19:38       ` Marc Gonzalez
2017-11-15 12:51         ` Marc Gonzalez
2017-11-15 13:13           ` Russell King - ARM Linux
2017-11-16 15:26             ` Marc Gonzalez
2017-11-16 15:36               ` Russell King - ARM Linux
2017-11-16 15:47                 ` Marc Gonzalez
2017-11-16 16:08                   ` Nicolas Pitre
2017-11-16 16:26                     ` Marc Gonzalez
2017-11-16 16:32                       ` Russell King - ARM Linux
2017-11-16 16:42                         ` Marc Gonzalez
2017-11-16 17:05                           ` Russell King - ARM Linux
2017-11-16 21:05                             ` Marc Gonzalez
2017-11-16 22:15                               ` Doug Anderson
2017-11-16 23:22                                 ` Russell King - ARM Linux
2017-11-20 17:38                                   ` Doug Anderson
2017-11-20 18:31                                     ` Russell King - ARM Linux
2017-11-16 16:47                       ` Nicolas Pitre
2017-11-16 16:51                         ` Marc Gonzalez
2017-11-16 17:00                           ` Nicolas Pitre
2017-12-07 12:43             ` Pavel Machek
2017-11-15 18:45           ` Doug Anderson
2017-11-01 19:36     ` Alan Cox
2017-11-01 19:39     ` Thomas Gleixner
2017-11-01 19:48     ` Baruch Siach
2017-11-02 16:12       ` Boris Brezillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).