* [RFC] Improving udelay/ndelay on platforms where that is possible
@ 2017-10-31 16:15 Marc Gonzalez
2017-10-31 16:44 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Marc Gonzalez @ 2017-10-31 16:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre,
Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann,
Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd,
Mason
Hello BDFL,
I am writing to you directly because Russell has bluntly stated (paraphrased)
"Send your patches to Linus; it's his kernel, and he has the final say." which
is his diplomatic way of telling me to fsck off.
Basically, I want to improve the accuracy of clock-based delays, in order to
improve the boot-time delay in the NAND framework. I will send a patch, but
I wanted to have a discussion about the design first.
The key points under discussion below are
a) guarantee of no under-delays
b) availability of ndelay for arm32
Below is the long-winded rationale.
1) Delays and sleeps (clocks, timers, alarms, etc)
Every operating system kernel (that I know of) offers some way for a kernel thread to
wait for a specified amount of time (expressed in seconds, milliseconds, microseconds,
or even nanoseconds).
For relatively small amounts of time, such a primitive is typically implemented as
a busy-wait spin-loop. Let us call it delay().
Such a primitive cannot guarantee that the calling thread will be delayed *exactly*
the amount of time specified, because there are many sources of inaccuracy:
* the timer's period may not divide the requested amount
* sampling the timer value may have variable latency
* the thread might be preempted for whatever reason
* the timer itself may have varying frequencies
* etc
Therefore, users are accustomed to having delays be longer (within a reasonable margin).
However, very few users would expect delays to be *shorter* than requested.
As you have stated yourself, the vast majority of code in Linux is driver code.
Typical driver writers (of which I am one) are not experts on Linux internals, and may
sometimes need some guidance, either from another programmer or a maintainer (whose time
is a rare resource), or from clear APIs (either well-documented, or just "natural" and
"obvious" interfaces).
A typical driver writer has some HW spec in front of them, which e.g. states:
* poke register A
* wait 1 microsecond for the dust to settle
* poke register B
which most programmers would translate to:
poke(reg_A, val_A);
udelay(1);
poke(reg_B, val_B);
Given a similar example, Russell has stated:
> And if a data sheet says "needs a delay of 2us" and you put in the
> driver udelay(2) then you're doing it wrong, and you need to read
> Linus' mails on this subject, such as the one I've provided a link
> to... that udelay() must be _at least_ udelay(3), if not 4.
I see two key points in this reply.
i) There seems to be an implicit agreement that it is BAD for the *actual*
delay to be less than what the data sheet specifies, leading to...
ii) It is the responsibility of the *driver writer* to figure out how much
of a cushion they need, in order to guarantee a minimal delay.
Obviously, I agree with the first point.
The second point is troubling. It means driver writers are required to be
aware of the quirkiness of Linux internals.
And because drivers are supposed to work on all platforms, are we saying
that driver writers should be aware of the quirkiness for ALL platforms?
For example, assume that for short delays, such as 1 µs:
* amd64 has 5% relative error, 10 ns absolute error
* arm32 has 10% relative error, 1 µs absolute error
* alpha has 3% relative error, 3 µs absolute error
The driver writer would need to write udelay(4); ?
In my opinion, it makes more sense to hide the complexity and quirkiness
of udelay inside each platform's implementation. Which brings me to...
2) Different implementations of udelay and ndelay
On arm32, it is possible to set up udelay() to be clock-based.
In that case, udelay simply polls a constant-frequency tick-counter.
For example, on my puny little platform, there is a 27 MHz crystal oscillator
(37 ns period) which is wired to a tick counter mapped on the system bus.
The latency to sample the register is within [10-200] ns.
This implies two things:
i) it is possible to guarantee a minimum delay (in quanta of 37 ns)
ii) the sampling error is limited to ~250 ns
[NB: some platforms are even "better", and move the tick counter inside the
CPU block (e.g. ARM architected timer) to achieve a lower sampling error.]
There is one minor trap when handling tick counter sampling:
Assume we are waiting for one period to elapse. Consider the timeline below,
where x marks the cycle when the tick counter register is incremented.
-------x----------x----------x----------x----------x-----> time
^ ^ ^
A B C
If the execution flow leads to sampling at times A and B, then B-A equals 1,
yet only a tiny amount of time has elapsed. To guarantee that at least one
period has elapsed, we must wait until the arithmetic difference is 2
(i.e. one more than required). In the example, until time C.
In other words, if we need to delay for N cycles, the correct code should be:
t0 = readl(tick_address);
while (readl(tick_address) - t0 <= N)
/* spin */ ;
There is another source of "error" (in the sense that udelay might spin too little)
caused by the conversion from µs to cycles -- which rounds down.
Consider arch/arm/include/asm/delay.h
loops = (loops_per_jiffy * delay_us * UDELAY_MULT) >> 31
where UDELAY_MULT = 2147 * HZ + 483648 * HZ / 1000000
Consider a platform with a 27 MHz clock, and HZ=300
The proper conversion is trivially
loops = delay_us * 27
Thus, for a 1 microsecond delay, loops equals 27.
But when using UDELAY_MULT = 644245 (rounded down from 644245,0944)
loops equals 26.
This "off-by-one" error is systematic over the entire range of allowed
delay_us input (1 to 2000), so it is easy to fix, by adding 1 to the result.
3) Why does all this even matter?
At boot, the NAND framework scans the NAND chips for bad blocks;
this operation generates approximately 10^5 calls to ndelay(100);
which cause a 100 ms delay, because ndelay is implemented as a
call to the nearest udelay (rounded up).
My current NAND chips are tiny (2 x 512 MB) but with larger chips,
the number of calls to ndelay would climb to 10^6 and the delay
increase to 1 second, with is starting to be a problem.
One solution is to implement ndelay, but ndelay is more prone to
under-delays, and thus a prerequisite is fixing under-delays.
Regards.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez @ 2017-10-31 16:44 ` Linus Torvalds 2017-10-31 16:56 ` Russell King - ARM Linux 2017-10-31 16:46 ` Russell King - ARM Linux 2017-11-01 17:53 ` Alan Cox 2 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2017-10-31 16:44 UTC (permalink / raw) To: Marc Gonzalez Cc: LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason On Tue, Oct 31, 2017 at 9:15 AM, Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote: > > On arm32, it is possible to set up udelay() to be clock-based. I'm not sure why this is discussed as some kind of generic problem. Just do what x86-64 already does. We use the TSC, and ndelay() does the right thing too. I've not checked precision, but it should be close to the TSC update frequency. So we're talking tens of megahertz. Does it work on all PC's? No. It's only half-way reliable on the ones where the TSC frequency doesn't change. So there are no *guarantees*. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-10-31 16:44 ` Linus Torvalds @ 2017-10-31 16:56 ` Russell King - ARM Linux 2017-10-31 17:45 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2017-10-31 16:56 UTC (permalink / raw) To: Linus Torvalds Cc: Marc Gonzalez, Mark Rutland, Mason, Jonathan Austin, Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd, Michael Turquette, Kevin Hilman, Will Deacon, LKML, Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM On Tue, Oct 31, 2017 at 09:44:20AM -0700, Linus Torvalds wrote: > On Tue, Oct 31, 2017 at 9:15 AM, Marc Gonzalez > <marc_gonzalez@sigmadesigns.com> wrote: > > > > On arm32, it is possible to set up udelay() to be clock-based. > > I'm not sure why this is discussed as some kind of generic problem. Hi Linus, Marc is stating something that's incorrect there. On ARM32, we don't have a TSC, and we aren't guaranteed to have a timer usable for delays. Where there is a suitable timer, it can be used for delays. However, where there isn't a timer, we fall back to using the software loop, and that's where the problem lies. For example, some platforms have a relatively slow timer (32kHz). This centres around is the discussion we had previously: http://lists.openwall.net/linux-kernel/2011/01/12/372 where errors of 5% are very much in the "don't care" category for udelay() - so if udelay() returns early by 5%, we don't care. Marc, however, does want to care about udelay() etc returning early, because he wants his NAND driver to be performant, despite using udelay()/ndelay() in critical paths. So his argument is that udelay() and ndelay() should be accurate, and certainly not return early. Marc's motivation here is to try and force me to "fix" the ARM and generic loops_per_jiffy code so that these functions provide accurate delays. I've said no, based on what we discussed back in 2011. There's the issue too of the timer-based delay code possibly returning one timer tick early, which at reasonable timer resolutions is pretty small. I don't want to do that, as it encourages people to use data- sheet values in udelay() and ndelay() without any cushioning, and so the drivers end up breaking if we fall back to using the software loop delays - which we know for a fact will return early. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-10-31 16:56 ` Russell King - ARM Linux @ 2017-10-31 17:45 ` Linus Torvalds 2017-10-31 17:58 ` Linus Torvalds 2017-11-01 0:23 ` Doug Anderson 0 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2017-10-31 17:45 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Marc Gonzalez, Mark Rutland, Mason, Jonathan Austin, Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd, Michael Turquette, Kevin Hilman, Will Deacon, LKML, Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM On Tue, Oct 31, 2017 at 9:56 AM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > > Marc is stating something that's incorrect there. On ARM32, we don't > have a TSC, and we aren't guaranteed to have a timer usable for delays. > Where there is a suitable timer, it can be used for delays. > > However, where there isn't a timer, we fall back to using the software > loop, and that's where the problem lies. For example, some platforms > have a relatively slow timer (32kHz). Right. So that is actually the basic issue: there is no way for us to really _ever_ give any kind of guarantees about the behavior of udelay/ndelay() in every general case. We can't even guarantee some kind of "at least" behavior, because on some platforms there is no reasonable stable clock at all. We can give good results in certain _particular_ cases, but not in some kind of blanket "we will always do well" way. Traditionally, we used to obviously do the bogo-loop, but it depends on processor frequency, which can (and does) change even outside SW control, never mind things like interrupts etc. On lots of platforms, we can generally do platform-specific clocks. On modern x86, as mentioned, the TSC is stable and fairly high frequency (it isn't really the gigahertz frequency that it reports - reading it takes time, and even ignoring that, the implementation is actually not a true adder at the reported frequency, but it is generally tens of hundreds of megahertz, so you should get something that is close to the "tens of nanoseconds" resolution). But on others we can't even get *close* to that kind of behavior, and if the clock is something like a 32kHz timer that you mention, you obviously aren't going to get even microsecond resotulion, much less nanoseconds. You can (and on x86 we do) calibrate a faster non-architected clock against a slow clock, but all the faster clocks tend to have that frequency shifting issue. So then you tend to be forced to simply rely on platform-specific hacks if you really need something more precise. Most people don't, which is why most people just use udelay() and friends. In particular, several drivers end up depending not on an explicit clock at all, but on the IO fabric itself. For a driver for a particular piece of hardware, that is often the sanest way to do really short timing: if you know you are on a PCI bus and you know your own hardware, you can often do things like "reading the status register takes 6 bus cycles, which is 200 nsec". Things like that are very hacky, but for a driver that is looking at times in the usec range, it's often the best you can do. Don't get me wrong. I think (a) platform code could try to make their udelay/ndelay() be as good as it can be on a particular platform (b) we could maybe export some interface to give estimated errors so that drivers could then try to corrtect for them depending on just how much they care. so I'm certainly not _opposed_ to trying to improve on udelay/ndelay(). It's just that for the generic case, we know we're never going to be very good, and the error (both absolute and relative) can be pretty damn big. One of the issues has historically been that because so few people care, and because there are probably more platforms than there are cases that care deeply, even that (a) thing is actually fairly hard to do. On the x86 side, for example, I doubt that most core kernel developers even have access to platforms that have unstable TSC's any more. I certainly don't. I complained to Intel for many many _years_, but they finally did fix it, and now it's been a long time since I cared. That's why I actually would encourage driver writers that really care deeply about delays to look at ways to get those delays from their own hardware (ie exactly that "read the status register three times" kind of model). It sounds hacky, but it couples the timing constraint with the piece of hardware that actually depends on it, which means that you don't get the nasty kinds of "worry about each platform" complications. I realize that this is not what people want to hear. In a perfect world, we'd just make "ndelay()" work and give the right behavior, and have some strictly bounded error. It's just that it's really fundamentally hard in the general case, even if it sounds like it should be pretty trivial in most _particular_ cases. So I'm very much open to udelay improvements, and if somebody sends patches for particular platforms to do particularly well on that platform, I think we should merge them. But ... Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-10-31 17:45 ` Linus Torvalds @ 2017-10-31 17:58 ` Linus Torvalds 2017-11-01 0:23 ` Doug Anderson 1 sibling, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2017-10-31 17:58 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Marc Gonzalez, Mark Rutland, Mason, Jonathan Austin, Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd, Michael Turquette, Kevin Hilman, Will Deacon, LKML, Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I'm very much open to udelay improvements, and if somebody sends > patches for particular platforms to do particularly well on that > platform, I think we should merge them. But ... Side note: this is obviously a negative feedback circle, where because we know we can't do all that great in general, few platforms really bother to do as well as they perhaps could, since nobody should generally care that deeply due to the generic problem.. Again, if a driver that has tight timing requirements can just generate the timings from its own hardware, that tends to JustWorks(tm). It was what people used to do traditionally at least on the PC/AT platform. None of this "udelay()" crud, you knew that doing a single PIO read took one usec, so you'd just delay that way. Then the hardware moved from ISA to EISA, and everything sped up, but that was ok, because now it needed a shorter delay anyway... Doing things like a status register read is often a good way (sometimes the _only_ way) to make sure preceding writes have been flushed all the way to the device anyway, so it can have those kinds of incidental advantages. Of course, all of this is relevant only for delays on the scale of individual device accesses. But that's also when udelay/ndelay() tends to have the most problems. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-10-31 17:45 ` Linus Torvalds 2017-10-31 17:58 ` Linus Torvalds @ 2017-11-01 0:23 ` Doug Anderson 2017-11-01 9:26 ` Russell King - ARM Linux 1 sibling, 1 reply; 41+ messages in thread From: Doug Anderson @ 2017-11-01 0:23 UTC (permalink / raw) To: Linus Torvalds Cc: Russell King - ARM Linux, Marc Gonzalez, Mark Rutland, Mason, Jonathan Austin, Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd, Michael Turquette, Kevin Hilman, Will Deacon, LKML, Steven Rostedt, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM Hi, On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I'm very much open to udelay improvements, and if somebody sends > patches for particular platforms to do particularly well on that > platform, I think we should merge them. But ... If I'm reading this all correctly, this sounds like you'd be willing to merge <https://patchwork.kernel.org/patch/9429841/>. This makes udelay() guaranteed not to underrun on arm32 platforms. -Doug ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 0:23 ` Doug Anderson @ 2017-11-01 9:26 ` Russell King - ARM Linux 2017-11-01 15:53 ` Doug Anderson 2017-11-01 19:28 ` Marc Gonzalez 0 siblings, 2 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-01 9:26 UTC (permalink / raw) To: Doug Anderson Cc: Linus Torvalds, Mark Rutland, Jonathan Austin, Arnd Bergmann, Mason, Peter Zijlstra, Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM, Marc Gonzalez On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote: > Hi, > > On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > So I'm very much open to udelay improvements, and if somebody sends > > patches for particular platforms to do particularly well on that > > platform, I think we should merge them. But ... > > If I'm reading this all correctly, this sounds like you'd be willing > to merge <https://patchwork.kernel.org/patch/9429841/>. This makes > udelay() guaranteed not to underrun on arm32 platforms. That's a mis-representation again. It stops a timer-based udelay() possibly underrunning by one tick if we are close to the start of a count increment. However, it does nothing for the loops_per_jiffy udelay(), which can still underrun. My argument against merging that patch is that with it merged, we get (as you say) a udelay() that doesn't underrun _when using a timer_ but when we end up using the loops_per_jiffy udelay(), we're back to the old problem. My opinion is that's bad, because it encourages people to write drivers that rely on udelay() having "good" behaviour, which it is not guaranteed to have. So, they'll specify a delay period of exactly what they want, and their drivers will then fail when running on systems that aren't using a timer-based udelay(). If we want udelay() to have this behaviour, it needs to _always_ have this behaviour irrespective of the implementation. So that means the loops_per_jiffy version also needs to be fixed in the same way, which IMHO is impossible. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 9:26 ` Russell King - ARM Linux @ 2017-11-01 15:53 ` Doug Anderson 2017-12-07 12:31 ` Pavel Machek 2017-11-01 19:28 ` Marc Gonzalez 1 sibling, 1 reply; 41+ messages in thread From: Doug Anderson @ 2017-11-01 15:53 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Linus Torvalds, Mark Rutland, Jonathan Austin, Arnd Bergmann, Mason, Peter Zijlstra, Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM, Marc Gonzalez Hi, On Wed, Nov 1, 2017 at 2:26 AM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote: >> Hi, >> >> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >> > So I'm very much open to udelay improvements, and if somebody sends >> > patches for particular platforms to do particularly well on that >> > platform, I think we should merge them. But ... >> >> If I'm reading this all correctly, this sounds like you'd be willing >> to merge <https://patchwork.kernel.org/patch/9429841/>. This makes >> udelay() guaranteed not to underrun on arm32 platforms. > > That's a mis-representation again. It stops a timer-based udelay() > possibly underrunning by one tick if we are close to the start of > a count increment. However, it does nothing for the loops_per_jiffy > udelay(), which can still underrun. > > My argument against merging that patch is that with it merged, we get > (as you say) a udelay() that doesn't underrun _when using a timer_ > but when we end up using the loops_per_jiffy udelay(), we're back to > the old problem. > > My opinion is that's bad, because it encourages people to write drivers > that rely on udelay() having "good" behaviour, which it is not guaranteed > to have. So, they'll specify a delay period of exactly what they want, > and their drivers will then fail when running on systems that aren't > using a timer-based udelay(). IMHO the current udelay is broken in an off-by-one way and it's easy to fix. Intentionally leaving a bug in the code seems silly. This seems to by what Linus is saying with his statement that "(a) platform code could try to make their udelay/ndelay() be as good as it can be on a particular platform". So no matter the rest of the discussions, we should land that. If you disagree then I'm happy to re-post that patch straight to Linus later this week since it sounds as if he'd take it. > If we want udelay() to have this behaviour, it needs to _always_ have > this behaviour irrespective of the implementation. So that means > the loops_per_jiffy version also needs to be fixed in the same way, > which IMHO is impossible. As Linus indicates, if there is a way to code things up that doesn't rely on udelay then that should be preferred. However, there may be cases where this is exceedingly difficult. If you're writing a driver at a high enough level that will work on a lot of underlying platforms (AKA it's platform-agnostic) then you can't necessarily rely on timing an individual hardware read. Since you're writing high-level platform-agnostic code, presumably implementing a 1 us delay in a generic way is equally difficult to making the platform-agnostic udelay() reliable. IMHO it would be OK to put in a requirement in a driver saying that it will only function properly on hardware that has a udelay() that is guaranteed to never return early. As Linus says: "most core kernel developers even have access to platforms that have unstable TSC's any more." Presumably all those old platforms aren't suddenly going to be attached to new devices unless those new devices are connected by a PCI, ISA, USB, etc. bus. Drivers for components connected by non-external busses seem like they don't need to take into account the quirks of really ancient hardware. Yes, I know there are still some arm32 chips that aren't that old and that don't have a CP15-based timer. We should make sure we don't change existing drivers and frameworks in a way that will break those boards. If that means we need to figure out how to add an API, as Linus says, to indicate how accurate udelay is then that might be a solution. Another would be to come up with some clever solution on affected boards. Most arm32 boards I'm aware of have other (non-CP15-based) timers. ...if they don't and these are real boards that are actually using a driver relying on udelay() then perhaps they could add a new board-specific udelay() implementation that delayed by reading a specific hardware register with known timing. Said another way: if we're writing a high level NAND driver and we can't find a better way than udelay() to ensure timing requirements, then the driver should use udelay() and document the fact that it must not underrun (ideally it could even test for it at runtime). If that NAND driver will never be used on platforms with an unreliable udelay() then we don't need to worry about it. If we find a platform where we need this NAND driver, we should find a way to implement a udelay() that will, at least, never underestimate. -Doug ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 15:53 ` Doug Anderson @ 2017-12-07 12:31 ` Pavel Machek 0 siblings, 0 replies; 41+ messages in thread From: Pavel Machek @ 2017-12-07 12:31 UTC (permalink / raw) To: Doug Anderson Cc: Russell King - ARM Linux, Linus Torvalds, Mark Rutland, Jonathan Austin, Arnd Bergmann, Mason, Peter Zijlstra, Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM, Marc Gonzalez Hi! > On Wed, Nov 1, 2017 at 2:26 AM, Russell King - ARM Linux > <linux@armlinux.org.uk> wrote: > > On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote: > >> Hi, > >> > >> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds > >> <torvalds@linux-foundation.org> wrote: > >> > So I'm very much open to udelay improvements, and if somebody sends > >> > patches for particular platforms to do particularly well on that > >> > platform, I think we should merge them. But ... > >> > >> If I'm reading this all correctly, this sounds like you'd be willing > >> to merge <https://patchwork.kernel.org/patch/9429841/>. This makes > >> udelay() guaranteed not to underrun on arm32 platforms. > > > > That's a mis-representation again. It stops a timer-based udelay() > > possibly underrunning by one tick if we are close to the start of > > a count increment. However, it does nothing for the loops_per_jiffy > > udelay(), which can still underrun. > > > > My argument against merging that patch is that with it merged, we get > > (as you say) a udelay() that doesn't underrun _when using a timer_ > > but when we end up using the loops_per_jiffy udelay(), we're back to > > the old problem. > > > > My opinion is that's bad, because it encourages people to write drivers > > that rely on udelay() having "good" behaviour, which it is not guaranteed > > to have. So, they'll specify a delay period of exactly what they want, > > and their drivers will then fail when running on systems that aren't > > using a timer-based udelay(). > > IMHO the current udelay is broken in an off-by-one way and it's easy > to fix. Intentionally leaving a bug in the code seems silly. This > seems to by what Linus is saying with his statement that "(a) platform > code could try to make their udelay/ndelay() be as good as it can be > on a particular platform". > > So no matter the rest of the discussions, we should land that. If you > disagree then I'm happy to re-post that patch straight to Linus later > this week since it sounds as if he'd take it. Did this get fixed in any way? Russell having crazy arguments for keeping kernel buggy should not be good enough reason to keep the bugs... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 9:26 ` Russell King - ARM Linux 2017-11-01 15:53 ` Doug Anderson @ 2017-11-01 19:28 ` Marc Gonzalez 2017-11-01 20:30 ` Russell King - ARM Linux 1 sibling, 1 reply; 41+ messages in thread From: Marc Gonzalez @ 2017-11-01 19:28 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Doug Anderson, Linus Torvalds, Mark Rutland, Jonathan Austin, Arnd Bergmann, Peter Zijlstra, Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM, Mason On 01/11/2017 10:26, Russell King - ARM Linux wrote: > On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote: > >> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds wrote: >> >>> So I'm very much open to udelay improvements, and if somebody sends >>> patches for particular platforms to do particularly well on that >>> platform, I think we should merge them. But ... >> >> If I'm reading this all correctly, this sounds like you'd be willing >> to merge <https://patchwork.kernel.org/patch/9429841/>. This makes >> udelay() guaranteed not to underrun on arm32 platforms. > > That's a mis-representation again. It stops a timer-based udelay() > possibly underrunning by one tick if we are close to the start of > a count increment. However, it does nothing for the loops_per_jiffy > udelay(), which can still underrun. It is correct that improving the clock-based implementation does strictly nothing for the loop-based implementation. Is it possible to derive a higher bound on the amount of under-run when using the loop-based delay on arm32? > My argument against merging that patch is that with it merged, we get > (as you say) a udelay() that doesn't underrun _when using a timer_ > but when we end up using the loops_per_jiffy udelay(), we're back to > the old problem. > > My opinion is that's bad, because it encourages people to write drivers > that rely on udelay() having "good" behaviour, which it is not guaranteed > to have. So, they'll specify a delay period of exactly what they want, > and their drivers will then fail when running on systems that aren't > using a timer-based udelay(). > > If we want udelay() to have this behaviour, it needs to _always_ have > this behaviour irrespective of the implementation. So that means > the loops_per_jiffy version also needs to be fixed in the same way, > which IMHO is impossible. Let's say some piece of HW absolutely, positively, unequivocally, uncompromisingly, requires a strict minimum of 10 microseconds elapsing between operations A and B. You say a driver writer must not write udelay(10); They have to take into account the possibility of under-delay. How much additional delay should they add? 10%? 20%? 50%? A percentage + a fixed quantity? If there is an actual rule, then it could be incorporated in the loop-based implementation? If it is impossible to say (as Linus hinted for some platforms) then this means there is no way to guarantee a minimal delay? Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:28 ` Marc Gonzalez @ 2017-11-01 20:30 ` Russell King - ARM Linux 0 siblings, 0 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-01 20:30 UTC (permalink / raw) To: Marc Gonzalez Cc: Doug Anderson, Linus Torvalds, Mark Rutland, Jonathan Austin, Arnd Bergmann, Peter Zijlstra, Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd, Steven Rostedt, LKML, Kevin Hilman, John Stultz, Thomas Gleixner, Ingo Molnar, Linux ARM, Mason On Wed, Nov 01, 2017 at 08:28:18PM +0100, Marc Gonzalez wrote: > On 01/11/2017 10:26, Russell King - ARM Linux wrote: > > > On Tue, Oct 31, 2017 at 05:23:19PM -0700, Doug Anderson wrote: > > > >> On Tue, Oct 31, 2017 at 10:45 AM, Linus Torvalds wrote: > >> > >>> So I'm very much open to udelay improvements, and if somebody sends > >>> patches for particular platforms to do particularly well on that > >>> platform, I think we should merge them. But ... > >> > >> If I'm reading this all correctly, this sounds like you'd be willing > >> to merge <https://patchwork.kernel.org/patch/9429841/>. This makes > >> udelay() guaranteed not to underrun on arm32 platforms. > > > > That's a mis-representation again. It stops a timer-based udelay() > > possibly underrunning by one tick if we are close to the start of > > a count increment. However, it does nothing for the loops_per_jiffy > > udelay(), which can still underrun. > > It is correct that improving the clock-based implementation does strictly > nothing for the loop-based implementation. > > Is it possible to derive a higher bound on the amount of under-run when > using the loop-based delay on arm32? Not really. If you read the archived thread via URL that I gave you when this was initially brought up, specifically the first email in the thread, I give a full analysis of the loop-based delay, why it gives short delays. What the analysis says, specifically point (2) in my initial email, is that the inaccuracy is dependent on two things: 1. the CPU speed 2. the time that the CPU has to spend executing the timer interrupt handler. For any particular kernel, we could assume that there are a fixed number of cycles to execute the timer interrupt handler. Let's call this t_timer. The timer fires every t_period. We measure how many loops we can do between two t_period interrupts, and that includes t_timer each time. So, the time that we're able to measure via the delay loop is t_period - t_timer (which is the equation I give in point (2) in that email). If the CPU is clocked slowly, then t_timer gets larger, but t_period remains the same. So, the error gets bigger. Conversely, the faster the CPU, the smaller the error. Further in that initial email, I give actual measured delay values for one of my systems (I forget which) for delays from 1us up to 5ms in 1, 2, 5 decade multiples. Small delays are longer than desired because of the overhead of computing the number of loops. Long delays are shorter because of the inaccuracies. Note also that (1) in the original email indicates that the loops_per_jiffy value is biased towards a smaller value rather than a larger value - and that also adds to the "it's shorter than requested" problem. At the end of the email, I proposed an improvement to the ARM implementation, which reduces the amount of underrun by correcting the calculations to round up. However, this adds four additional instructions to the computation, which has the effect of making small delays ever so slightly longer than they are already. That said, in the last six years, ARM CPUs have become a lot faster, so the effect of those four instructions is now reduced. Now, if you read Linus' reply to my initial email, you'll see that Linus stated very clearly that the error caused by that is in the "don't care too much" (quoting Linus) category, which is what I've been trying to tell you all this time. The slight underrun that we get with the software loop is not something we care about, we know that it happens and there's very little motivation to fix it. What does that mean? Well, if we have a super-accurate udelay() implementation based on timers, that's really great, it means we can have accurate delays that won't expire before the time they're supposed to. I agree that's a good thing. There's also a major got-cha, which is if we have to fall back to using the software based udelay(), then we immediately start loosing. As can be seen from the figures I've given, if you ask for a 1ms delay, it's about 1% short, so 990us. Do those 10us matter? That depends on the driver, buses, and what the delay is being used for - but the point is, with the software loop, we _can't_ guarantee that if we ask for a 1ms delay, we'll get at least a 1ms delay. Now, what does this 1% equate to for a timer based delay? If the timer ticks at 10MHz, so giving a 100ns resolution, and we miss one by one tick, that's a 100ns error. If you now look at the figures a measured from the software udelay(), that's peanuts compared to the error with the software loop. If your timer ticks at 1MHz, then maybe it's a bigger problem as the error would be 1us, but then do you really want to use a 1MHz counter to implement udelay() which suffers from not knowing where in its count cycle you start - if you request 1us, and you wait for two ticks, you could be waiting close to 2us. Probably not the best choice if you're trying to bitbang a serial bus since it'll make everything twice as slow as using the software delay loop! It's really all about balances and tradeoffs. > > If we want udelay() to have this behaviour, it needs to _always_ have > > this behaviour irrespective of the implementation. So that means > > the loops_per_jiffy version also needs to be fixed in the same way, > > which IMHO is impossible. > > Let's say some piece of HW absolutely, positively, unequivocally, > uncompromisingly, requires a strict minimum of 10 microseconds > elapsing between operations A and B. > > You say a driver writer must not write udelay(10); Correct, because udelay() may return early. > They have to take into account the possibility of under-delay. > How much additional delay should they add? > 10%? 20%? 50%? A percentage + a fixed quantity? > > If there is an actual rule, then it could be incorporated in the > loop-based implementation? Well, there are two ways that udelay() gets used: 1. As a busy-wait for short intervals while waiting for hardware to produce an event, such as: /* wait */ timeout = 1000; do { tc_read(DP0_LTSTAT, &value); udelay(1); } while ((!(value & LT_LOOPDONE)) && (--timeout)); if (timeout == 0) { dev_err(tc->dev, "Link training timeout!\n"); Here, the "timeout" will be way over-estimated, so that a short udelay() has little effect, and to give the hardware more than enough time to respond. The same is done in other drivers, eg: hdmi_phy_wait_i2c_done(hdmi, 1000); static bool hdmi_phy_wait_i2c_done(struct dw_hdmi *hdmi, int msec) { u32 val; while ((val = hdmi_readb(hdmi, HDMI_IH_I2CMPHY_STAT0) & 0x3) == 0) { if (msec-- == 0) return false; udelay(1000); } hdmi_writeb(hdmi, val, HDMI_IH_I2CMPHY_STAT0); It probably doesn't take anywhere near 1 _second_ for the PHY to complete the write operation, but the point is to allow progress to be made if it did rather than the kernel just coming to a dead stop. 2. Drivers that use udelay() to produce bus timings, eg, i2c-algo-bit.c. The value of adap->udelay is the period of the clock. There are a fairly small set of standard I2C bus maximum frequencies (100kHz, 400kHz, we'll ignore the higher ones because they're not relevant here): drivers/i2c/busses/i2c-hydra.c: .udelay = 5, drivers/i2c/busses/i2c-versatile.c: .udelay = 30, drivers/i2c/busses/i2c-simtec.c: pd->bit.udelay = 20; drivers/i2c/busses/i2c-parport.c: .udelay = 10, /* ~50 kbps */ drivers/i2c/busses/i2c-parport.c: adapter->algo_data.udelay = 50; /* ~10 kbps */ drivers/i2c/busses/i2c-acorn.c: .udelay = 80, drivers/i2c/busses/i2c-via.c: .udelay = 5, drivers/i2c/busses/i2c-parport-light.c: .udelay = 50, So we have 200kHz, 33kHz, 50kHz, 100kHz, 20kHz, 12kHz, etc. However, note that some of those (eg, parport) are writing to an ISA bus, and ISA buses are comparitively slow, and will reduce the cycle time below that of the specified delay. These figures are probably arrived at by repeated test and measurement of the bus behaviour over a range of conditions, rather than any particular fixed "we need to inflate by a particular %age and add a fixed value". I've been there - I've had a parallel port bit-banging a serial protocol using a driver with udelay(), and if you want it to perform with the minimum of overhead, it's very much a case of "connect the 'scope, measure the behaviour, adjust the software to cater for the overheads while leaving a margin." All in all, there is not any nice answer to "I want udelay() to be accurate" or even "I want udelay() to return only after the minimum specified time". There are ways we can do it, but the point is that the kernel _as a whole_ can not make either of those guarantees to driver authors, and driver authors must not rely on these functions to produce accurate delays. I'm sorry that I can't give you exact figures to your question, but I'm trying to give you the full understanding about why this is the case, and why you should not make the assumptions you want to about these functions. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez 2017-10-31 16:44 ` Linus Torvalds @ 2017-10-31 16:46 ` Russell King - ARM Linux 2017-11-01 17:53 ` Alan Cox 2 siblings, 0 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2017-10-31 16:46 UTC (permalink / raw) To: Marc Gonzalez Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Mason On Tue, Oct 31, 2017 at 05:15:34PM +0100, Marc Gonzalez wrote: > I am writing to you directly because Russell has bluntly stated (paraphrased) > "Send your patches to Linus; it's his kernel, and he has the final say." which > is his diplomatic way of telling me to fsck off. ... after I've already pointed you at archived emails from Linus himself giving you all the information you need, and me stating the reasons for it being the way it is. Why can you not accept what has already been said? Why do you need to continually re-open old discussions only to be told the same thing as has already been said? Why this "if I don't get the answer I want, I'm going to keep on trying until I do" attitude? -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez 2017-10-31 16:44 ` Linus Torvalds 2017-10-31 16:46 ` Russell King - ARM Linux @ 2017-11-01 17:53 ` Alan Cox 2017-11-01 19:03 ` Marc Gonzalez 2 siblings, 1 reply; 41+ messages in thread From: Alan Cox @ 2017-11-01 17:53 UTC (permalink / raw) To: Marc Gonzalez Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason On Tue, 31 Oct 2017 17:15:34 +0100 > Therefore, users are accustomed to having delays be longer (within a reasonable margin). > However, very few users would expect delays to be *shorter* than requested. If your udelay can be under by 10% then just bump the number by 10%. However at that level most hardware isn't that predictable anyway because the fabric between the CPU core and the device isn't some clunky serialized link. Writes get delayed, they can bunch together, busses do posting and queueing. Then there is virtualisation 8) > A typical driver writer has some HW spec in front of them, which e.g. states: > > * poke register A > * wait 1 microsecond for the dust to settle > * poke register B Rarely because of posting. It's usually write while(read() != READY); write and even when you've got a legacy device with timeouts its write read delay write and for sub 1ms delays I suspect the read and bus latency actually add a randomization sufficient that it's not much of an optimization to worry about an accurate ndelay(). > This "off-by-one" error is systematic over the entire range of allowed > delay_us input (1 to 2000), so it is easy to fix, by adding 1 to the result. And that + 1 might be worth adding but really there isn't a lot of modern hardware that haas a bus that behaves like software folks imagine and everything has percentage errors factored into published numbers. > 3) Why does all this even matter? > > At boot, the NAND framework scans the NAND chips for bad blocks; > this operation generates approximately 10^5 calls to ndelay(100); > which cause a 100 ms delay, because ndelay is implemented as a > call to the nearest udelay (rounded up). So why aren't you doing that on both NANDs in parallel and asynchronous to other parts of boot ? If you start scanning at early boot time do you need the bad block list before mounting / - or are you stuck with a single threaded CPU and PIO ? For that matter given the bad blocks don't randomly change why not cache them ? > My current NAND chips are tiny (2 x 512 MB) but with larger chips, > the number of calls to ndelay would climb to 10^6 and the delay > increase to 1 second, with is starting to be a problem. > > One solution is to implement ndelay, but ndelay is more prone to > under-delays, and thus a prerequisite is fixing under-delays. For ndelay you probably have to make it platform specific or just use udelay if not. We do have a few cases we wanted 400ns delays in the PC world (ATA) but not many. Akab ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 17:53 ` Alan Cox @ 2017-11-01 19:03 ` Marc Gonzalez 2017-11-01 19:09 ` Linus Torvalds ` (3 more replies) 0 siblings, 4 replies; 41+ messages in thread From: Marc Gonzalez @ 2017-11-01 19:03 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason On 01/11/2017 18:53, Alan Cox wrote: > On Tue, 31 Oct 2017 17:15:34 +0100 > >> Therefore, users are accustomed to having delays be longer (within a reasonable margin). >> However, very few users would expect delays to be *shorter* than requested. > > If your udelay can be under by 10% then just bump the number by 10%. Except it's not *quite* that simple. Error has both an absolute and a relative component. So the actual value matters, and it's not always a constant. For example: http://elixir.free-electrons.com/linux/latest/source/drivers/mtd/nand/nand_base.c#L814 > However at that level most hardware isn't that predictable anyway because > the fabric between the CPU core and the device isn't some clunky > serialized link. Writes get delayed, they can bunch together, busses do > posting and queueing. Are you talking about the actual delay operation, or the pokes around it? > Then there is virtualisation 8) > >> A typical driver writer has some HW spec in front of them, which e.g. states: >> >> * poke register A >> * wait 1 microsecond for the dust to settle >> * poke register B > > Rarely because of posting. It's usually > > write > while(read() != READY); > write > > and even when you've got a legacy device with timeouts its > > write > read > delay > write > > and for sub 1ms delays I suspect the read and bus latency actually add a > randomization sufficient that it's not much of an optimization to worry > about an accurate ndelay(). I don't think "accurate" is the proper term. Over-delays are fine, under-delays are problematic. >> This "off-by-one" error is systematic over the entire range of allowed >> delay_us input (1 to 2000), so it is easy to fix, by adding 1 to the result. > > And that + 1 might be worth adding but really there isn't a lot of > modern hardware that has a bus that behaves like software folks imagine > and everything has percentage errors factored into published numbers. I guess I'm a software folk, but the designer of the system bus sits across my desk, and we do talk often. >> 3) Why does all this even matter? >> >> At boot, the NAND framework scans the NAND chips for bad blocks; >> this operation generates approximately 10^5 calls to ndelay(100); >> which cause a 100 ms delay, because ndelay is implemented as a >> call to the nearest udelay (rounded up). > > So why aren't you doing that on both NANDs in parallel and asynchronous > to other parts of boot ? If you start scanning at early boot time do you > need the bad block list before mounting / - or are you stuck with a > single threaded CPU and PIO ? There might be some low(ish) hanging fruit to improve the performance of the NAND framework, such as multi-page reads/writes. But the NAND controller on my SoC muxes access to the two NAND chips, so no parallel access, and this requires PIO. > For that matter given the bad blocks don't randomly change why not cache > them ? That's a good question, I'll ask the NAND framework maintainer. Store them where, by the way? On the NAND chip itself? >> My current NAND chips are tiny (2 x 512 MB) but with larger chips, >> the number of calls to ndelay would climb to 10^6 and the delay >> increase to 1 second, with is starting to be a problem. >> >> One solution is to implement ndelay, but ndelay is more prone to >> under-delays, and thus a prerequisite is fixing under-delays. > > For ndelay you probably have to make it platform specific or just use > udelay if not. We do have a few cases we wanted 400ns delays in the PC > world (ATA) but not many. By default, ndelay is implemented in terms of udelay. Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:03 ` Marc Gonzalez @ 2017-11-01 19:09 ` Linus Torvalds 2017-11-01 19:17 ` Linus Torvalds 2017-11-01 19:38 ` Marc Gonzalez 2017-11-01 19:36 ` Alan Cox ` (2 subsequent siblings) 3 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2017-11-01 19:09 UTC (permalink / raw) To: Marc Gonzalez Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason On Wed, Nov 1, 2017 at 12:03 PM, Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote: > > By default, ndelay is implemented in terms of udelay. That's very much *NOT* the case. Yes, there is a *fallback* for when somebody doesn't do ndelay() at all, but that doesn't make it the default. It's just a "the architecture didn't implement ndelay at all, we'll work around it". So stop this idiocy already. About half of what I've seen in this thread has been pure garbage. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:09 ` Linus Torvalds @ 2017-11-01 19:17 ` Linus Torvalds 2017-11-01 19:38 ` Marc Gonzalez 1 sibling, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2017-11-01 19:17 UTC (permalink / raw) To: Marc Gonzalez Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason On Wed, Nov 1, 2017 at 12:09 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yes, there is a *fallback* for when somebody doesn't do ndelay() at > all, but that doesn't make it the default. > > It's just a "the architecture didn't implement ndelay at all, we'll > work around it". Side note: I will continue to claim that anybody who thinks they need ndelay() needs to rethink their position anyway, due to all the issues I did already bring up. So I'd say that pretty much 100% of all ndelay() users are just wrong. If you are doing a driver that needs that kind of delay granularity, you had better already know exactly how long it takes to read the status register, and just do that instead. And if you don't know how long it takes to read a status register, you have no business thinking that you need ndelay(). You probably ended up reading one piece of hardware doc ("the sample-and-hold register takes 200ns to settle") without then reading the other low-level hardware documentation that talks about the timings of the chip accesses from the host side. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:09 ` Linus Torvalds 2017-11-01 19:17 ` Linus Torvalds @ 2017-11-01 19:38 ` Marc Gonzalez 2017-11-15 12:51 ` Marc Gonzalez 1 sibling, 1 reply; 41+ messages in thread From: Marc Gonzalez @ 2017-11-01 19:38 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason On 01/11/2017 20:09, Linus Torvalds wrote: > On Wed, Nov 1, 2017 at 12:03 PM, Marc Gonzalez wrote: > >> By default, ndelay is implemented in terms of udelay. > > That's very much *NOT* the case. > > Yes, there is a *fallback* for when somebody doesn't do ndelay() at > all, but that doesn't make it the default. > > It's just a "the architecture didn't implement ndelay at all, we'll > work around it". Yes, sorry, I wrote "default" when I meant "fallback". (arm32 currently does not define a specific ndelay implementation.) > So stop this idiocy already. About half of what I've seen in this > thread has been pure garbage. OK, I'll just send my patch, and then crawl back under my rock. Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:38 ` Marc Gonzalez @ 2017-11-15 12:51 ` Marc Gonzalez 2017-11-15 13:13 ` Russell King - ARM Linux 2017-11-15 18:45 ` Doug Anderson 0 siblings, 2 replies; 41+ messages in thread From: Marc Gonzalez @ 2017-11-15 12:51 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On 01/11/2017 20:38, Marc Gonzalez wrote: > OK, I'll just send my patch, and then crawl back under my rock. Linus, As promised, the patch is provided below. And as promised, I will no longer bring this up on LKML. FWIW, I have checked that the computed value matches the expected value for all HZ and delay_us, and for a few clock frequencies, using the following program: $ cat delays.c #include <stdio.h> #define MEGA 1000000u typedef unsigned int uint; typedef unsigned long long u64; #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) static const uint HZ_tab[] = { 100, 250, 300, 1000 }; static void check_cycle_count(uint freq, uint HZ, uint delay_us) { uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA); uint lpj = DIV_ROUND_UP(freq, HZ); uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1; uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA); if (computed != expected) printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected); } int main(void) { uint idx, delay_us, freq; for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA) for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx) for (delay_us = 1; delay_us <= 2000; ++delay_us) check_cycle_count(freq, HZ_tab[idx], delay_us); return 0; } -- >8 -- Subject: [PATCH] ARM: Tweak clock-based udelay implementation In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy") Russell pointed out that loop-based delays may return early. On the arm platform, delays may be either loop-based or clock-based. This patch tweaks the clock-based implementation so that udelay(N) is guaranteed to spin at least N microseconds. Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> --- arch/arm/lib/delay.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c index 2cef11884857..0a25712077ec 100644 --- a/arch/arm/lib/delay.c +++ b/arch/arm/lib/delay.c @@ -58,15 +58,15 @@ static void __timer_delay(unsigned long cycles) { cycles_t start = get_cycles(); - while ((get_cycles() - start) < cycles) + while ((get_cycles() - start) <= cycles) cpu_relax(); } static void __timer_const_udelay(unsigned long xloops) { - unsigned long long loops = xloops; - loops *= arm_delay_ops.ticks_per_jiffy; - __timer_delay(loops >> UDELAY_SHIFT); + u64 tmp = (u64)xloops * arm_delay_ops.ticks_per_jiffy; + unsigned long cycles = tmp >> UDELAY_SHIFT; + __timer_delay(cycles + 1); /* Round up in 1 instruction */ } static void __timer_udelay(unsigned long usecs) @@ -92,7 +92,7 @@ void __init register_current_timer_delay(const struct delay_timer *timer) if (!delay_calibrated && (!delay_res || (res < delay_res))) { pr_info("Switching to timer-based delay loop, resolution %lluns\n", res); delay_timer = timer; - lpj_fine = timer->freq / HZ; + lpj_fine = DIV_ROUND_UP(timer->freq, HZ); delay_res = res; /* cpufreq may scale loops_per_jiffy, so keep a private copy */ -- 2.15.0 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-15 12:51 ` Marc Gonzalez @ 2017-11-15 13:13 ` Russell King - ARM Linux 2017-11-16 15:26 ` Marc Gonzalez 2017-12-07 12:43 ` Pavel Machek 2017-11-15 18:45 ` Doug Anderson 1 sibling, 2 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-15 13:13 UTC (permalink / raw) To: Marc Gonzalez Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Wed, Nov 15, 2017 at 01:51:54PM +0100, Marc Gonzalez wrote: > On 01/11/2017 20:38, Marc Gonzalez wrote: > > > OK, I'll just send my patch, and then crawl back under my rock. > > Linus, > > As promised, the patch is provided below. And as promised, I will > no longer bring this up on LKML. > > FWIW, I have checked that the computed value matches the expected > value for all HZ and delay_us, and for a few clock frequencies, > using the following program: > > $ cat delays.c > #include <stdio.h> > #define MEGA 1000000u > typedef unsigned int uint; > typedef unsigned long long u64; > #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) > > static const uint HZ_tab[] = { 100, 250, 300, 1000 }; > > static void check_cycle_count(uint freq, uint HZ, uint delay_us) > { > uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA); > uint lpj = DIV_ROUND_UP(freq, HZ); > uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1; > uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA); > > if (computed != expected) > printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected); > } > > int main(void) > { > uint idx, delay_us, freq; > > for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA) > for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx) > for (delay_us = 1; delay_us <= 2000; ++delay_us) > check_cycle_count(freq, HZ_tab[idx], delay_us); > > return 0; > } > > > > -- >8 -- > Subject: [PATCH] ARM: Tweak clock-based udelay implementation > > In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy") > Russell pointed out that loop-based delays may return early. > > On the arm platform, delays may be either loop-based or clock-based. > > This patch tweaks the clock-based implementation so that udelay(N) > is guaranteed to spin at least N microseconds. As I've already said, I don't want this, because it encourages people to use too-small delays in driver code, and if we merge it then you will look at your data sheet, decide it says "you need to wait 10us" and write in your driver "udelay(10)" which will break on the loops based delay. udelay() needs to offer a consistent interface so that drivers know what to expect no matter what the implementation is. Making one implementation conform to your ideas while leaving the other implementations with other expectations is a recipe for bugs. If you really want to do this, fix the loops_per_jiffy implementation as well so that the consistency is maintained. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-15 13:13 ` Russell King - ARM Linux @ 2017-11-16 15:26 ` Marc Gonzalez 2017-11-16 15:36 ` Russell King - ARM Linux 2017-12-07 12:43 ` Pavel Machek 1 sibling, 1 reply; 41+ messages in thread From: Marc Gonzalez @ 2017-11-16 15:26 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On 15/11/2017 14:13, Russell King - ARM Linux wrote: > udelay() needs to offer a consistent interface so that drivers know > what to expect no matter what the implementation is. Making one > implementation conform to your ideas while leaving the other > implementations with other expectations is a recipe for bugs. > > If you really want to do this, fix the loops_per_jiffy implementation > as well so that the consistency is maintained. Hello Russell, It seems to me that, when using DFS, there's a serious issue with loop-based delays. (IIRC, it was you who pointed this out a few years ago.) If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled when the frequency changes. But arch/arm/lib/delay-loop.S starts by loading the current value of loops_per_jiffy, computes the number of times to loop, and then loops. If the frequency increases when the core is in __loop_delay, the delay will be much shorter than requested. Is this a correct assessment of the situation? (BTW, does arch/arm/lib/delay-loop.S load the per_cpu loops_per_jiffy or the system-wide variable?) Should loop-based delays be disabled when CPUFREQ is enabled? Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 15:26 ` Marc Gonzalez @ 2017-11-16 15:36 ` Russell King - ARM Linux 2017-11-16 15:47 ` Marc Gonzalez 0 siblings, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-16 15:36 UTC (permalink / raw) To: Marc Gonzalez Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: > On 15/11/2017 14:13, Russell King - ARM Linux wrote: > > > udelay() needs to offer a consistent interface so that drivers know > > what to expect no matter what the implementation is. Making one > > implementation conform to your ideas while leaving the other > > implementations with other expectations is a recipe for bugs. > > > > If you really want to do this, fix the loops_per_jiffy implementation > > as well so that the consistency is maintained. > > Hello Russell, > > It seems to me that, when using DFS, there's a serious issue with loop-based > delays. (IIRC, it was you who pointed this out a few years ago.) > > If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled > when the frequency changes. > > But arch/arm/lib/delay-loop.S starts by loading the current value of > loops_per_jiffy, computes the number of times to loop, and then loops. > If the frequency increases when the core is in __loop_delay, the > delay will be much shorter than requested. > > Is this a correct assessment of the situation? Absolutely correct, and it's something that people are aware of, and have already catered for while writing their drivers. > (BTW, does arch/arm/lib/delay-loop.S load the per_cpu loops_per_jiffy > or the system-wide variable?) > > Should loop-based delays be disabled when CPUFREQ is enabled? What about platforms (and there are those in the kernel today) which have CPUFREQ enabled and also have no timer based delay registered? These rely on using the delay loop mechanism today. What this means is you can't just "turn off" loop-based delays just because CPUFREQ is enabled, because that's going to cause regressions. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 15:36 ` Russell King - ARM Linux @ 2017-11-16 15:47 ` Marc Gonzalez 2017-11-16 16:08 ` Nicolas Pitre 0 siblings, 1 reply; 41+ messages in thread From: Marc Gonzalez @ 2017-11-16 15:47 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On 16/11/2017 16:36, Russell King - ARM Linux wrote: > On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: >> On 15/11/2017 14:13, Russell King - ARM Linux wrote: >> >>> udelay() needs to offer a consistent interface so that drivers know >>> what to expect no matter what the implementation is. Making one >>> implementation conform to your ideas while leaving the other >>> implementations with other expectations is a recipe for bugs. >>> >>> If you really want to do this, fix the loops_per_jiffy implementation >>> as well so that the consistency is maintained. >> >> Hello Russell, >> >> It seems to me that, when using DFS, there's a serious issue with loop-based >> delays. (IIRC, it was you who pointed this out a few years ago.) >> >> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled >> when the frequency changes. >> >> But arch/arm/lib/delay-loop.S starts by loading the current value of >> loops_per_jiffy, computes the number of times to loop, and then loops. >> If the frequency increases when the core is in __loop_delay, the >> delay will be much shorter than requested. >> >> Is this a correct assessment of the situation? > > Absolutely correct, and it's something that people are aware of, and > have already catered for while writing their drivers. In their cpufreq driver? In "real" device drivers that happen to use delays? On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz. If the frequency increases at the beginning of __loop_delay, udelay(100) would spin only 10 microseconds. This is likely to cause issues in any driver using udelay. How does one cater for that? Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 15:47 ` Marc Gonzalez @ 2017-11-16 16:08 ` Nicolas Pitre 2017-11-16 16:26 ` Marc Gonzalez 0 siblings, 1 reply; 41+ messages in thread From: Nicolas Pitre @ 2017-11-16 16:08 UTC (permalink / raw) To: Marc Gonzalez Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Thu, 16 Nov 2017, Marc Gonzalez wrote: > On 16/11/2017 16:36, Russell King - ARM Linux wrote: > > On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: > >> On 15/11/2017 14:13, Russell King - ARM Linux wrote: > >> > >>> udelay() needs to offer a consistent interface so that drivers know > >>> what to expect no matter what the implementation is. Making one > >>> implementation conform to your ideas while leaving the other > >>> implementations with other expectations is a recipe for bugs. > >>> > >>> If you really want to do this, fix the loops_per_jiffy implementation > >>> as well so that the consistency is maintained. > >> > >> Hello Russell, > >> > >> It seems to me that, when using DFS, there's a serious issue with loop-based > >> delays. (IIRC, it was you who pointed this out a few years ago.) > >> > >> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled > >> when the frequency changes. > >> > >> But arch/arm/lib/delay-loop.S starts by loading the current value of > >> loops_per_jiffy, computes the number of times to loop, and then loops. > >> If the frequency increases when the core is in __loop_delay, the > >> delay will be much shorter than requested. > >> > >> Is this a correct assessment of the situation? > > > > Absolutely correct, and it's something that people are aware of, and > > have already catered for while writing their drivers. > > In their cpufreq driver? > In "real" device drivers that happen to use delays? > > On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz. > If the frequency increases at the beginning of __loop_delay, udelay(100) > would spin only 10 microseconds. This is likely to cause issues in > any driver using udelay. > > How does one cater for that? You make sure your delays are based on a stable hardware timer. Most platforms nowdays should have a suitable timer source. Nicolas ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 16:08 ` Nicolas Pitre @ 2017-11-16 16:26 ` Marc Gonzalez 2017-11-16 16:32 ` Russell King - ARM Linux 2017-11-16 16:47 ` Nicolas Pitre 0 siblings, 2 replies; 41+ messages in thread From: Marc Gonzalez @ 2017-11-16 16:26 UTC (permalink / raw) To: Nicolas Pitre Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On 16/11/2017 17:08, Nicolas Pitre wrote: > On Thu, 16 Nov 2017, Marc Gonzalez wrote: > >> On 16/11/2017 16:36, Russell King - ARM Linux wrote: >>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: >>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote: >>>> >>>>> udelay() needs to offer a consistent interface so that drivers know >>>>> what to expect no matter what the implementation is. Making one >>>>> implementation conform to your ideas while leaving the other >>>>> implementations with other expectations is a recipe for bugs. >>>>> >>>>> If you really want to do this, fix the loops_per_jiffy implementation >>>>> as well so that the consistency is maintained. >>>> >>>> Hello Russell, >>>> >>>> It seems to me that, when using DFS, there's a serious issue with loop-based >>>> delays. (IIRC, it was you who pointed this out a few years ago.) >>>> >>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled >>>> when the frequency changes. >>>> >>>> But arch/arm/lib/delay-loop.S starts by loading the current value of >>>> loops_per_jiffy, computes the number of times to loop, and then loops. >>>> If the frequency increases when the core is in __loop_delay, the >>>> delay will be much shorter than requested. >>>> >>>> Is this a correct assessment of the situation? >>> >>> Absolutely correct, and it's something that people are aware of, and >>> have already catered for while writing their drivers. >> >> In their cpufreq driver? >> In "real" device drivers that happen to use delays? >> >> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz. >> If the frequency increases at the beginning of __loop_delay, udelay(100) >> would spin only 10 microseconds. This is likely to cause issues in >> any driver using udelay. >> >> How does one cater for that? > > You make sure your delays are based on a stable hardware timer. > Most platforms nowadays should have a suitable timer source. So you propose fixing loop-based delays by using clock-based delays, is that correct? (That is indeed what I did on my platform.) Russell stated that there are platforms using loop-based delays with cpufreq enabled. I'm asking how they manage the brokenness. Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 16:26 ` Marc Gonzalez @ 2017-11-16 16:32 ` Russell King - ARM Linux 2017-11-16 16:42 ` Marc Gonzalez 2017-11-16 16:47 ` Nicolas Pitre 1 sibling, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-16 16:32 UTC (permalink / raw) To: Marc Gonzalez Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote: > On 16/11/2017 17:08, Nicolas Pitre wrote: > > > On Thu, 16 Nov 2017, Marc Gonzalez wrote: > > > >> On 16/11/2017 16:36, Russell King - ARM Linux wrote: > >>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: > >>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote: > >>>> > >>>>> udelay() needs to offer a consistent interface so that drivers know > >>>>> what to expect no matter what the implementation is. Making one > >>>>> implementation conform to your ideas while leaving the other > >>>>> implementations with other expectations is a recipe for bugs. > >>>>> > >>>>> If you really want to do this, fix the loops_per_jiffy implementation > >>>>> as well so that the consistency is maintained. > >>>> > >>>> Hello Russell, > >>>> > >>>> It seems to me that, when using DFS, there's a serious issue with loop-based > >>>> delays. (IIRC, it was you who pointed this out a few years ago.) > >>>> > >>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled > >>>> when the frequency changes. > >>>> > >>>> But arch/arm/lib/delay-loop.S starts by loading the current value of > >>>> loops_per_jiffy, computes the number of times to loop, and then loops. > >>>> If the frequency increases when the core is in __loop_delay, the > >>>> delay will be much shorter than requested. > >>>> > >>>> Is this a correct assessment of the situation? > >>> > >>> Absolutely correct, and it's something that people are aware of, and > >>> have already catered for while writing their drivers. > >> > >> In their cpufreq driver? > >> In "real" device drivers that happen to use delays? > >> > >> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz. > >> If the frequency increases at the beginning of __loop_delay, udelay(100) > >> would spin only 10 microseconds. This is likely to cause issues in > >> any driver using udelay. > >> > >> How does one cater for that? > > > > You make sure your delays are based on a stable hardware timer. > > Most platforms nowadays should have a suitable timer source. > > So you propose fixing loop-based delays by using clock-based delays, > is that correct? (That is indeed what I did on my platform.) > > Russell stated that there are platforms using loop-based delays with > cpufreq enabled. I'm asking how they manage the brokenness. Quite simply, they don't have such a wide range of frequencies that can be selected. They're on the order of 4x. For example, the original platform that cpufreq was developed on, a StrongARM-1110 board, can practically range from 221MHz down to 59MHz. BTW, your example above is incorrect. If I'm not mistaken "120 MHz to 1.2 GHz." is only a 10x range, not a 100x range. That would mean if udelay(100) is initially executed while at 1.2GHz, and the frequency drops to 120MHz during that delay, the delay could be anything between just short of 100us to just short of 1ms. For 10ms to come into it, you'd need to range from 1.2GHz down to 0.012GHz, iow, 12MHz. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 16:32 ` Russell King - ARM Linux @ 2017-11-16 16:42 ` Marc Gonzalez 2017-11-16 17:05 ` Russell King - ARM Linux 0 siblings, 1 reply; 41+ messages in thread From: Marc Gonzalez @ 2017-11-16 16:42 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On 16/11/2017 17:32, Russell King - ARM Linux wrote: > On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote: >> On 16/11/2017 17:08, Nicolas Pitre wrote: >> >>> On Thu, 16 Nov 2017, Marc Gonzalez wrote: >>> >>>> On 16/11/2017 16:36, Russell King - ARM Linux wrote: >>>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: >>>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote: >>>>>> >>>>>>> udelay() needs to offer a consistent interface so that drivers know >>>>>>> what to expect no matter what the implementation is. Making one >>>>>>> implementation conform to your ideas while leaving the other >>>>>>> implementations with other expectations is a recipe for bugs. >>>>>>> >>>>>>> If you really want to do this, fix the loops_per_jiffy implementation >>>>>>> as well so that the consistency is maintained. >>>>>> >>>>>> Hello Russell, >>>>>> >>>>>> It seems to me that, when using DFS, there's a serious issue with loop-based >>>>>> delays. (IIRC, it was you who pointed this out a few years ago.) >>>>>> >>>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled >>>>>> when the frequency changes. >>>>>> >>>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of >>>>>> loops_per_jiffy, computes the number of times to loop, and then loops. >>>>>> If the frequency increases when the core is in __loop_delay, the >>>>>> delay will be much shorter than requested. >>>>>> >>>>>> Is this a correct assessment of the situation? >>>>> >>>>> Absolutely correct, and it's something that people are aware of, and >>>>> have already catered for while writing their drivers. >>>> >>>> In their cpufreq driver? >>>> In "real" device drivers that happen to use delays? >>>> >>>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz. >>>> If the frequency increases at the beginning of __loop_delay, udelay(100) >>>> would spin only 10 microseconds. This is likely to cause issues in >>>> any driver using udelay. >>>> >>>> How does one cater for that? >>> >>> You make sure your delays are based on a stable hardware timer. >>> Most platforms nowadays should have a suitable timer source. >> >> So you propose fixing loop-based delays by using clock-based delays, >> is that correct? (That is indeed what I did on my platform.) >> >> Russell stated that there are platforms using loop-based delays with >> cpufreq enabled. I'm asking how they manage the brokenness. > > Quite simply, they don't have such a wide range of frequencies that can > be selected. They're on the order of 4x. For example, the original > platform that cpufreq was developed on, a StrongARM-1110 board, can > practically range from 221MHz down to 59MHz. Requesting 100 µs and spinning only 25 µs is still a problem, don't you agree? > BTW, your example above is incorrect. A 10x increase in frequency causes a request of 100 µs to spin only 10 µs, as written above. The problem is not when the frequency drops -- this makes the delay longer. The problem is when the frequency increases, which makes the delay shorter. Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 16:42 ` Marc Gonzalez @ 2017-11-16 17:05 ` Russell King - ARM Linux 2017-11-16 21:05 ` Marc Gonzalez 0 siblings, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-16 17:05 UTC (permalink / raw) To: Marc Gonzalez Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote: > On 16/11/2017 17:32, Russell King - ARM Linux wrote: > > On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote: > >> On 16/11/2017 17:08, Nicolas Pitre wrote: > >> > >>> On Thu, 16 Nov 2017, Marc Gonzalez wrote: > >>> > >>>> On 16/11/2017 16:36, Russell King - ARM Linux wrote: > >>>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: > >>>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote: > >>>>>> > >>>>>>> udelay() needs to offer a consistent interface so that drivers know > >>>>>>> what to expect no matter what the implementation is. Making one > >>>>>>> implementation conform to your ideas while leaving the other > >>>>>>> implementations with other expectations is a recipe for bugs. > >>>>>>> > >>>>>>> If you really want to do this, fix the loops_per_jiffy implementation > >>>>>>> as well so that the consistency is maintained. > >>>>>> > >>>>>> Hello Russell, > >>>>>> > >>>>>> It seems to me that, when using DFS, there's a serious issue with loop-based > >>>>>> delays. (IIRC, it was you who pointed this out a few years ago.) > >>>>>> > >>>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled > >>>>>> when the frequency changes. > >>>>>> > >>>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of > >>>>>> loops_per_jiffy, computes the number of times to loop, and then loops. > >>>>>> If the frequency increases when the core is in __loop_delay, the > >>>>>> delay will be much shorter than requested. > >>>>>> > >>>>>> Is this a correct assessment of the situation? > >>>>> > >>>>> Absolutely correct, and it's something that people are aware of, and > >>>>> have already catered for while writing their drivers. > >>>> > >>>> In their cpufreq driver? > >>>> In "real" device drivers that happen to use delays? > >>>> > >>>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz. > >>>> If the frequency increases at the beginning of __loop_delay, udelay(100) > >>>> would spin only 10 microseconds. This is likely to cause issues in > >>>> any driver using udelay. > >>>> > >>>> How does one cater for that? > >>> > >>> You make sure your delays are based on a stable hardware timer. > >>> Most platforms nowadays should have a suitable timer source. > >> > >> So you propose fixing loop-based delays by using clock-based delays, > >> is that correct? (That is indeed what I did on my platform.) > >> > >> Russell stated that there are platforms using loop-based delays with > >> cpufreq enabled. I'm asking how they manage the brokenness. > > > > Quite simply, they don't have such a wide range of frequencies that can > > be selected. They're on the order of 4x. For example, the original > > platform that cpufreq was developed on, a StrongARM-1110 board, can > > practically range from 221MHz down to 59MHz. > > Requesting 100 µs and spinning only 25 µs is still a problem, > don't you agree? Which is why, as I've said *many* times already, that drivers are written with leaway on the delays. I get the impression that we're just going around in circles, and what you're trying to do is to get me to agree with your point of view. That's not going to happen, because I know the history over about the last /24/ years of kernel development (which is how long I've been involved with the kernel.) That's almost a quarter of a century! I know how things were done years ago (which is relevant because we still have support in the kernel for these systems), and I also know the history of facilities like cpufreq - I was the one who took the work that Erik Mouw and others involved with the LART project, and turned it into something a little more generic. The idea of dynamically scaling the CPU frequency on ARM SoCs was something that the SoC manufacturer had not even considered - it was innovative. I know that udelay() can return short delays when used in a kernel with cpufreq enabled, and I also know that's almost an impossible problem to solve without going to a timer-based delay. So, when you think that sending an email about a udelay() that can be 10x shorter might be somehow new information, and might convince people that there's a problem, I'm afraid that it isn't really new information. The SA1110 cpufreq driver is dated 2001, and carries my copyright, and has the ability to make udelay()s 4x shorter or 4x longer depending on the direction of change. We've discussed solutions in the past (probably 10 years ago) about this, and what can be done, and the conclusion to that was, as Nicolas has said, to switch to using a timer-based delay mechanism where possible. Where this is not possible, the platform is stuck with the loops based delays, and their inherent variability and inaccuracy. These platforms have been tested with such a setup over many years. They work even with udelay() having this behaviour, because it's a known issue and drivers cater for it in ways that I've already covered in my many previous emails to you. These issues are known. They've been known for the last 15 odd years. > > BTW, your example above is incorrect. > > A 10x increase in frequency causes a request of 100 µs to spin > only 10 µs, as written above. Right, sorry, misread. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 17:05 ` Russell King - ARM Linux @ 2017-11-16 21:05 ` Marc Gonzalez 2017-11-16 22:15 ` Doug Anderson 0 siblings, 1 reply; 41+ messages in thread From: Marc Gonzalez @ 2017-11-16 21:05 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On 16/11/2017 18:05, Russell King - ARM Linux wrote: > On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote: > >> Requesting 100 µs and spinning only 25 µs is still a problem, >> don't you agree? > > Which is why, as I've said *many* times already, that drivers are written > with leaway on the delays. A delay 75% too short is possible. Roger that. > I get the impression that we're just going around in circles, and what > you're trying to do is to get me to agree with your point of view. > That's not going to happen, because I know the history over about the > last /24/ years of kernel development (which is how long I've been > involved with the kernel.) That's almost a quarter of a century! > > I know how things were done years ago (which is relevant because we > still have support in the kernel for these systems), and I also know the > history of facilities like cpufreq - I was the one who took the work > that Erik Mouw and others involved with the LART project, and turned it > into something a little more generic. The idea of dynamically scaling > the CPU frequency on ARM SoCs was something that the SoC manufacturer > had not even considered - it was innovative. > > I know that udelay() can return short delays when used in a kernel with > cpufreq enabled, and I also know that's almost an impossible problem to > solve without going to a timer-based delay. > > So, when you think that sending an email about a udelay() that can be > 10x shorter might be somehow new information, and might convince people > that there's a problem, I'm afraid that it isn't really new information. > The SA1110 cpufreq driver is dated 2001, and carries my copyright, and > has the ability to make udelay()s 4x shorter or 4x longer depending on > the direction of change. > > We've discussed solutions in the past (probably 10 years ago) about > this, and what can be done, and the conclusion to that was, as Nicolas > has said, to switch to using a timer-based delay mechanism where > possible. Where this is not possible, the platform is stuck with the > loops based delays, and their inherent variability and inaccuracy. > > These platforms have been tested with such a setup over many years. > They work even with udelay() having this behaviour, because it's a > known issue and drivers cater for it in ways that I've already covered > in my many previous emails to you. > > These issues are known. They've been known for the last 15 odd years. So you've known for umpteen years that fixing loop-based delays is intractable, yet you wrote: > udelay() needs to offer a consistent interface so that drivers know > what to expect no matter what the implementation is. Making one > implementation conform to your ideas while leaving the other > implementations with other expectations is a recipe for bugs. > > If you really want to do this, fix the loops_per_jiffy implementation > as well so that the consistency is maintained. In other words, "I'll consider your patch as soon as Hell freezes over". Roger that. I'll drop the subject then. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 21:05 ` Marc Gonzalez @ 2017-11-16 22:15 ` Doug Anderson 2017-11-16 23:22 ` Russell King - ARM Linux 0 siblings, 1 reply; 41+ messages in thread From: Doug Anderson @ 2017-11-16 22:15 UTC (permalink / raw) To: Marc Gonzalez Cc: Russell King - ARM Linux, Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason Hi, On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote: > On 16/11/2017 18:05, Russell King - ARM Linux wrote: > >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote: >> >>> Requesting 100 µs and spinning only 25 µs is still a problem, >>> don't you agree? >> >> Which is why, as I've said *many* times already, that drivers are written >> with leaway on the delays. > > A delay 75% too short is possible. Roger that. > >> I get the impression that we're just going around in circles, and what >> you're trying to do is to get me to agree with your point of view. >> That's not going to happen, because I know the history over about the >> last /24/ years of kernel development (which is how long I've been >> involved with the kernel.) That's almost a quarter of a century! >> >> I know how things were done years ago (which is relevant because we >> still have support in the kernel for these systems), and I also know the >> history of facilities like cpufreq - I was the one who took the work >> that Erik Mouw and others involved with the LART project, and turned it >> into something a little more generic. The idea of dynamically scaling >> the CPU frequency on ARM SoCs was something that the SoC manufacturer >> had not even considered - it was innovative. >> >> I know that udelay() can return short delays when used in a kernel with >> cpufreq enabled, and I also know that's almost an impossible problem to >> solve without going to a timer-based delay. >> >> So, when you think that sending an email about a udelay() that can be >> 10x shorter might be somehow new information, and might convince people >> that there's a problem, I'm afraid that it isn't really new information. >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and >> has the ability to make udelay()s 4x shorter or 4x longer depending on >> the direction of change. >> >> We've discussed solutions in the past (probably 10 years ago) about >> this, and what can be done, and the conclusion to that was, as Nicolas >> has said, to switch to using a timer-based delay mechanism where >> possible. Where this is not possible, the platform is stuck with the >> loops based delays, and their inherent variability and inaccuracy. >> >> These platforms have been tested with such a setup over many years. >> They work even with udelay() having this behaviour, because it's a >> known issue and drivers cater for it in ways that I've already covered >> in my many previous emails to you. >> >> These issues are known. They've been known for the last 15 odd years. > > So you've known for umpteen years that fixing loop-based delays is > intractable, yet you wrote: > >> udelay() needs to offer a consistent interface so that drivers know >> what to expect no matter what the implementation is. Making one >> implementation conform to your ideas while leaving the other >> implementations with other expectations is a recipe for bugs. >> >> If you really want to do this, fix the loops_per_jiffy implementation >> as well so that the consistency is maintained. > > In other words, "I'll consider your patch as soon as Hell freezes over". > > Roger that. I'll drop the subject then. Presumably, though, you could introduce a new API like: udelay_atleast() That was guaranteed to delay at least the given number of microseconds. Unlike the current udelay(), the new udelay_atleast() wouldn't really try that hard to get a delay that's approximately the one requested, it would just guarantee not to ever delay _less_ than the amount requested. Thus there would be some reasons to use one API or the other and it would be up to the driver to pick one. You wouldn't regress any performance on old code since the old API would work the same as it always did. You could presumably implement udelay_atleast() something like I did in an ancient patch I was involved in at http://crosreview.com/189885. In this case you wouldn't modify the normal udelay() but you'd actually add a new API. (for the curious, we later picked back a proper timer-based delay in http://crosreview.com/218961 so it's not like we were stuck with the crappy delay for long) Once you added udelay_atleast() you'd have a good reason to fixup the timer-based udelay() since udelay_atleast() could just be a wrapper of that function and you'd have the requested consistency. -Doug ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 22:15 ` Doug Anderson @ 2017-11-16 23:22 ` Russell King - ARM Linux 2017-11-20 17:38 ` Doug Anderson 0 siblings, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-16 23:22 UTC (permalink / raw) To: Doug Anderson Cc: Marc Gonzalez, Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote: > Hi, > > On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez > <marc_gonzalez@sigmadesigns.com> wrote: > > On 16/11/2017 18:05, Russell King - ARM Linux wrote: > > > >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote: > >> > >>> Requesting 100 µs and spinning only 25 µs is still a problem, > >>> don't you agree? > >> > >> Which is why, as I've said *many* times already, that drivers are written > >> with leaway on the delays. > > > > A delay 75% too short is possible. Roger that. > > > >> I get the impression that we're just going around in circles, and what > >> you're trying to do is to get me to agree with your point of view. > >> That's not going to happen, because I know the history over about the > >> last /24/ years of kernel development (which is how long I've been > >> involved with the kernel.) That's almost a quarter of a century! > >> > >> I know how things were done years ago (which is relevant because we > >> still have support in the kernel for these systems), and I also know the > >> history of facilities like cpufreq - I was the one who took the work > >> that Erik Mouw and others involved with the LART project, and turned it > >> into something a little more generic. The idea of dynamically scaling > >> the CPU frequency on ARM SoCs was something that the SoC manufacturer > >> had not even considered - it was innovative. > >> > >> I know that udelay() can return short delays when used in a kernel with > >> cpufreq enabled, and I also know that's almost an impossible problem to > >> solve without going to a timer-based delay. > >> > >> So, when you think that sending an email about a udelay() that can be > >> 10x shorter might be somehow new information, and might convince people > >> that there's a problem, I'm afraid that it isn't really new information. > >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and > >> has the ability to make udelay()s 4x shorter or 4x longer depending on > >> the direction of change. > >> > >> We've discussed solutions in the past (probably 10 years ago) about > >> this, and what can be done, and the conclusion to that was, as Nicolas > >> has said, to switch to using a timer-based delay mechanism where > >> possible. Where this is not possible, the platform is stuck with the > >> loops based delays, and their inherent variability and inaccuracy. > >> > >> These platforms have been tested with such a setup over many years. > >> They work even with udelay() having this behaviour, because it's a > >> known issue and drivers cater for it in ways that I've already covered > >> in my many previous emails to you. > >> > >> These issues are known. They've been known for the last 15 odd years. > > > > So you've known for umpteen years that fixing loop-based delays is > > intractable, yet you wrote: > > > >> udelay() needs to offer a consistent interface so that drivers know > >> what to expect no matter what the implementation is. Making one > >> implementation conform to your ideas while leaving the other > >> implementations with other expectations is a recipe for bugs. > >> > >> If you really want to do this, fix the loops_per_jiffy implementation > >> as well so that the consistency is maintained. > > > > In other words, "I'll consider your patch as soon as Hell freezes over". > > > > Roger that. I'll drop the subject then. > > Presumably, though, you could introduce a new API like: > > udelay_atleast() > > That was guaranteed to delay at least the given number of > microseconds. Unlike the current udelay(), the new udelay_atleast() > wouldn't really try that hard to get a delay that's approximately the > one requested, it would just guarantee not to ever delay _less_ than > the amount requested. I look forward to reviewing your implementation. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 23:22 ` Russell King - ARM Linux @ 2017-11-20 17:38 ` Doug Anderson 2017-11-20 18:31 ` Russell King - ARM Linux 0 siblings, 1 reply; 41+ messages in thread From: Doug Anderson @ 2017-11-20 17:38 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Marc Gonzalez, Nicolas Pitre, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason Hi, On Thu, Nov 16, 2017 at 3:22 PM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote: >> Hi, >> >> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez >> <marc_gonzalez@sigmadesigns.com> wrote: >> > On 16/11/2017 18:05, Russell King - ARM Linux wrote: >> > >> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote: >> >> >> >>> Requesting 100 盜 and spinning only 25 盜 is still a problem, >> >>> don't you agree? >> >> >> >> Which is why, as I've said *many* times already, that drivers are written >> >> with leaway on the delays. >> > >> > A delay 75% too short is possible. Roger that. >> > >> >> I get the impression that we're just going around in circles, and what >> >> you're trying to do is to get me to agree with your point of view. >> >> That's not going to happen, because I know the history over about the >> >> last /24/ years of kernel development (which is how long I've been >> >> involved with the kernel.) That's almost a quarter of a century! >> >> >> >> I know how things were done years ago (which is relevant because we >> >> still have support in the kernel for these systems), and I also know the >> >> history of facilities like cpufreq - I was the one who took the work >> >> that Erik Mouw and others involved with the LART project, and turned it >> >> into something a little more generic. The idea of dynamically scaling >> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer >> >> had not even considered - it was innovative. >> >> >> >> I know that udelay() can return short delays when used in a kernel with >> >> cpufreq enabled, and I also know that's almost an impossible problem to >> >> solve without going to a timer-based delay. >> >> >> >> So, when you think that sending an email about a udelay() that can be >> >> 10x shorter might be somehow new information, and might convince people >> >> that there's a problem, I'm afraid that it isn't really new information. >> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and >> >> has the ability to make udelay()s 4x shorter or 4x longer depending on >> >> the direction of change. >> >> >> >> We've discussed solutions in the past (probably 10 years ago) about >> >> this, and what can be done, and the conclusion to that was, as Nicolas >> >> has said, to switch to using a timer-based delay mechanism where >> >> possible. Where this is not possible, the platform is stuck with the >> >> loops based delays, and their inherent variability and inaccuracy. >> >> >> >> These platforms have been tested with such a setup over many years. >> >> They work even with udelay() having this behaviour, because it's a >> >> known issue and drivers cater for it in ways that I've already covered >> >> in my many previous emails to you. >> >> >> >> These issues are known. They've been known for the last 15 odd years. >> > >> > So you've known for umpteen years that fixing loop-based delays is >> > intractable, yet you wrote: >> > >> >> udelay() needs to offer a consistent interface so that drivers know >> >> what to expect no matter what the implementation is. Making one >> >> implementation conform to your ideas while leaving the other >> >> implementations with other expectations is a recipe for bugs. >> >> >> >> If you really want to do this, fix the loops_per_jiffy implementation >> >> as well so that the consistency is maintained. >> > >> > In other words, "I'll consider your patch as soon as Hell freezes over". >> > >> > Roger that. I'll drop the subject then. >> >> Presumably, though, you could introduce a new API like: >> >> udelay_atleast() >> >> That was guaranteed to delay at least the given number of >> microseconds. Unlike the current udelay(), the new udelay_atleast() >> wouldn't really try that hard to get a delay that's approximately the >> one requested, it would just guarantee not to ever delay _less_ than >> the amount requested. > > I look forward to reviewing your implementation. It's unlikely I'll post a patch in the near term since this isn't presenting me with a big problem right now. Mostly I saw Marc's patch and thought it would be a good patch to land and I knew this type of thing had bitten me in the past. One happy result of this whole discussion, though, is that you now sound as if you'll be happy the next time someone brings this up since you're looking forward to reviewing an implementation. That's a nice change from the original statement questioning why someone was asking about this again. :) As I've expressed in the past, IMHO fixing the timer based delays is still a sane thing to do even without also fixing all the loop-based implementations. As someone said earlier in this thread, all the old platforms using loop-based delays are working great for the platforms they support and there's no reason to mess with them, but that doesn't mean we shouldn't fix the problems that are easy to fix. -Doug ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-20 17:38 ` Doug Anderson @ 2017-11-20 18:31 ` Russell King - ARM Linux 0 siblings, 0 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2017-11-20 18:31 UTC (permalink / raw) To: Doug Anderson Cc: Nicolas Pitre, Alan Cox, Mason, Thibaud Cornic, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Peter Zijlstra, Stephen Boyd, Michael Turquette, Mark Rutland, Will Deacon, LKML, Steven Rostedt, John Stultz, Boris Brezillon, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Linux ARM, Marc Gonzalez On Mon, Nov 20, 2017 at 09:38:46AM -0800, Doug Anderson wrote: > Hi, > > On Thu, Nov 16, 2017 at 3:22 PM, Russell King - ARM Linux > <linux@armlinux.org.uk> wrote: > > On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote: > >> Hi, > >> > >> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez > >> <marc_gonzalez@sigmadesigns.com> wrote: > >> > On 16/11/2017 18:05, Russell King - ARM Linux wrote: > >> > > >> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote: > >> >> > >> >>> Requesting 100 盜 and spinning only 25 盜 is still a problem, > >> >>> don't you agree? > >> >> > >> >> Which is why, as I've said *many* times already, that drivers are written > >> >> with leaway on the delays. > >> > > >> > A delay 75% too short is possible. Roger that. > >> > > >> >> I get the impression that we're just going around in circles, and what > >> >> you're trying to do is to get me to agree with your point of view. > >> >> That's not going to happen, because I know the history over about the > >> >> last /24/ years of kernel development (which is how long I've been > >> >> involved with the kernel.) That's almost a quarter of a century! > >> >> > >> >> I know how things were done years ago (which is relevant because we > >> >> still have support in the kernel for these systems), and I also know the > >> >> history of facilities like cpufreq - I was the one who took the work > >> >> that Erik Mouw and others involved with the LART project, and turned it > >> >> into something a little more generic. The idea of dynamically scaling > >> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer > >> >> had not even considered - it was innovative. > >> >> > >> >> I know that udelay() can return short delays when used in a kernel with > >> >> cpufreq enabled, and I also know that's almost an impossible problem to > >> >> solve without going to a timer-based delay. > >> >> > >> >> So, when you think that sending an email about a udelay() that can be > >> >> 10x shorter might be somehow new information, and might convince people > >> >> that there's a problem, I'm afraid that it isn't really new information. > >> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and > >> >> has the ability to make udelay()s 4x shorter or 4x longer depending on > >> >> the direction of change. > >> >> > >> >> We've discussed solutions in the past (probably 10 years ago) about > >> >> this, and what can be done, and the conclusion to that was, as Nicolas > >> >> has said, to switch to using a timer-based delay mechanism where > >> >> possible. Where this is not possible, the platform is stuck with the > >> >> loops based delays, and their inherent variability and inaccuracy. > >> >> > >> >> These platforms have been tested with such a setup over many years. > >> >> They work even with udelay() having this behaviour, because it's a > >> >> known issue and drivers cater for it in ways that I've already covered > >> >> in my many previous emails to you. > >> >> > >> >> These issues are known. They've been known for the last 15 odd years. > >> > > >> > So you've known for umpteen years that fixing loop-based delays is > >> > intractable, yet you wrote: > >> > > >> >> udelay() needs to offer a consistent interface so that drivers know > >> >> what to expect no matter what the implementation is. Making one > >> >> implementation conform to your ideas while leaving the other > >> >> implementations with other expectations is a recipe for bugs. > >> >> > >> >> If you really want to do this, fix the loops_per_jiffy implementation > >> >> as well so that the consistency is maintained. > >> > > >> > In other words, "I'll consider your patch as soon as Hell freezes over". > >> > > >> > Roger that. I'll drop the subject then. > >> > >> Presumably, though, you could introduce a new API like: > >> > >> udelay_atleast() > >> > >> That was guaranteed to delay at least the given number of > >> microseconds. Unlike the current udelay(), the new udelay_atleast() > >> wouldn't really try that hard to get a delay that's approximately the > >> one requested, it would just guarantee not to ever delay _less_ than > >> the amount requested. > > > > I look forward to reviewing your implementation. > > It's unlikely I'll post a patch in the near term since this isn't > presenting me with a big problem right now. Mostly I saw Marc's patch > and thought it would be a good patch to land and I knew this type of > thing had bitten me in the past. > > One happy result of this whole discussion, though, is that you now > sound as if you'll be happy the next time someone brings this up since > you're looking forward to reviewing an implementation. That's a nice > change from the original statement questioning why someone was asking > about this again. :) What I'd be happy with, and what I've always been happy with is what I've stated: either we fix _all_ implementations or none of them. We can't have the situation where some implementations give one expectation and others give something completely different. That's always been my argument against _just_ fixing the timer-based delays and ignoring the rest of the problem. Nothing has changed about my position. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 16:26 ` Marc Gonzalez 2017-11-16 16:32 ` Russell King - ARM Linux @ 2017-11-16 16:47 ` Nicolas Pitre 2017-11-16 16:51 ` Marc Gonzalez 1 sibling, 1 reply; 41+ messages in thread From: Nicolas Pitre @ 2017-11-16 16:47 UTC (permalink / raw) To: Marc Gonzalez Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Thu, 16 Nov 2017, Marc Gonzalez wrote: > On 16/11/2017 17:08, Nicolas Pitre wrote: > > > On Thu, 16 Nov 2017, Marc Gonzalez wrote: > > > >> On 16/11/2017 16:36, Russell King - ARM Linux wrote: > >>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote: > >>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote: > >>>> > >>>>> udelay() needs to offer a consistent interface so that drivers know > >>>>> what to expect no matter what the implementation is. Making one > >>>>> implementation conform to your ideas while leaving the other > >>>>> implementations with other expectations is a recipe for bugs. > >>>>> > >>>>> If you really want to do this, fix the loops_per_jiffy implementation > >>>>> as well so that the consistency is maintained. > >>>> > >>>> Hello Russell, > >>>> > >>>> It seems to me that, when using DFS, there's a serious issue with loop-based > >>>> delays. (IIRC, it was you who pointed this out a few years ago.) > >>>> > >>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled > >>>> when the frequency changes. > >>>> > >>>> But arch/arm/lib/delay-loop.S starts by loading the current value of > >>>> loops_per_jiffy, computes the number of times to loop, and then loops. > >>>> If the frequency increases when the core is in __loop_delay, the > >>>> delay will be much shorter than requested. > >>>> > >>>> Is this a correct assessment of the situation? > >>> > >>> Absolutely correct, and it's something that people are aware of, and > >>> have already catered for while writing their drivers. > >> > >> In their cpufreq driver? > >> In "real" device drivers that happen to use delays? > >> > >> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz. > >> If the frequency increases at the beginning of __loop_delay, udelay(100) > >> would spin only 10 microseconds. This is likely to cause issues in > >> any driver using udelay. > >> > >> How does one cater for that? > > > > You make sure your delays are based on a stable hardware timer. > > Most platforms nowadays should have a suitable timer source. > > So you propose fixing loop-based delays by using clock-based delays, > is that correct? (That is indeed what I did on my platform.) > > Russell stated that there are platforms using loop-based delays with > cpufreq enabled. I'm asking how they manage the brokenness. Look at cpufreq_callback() in arch/arm/kernel/smp.c. Nicolas ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 16:47 ` Nicolas Pitre @ 2017-11-16 16:51 ` Marc Gonzalez 2017-11-16 17:00 ` Nicolas Pitre 0 siblings, 1 reply; 41+ messages in thread From: Marc Gonzalez @ 2017-11-16 16:51 UTC (permalink / raw) To: Nicolas Pitre Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On 16/11/2017 17:47, Nicolas Pitre wrote: > Look at cpufreq_callback() in arch/arm/kernel/smp.c. Are you pointing at the scaling of loops_per_jiffy done in that function? As I wrote earlier: If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled when the frequency changes. But arch/arm/lib/delay-loop.S starts by loading the current value of loops_per_jiffy, computes the number of times to loop, and then loops. If the frequency increases when the core is in __loop_delay, the delay will be much shorter than requested. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-16 16:51 ` Marc Gonzalez @ 2017-11-16 17:00 ` Nicolas Pitre 0 siblings, 0 replies; 41+ messages in thread From: Nicolas Pitre @ 2017-11-16 17:00 UTC (permalink / raw) To: Marc Gonzalez Cc: Russell King - ARM Linux, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason On Thu, 16 Nov 2017, Marc Gonzalez wrote: > On 16/11/2017 17:47, Nicolas Pitre wrote: > > > Look at cpufreq_callback() in arch/arm/kernel/smp.c. > > Are you pointing at the scaling of loops_per_jiffy done in that function? > > As I wrote earlier: > > If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled > when the frequency changes. > > But arch/arm/lib/delay-loop.S starts by loading the current value of > loops_per_jiffy, computes the number of times to loop, and then loops. > If the frequency increases when the core is in __loop_delay, the > delay will be much shorter than requested. The callback is invoked with CPUFREQ_PRECHANGE before the actual frequency increase. If your CPU clock is per core, then you won't be in the middle of the delay loop when this happens, unless you change your core clock from an interrupt handler. If your CPU clock is common to all cores then you are screwed. In that case the only way out is a hardware timer based delay. Nicolas ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-15 13:13 ` Russell King - ARM Linux 2017-11-16 15:26 ` Marc Gonzalez @ 2017-12-07 12:43 ` Pavel Machek 1 sibling, 0 replies; 41+ messages in thread From: Pavel Machek @ 2017-12-07 12:43 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Marc Gonzalez, Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason > On Wed, Nov 15, 2017 at 01:51:54PM +0100, Marc Gonzalez wrote: > > On 01/11/2017 20:38, Marc Gonzalez wrote: > > > > > OK, I'll just send my patch, and then crawl back under my rock. > > > > Linus, > > > > As promised, the patch is provided below. And as promised, I will > > no longer bring this up on LKML. > > > > FWIW, I have checked that the computed value matches the expected > > value for all HZ and delay_us, and for a few clock frequencies, > > using the following program: > > > > $ cat delays.c > > #include <stdio.h> > > #define MEGA 1000000u > > typedef unsigned int uint; > > typedef unsigned long long u64; > > #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) > > > > static const uint HZ_tab[] = { 100, 250, 300, 1000 }; > > > > static void check_cycle_count(uint freq, uint HZ, uint delay_us) > > { > > uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA); > > uint lpj = DIV_ROUND_UP(freq, HZ); > > uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1; > > uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA); > > > > if (computed != expected) > > printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected); > > } > > > > int main(void) > > { > > uint idx, delay_us, freq; > > > > for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA) > > for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx) > > for (delay_us = 1; delay_us <= 2000; ++delay_us) > > check_cycle_count(freq, HZ_tab[idx], delay_us); > > > > return 0; > > } > > > > > > > > -- >8 -- > > Subject: [PATCH] ARM: Tweak clock-based udelay implementation > > > > In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy") > > Russell pointed out that loop-based delays may return early. > > > > On the arm platform, delays may be either loop-based or clock-based. > > > > This patch tweaks the clock-based implementation so that udelay(N) > > is guaranteed to spin at least N microseconds. > > As I've already said, I don't want this, because it encourages people > to use too-small delays in driver code, and if we merge it then you > will look at your data sheet, decide it says "you need to wait 10us" > and write in your driver "udelay(10)" which will break on the loops > based delay. > > udelay() needs to offer a consistent interface so that drivers know > what to expect no matter what the implementation is. Making one > implementation conform to your ideas while leaving the other > implementations with other expectations is a recipe for bugs. udelay() needs to be consistent across platforms, and yes, udelay(10) is expected to delay at least 10usec. If that is not true on your platform, _fix your platform_. But it is not valid to reject patches fixing other platforms, just because your platform is broken. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-15 12:51 ` Marc Gonzalez 2017-11-15 13:13 ` Russell King - ARM Linux @ 2017-11-15 18:45 ` Doug Anderson 1 sibling, 0 replies; 41+ messages in thread From: Doug Anderson @ 2017-11-15 18:45 UTC (permalink / raw) To: Marc Gonzalez Cc: Linus Torvalds, Alan Cox, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Boris Brezillon, Thibaud Cornic, Mason Hi, On Wed, Nov 15, 2017 at 4:51 AM, Marc Gonzalez <marc_gonzalez@sigmadesigns.com> wrote: > On 01/11/2017 20:38, Marc Gonzalez wrote: > >> OK, I'll just send my patch, and then crawl back under my rock. > > Linus, > > As promised, the patch is provided below. And as promised, I will > no longer bring this up on LKML. > > FWIW, I have checked that the computed value matches the expected > value for all HZ and delay_us, and for a few clock frequencies, > using the following program: > > $ cat delays.c > #include <stdio.h> > #define MEGA 1000000u > typedef unsigned int uint; > typedef unsigned long long u64; > #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) > > static const uint HZ_tab[] = { 100, 250, 300, 1000 }; > > static void check_cycle_count(uint freq, uint HZ, uint delay_us) > { > uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA); > uint lpj = DIV_ROUND_UP(freq, HZ); > uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1; > uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA); > > if (computed != expected) > printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected); > } > > int main(void) > { > uint idx, delay_us, freq; > > for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA) > for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx) > for (delay_us = 1; delay_us <= 2000; ++delay_us) > check_cycle_count(freq, HZ_tab[idx], delay_us); > > return 0; > } > > > > -- >8 -- > Subject: [PATCH] ARM: Tweak clock-based udelay implementation > > In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy") > Russell pointed out that loop-based delays may return early. > > On the arm platform, delays may be either loop-based or clock-based. > > This patch tweaks the clock-based implementation so that udelay(N) > is guaranteed to spin at least N microseconds. > > Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> > --- > arch/arm/lib/delay.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) As I have indicated in the past, I'm not a believer in the "don't fix bug A because bug B is still there" argument. From the statements "platform code could try to make their udelay/ndelay() be as good as it can be on a particular platform" and "I'm very much open to udelay improvements, and if somebody sends patches for particular platforms to do particularly well on that platform" it's my understanding that this is consistent with Linus's opinion. Since Marc's bugfix seems good and valid: Reviewed-by: Douglas Anderson <dianders@chromium.org> Marc's bugfix would immediately be useful if you happened to know your driver was only running on a system that was using a timer-based udelay on ARM. Marc's bugfix could also form the basis of future patches that extended the udelay() API to somehow express the error, as Linus suggested by saying "we could maybe export some interface to give estimated errors so that drivers could then try to correct for them depending on just how much they care". -Doug ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:03 ` Marc Gonzalez 2017-11-01 19:09 ` Linus Torvalds @ 2017-11-01 19:36 ` Alan Cox 2017-11-01 19:39 ` Thomas Gleixner 2017-11-01 19:48 ` Baruch Siach 3 siblings, 0 replies; 41+ messages in thread From: Alan Cox @ 2017-11-01 19:36 UTC (permalink / raw) To: Marc Gonzalez Cc: Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason > > serialized link. Writes get delayed, they can bunch together, busses do > > posting and queueing. > > Are you talking about the actual delay operation, or the pokes around it? All of it. A write from the CPU except on the lowest end ones isn't neccessarily going to occur in strict sequential order as expressed by the program. The write even with cache bypassing goes to whatever bus interface unit is involved and then the signal goes on (eventually) to the device. The chances are the state machine for the external bus isn't the same as the internal one and already does things like posting so the CPU isn't stalled for the write to complete on the slow external bus. All your bus clocks have errors, even more so if spread spectrum is in use to reduce the noise. For some bus encoding transferring the data takes a subtly different amount of time accordig to the value. The reality for most machines is that writing to a device rather more resembles one of those games where your ball descends at an unpredictable rate bouncing off pillars to score points, than is implied by the diagrams. > I don't think "accurate" is the proper term. > Over-delays are fine, under-delays are problematic. Not often. The people who characterized the silicon will have added a safety margin to. They are always quoting +/- something if you dig in the small print or ask. > > For that matter given the bad blocks don't randomly change why not cache > > them ? > > That's a good question, I'll ask the NAND framework maintainer. > Store them where, by the way? On the NAND chip itself? I guess the ideal case would be to store it with magic numbers on one of a certain number of blocks (in case the default one to hold it is itself a bad block) ? > > >> My current NAND chips are tiny (2 x 512 MB) but with larger chips, > >> the number of calls to ndelay would climb to 10^6 and the delay > >> increase to 1 second, with is starting to be a problem. And this is another reason I think that worrying about ndelay is not the answer. If you've got multiple threads you can bring it up asynchronously, if you've got multiple buses (ok you don't) then you can halve or quarter the time needed. If you've got them cached you can nearly eliminate it. Compared with those scraping a few percent by fine tuning ndelay doesn't look such a good return ? Alan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:03 ` Marc Gonzalez 2017-11-01 19:09 ` Linus Torvalds 2017-11-01 19:36 ` Alan Cox @ 2017-11-01 19:39 ` Thomas Gleixner 2017-11-01 19:48 ` Baruch Siach 3 siblings, 0 replies; 41+ messages in thread From: Thomas Gleixner @ 2017-11-01 19:39 UTC (permalink / raw) To: Marc Gonzalez Cc: Alan Cox, Linus Torvalds, LKML, Linux ARM, Steven Rostedt, Ingo Molnar, Peter Zijlstra, John Stultz, Douglas Anderson, Nicolas Pitre, Mark Rutland, Will Deacon, Jonathan Austin, Arnd Bergmann, Kevin Hilman, Russell King, Michael Turquette, Stephen Boyd, Mason On Wed, 1 Nov 2017, Marc Gonzalez wrote: > On 01/11/2017 18:53, Alan Cox wrote: > > > On Tue, 31 Oct 2017 17:15:34 +0100 > > > >> Therefore, users are accustomed to having delays be longer (within a reasonable margin). > >> However, very few users would expect delays to be *shorter* than requested. > > > > If your udelay can be under by 10% then just bump the number by 10%. > > Except it's not *quite* that simple. > Error has both an absolute and a relative component. > So the actual value matters, and it's not always a constant. > > For example: > http://elixir.free-electrons.com/linux/latest/source/drivers/mtd/nand/nand_base.c#L814 That example is really pointless, because the only effect it does is to delay the read of the ready/busy pin long enough to bridge the silly gap between |----------------------- CS _____| and -------| R/B |_____________________ IIRC, the gap is in the low single digit nsec range, so it really does not matter how long you delay here. In fact on most older CPUs excuting the 3 instructions even if it returned immediately were sufficient. Any real NAND controller, not the stupid PIO based ones we had 15 years ago, should not even make it necessary to handle that on the software side. Simply because the controller should already take care of that and not blindly reflect the status of the R/B pin in the status register. If your hardware does that, then poke that HW dude on the desk next to yours with a big cluestick. Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:03 ` Marc Gonzalez ` (2 preceding siblings ...) 2017-11-01 19:39 ` Thomas Gleixner @ 2017-11-01 19:48 ` Baruch Siach 2017-11-02 16:12 ` Boris Brezillon 3 siblings, 1 reply; 41+ messages in thread From: Baruch Siach @ 2017-11-01 19:48 UTC (permalink / raw) To: Marc Gonzalez Cc: Alan Cox, Mark Rutland, Mason, Russell King, Jonathan Austin, Arnd Bergmann, Nicolas Pitre, Peter Zijlstra, Stephen Boyd, Michael Turquette, Kevin Hilman, Will Deacon, LKML, Steven Rostedt, Douglas Anderson, John Stultz, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Linux ARM, linux-mtd Hi Marc, On Wed, Nov 01, 2017 at 08:03:20PM +0100, Marc Gonzalez wrote: > On 01/11/2017 18:53, Alan Cox wrote: > > For that matter given the bad blocks don't randomly change why not cache > > them ? > > That's a good question, I'll ask the NAND framework maintainer. > Store them where, by the way? On the NAND chip itself? Yes. In the bad block table (bbt). See drivers/mtd/nand/nand_bbt.c. baruch -- http://baruch.siach.name/blog/ ~. .~ Tk Open Systems =}------------------------------------------------ooO--U--Ooo------------{= - baruch@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il - ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] Improving udelay/ndelay on platforms where that is possible 2017-11-01 19:48 ` Baruch Siach @ 2017-11-02 16:12 ` Boris Brezillon 0 siblings, 0 replies; 41+ messages in thread From: Boris Brezillon @ 2017-11-02 16:12 UTC (permalink / raw) To: Baruch Siach Cc: Marc Gonzalez, Mark Rutland, Alan Cox, Russell King, Jonathan Austin, Arnd Bergmann, Mason, Peter Zijlstra, Will Deacon, Michael Turquette, Nicolas Pitre, Stephen Boyd, LKML, Steven Rostedt, Douglas Anderson, Kevin Hilman, John Stultz, linux-mtd, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Linux ARM On Wed, 1 Nov 2017 21:48:22 +0200 Baruch Siach <baruch@tkos.co.il> wrote: > Hi Marc, > > On Wed, Nov 01, 2017 at 08:03:20PM +0100, Marc Gonzalez wrote: > > On 01/11/2017 18:53, Alan Cox wrote: > > > For that matter given the bad blocks don't randomly change why not cache > > > them ? > > > > That's a good question, I'll ask the NAND framework maintainer. > > Store them where, by the way? On the NAND chip itself? > > Yes. In the bad block table (bbt). See drivers/mtd/nand/nand_bbt.c. Yes, you can cache this information in a bad block table stored on the flash. But still, the ndelay()/udelay() problem remains: reading the out-of-band area of each eraseblock to re-create the bad block table is just one use case. This ndelay() is something you can have on all kind of read/write operations. As Thomas Gleixner stated, this is not really needed on modern NAND controllers which take care of various timing constraints internally, but still, we have some controllers that are not that smart, and we have to support them. I'm not concerned about performances here, and if I'm told that we should turn all ndelay() calls into usleep_range() ones, then I'm perfectly fine with that, but I need a guarantee that when I say "I want to wait at least X ns/us", the function does not return before that time has expired. Not sure if that would work, but maybe we could create a wrapper like void nand_ndelay(unsigned long nanoseconds) { ktime_t end = ktime_add_ns(ktime_get(), nanoseconds); do { ndelay(nanoseconds); } while (ktime_before(ktime_get(), end)); } ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2017-12-07 12:43 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-31 16:15 [RFC] Improving udelay/ndelay on platforms where that is possible Marc Gonzalez 2017-10-31 16:44 ` Linus Torvalds 2017-10-31 16:56 ` Russell King - ARM Linux 2017-10-31 17:45 ` Linus Torvalds 2017-10-31 17:58 ` Linus Torvalds 2017-11-01 0:23 ` Doug Anderson 2017-11-01 9:26 ` Russell King - ARM Linux 2017-11-01 15:53 ` Doug Anderson 2017-12-07 12:31 ` Pavel Machek 2017-11-01 19:28 ` Marc Gonzalez 2017-11-01 20:30 ` Russell King - ARM Linux 2017-10-31 16:46 ` Russell King - ARM Linux 2017-11-01 17:53 ` Alan Cox 2017-11-01 19:03 ` Marc Gonzalez 2017-11-01 19:09 ` Linus Torvalds 2017-11-01 19:17 ` Linus Torvalds 2017-11-01 19:38 ` Marc Gonzalez 2017-11-15 12:51 ` Marc Gonzalez 2017-11-15 13:13 ` Russell King - ARM Linux 2017-11-16 15:26 ` Marc Gonzalez 2017-11-16 15:36 ` Russell King - ARM Linux 2017-11-16 15:47 ` Marc Gonzalez 2017-11-16 16:08 ` Nicolas Pitre 2017-11-16 16:26 ` Marc Gonzalez 2017-11-16 16:32 ` Russell King - ARM Linux 2017-11-16 16:42 ` Marc Gonzalez 2017-11-16 17:05 ` Russell King - ARM Linux 2017-11-16 21:05 ` Marc Gonzalez 2017-11-16 22:15 ` Doug Anderson 2017-11-16 23:22 ` Russell King - ARM Linux 2017-11-20 17:38 ` Doug Anderson 2017-11-20 18:31 ` Russell King - ARM Linux 2017-11-16 16:47 ` Nicolas Pitre 2017-11-16 16:51 ` Marc Gonzalez 2017-11-16 17:00 ` Nicolas Pitre 2017-12-07 12:43 ` Pavel Machek 2017-11-15 18:45 ` Doug Anderson 2017-11-01 19:36 ` Alan Cox 2017-11-01 19:39 ` Thomas Gleixner 2017-11-01 19:48 ` Baruch Siach 2017-11-02 16:12 ` Boris Brezillon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).