* [PATCH 0/2] Allwinner A64 timer workaround @ 2018-05-11 2:27 Samuel Holland 2018-05-11 2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland ` (3 more replies) 0 siblings, 4 replies; 30+ messages in thread From: Samuel Holland @ 2018-05-11 2:27 UTC (permalink / raw) To: Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier Cc: linux-arm-kernel, linux-kernel, linux-sunxi Hello, Several people (including me) have experienced extremely large system clock jumps on their A64-based devices, apparently due to the architectural timer going backward, which is interpreted by Linux as the timer wrapping around after 2^56 cycles. Investigation led to discovery of some obvious problems with this SoC's architectural timer, and this patch series introduces what I believe is the simplest workaround. More details are in the commit message for patch 1. Patch 2 simply enables the workaround in the device tree. Thanks, Samuel Samuel Holland (2): arm64: arch_timer: Workaround for Allwinner A64 timer instability arm64: dts: allwinner: a64: Enable A64 timer workaround arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 1 + drivers/clocksource/Kconfig | 11 ++++++++ drivers/clocksource/arm_arch_timer.c | 39 +++++++++++++++++++++++++++ 3 files changed, 51 insertions(+) -- 2.16.1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability 2018-05-11 2:27 [PATCH 0/2] Allwinner A64 timer workaround Samuel Holland @ 2018-05-11 2:27 ` Samuel Holland 2018-05-11 8:26 ` Maxime Ripard ` (2 more replies) 2018-05-11 2:27 ` [PATCH 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland ` (2 subsequent siblings) 3 siblings, 3 replies; 30+ messages in thread From: Samuel Holland @ 2018-05-11 2:27 UTC (permalink / raw) To: Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Samuel Holland The Allwinner A64 SoC is known [1] to have an unstable architectural timer, which manifests itself most obviously in the time jumping forward a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a timer frequency of 24 MHz, implying that the time went slightly backward (and this was interpreted by the kernel as it jumping forward and wrapping around past the epoch). Further investigation revealed instability in the low bits of CNTVCT at the point a high bit rolls over. This leads to power-of-two cycle forward and backward jumps. (Testing shows that forward jumps are about twice as likely as backward jumps.) Without trapping reads to CNTVCT, a userspace program is able to read it in a loop faster than it changes. A test program running on all 4 CPU cores that reported jumps larger than 100 ms was run for 13.6 hours and reported the following: Count | Event -------+--------------------------- 9940 | jumped backward 699ms 268 | jumped backward 1398ms 1 | jumped backward 2097ms 16020 | jumped forward 175ms 6443 | jumped forward 699ms 2976 | jumped forward 1398ms 9 | jumped forward 356516ms 9 | jumped forward 357215ms 4 | jumped forward 714430ms 1 | jumped forward 3578440ms This works out to a jump larger than 100 ms about every 5.5 seconds on each CPU core. The largest jump (almost an hour!) was the following sequence of reads: 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 Note that the middle bits don't necessarily all read as all zeroes or all ones during the anomalous behavior; however the low 11 bits checked by the function in this patch have never been observed with any other value. Also note that smaller jumps are much more common, with the smallest backward jumps of 2048 cycles observed over 400 times per second on each core. (Of course, this is partially due to lower bits rolling over more frequently.) Any one of these could have caused the 95 year time skip. Similar anomalies were observed while reading CNTPCT (after patching the kernel to allow reads from userspace). However, the jumps are much less frequent, and only small jumps were observed. The same program as before (except now reading CNTPCT) observed after 72 hours: Count | Event -------+--------------------------- 17 | jumped backward 699ms 52 | jumped forward 175ms 2831 | jumped forward 699ms 5 | jumped forward 1398ms ======================================================================== Because the CPU can read the CNTPCT/CNTVCT registers faster than they change, performing two reads of the register and comparing the high bits (like other workarounds) is not a workable solution. And because the timer can jump both forward and backward, no pair of reads can distinguish a good value from a bad one. The only way to guarantee a good value from consecutive reads would be to read _three_ times, and take the middle value iff the three values are 1) individually unique and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if an anomaly is detected. However, since there is a distinct pattern to the bad values, we can optimize the common case (2046/2048 of the time) to a single read by simply ignoring values that match the pattern. This still takes no more than 3 cycles in the worst case, and requires much less code. [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26 Signed-off-by: Samuel Holland <samuel@sholland.org> --- drivers/clocksource/Kconfig | 11 ++++++++++ drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig index 8e8a09755d10..7a5d434dd30b 100644 --- a/drivers/clocksource/Kconfig +++ b/drivers/clocksource/Kconfig @@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921 The workaround will be dynamically enabled when an affected core is detected. +config SUN50I_A64_UNSTABLE_TIMER + bool "Workaround for Allwinner A64 timer instability" + default y + depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI + select ARM_ARCH_TIMER_OOL_WORKAROUND + help + This option enables a workaround for instability in the timer on + the Allwinner A64 SoC. The workaround will only be active if the + allwinner,sun50i-a64-unstable-timer property is found in the + timer node. + config ARM_GLOBAL_TIMER bool "Support for the ARM global timer" if COMPILE_TEST select TIMER_OF if OF diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c index 57cb2f00fc07..66ce13578c52 100644 --- a/drivers/clocksource/arm_arch_timer.c +++ b/drivers/clocksource/arm_arch_timer.c @@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void) } #endif +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER +/* + * The low bits of each register can transiently read as all ones or all zeroes + * when bit 11 or greater rolls over. Since the value can jump both backward + * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just + * ignore register values with all ones or zeros in the low bits. + */ +static u64 notrace sun50i_a64_read_cntpct_el0(void) +{ + u64 val; + + do { + val = read_sysreg(cntpct_el0); + } while (((val + 1) & GENMASK(10, 0)) <= 1); + + return val; +} + +static u64 notrace sun50i_a64_read_cntvct_el0(void) +{ + u64 val; + + do { + val = read_sysreg(cntvct_el0); + } while (((val + 1) & GENMASK(10, 0)) <= 1); + + return val; +} +#endif + #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround); EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround); @@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = { .read_cntvct_el0 = arm64_858921_read_cntvct_el0, }, #endif +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER + { + .match_type = ate_match_dt, + .id = "allwinner,sun50i-a64-unstable-timer", + .desc = "Allwinner A64 timer instability", + .read_cntpct_el0 = sun50i_a64_read_cntpct_el0, + .read_cntvct_el0 = sun50i_a64_read_cntvct_el0, + }, +#endif }; typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *, -- 2.16.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability 2018-05-11 2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland @ 2018-05-11 8:26 ` Maxime Ripard 2018-05-11 8:48 ` Marc Zyngier 2018-05-26 15:15 ` André Przywara 2 siblings, 0 replies; 30+ messages in thread From: Maxime Ripard @ 2018-05-11 8:26 UTC (permalink / raw) To: Samuel Holland Cc: Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier, linux-arm-kernel, linux-kernel, linux-sunxi [-- Attachment #1: Type: text/plain, Size: 4259 bytes --] On Thu, May 10, 2018 at 09:27:50PM -0500, Samuel Holland wrote: > The Allwinner A64 SoC is known [1] to have an unstable architectural > timer, which manifests itself most obviously in the time jumping forward > a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a > timer frequency of 24 MHz, implying that the time went slightly backward > (and this was interpreted by the kernel as it jumping forward and > wrapping around past the epoch). > > Further investigation revealed instability in the low bits of CNTVCT at > the point a high bit rolls over. This leads to power-of-two cycle > forward and backward jumps. (Testing shows that forward jumps are about > twice as likely as backward jumps.) > > Without trapping reads to CNTVCT, a userspace program is able to read it > in a loop faster than it changes. A test program running on all 4 CPU > cores that reported jumps larger than 100 ms was run for 13.6 hours and > reported the following: > > Count | Event > -------+--------------------------- > 9940 | jumped backward 699ms > 268 | jumped backward 1398ms > 1 | jumped backward 2097ms > 16020 | jumped forward 175ms > 6443 | jumped forward 699ms > 2976 | jumped forward 1398ms > 9 | jumped forward 356516ms > 9 | jumped forward 357215ms > 4 | jumped forward 714430ms > 1 | jumped forward 3578440ms > > This works out to a jump larger than 100 ms about every 5.5 seconds on > each CPU core. > > The largest jump (almost an hour!) was the following sequence of reads: > 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 > > Note that the middle bits don't necessarily all read as all zeroes or > all ones during the anomalous behavior; however the low 11 bits checked > by the function in this patch have never been observed with any other > value. > > Also note that smaller jumps are much more common, with the smallest > backward jumps of 2048 cycles observed over 400 times per second on each > core. (Of course, this is partially due to lower bits rolling over more > frequently.) Any one of these could have caused the 95 year time skip. > > Similar anomalies were observed while reading CNTPCT (after patching the > kernel to allow reads from userspace). However, the jumps are much less > frequent, and only small jumps were observed. The same program as before > (except now reading CNTPCT) observed after 72 hours: > > Count | Event > -------+--------------------------- > 17 | jumped backward 699ms > 52 | jumped forward 175ms > 2831 | jumped forward 699ms > 5 | jumped forward 1398ms > > ======================================================================== > > Because the CPU can read the CNTPCT/CNTVCT registers faster than they > change, performing two reads of the register and comparing the high bits > (like other workarounds) is not a workable solution. And because the > timer can jump both forward and backward, no pair of reads can > distinguish a good value from a bad one. The only way to guarantee a > good value from consecutive reads would be to read _three_ times, and > take the middle value iff the three values are 1) individually unique > and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if > an anomaly is detected. > > However, since there is a distinct pattern to the bad values, we can > optimize the common case (2046/2048 of the time) to a single read by > simply ignoring values that match the pattern. This still takes no more > than 3 cycles in the worst case, and requires much less code. That's an awesome commit log, thanks! For both patches: Acked-by: Maxime Ripard <maxime.ripard@bootlin.com> > [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 > [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ Sigh. So armbian knew about this for more than a year and had a fix, and didn't judge necessary to report it anywhere. That's some solid, responsible, development right there... Maxime -- Maxime Ripard, Bootlin (formerly Free Electrons) Embedded Linux and Kernel engineering https://bootlin.com [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability 2018-05-11 2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland 2018-05-11 8:26 ` Maxime Ripard @ 2018-05-11 8:48 ` Marc Zyngier 2018-05-11 15:08 ` Samuel Holland 2018-05-26 15:15 ` André Przywara 2 siblings, 1 reply; 30+ messages in thread From: Marc Zyngier @ 2018-05-11 8:48 UTC (permalink / raw) To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland [+Mark, who co-maintains the arch timer code with me] Hi Samuel, On 11/05/18 03:27, Samuel Holland wrote: > The Allwinner A64 SoC is known [1] to have an unstable architectural > timer, which manifests itself most obviously in the time jumping forward > a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a > timer frequency of 24 MHz, implying that the time went slightly backward > (and this was interpreted by the kernel as it jumping forward and > wrapping around past the epoch). > > Further investigation revealed instability in the low bits of CNTVCT at > the point a high bit rolls over. This leads to power-of-two cycle > forward and backward jumps. (Testing shows that forward jumps are about > twice as likely as backward jumps.) > > Without trapping reads to CNTVCT, a userspace program is able to read it > in a loop faster than it changes. A test program running on all 4 CPU > cores that reported jumps larger than 100 ms was run for 13.6 hours and > reported the following: > > Count | Event > -------+--------------------------- > 9940 | jumped backward 699ms > 268 | jumped backward 1398ms > 1 | jumped backward 2097ms > 16020 | jumped forward 175ms > 6443 | jumped forward 699ms > 2976 | jumped forward 1398ms > 9 | jumped forward 356516ms > 9 | jumped forward 357215ms > 4 | jumped forward 714430ms > 1 | jumped forward 3578440ms > > This works out to a jump larger than 100 ms about every 5.5 seconds on > each CPU core. > > The largest jump (almost an hour!) was the following sequence of reads: > 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 > > Note that the middle bits don't necessarily all read as all zeroes or > all ones during the anomalous behavior; however the low 11 bits checked > by the function in this patch have never been observed with any other > value. > > Also note that smaller jumps are much more common, with the smallest > backward jumps of 2048 cycles observed over 400 times per second on each > core. (Of course, this is partially due to lower bits rolling over more > frequently.) Any one of these could have caused the 95 year time skip. > > Similar anomalies were observed while reading CNTPCT (after patching the > kernel to allow reads from userspace). However, the jumps are much less > frequent, and only small jumps were observed. The same program as before > (except now reading CNTPCT) observed after 72 hours: > > Count | Event > -------+--------------------------- > 17 | jumped backward 699ms > 52 | jumped forward 175ms > 2831 | jumped forward 699ms > 5 | jumped forward 1398ms > > ======================================================================== > > Because the CPU can read the CNTPCT/CNTVCT registers faster than they > change, performing two reads of the register and comparing the high bits > (like other workarounds) is not a workable solution. And because the > timer can jump both forward and backward, no pair of reads can > distinguish a good value from a bad one. The only way to guarantee a > good value from consecutive reads would be to read _three_ times, and > take the middle value iff the three values are 1) individually unique > and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if > an anomaly is detected. > > However, since there is a distinct pattern to the bad values, we can > optimize the common case (2046/2048 of the time) to a single read by > simply ignoring values that match the pattern. This still takes no more > than 3 cycles in the worst case, and requires much less code. Thanks for the thorough description of the problem. A couple of questions: - Have the 000/7ff values of the bottom bits only been experimentally found? Or do you have more concrete evidence of this is what happens in the HW? - Do you have an official erratum number from the silicon vendor, so that Documentation/arm64/silicon-errata.txt can be kept up to date? > > [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 > [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ > [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26 > > Signed-off-by: Samuel Holland <samuel@sholland.org> > --- > drivers/clocksource/Kconfig | 11 ++++++++++ > drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 50 insertions(+) > > diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig > index 8e8a09755d10..7a5d434dd30b 100644 > --- a/drivers/clocksource/Kconfig > +++ b/drivers/clocksource/Kconfig > @@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921 > The workaround will be dynamically enabled when an affected > core is detected. > > +config SUN50I_A64_UNSTABLE_TIMER > + bool "Workaround for Allwinner A64 timer instability" > + default y > + depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI > + select ARM_ARCH_TIMER_OOL_WORKAROUND > + help > + This option enables a workaround for instability in the timer on > + the Allwinner A64 SoC. The workaround will only be active if the > + allwinner,sun50i-a64-unstable-timer property is found in the > + timer node. > + > config ARM_GLOBAL_TIMER > bool "Support for the ARM global timer" if COMPILE_TEST > select TIMER_OF if OF > diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c > index 57cb2f00fc07..66ce13578c52 100644 > --- a/drivers/clocksource/arm_arch_timer.c > +++ b/drivers/clocksource/arm_arch_timer.c > @@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void) > } > #endif > > +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER > +/* > + * The low bits of each register can transiently read as all ones or all zeroes > + * when bit 11 or greater rolls over. Since the value can jump both backward > + * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just > + * ignore register values with all ones or zeros in the low bits. > + */ > +static u64 notrace sun50i_a64_read_cntpct_el0(void) > +{ > + u64 val; > + > + do { > + val = read_sysreg(cntpct_el0); > + } while (((val + 1) & GENMASK(10, 0)) <= 1); Other workarounds of the same kind have a bounded loop. Have you done any investigation on how many loops you need at most to sort the timer? Depending on how many loops you need, it might be worth sticking a WFE here to just wait for something to happen instead of busy looping. > + > + return val; > +} > + > +static u64 notrace sun50i_a64_read_cntvct_el0(void) > +{ > + u64 val; > + > + do { > + val = read_sysreg(cntvct_el0); > + } while (((val + 1) & GENMASK(10, 0)) <= 1); > + > + return val; > +} > +#endif > + > #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND > DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround); > EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround); > @@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = { > .read_cntvct_el0 = arm64_858921_read_cntvct_el0, > }, > #endif > +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER > + { > + .match_type = ate_match_dt, > + .id = "allwinner,sun50i-a64-unstable-timer", > + .desc = "Allwinner A64 timer instability", > + .read_cntpct_el0 = sun50i_a64_read_cntpct_el0, > + .read_cntvct_el0 = sun50i_a64_read_cntvct_el0, > + }, > +#endif > }; > > typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *, > Thanks, M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability 2018-05-11 8:48 ` Marc Zyngier @ 2018-05-11 15:08 ` Samuel Holland 0 siblings, 0 replies; 30+ messages in thread From: Samuel Holland @ 2018-05-11 15:08 UTC (permalink / raw) To: Marc Zyngier, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 05/11/18 03:48, Marc Zyngier wrote: > [+Mark, who co-maintains the arch timer code with me] > > Hi Samuel, > > On 11/05/18 03:27, Samuel Holland wrote: >> The Allwinner A64 SoC is known [1] to have an unstable architectural >> timer, which manifests itself most obviously in the time jumping forward >> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a >> timer frequency of 24 MHz, implying that the time went slightly backward >> (and this was interpreted by the kernel as it jumping forward and >> wrapping around past the epoch). >> >> Further investigation revealed instability in the low bits of CNTVCT at >> the point a high bit rolls over. This leads to power-of-two cycle >> forward and backward jumps. (Testing shows that forward jumps are about >> twice as likely as backward jumps.) >> >> Without trapping reads to CNTVCT, a userspace program is able to read it >> in a loop faster than it changes. A test program running on all 4 CPU >> cores that reported jumps larger than 100 ms was run for 13.6 hours and >> reported the following: >> >> Count | Event >> -------+--------------------------- >> 9940 | jumped backward 699ms >> 268 | jumped backward 1398ms >> 1 | jumped backward 2097ms >> 16020 | jumped forward 175ms >> 6443 | jumped forward 699ms >> 2976 | jumped forward 1398ms >> 9 | jumped forward 356516ms >> 9 | jumped forward 357215ms >> 4 | jumped forward 714430ms >> 1 | jumped forward 3578440ms >> >> This works out to a jump larger than 100 ms about every 5.5 seconds on >> each CPU core. >> >> The largest jump (almost an hour!) was the following sequence of reads: >> 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 >> >> Note that the middle bits don't necessarily all read as all zeroes or >> all ones during the anomalous behavior; however the low 11 bits checked >> by the function in this patch have never been observed with any other >> value. >> >> Also note that smaller jumps are much more common, with the smallest >> backward jumps of 2048 cycles observed over 400 times per second on each >> core. (Of course, this is partially due to lower bits rolling over more >> frequently.) Any one of these could have caused the 95 year time skip. >> >> Similar anomalies were observed while reading CNTPCT (after patching the >> kernel to allow reads from userspace). However, the jumps are much less >> frequent, and only small jumps were observed. The same program as before >> (except now reading CNTPCT) observed after 72 hours: >> >> Count | Event >> -------+--------------------------- >> 17 | jumped backward 699ms >> 52 | jumped forward 175ms >> 2831 | jumped forward 699ms >> 5 | jumped forward 1398ms >> >> ======================================================================== >> >> Because the CPU can read the CNTPCT/CNTVCT registers faster than they >> change, performing two reads of the register and comparing the high bits >> (like other workarounds) is not a workable solution. And because the >> timer can jump both forward and backward, no pair of reads can >> distinguish a good value from a bad one. The only way to guarantee a >> good value from consecutive reads would be to read _three_ times, and >> take the middle value iff the three values are 1) individually unique >> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if >> an anomaly is detected. >> >> However, since there is a distinct pattern to the bad values, we can >> optimize the common case (2046/2048 of the time) to a single read by >> simply ignoring values that match the pattern. This still takes no more >> than 3 cycles in the worst case, and requires much less code. > > Thanks for the thorough description of the problem. A couple of questions: > > - Have the 000/7ff values of the bottom bits only been experimentally > found? Or do you have more concrete evidence of this is what happens in > the HW? Only experimentally found. Here's a sample of the larger jumps: CPU 3: 0x000000a5ebffffff → 0x000000a5edffffff (retry = 0x000000a5ec000000) CPU 1: 0x000000a5ebffffff → 0x000000a5edffffff (retry = 0x000000a5ec000000) CPU 1: 0x000000a5ecffffff > 0x000000a5ec000000 (retry = 0x000000a5ed000000) CPU 0: 0x000000a5eeffffff > 0x000000a5ee000000 (retry = 0x000000a5ef000000) CPU 2: 0x000000a5f0bfffff → 0x000000a5f0ffffff (retry = 0x000000a5f0c00000) CPU 0: 0x000000a5f0ffffff > 0x000000a5f0000000 (retry = 0x000000a5f1000000) CPU 2: 0x000000a5f3ffffff > 0x000000a5f2000000 (retry = 0x000000a5f4000000) CPU 0: 0x000000a5f6ffffff > 0x000000a5f6000000 (retry = 0x000000a5f7000000) CPU 1: 0x000000a5fbbfffff → 0x000000a5fbffffff (retry = 0x000000a5fbc00000) CPU 0: 0x000000a5fdffffff → 0x000000a5feffffff (retry = 0x000000a5fe000000) CPU 2: 0x000000a5ffffffff → 0x000000a7fe000000 (retry = 0x000000a600000000) CPU 3: 0x000000a5ffffffff → 0x000000a7fe000000 (retry = 0x000000a600000000) CPU 1: 0x000000a603bfffff → 0x000000a603ffffff (retry = 0x000000a603c00000) CPU 1: 0x000000a607bfffff → 0x000000a607ffffff (retry = 0x000000a607c00000) CPU 3: 0x000000a60cbfffff → 0x000000a60cffffff (retry = 0x000000a60cc00000) CPU 2: 0x000000a60cbfffff → 0x000000a60cffffff (retry = 0x000000a60cc00000) CPU 2: 0x000000a60dffffff → 0x000000a60effffff (retry = 0x000000a60e000000) CPU 0: 0x000000a60e3fffff → 0x000000a60e7fffff (retry = 0x000000a60e400000) CPU 1: 0x000000a60effffff > 0x000000a60e000000 (retry = 0x000000a60f000000) (Note that for the large jump, 0x000000a5ffffffff → 0x000000a7fe000000, not all of the low bits are zeroes during the transition.) And of the tiny jumps backward: CPU 3: 0x000000d2f010bfff > 0x000000d2f010b800 (retry = 0x000000d2f010b801) CPU 3: 0x000000d2f0110fff > 0x000000d2f0110800 (retry = 0x000000d2f0110801) CPU 3: 0x000000d2f0112fff > 0x000000d2f0112800 (retry = 0x000000d2f0112801) CPU 3: 0x000000d2f0119fff > 0x000000d2f0119800 (retry = 0x000000d2f0119801) CPU 3: 0x000000d2f011f7ff > 0x000000d2f011f000 (retry = 0x000000d2f011f800) CPU 3: 0x000000d2f01257ff > 0x000000d2f0125000 (retry = 0x000000d2f0125800) CPU 3: 0x000000d2f012b7ff > 0x000000d2f012b000 (retry = 0x000000d2f012b800) CPU 3: 0x000000d2f01317ff > 0x000000d2f0131000 (retry = 0x000000d2f0131800) CPU 2: 0x000000d2f00c17ff > 0x000000d2f00c1000 (retry = 0x000000d2f00c1800) CPU 2: 0x000000d2f0136fff > 0x000000d2f0136800 (retry = 0x000000d2f0136801) CPU 2: 0x000000d2f01377ff > 0x000000d2f0137000 (retry = 0x000000d2f0137800) CPU 2: 0x000000d2f013afff > 0x000000d2f013a800 (retry = 0x000000d2f013a801) CPU 2: 0x000000d2f01417ff > 0x000000d2f0141000 (retry = 0x000000d2f0141800) CPU 2: 0x000000d2f0142fff > 0x000000d2f0142800 (retry = 0x000000d2f0142801) CPU 2: 0x000000d2f01497ff > 0x000000d2f0149000 (retry = 0x000000d2f0149800) CPU 2: 0x000000d2f014b7ff > 0x000000d2f014b000 (retry = 0x000000d2f014b800) CPU 2: 0x000000d2f0150fff > 0x000000d2f0150800 (retry = 0x000000d2f0150801) CPU 2: 0x000000d2f01577ff > 0x000000d2f0157000 (retry = 0x000000d2f0157800) CPU 2: 0x000000d2f015d7ff > 0x000000d2f015d000 (retry = 0x000000d2f015d800) CPU 2: 0x000000d2f015f7ff > 0x000000d2f015f000 (retry = 0x000000d2f015f800) CPU 2: 0x000000d2f01617ff > 0x000000d2f0161000 (retry = 0x000000d2f0161800) CPU 2: 0x000000d2f0166fff > 0x000000d2f0166800 (retry = 0x000000d2f0166801) CPU 2: 0x000000d2f0168fff > 0x000000d2f0168800 (retry = 0x000000d2f0168801) CPU 2: 0x000000d2f016b7ff > 0x000000d2f016b000 (retry = 0x000000d2f016b800) I have close to 100 hours of data, and the "bottom 11 bits" value came from the observations that: 1) Jumps by 2048 cycles were the smallest jumps seen 2) Even when the anomalous values were not all ones/all zeroes, the bottom 11 bits _were_ all ones/all zeroes I'd be happy to run more tests if there's something else you think I should look for. > - Do you have an official erratum number from the silicon vendor, so > that Documentation/arm64/silicon-errata.txt can be kept up to date? No, unfortunately the only information we have from the vendor is in the form of BSP kernel code drops. There _is_ a workaround present in their official sources [1], but it looks to be for a different issue (one that I haven't experienced). [1] https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/e634a960316d/drivers/clocksource/arm_arch_timer.c#L272-L327 >> >> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 >> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ >> [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26 >> >> Signed-off-by: Samuel Holland <samuel@sholland.org> >> --- >> drivers/clocksource/Kconfig | 11 ++++++++++ >> drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++ >> 2 files changed, 50 insertions(+) >> >> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig >> index 8e8a09755d10..7a5d434dd30b 100644 >> --- a/drivers/clocksource/Kconfig >> +++ b/drivers/clocksource/Kconfig >> @@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921 >> The workaround will be dynamically enabled when an affected >> core is detected. >> >> +config SUN50I_A64_UNSTABLE_TIMER >> + bool "Workaround for Allwinner A64 timer instability" >> + default y >> + depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI >> + select ARM_ARCH_TIMER_OOL_WORKAROUND >> + help >> + This option enables a workaround for instability in the timer on >> + the Allwinner A64 SoC. The workaround will only be active if the >> + allwinner,sun50i-a64-unstable-timer property is found in the >> + timer node. >> + >> config ARM_GLOBAL_TIMER >> bool "Support for the ARM global timer" if COMPILE_TEST >> select TIMER_OF if OF >> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c >> index 57cb2f00fc07..66ce13578c52 100644 >> --- a/drivers/clocksource/arm_arch_timer.c >> +++ b/drivers/clocksource/arm_arch_timer.c >> @@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void) >> } >> #endif >> >> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER >> +/* >> + * The low bits of each register can transiently read as all ones or all zeroes >> + * when bit 11 or greater rolls over. Since the value can jump both backward >> + * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just >> + * ignore register values with all ones or zeros in the low bits. >> + */ >> +static u64 notrace sun50i_a64_read_cntpct_el0(void) >> +{ >> + u64 val; >> + >> + do { >> + val = read_sysreg(cntpct_el0); >> + } while (((val + 1) & GENMASK(10, 0)) <= 1); > > Other workarounds of the same kind have a bounded loop. Have you done > any investigation on how many loops you need at most to sort the timer? > Depending on how many loops you need, it might be worth sticking a WFE > here to just wait for something to happen instead of busy looping. The worst case delay inside the loop is two timer cycles at 24 MHz (83.3 ns). So the number of iterations depends on the CPU speed. For example: 0x???4ffe -> read once and return 0x???4fff -> read and loop (((val + 1) & GENMASK(10, 0)) == 0) 0x???5000 -> read and loop (((val + 1) & GENMASK(10, 0)) == 1) 0x???5001 -> read once and return Now of course this function could be preempted, in which case it most likely does even fewer reads. (The probability of "read 0x???4fff", "preempt", "read 0x???5fff" is pretty low.) I do not think it is possible to know if an 000/7ff value is valid without reading three unique values from the timer. So the latency entirely depends on the timer frequency, not the CPU speed. >> + >> + return val; >> +} >> + >> +static u64 notrace sun50i_a64_read_cntvct_el0(void) >> +{ >> + u64 val; >> + >> + do { >> + val = read_sysreg(cntvct_el0); >> + } while (((val + 1) & GENMASK(10, 0)) <= 1); >> + >> + return val; >> +} >> +#endif >> + >> #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND >> DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround); >> EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround); >> @@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = { >> .read_cntvct_el0 = arm64_858921_read_cntvct_el0, >> }, >> #endif >> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER >> + { >> + .match_type = ate_match_dt, >> + .id = "allwinner,sun50i-a64-unstable-timer", >> + .desc = "Allwinner A64 timer instability", >> + .read_cntpct_el0 = sun50i_a64_read_cntpct_el0, >> + .read_cntvct_el0 = sun50i_a64_read_cntvct_el0, >> + }, >> +#endif >> }; >> >> typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *, >> > > Thanks, > > M. > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability 2018-05-11 2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland 2018-05-11 8:26 ` Maxime Ripard 2018-05-11 8:48 ` Marc Zyngier @ 2018-05-26 15:15 ` André Przywara 2 siblings, 0 replies; 30+ messages in thread From: André Przywara @ 2018-05-26 15:15 UTC (permalink / raw) To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier Cc: linux-sunxi, linux-kernel, linux-arm-kernel, Mark Rutland On 05/11/2018 03:27 AM, Samuel Holland wrote: > The Allwinner A64 SoC is known [1] to have an unstable architectural > timer, which manifests itself most obviously in the time jumping forward > a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a > timer frequency of 24 MHz, implying that the time went slightly backward > (and this was interpreted by the kernel as it jumping forward and > wrapping around past the epoch). > > Further investigation revealed instability in the low bits of CNTVCT at > the point a high bit rolls over. This leads to power-of-two cycle > forward and backward jumps. (Testing shows that forward jumps are about > twice as likely as backward jumps.) > > Without trapping reads to CNTVCT, a userspace program is able to read it > in a loop faster than it changes. A test program running on all 4 CPU > cores that reported jumps larger than 100 ms was run for 13.6 hours and > reported the following: > > Count | Event > -------+--------------------------- > 9940 | jumped backward 699ms > 268 | jumped backward 1398ms > 1 | jumped backward 2097ms > 16020 | jumped forward 175ms > 6443 | jumped forward 699ms > 2976 | jumped forward 1398ms > 9 | jumped forward 356516ms > 9 | jumped forward 357215ms > 4 | jumped forward 714430ms > 1 | jumped forward 3578440ms > > This works out to a jump larger than 100 ms about every 5.5 seconds on > each CPU core. > > The largest jump (almost an hour!) was the following sequence of reads: > 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 > > Note that the middle bits don't necessarily all read as all zeroes or > all ones during the anomalous behavior; however the low 11 bits checked > by the function in this patch have never been observed with any other > value. > > Also note that smaller jumps are much more common, with the smallest > backward jumps of 2048 cycles observed over 400 times per second on each > core. (Of course, this is partially due to lower bits rolling over more > frequently.) Any one of these could have caused the 95 year time skip. > > Similar anomalies were observed while reading CNTPCT (after patching the > kernel to allow reads from userspace). However, the jumps are much less > frequent, and only small jumps were observed. The same program as before > (except now reading CNTPCT) observed after 72 hours: > > Count | Event > -------+--------------------------- > 17 | jumped backward 699ms > 52 | jumped forward 175ms > 2831 | jumped forward 699ms > 5 | jumped forward 1398ms > > ======================================================================== > > Because the CPU can read the CNTPCT/CNTVCT registers faster than they > change, performing two reads of the register and comparing the high bits > (like other workarounds) is not a workable solution. And because the > timer can jump both forward and backward, no pair of reads can > distinguish a good value from a bad one. The only way to guarantee a > good value from consecutive reads would be to read _three_ times, and > take the middle value iff the three values are 1) individually unique > and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if > an anomaly is detected. > > However, since there is a distinct pattern to the bad values, we can > optimize the common case (2046/2048 of the time) to a single read by > simply ignoring values that match the pattern. This still takes no more > than 3 cycles in the worst case, and requires much less code. Clever solution, and indeed much less costly than other workarounds. FWIW, I tested this on a Pine64 and can confirm that it works. I put a test program here: https://github.com/apritzel/pine64/blob/master/tools/test_timer.c This only checks for consecutive reads going backwards, but within a split second yells on an unpatched kernel: https://gist.github.com/apritzel/fc78ca6edb17be2024d5adfd35edb520 Applying the patch and adding the DT property fixed that for me. I second Marc's request for an upper bound on the loop. Question is just what we do when we reach the loop count limit? But regardless, given that this fixes this nasty issue: Tested-by: Andre Przywara <andre.przywara@arm.com> Cheers, Andre. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround 2018-05-11 2:27 [PATCH 0/2] Allwinner A64 timer workaround Samuel Holland 2018-05-11 2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland @ 2018-05-11 2:27 ` Samuel Holland 2018-05-11 9:24 ` [PATCH 0/2] Allwinner " Andre Przywara 2018-07-03 15:09 ` Marc Zyngier 3 siblings, 0 replies; 30+ messages in thread From: Samuel Holland @ 2018-05-11 2:27 UTC (permalink / raw) To: Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Samuel Holland As instability in the architectural timer has been observed on multiple devices using this SoC, inluding the Pine64 and the Orange Pi Win, enable the workaround in the SoC's device tree. Signed-off-by: Samuel Holland <samuel@sholland.org> --- arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi index 1b2ef28c42bd..5202b76e9684 100644 --- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi +++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi @@ -152,6 +152,7 @@ timer { compatible = "arm,armv8-timer"; + allwinner,sun50i-a64-unstable-timer; interrupts = <GIC_PPI 13 (GIC_CPU_MASK_SIMPLE(4) | IRQ_TYPE_LEVEL_HIGH)>, <GIC_PPI 14 -- 2.16.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-05-11 2:27 [PATCH 0/2] Allwinner A64 timer workaround Samuel Holland 2018-05-11 2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland 2018-05-11 2:27 ` [PATCH 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland @ 2018-05-11 9:24 ` Andre Przywara 2018-07-03 15:09 ` Marc Zyngier 3 siblings, 0 replies; 30+ messages in thread From: Andre Przywara @ 2018-05-11 9:24 UTC (permalink / raw) To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier Cc: linux-sunxi, linux-kernel, linux-arm-kernel Hi, On 11/05/18 03:27, Samuel Holland wrote: > Hello, > > Several people (including me) have experienced extremely large system > clock jumps on their A64-based devices, apparently due to the > architectural timer going backward, which is interpreted by Linux as > the timer wrapping around after 2^56 cycles. So I experienced this before, though I didn't see actual clock jumps, only that subsequent reads of CNTVCT_EL0, both directly via mrs and indirectly via CLOCK_MONOTONIC, from userland (triggered by a directed test) *sometimes* went backwards, with a number of '1's in the lower bits. But that didn't happen on every boot, and I was suspecting that some timer setup was missing on the hardware/firmware side. And later on I failed to reproduce it anymore. So do you see it on every boot, with recent U-Boot/ATF? Cheers, Andre. > Investigation led to discovery of some obvious problems with this SoC's > architectural timer, and this patch series introduces what I believe is > the simplest workaround. More details are in the commit message for > patch 1. Patch 2 simply enables the workaround in the device tree. > > Thanks, > Samuel > > Samuel Holland (2): > arm64: arch_timer: Workaround for Allwinner A64 timer instability > arm64: dts: allwinner: a64: Enable A64 timer workaround > > arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 1 + > drivers/clocksource/Kconfig | 11 ++++++++ > drivers/clocksource/arm_arch_timer.c | 39 +++++++++++++++++++++++++++ > 3 files changed, 51 insertions(+) > > -- > 2.16.1 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-05-11 2:27 [PATCH 0/2] Allwinner A64 timer workaround Samuel Holland ` (2 preceding siblings ...) 2018-05-11 9:24 ` [PATCH 0/2] Allwinner " Andre Przywara @ 2018-07-03 15:09 ` Marc Zyngier 2018-07-03 18:42 ` Samuel Holland 3 siblings, 1 reply; 30+ messages in thread From: Marc Zyngier @ 2018-07-03 15:09 UTC (permalink / raw) To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi On 11/05/18 03:27, Samuel Holland wrote: > Hello, > > Several people (including me) have experienced extremely large system > clock jumps on their A64-based devices, apparently due to the > architectural timer going backward, which is interpreted by Linux as > the timer wrapping around after 2^56 cycles. > > Investigation led to discovery of some obvious problems with this SoC's > architectural timer, and this patch series introduces what I believe is > the simplest workaround. More details are in the commit message for > patch 1. Patch 2 simply enables the workaround in the device tree. What's the deal with this series? There was a couple of nits to address, and I was more or less expecting a v2. Thanks, M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-03 15:09 ` Marc Zyngier @ 2018-07-03 18:42 ` Samuel Holland 2018-07-04 8:16 ` Marc Zyngier ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Samuel Holland @ 2018-07-03 18:42 UTC (permalink / raw) To: Marc Zyngier, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 07/03/18 10:09, Marc Zyngier wrote: > On 11/05/18 03:27, Samuel Holland wrote: >> Hello, >> >> Several people (including me) have experienced extremely large system >> clock jumps on their A64-based devices, apparently due to the architectural >> timer going backward, which is interpreted by Linux as the timer wrapping >> around after 2^56 cycles. >> >> Investigation led to discovery of some obvious problems with this SoC's >> architectural timer, and this patch series introduces what I believe is >> the simplest workaround. More details are in the commit message for patch >> 1. Patch 2 simply enables the workaround in the device tree. > > What's the deal with this series? There was a couple of nits to address, and > I was more or less expecting a v2. I got reports that people were still occasionally having clock jumps after applying this series, so I wanted to attempt a more complete fix, but I haven't had time to do any deeper investigation. I think this series is still beneficial even if it's not a complete solution, so I'll come back with another patch on top of this if/once I get it fully fixed. I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz timer) ≈ 150 should be a conservative iteration limit? Also, does this make sense to CC to stable? Thanks, Samuel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-03 18:42 ` Samuel Holland @ 2018-07-04 8:16 ` Marc Zyngier 2018-07-04 8:19 ` Chen-Yu Tsai ` (3 more replies) 2018-07-04 8:41 ` Daniel Lezcano 2018-07-04 12:52 ` [linux-sunxi] " Andre Przywara 2 siblings, 4 replies; 30+ messages in thread From: Marc Zyngier @ 2018-07-04 8:16 UTC (permalink / raw) To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 03/07/18 19:42, Samuel Holland wrote: > On 07/03/18 10:09, Marc Zyngier wrote: >> On 11/05/18 03:27, Samuel Holland wrote: >>> Hello, >>> >>> Several people (including me) have experienced extremely large system >>> clock jumps on their A64-based devices, apparently due to the architectural >>> timer going backward, which is interpreted by Linux as the timer wrapping >>> around after 2^56 cycles. >>> >>> Investigation led to discovery of some obvious problems with this SoC's >>> architectural timer, and this patch series introduces what I believe is >>> the simplest workaround. More details are in the commit message for patch >>> 1. Patch 2 simply enables the workaround in the device tree. >> >> What's the deal with this series? There was a couple of nits to address, and >> I was more or less expecting a v2. > > I got reports that people were still occasionally having clock jumps after > applying this series, so I wanted to attempt a more complete fix, but I haven't > had time to do any deeper investigation. I think this series is still beneficial > even if it's not a complete solution, so I'll come back with another patch on > top of this if/once I get it fully fixed. > > I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz > timer) ≈ 150 should be a conservative iteration limit? Should be OK. Maxime: How do you want to deal with the documentation aspect? We need an erratum number, but AFAIU the concept hasn't made it into the silicom vendor's brain yet. Any chance you could come up with something that uniquely identifies this? > Also, does this make sense to CC to stable? Probably not, as the HW never worked, so it is not a regression. Thanks, M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 8:16 ` Marc Zyngier @ 2018-07-04 8:19 ` Chen-Yu Tsai 2018-07-04 8:23 ` Daniel Lezcano ` (2 subsequent siblings) 3 siblings, 0 replies; 30+ messages in thread From: Chen-Yu Tsai @ 2018-07-04 8:19 UTC (permalink / raw) To: Marc Zyngier Cc: Samuel Holland, Maxime Ripard, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On Wed, Jul 4, 2018 at 4:16 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 03/07/18 19:42, Samuel Holland wrote: >> On 07/03/18 10:09, Marc Zyngier wrote: >>> On 11/05/18 03:27, Samuel Holland wrote: >>>> Hello, >>>> >>>> Several people (including me) have experienced extremely large system >>>> clock jumps on their A64-based devices, apparently due to the architectural >>>> timer going backward, which is interpreted by Linux as the timer wrapping >>>> around after 2^56 cycles. >>>> >>>> Investigation led to discovery of some obvious problems with this SoC's >>>> architectural timer, and this patch series introduces what I believe is >>>> the simplest workaround. More details are in the commit message for patch >>>> 1. Patch 2 simply enables the workaround in the device tree. >>> >>> What's the deal with this series? There was a couple of nits to address, and >>> I was more or less expecting a v2. >> >> I got reports that people were still occasionally having clock jumps after >> applying this series, so I wanted to attempt a more complete fix, but I haven't >> had time to do any deeper investigation. I think this series is still beneficial >> even if it's not a complete solution, so I'll come back with another patch on >> top of this if/once I get it fully fixed. >> >> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz >> timer) ≈ 150 should be a conservative iteration limit? > > Should be OK. > > Maxime: How do you want to deal with the documentation aspect? We need > an erratum number, but AFAIU the concept hasn't made it into the silicom > vendor's brain yet. Any chance you could come up with something that > uniquely identifies this? > >> Also, does this make sense to CC to stable? > > Probably not, as the HW never worked, so it is not a regression. A64 support has been in for a few releases now. So while this is not fixing a regression, people will still benefit from it being in stable. ChenYu ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 8:16 ` Marc Zyngier 2018-07-04 8:19 ` Chen-Yu Tsai @ 2018-07-04 8:23 ` Daniel Lezcano 2018-07-04 8:39 ` Marc Zyngier 2018-07-04 8:41 ` Daniel Lezcano 2018-07-04 9:06 ` Maxime Ripard 3 siblings, 1 reply; 30+ messages in thread From: Daniel Lezcano @ 2018-07-04 8:23 UTC (permalink / raw) To: Marc Zyngier, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 04/07/2018 10:16, Marc Zyngier wrote: > On 03/07/18 19:42, Samuel Holland wrote: >> On 07/03/18 10:09, Marc Zyngier wrote: >>> On 11/05/18 03:27, Samuel Holland wrote: >>>> Hello, >>>> >>>> Several people (including me) have experienced extremely large system >>>> clock jumps on their A64-based devices, apparently due to the architectural >>>> timer going backward, which is interpreted by Linux as the timer wrapping >>>> around after 2^56 cycles. >>>> >>>> Investigation led to discovery of some obvious problems with this SoC's >>>> architectural timer, and this patch series introduces what I believe is >>>> the simplest workaround. More details are in the commit message for patch >>>> 1. Patch 2 simply enables the workaround in the device tree. >>> >>> What's the deal with this series? There was a couple of nits to address, and >>> I was more or less expecting a v2. >> >> I got reports that people were still occasionally having clock jumps after >> applying this series, so I wanted to attempt a more complete fix, but I haven't >> had time to do any deeper investigation. I think this series is still beneficial >> even if it's not a complete solution, so I'll come back with another patch on >> top of this if/once I get it fully fixed. >> >> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz >> timer) ≈ 150 should be a conservative iteration limit? > > Should be OK. > > Maxime: How do you want to deal with the documentation aspect? We need > an erratum number, but AFAIU the concept hasn't made it into the silicom > vendor's brain yet. Any chance you could come up with something that > uniquely identifies this? > >> Also, does this make sense to CC to stable? > > Probably not, as the HW never worked, so it is not a regression. If the patches fix a bug which already exist, it makes sense to propagated the fix back to the stable versions. -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 8:23 ` Daniel Lezcano @ 2018-07-04 8:39 ` Marc Zyngier 2018-07-04 10:00 ` Thomas Gleixner 0 siblings, 1 reply; 30+ messages in thread From: Marc Zyngier @ 2018-07-04 8:39 UTC (permalink / raw) To: Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 04/07/18 09:23, Daniel Lezcano wrote: > On 04/07/2018 10:16, Marc Zyngier wrote: >> On 03/07/18 19:42, Samuel Holland wrote: >>> On 07/03/18 10:09, Marc Zyngier wrote: >>>> On 11/05/18 03:27, Samuel Holland wrote: >>>>> Hello, >>>>> >>>>> Several people (including me) have experienced extremely large system >>>>> clock jumps on their A64-based devices, apparently due to the architectural >>>>> timer going backward, which is interpreted by Linux as the timer wrapping >>>>> around after 2^56 cycles. >>>>> >>>>> Investigation led to discovery of some obvious problems with this SoC's >>>>> architectural timer, and this patch series introduces what I believe is >>>>> the simplest workaround. More details are in the commit message for patch >>>>> 1. Patch 2 simply enables the workaround in the device tree. >>>> >>>> What's the deal with this series? There was a couple of nits to address, and >>>> I was more or less expecting a v2. >>> >>> I got reports that people were still occasionally having clock jumps after >>> applying this series, so I wanted to attempt a more complete fix, but I haven't >>> had time to do any deeper investigation. I think this series is still beneficial >>> even if it's not a complete solution, so I'll come back with another patch on >>> top of this if/once I get it fully fixed. >>> >>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz >>> timer) ≈ 150 should be a conservative iteration limit? >> >> Should be OK. >> >> Maxime: How do you want to deal with the documentation aspect? We need >> an erratum number, but AFAIU the concept hasn't made it into the silicom >> vendor's brain yet. Any chance you could come up with something that >> uniquely identifies this? >> >>> Also, does this make sense to CC to stable? >> >> Probably not, as the HW never worked, so it is not a regression. > > If the patches fix a bug which already exist, it makes sense to > propagated the fix back to the stable versions. That's your call, but I'm not supportive of that decision, specially as we have information from the person developing the workaround that this doesn't fully address the issue. M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 8:39 ` Marc Zyngier @ 2018-07-04 10:00 ` Thomas Gleixner 2018-07-04 13:08 ` [linux-sunxi] " Andre Przywara 0 siblings, 1 reply; 30+ messages in thread From: Thomas Gleixner @ 2018-07-04 10:00 UTC (permalink / raw) To: Marc Zyngier Cc: Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On Wed, 4 Jul 2018, Marc Zyngier wrote: > On 04/07/18 09:23, Daniel Lezcano wrote: > > > > If the patches fix a bug which already exist, it makes sense to > > propagated the fix back to the stable versions. > > That's your call, but I'm not supportive of that decision, specially as > we have information from the person developing the workaround that this > doesn't fully address the issue. The patches should not be applied at all. Simply because they don't fix the issue completely. From a quick glance at various links and information about this, this very much smells like the FSL_ERRATUM_A008585. Has that been tried? It looks way more robust than the magic 11 bit crystal ball logic. Thanks, tglx ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 10:00 ` Thomas Gleixner @ 2018-07-04 13:08 ` Andre Przywara 2018-07-04 14:31 ` Thomas Gleixner 0 siblings, 1 reply; 30+ messages in thread From: Andre Przywara @ 2018-07-04 13:08 UTC (permalink / raw) To: tglx, Marc Zyngier Cc: Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland Hi, On 04/07/18 11:00, Thomas Gleixner wrote: > On Wed, 4 Jul 2018, Marc Zyngier wrote: >> On 04/07/18 09:23, Daniel Lezcano wrote: >>> >>> If the patches fix a bug which already exist, it makes sense to >>> propagated the fix back to the stable versions. >> >> That's your call, but I'm not supportive of that decision, specially as >> we have information from the person developing the workaround that this >> doesn't fully address the issue. > > The patches should not be applied at all. Simply because they don't fix the > issue completely. > > From a quick glance at various links and information about this, this very > much smells like the FSL_ERRATUM_A008585. > Has that been tried? It looks way more robust than the magic 11 bit > crystal ball logic. The Freescale erratum is similar, but not identical [1]. It seems like the A64 is less variable, so we can use a cheaper workaround, which gets away with normally just one sysreg read. But then again the newer error reports may actually suggest otherwise ... And as it currently stands, the Freescale erratum has the drawback of relying on the CPU running much faster than the timer. The A64 can run at 24 MHz (for power savings, or possibly during DVFS transitions), which is the timer frequency. So subsequent counter reads will never return the same value and the workaround times out. Cheers, Andre. [1] https://lists.denx.de/pipermail/u-boot/2016-November/271836.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 13:08 ` [linux-sunxi] " Andre Przywara @ 2018-07-04 14:31 ` Thomas Gleixner 2018-07-04 14:44 ` Andre Przywara 0 siblings, 1 reply; 30+ messages in thread From: Thomas Gleixner @ 2018-07-04 14:31 UTC (permalink / raw) To: Andre Przywara Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On Wed, 4 Jul 2018, Andre Przywara wrote: > On 04/07/18 11:00, Thomas Gleixner wrote: > > On Wed, 4 Jul 2018, Marc Zyngier wrote: > >> On 04/07/18 09:23, Daniel Lezcano wrote: > >>> > >>> If the patches fix a bug which already exist, it makes sense to > >>> propagated the fix back to the stable versions. > >> > >> That's your call, but I'm not supportive of that decision, specially as > >> we have information from the person developing the workaround that this > >> doesn't fully address the issue. > > > > The patches should not be applied at all. Simply because they don't fix the > > issue completely. > > > > From a quick glance at various links and information about this, this very > > much smells like the FSL_ERRATUM_A008585. > > Has that been tried? It looks way more robust than the magic 11 bit > > crystal ball logic. > > The Freescale erratum is similar, but not identical [1]. > It seems like the A64 is less variable, so we can use a cheaper > workaround, which gets away with normally just one sysreg read. But then > again the newer error reports may actually suggest otherwise ... > > And as it currently stands, the Freescale erratum has the drawback of > relying on the CPU running much faster than the timer. The A64 can run > at 24 MHz (for power savings, or possibly during DVFS transitions), > which is the timer frequency. So subsequent counter reads will never > return the same value and the workaround times out. If that's the case then you need to find a different functional timer for time keeping. Having an erratic behaving timer for time keeping is not an option at all. Thanks, tglx ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 14:31 ` Thomas Gleixner @ 2018-07-04 14:44 ` Andre Przywara 2018-07-04 15:01 ` Marc Zyngier 2018-07-04 15:14 ` Thomas Gleixner 0 siblings, 2 replies; 30+ messages in thread From: Andre Przywara @ 2018-07-04 14:44 UTC (permalink / raw) To: Thomas Gleixner Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland Hi, On 04/07/18 15:31, Thomas Gleixner wrote: > On Wed, 4 Jul 2018, Andre Przywara wrote: >> On 04/07/18 11:00, Thomas Gleixner wrote: >>> On Wed, 4 Jul 2018, Marc Zyngier wrote: >>>> On 04/07/18 09:23, Daniel Lezcano wrote: >>>>> >>>>> If the patches fix a bug which already exist, it makes sense to >>>>> propagated the fix back to the stable versions. >>>> >>>> That's your call, but I'm not supportive of that decision, specially as >>>> we have information from the person developing the workaround that this >>>> doesn't fully address the issue. >>> >>> The patches should not be applied at all. Simply because they don't fix the >>> issue completely. >>> >>> From a quick glance at various links and information about this, this very >>> much smells like the FSL_ERRATUM_A008585. >>> Has that been tried? It looks way more robust than the magic 11 bit >>> crystal ball logic. >> >> The Freescale erratum is similar, but not identical [1]. >> It seems like the A64 is less variable, so we can use a cheaper >> workaround, which gets away with normally just one sysreg read. But then >> again the newer error reports may actually suggest otherwise ... >> >> And as it currently stands, the Freescale erratum has the drawback of >> relying on the CPU running much faster than the timer. The A64 can run >> at 24 MHz (for power savings, or possibly during DVFS transitions), >> which is the timer frequency. So subsequent counter reads will never >> return the same value and the workaround times out. > > If that's the case then you need to find a different functional timer for > time keeping. Having an erratic behaving timer for time keeping is not an > option at all. That's not an option on arm64. There are other usable time sources in the SoC, but the arch timer is somewhat mandatory for all practical purposes on arm64. We rely on it in some many places that it's not feasible to run without it. That's why we call it "architected" timer after all ;-) But I am quite confident that we can find a correct workaround. Maybe it's really the TVAL (the downcounter) write which is the culprit here, since the hardware actually writes "now() + TVAL" into the CVAL (upcounter) register. This internal counter access may be flawed as well. Cheers, Andre. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 14:44 ` Andre Przywara @ 2018-07-04 15:01 ` Marc Zyngier 2018-07-04 15:15 ` Andre Przywara 2018-07-04 15:23 ` Samuel Holland 2018-07-04 15:14 ` Thomas Gleixner 1 sibling, 2 replies; 30+ messages in thread From: Marc Zyngier @ 2018-07-04 15:01 UTC (permalink / raw) To: Andre Przywara Cc: Thomas Gleixner, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On Wed, 04 Jul 2018 15:44:36 +0100, Andre Przywara <andre.przywara@arm.com> wrote: > > Hi, > > On 04/07/18 15:31, Thomas Gleixner wrote: > > On Wed, 4 Jul 2018, Andre Przywara wrote: > >> On 04/07/18 11:00, Thomas Gleixner wrote: > >>> On Wed, 4 Jul 2018, Marc Zyngier wrote: > >>>> On 04/07/18 09:23, Daniel Lezcano wrote: > >>>>> > >>>>> If the patches fix a bug which already exist, it makes sense to > >>>>> propagated the fix back to the stable versions. > >>>> > >>>> That's your call, but I'm not supportive of that decision, specially as > >>>> we have information from the person developing the workaround that this > >>>> doesn't fully address the issue. > >>> > >>> The patches should not be applied at all. Simply because they don't fix the > >>> issue completely. > >>> > >>> From a quick glance at various links and information about this, this very > >>> much smells like the FSL_ERRATUM_A008585. > >>> Has that been tried? It looks way more robust than the magic 11 bit > >>> crystal ball logic. > >> > >> The Freescale erratum is similar, but not identical [1]. > >> It seems like the A64 is less variable, so we can use a cheaper > >> workaround, which gets away with normally just one sysreg read. But then > >> again the newer error reports may actually suggest otherwise ... > >> > >> And as it currently stands, the Freescale erratum has the drawback of > >> relying on the CPU running much faster than the timer. The A64 can run > >> at 24 MHz (for power savings, or possibly during DVFS transitions), > >> which is the timer frequency. So subsequent counter reads will never > >> return the same value and the workaround times out. > > > > If that's the case then you need to find a different functional timer for > > time keeping. Having an erratic behaving timer for time keeping is not an > > option at all. > > That's not an option on arm64. There are other usable time sources in > the SoC, but the arch timer is somewhat mandatory for all practical > purposes on arm64. We rely on it in some many places that it's not > feasible to run without it. That's why we call it "architected" timer > after all ;-) > But I am quite confident that we can find a correct workaround. Maybe > it's really the TVAL (the downcounter) write which is the culprit here, > since the hardware actually writes "now() + TVAL" into the CVAL > (upcounter) register. This internal counter access may be flawed as well. You got it backward: CVAL is not a counter at all. It is a Comparator. And TVAL has an implicit read from the counter, as it is defined as "CVAL - CNT" (i.e. the number of ticks until the timer expires). So it might be worth trying to handle TVAL entirely in SW. But this relies on being able to read the timer and get a number of correct values out of it. One possibility would be to sacrifice precision and always ignore some of the bottom bits, but this is always going to suck terribly. The alternative is burn that thing, and pretend it never existed. M. -- Jazz is not dead, it just smell funny. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 15:01 ` Marc Zyngier @ 2018-07-04 15:15 ` Andre Przywara 2018-07-04 15:30 ` Marc Zyngier 2018-07-04 15:23 ` Samuel Holland 1 sibling, 1 reply; 30+ messages in thread From: Andre Przywara @ 2018-07-04 15:15 UTC (permalink / raw) To: Marc Zyngier Cc: Thomas Gleixner, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland Hi, On 04/07/18 16:01, Marc Zyngier wrote: > On Wed, 04 Jul 2018 15:44:36 +0100, > Andre Przywara <andre.przywara@arm.com> wrote: >> >> Hi, >> >> On 04/07/18 15:31, Thomas Gleixner wrote: >>> On Wed, 4 Jul 2018, Andre Przywara wrote: >>>> On 04/07/18 11:00, Thomas Gleixner wrote: >>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote: >>>>>> On 04/07/18 09:23, Daniel Lezcano wrote: >>>>>>> >>>>>>> If the patches fix a bug which already exist, it makes sense to >>>>>>> propagated the fix back to the stable versions. >>>>>> >>>>>> That's your call, but I'm not supportive of that decision, specially as >>>>>> we have information from the person developing the workaround that this >>>>>> doesn't fully address the issue. >>>>> >>>>> The patches should not be applied at all. Simply because they don't fix the >>>>> issue completely. >>>>> >>>>> From a quick glance at various links and information about this, this very >>>>> much smells like the FSL_ERRATUM_A008585. >>>>> Has that been tried? It looks way more robust than the magic 11 bit >>>>> crystal ball logic. >>>> >>>> The Freescale erratum is similar, but not identical [1]. >>>> It seems like the A64 is less variable, so we can use a cheaper >>>> workaround, which gets away with normally just one sysreg read. But then >>>> again the newer error reports may actually suggest otherwise ... >>>> >>>> And as it currently stands, the Freescale erratum has the drawback of >>>> relying on the CPU running much faster than the timer. The A64 can run >>>> at 24 MHz (for power savings, or possibly during DVFS transitions), >>>> which is the timer frequency. So subsequent counter reads will never >>>> return the same value and the workaround times out. >>> >>> If that's the case then you need to find a different functional timer for >>> time keeping. Having an erratic behaving timer for time keeping is not an >>> option at all. >> >> That's not an option on arm64. There are other usable time sources in >> the SoC, but the arch timer is somewhat mandatory for all practical >> purposes on arm64. We rely on it in some many places that it's not >> feasible to run without it. That's why we call it "architected" timer >> after all ;-) >> But I am quite confident that we can find a correct workaround. Maybe >> it's really the TVAL (the downcounter) write which is the culprit here, >> since the hardware actually writes "now() + TVAL" into the CVAL >> (upcounter) register. This internal counter access may be flawed as well. > > You got it backward: CVAL is not a counter at all. It is a > Comparator. And TVAL has an implicit read from the counter, as it is > defined as "CVAL - CNT" (i.e. the number of ticks until the timer > expires). Yes, that's what I meant actually, sorry for the lousy wording. What I am actually more concerned about than reading (do we actually read TVAL?), is writing TVAL. The original BSP errata hack hints at this being a problem: https://github.com/longsleep/linux-pine64/blob/5b10a45ae8b0/drivers/clocksource/arm_arch_timer.c#L231-L244 > So it might be worth trying to handle TVAL entirely in SW. > > But this relies on being able to read the timer and get a number of > correct values out of it. One possibility would be to sacrifice > precision and always ignore some of the bottom bits, but this is > always going to suck terribly. > > The alternative is burn that thing, and pretend it never existed. Yes, that crossed my mind multiple times. Cheers, Andre. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 15:15 ` Andre Przywara @ 2018-07-04 15:30 ` Marc Zyngier 0 siblings, 0 replies; 30+ messages in thread From: Marc Zyngier @ 2018-07-04 15:30 UTC (permalink / raw) To: Andre Przywara Cc: Thomas Gleixner, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 04/07/18 16:15, Andre Przywara wrote: > Hi, > > On 04/07/18 16:01, Marc Zyngier wrote: >> On Wed, 04 Jul 2018 15:44:36 +0100, >> Andre Przywara <andre.przywara@arm.com> wrote: >>> >>> Hi, >>> >>> On 04/07/18 15:31, Thomas Gleixner wrote: >>>> On Wed, 4 Jul 2018, Andre Przywara wrote: >>>>> On 04/07/18 11:00, Thomas Gleixner wrote: >>>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote: >>>>>>> On 04/07/18 09:23, Daniel Lezcano wrote: >>>>>>>> >>>>>>>> If the patches fix a bug which already exist, it makes sense to >>>>>>>> propagated the fix back to the stable versions. >>>>>>> >>>>>>> That's your call, but I'm not supportive of that decision, specially as >>>>>>> we have information from the person developing the workaround that this >>>>>>> doesn't fully address the issue. >>>>>> >>>>>> The patches should not be applied at all. Simply because they don't fix the >>>>>> issue completely. >>>>>> >>>>>> From a quick glance at various links and information about this, this very >>>>>> much smells like the FSL_ERRATUM_A008585. >>>>>> Has that been tried? It looks way more robust than the magic 11 bit >>>>>> crystal ball logic. >>>>> >>>>> The Freescale erratum is similar, but not identical [1]. >>>>> It seems like the A64 is less variable, so we can use a cheaper >>>>> workaround, which gets away with normally just one sysreg read. But then >>>>> again the newer error reports may actually suggest otherwise ... >>>>> >>>>> And as it currently stands, the Freescale erratum has the drawback of >>>>> relying on the CPU running much faster than the timer. The A64 can run >>>>> at 24 MHz (for power savings, or possibly during DVFS transitions), >>>>> which is the timer frequency. So subsequent counter reads will never >>>>> return the same value and the workaround times out. >>>> >>>> If that's the case then you need to find a different functional timer for >>>> time keeping. Having an erratic behaving timer for time keeping is not an >>>> option at all. >>> >>> That's not an option on arm64. There are other usable time sources in >>> the SoC, but the arch timer is somewhat mandatory for all practical >>> purposes on arm64. We rely on it in some many places that it's not >>> feasible to run without it. That's why we call it "architected" timer >>> after all ;-) >>> But I am quite confident that we can find a correct workaround. Maybe >>> it's really the TVAL (the downcounter) write which is the culprit here, >>> since the hardware actually writes "now() + TVAL" into the CVAL >>> (upcounter) register. This internal counter access may be flawed as well. >> >> You got it backward: CVAL is not a counter at all. It is a >> Comparator. And TVAL has an implicit read from the counter, as it is >> defined as "CVAL - CNT" (i.e. the number of ticks until the timer >> expires). > > Yes, that's what I meant actually, sorry for the lousy wording. > > What I am actually more concerned about than reading (do we actually > read TVAL?), is writing TVAL. The original BSP errata hack hints at this > being a problem: > https://github.com/longsleep/linux-pine64/blob/5b10a45ae8b0/drivers/clocksource/arm_arch_timer.c#L231-L244 Right, and they only address the comparator, ignoring the counter. Braindead. I specially enjoy the "we should try to fix this" comment. >> So it might be worth trying to handle TVAL entirely in SW. Given the above, I think the above makes sense: - write TVAL: read CNT until stable, add the delta, write CVAL instead - read TVAL: read CNT until stable, substract CVAL, return the delta The low frequency problem remains. If it can't be solved, drop the arch timer from the DT (it is dead), and use a separate timer/counter. Simply not fit for purpose. M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 15:01 ` Marc Zyngier 2018-07-04 15:15 ` Andre Przywara @ 2018-07-04 15:23 ` Samuel Holland 1 sibling, 0 replies; 30+ messages in thread From: Samuel Holland @ 2018-07-04 15:23 UTC (permalink / raw) To: Marc Zyngier, Andre Przywara Cc: Thomas Gleixner, Daniel Lezcano, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 07/04/18 10:01, Marc Zyngier wrote: > On Wed, 04 Jul 2018 15:44:36, Andre Przywara <andre.przywara@arm.com> wrote: >> On 04/07/18 15:31, Thomas Gleixner wrote: >>> On Wed, 4 Jul 2018, Andre Przywara wrote: >>>> On 04/07/18 11:00, Thomas Gleixner wrote: >>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote: >>>>>> On 04/07/18 09:23, Daniel Lezcano wrote: >>>>>>> >>>>>>> If the patches fix a bug which already exist, it makes sense to >>>>>>> propagated the fix back to the stable versions. >>>>>> >>>>>> That's your call, but I'm not supportive of that decision, >>>>>> specially as we have information from the person developing the >>>>>> workaround that this doesn't fully address the issue. >>>>> >>>>> The patches should not be applied at all. Simply because they don't >>>>> fix the issue completely. >>>>> >>>>> From a quick glance at various links and information about this, >>>>> this very much smells like the FSL_ERRATUM_A008585. Has that been >>>>> tried? It looks way more robust than the magic 11 bit crystal ball >>>>> logic. >>>> >>>> The Freescale erratum is similar, but not identical [1]. It seems like >>>> the A64 is less variable, so we can use a cheaper workaround, which >>>> gets away with normally just one sysreg read. But then again the newer >>>> error reports may actually suggest otherwise ... >>>> >>>> And as it currently stands, the Freescale erratum has the drawback of >>>> relying on the CPU running much faster than the timer. The A64 can run >>>> at 24 MHz (for power savings, or possibly during DVFS transitions), >>>> which is the timer frequency. So subsequent counter reads will never >>>> return the same value and the workaround times out. >>> >>> If that's the case then you need to find a different functional timer for >>> time keeping. Having an erratic behaving timer for time keeping is not an >>> option at all. >> >> That's not an option on arm64. There are other usable time sources in the >> SoC, but the arch timer is somewhat mandatory for all practical purposes >> on arm64. We rely on it in some many places that it's not feasible to run >> without it. That's why we call it "architected" timer after all ;-) But I >> am quite confident that we can find a correct workaround. Maybe it's >> really the TVAL (the downcounter) write which is the culprit here, since >> the hardware actually writes "now() + TVAL" into the CVAL (upcounter) >> register. This internal counter access may be flawed as well. > > You got it backward: CVAL is not a counter at all. It is a Comparator. And > TVAL has an implicit read from the counter, as it is defined as "CVAL - CNT" > (i.e. the number of ticks until the timer expires). > > So it might be worth trying to handle TVAL entirely in SW. > > But this relies on being able to read the timer and get a number of correct > values out of it. One possibility would be to sacrifice precision and always > ignore some of the bottom bits, but this is always going to suck terribly. > > The alternative is burn that thing, and pretend it never existed. From the testing I have done, this patch series fully stabilizes reading CNTPCT and CNTVCT. So with the workaround, the timer *can* accurately count time. So merging this would be an improvement on the current situation. The system clock jumps might be explained by interaction with CVAL/TVAL, and that's the part I haven't investigated yet. As I mentioned before, and Andre just mentioned again, the BSP provided by the vendor has another workaround for writing the TVAL register. Hopefully, that's the missing piece which will fix the clock jumps. Thanks, Samuel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 14:44 ` Andre Przywara 2018-07-04 15:01 ` Marc Zyngier @ 2018-07-04 15:14 ` Thomas Gleixner 2018-07-04 15:43 ` Andre Przywara 1 sibling, 1 reply; 30+ messages in thread From: Thomas Gleixner @ 2018-07-04 15:14 UTC (permalink / raw) To: Andre Przywara Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On Wed, 4 Jul 2018, Andre Przywara wrote: > On 04/07/18 15:31, Thomas Gleixner wrote: > > If that's the case then you need to find a different functional timer for > > time keeping. Having an erratic behaving timer for time keeping is not an > > option at all. > > That's not an option on arm64. There are other usable time sources in > the SoC, but the arch timer is somewhat mandatory for all practical > purposes on arm64. We rely on it in some many places that it's not > feasible to run without it. That's why we call it "architected" timer > after all ;-) The argument that it has to be used just because someone defined it as 'architected' is bullshit and doesn't change the fact that it's broken and not usable for timekeeping. There is no wiggle room. Either it works or not, but works mostly is not an option. > But I am quite confident that we can find a correct workaround. Maybe > it's really the TVAL (the downcounter) write which is the culprit here, > since the hardware actually writes "now() + TVAL" into the CVAL > (upcounter) register. This internal counter access may be flawed as well. If the write to the event device is wreckaging the counter which provides time, then there is something seriously wrong either in the design or in that particular piece of silicon. Yet another proof for the theory that timers are implemented by janitors and that silicon/IP vendors have a competition running who can create the most subtly broken timers. Intel surely had a head start with that, but ARM is definitely catching up. Thanks, tglx ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 15:14 ` Thomas Gleixner @ 2018-07-04 15:43 ` Andre Przywara 2018-07-04 19:49 ` Thomas Gleixner 0 siblings, 1 reply; 30+ messages in thread From: Andre Przywara @ 2018-07-04 15:43 UTC (permalink / raw) To: Thomas Gleixner Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland Hi, On 04/07/18 16:14, Thomas Gleixner wrote: > On Wed, 4 Jul 2018, Andre Przywara wrote: >> On 04/07/18 15:31, Thomas Gleixner wrote: >>> If that's the case then you need to find a different functional timer for >>> time keeping. Having an erratic behaving timer for time keeping is not an >>> option at all. >> >> That's not an option on arm64. There are other usable time sources in >> the SoC, but the arch timer is somewhat mandatory for all practical >> purposes on arm64. We rely on it in some many places that it's not >> feasible to run without it. That's why we call it "architected" timer >> after all ;-) > > The argument that it has to be used just because someone defined it as > 'architected' is bullshit and doesn't change the fact that it's broken and > not usable for timekeeping. There is no wiggle room. Either it works or > not, but works mostly is not an option. The "architected" part of the arch timer is fine, it's just that eventually someone has to implement that at some point. And as you mention below, this is where Murphy's law is kicking in ;-) Especially for such seemingly simple tasks as connecting a counter in the "uncore" part of the chip (Allwinner SoC) to the counter register interface in the core (ARM Cortex-A53) [1]. Apparently the propagation is not really atomic for all bits here ... >> But I am quite confident that we can find a correct workaround. Maybe >> it's really the TVAL (the downcounter) write which is the culprit here, >> since the hardware actually writes "now() + TVAL" into the CVAL >> (upcounter) register. This internal counter access may be flawed as well. > > If the write to the event device is wreckaging the counter which provides > time, then there is something seriously wrong either in the design or in > that particular piece of silicon. Apologies, that was my lousy wording: There is one 64-bit comparison register (CVAL), which signals when the counter (an independent register) is greater or equal. TVAL is just a different *view* of that same relation. So this part is fine, it's really that the "strictly monotonic counter" nature of CNTPCT is not really observed by the chip. > Yet another proof for the theory that timers are implemented by janitors > and that silicon/IP vendors have a competition running who can create the > most subtly broken timers. Intel surely had a head start with that, but ARM > is definitely catching up. ARM is trying really hard to be actually better ;-) Cheers, Andre. [1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500g/BABIGHII.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 15:43 ` Andre Przywara @ 2018-07-04 19:49 ` Thomas Gleixner 0 siblings, 0 replies; 30+ messages in thread From: Thomas Gleixner @ 2018-07-04 19:49 UTC (permalink / raw) To: Andre Przywara Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On Wed, 4 Jul 2018, Andre Przywara wrote: > On 04/07/18 16:14, Thomas Gleixner wrote: > > On Wed, 4 Jul 2018, Andre Przywara wrote: > >> On 04/07/18 15:31, Thomas Gleixner wrote: > >>> If that's the case then you need to find a different functional timer for > >>> time keeping. Having an erratic behaving timer for time keeping is not an > >>> option at all. > >> > >> That's not an option on arm64. There are other usable time sources in > >> the SoC, but the arch timer is somewhat mandatory for all practical > >> purposes on arm64. We rely on it in some many places that it's not > >> feasible to run without it. That's why we call it "architected" timer > >> after all ;-) > > > > The argument that it has to be used just because someone defined it as > > 'architected' is bullshit and doesn't change the fact that it's broken and > > not usable for timekeeping. There is no wiggle room. Either it works or > > not, but works mostly is not an option. > > The "architected" part of the arch timer is fine, it's just that > eventually someone has to implement that at some point. And as you > mention below, this is where Murphy's law is kicking in ;-) Especially > for such seemingly simple tasks as connecting a counter in the "uncore" > part of the chip (Allwinner SoC) to the counter register interface in > the core (ARM Cortex-A53) [1]. Apparently the propagation is not really > atomic for all bits here ... I've immediately spotted the fail in that document: The Cortex-A53 processor does not include the system counter. This resides in the SoC. > >> But I am quite confident that we can find a correct workaround. Maybe > >> it's really the TVAL (the downcounter) write which is the culprit here, > >> since the hardware actually writes "now() + TVAL" into the CVAL > >> (upcounter) register. This internal counter access may be flawed as well. > > > > If the write to the event device is wreckaging the counter which provides > > time, then there is something seriously wrong either in the design or in > > that particular piece of silicon. > > Apologies, that was my lousy wording: There is one 64-bit comparison > register (CVAL), which signals when the counter (an independent > register) is greater or equal. TVAL is just a different *view* of that > same relation. So this part is fine, it's really that the "strictly > monotonic counter" nature of CNTPCT is not really observed by the chip. > > > Yet another proof for the theory that timers are implemented by janitors > > and that silicon/IP vendors have a competition running who can create the > > most subtly broken timers. Intel surely had a head start with that, but ARM > > is definitely catching up. > > ARM is trying really hard to be actually better ;-) Better in terms of subtle brokenness? I surely can do consulting for that. I've seen most of it in all colours, but I surely can come up with new even subtler ways to wreckage them. You know how to reach me. Thanks, tglx ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 8:16 ` Marc Zyngier 2018-07-04 8:19 ` Chen-Yu Tsai 2018-07-04 8:23 ` Daniel Lezcano @ 2018-07-04 8:41 ` Daniel Lezcano 2018-07-12 2:23 ` Samuel Holland 2018-07-04 9:06 ` Maxime Ripard 3 siblings, 1 reply; 30+ messages in thread From: Daniel Lezcano @ 2018-07-04 8:41 UTC (permalink / raw) To: Marc Zyngier, Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 04/07/2018 10:16, Marc Zyngier wrote: > On 03/07/18 19:42, Samuel Holland wrote: >> On 07/03/18 10:09, Marc Zyngier wrote: >>> On 11/05/18 03:27, Samuel Holland wrote: >>>> Hello, >>>> >>>> Several people (including me) have experienced extremely large system >>>> clock jumps on their A64-based devices, apparently due to the architectural >>>> timer going backward, which is interpreted by Linux as the timer wrapping >>>> around after 2^56 cycles. >>>> >>>> Investigation led to discovery of some obvious problems with this SoC's >>>> architectural timer, and this patch series introduces what I believe is >>>> the simplest workaround. More details are in the commit message for patch >>>> 1. Patch 2 simply enables the workaround in the device tree. >>> >>> What's the deal with this series? There was a couple of nits to address, and >>> I was more or less expecting a v2. >> >> I got reports that people were still occasionally having clock jumps after >> applying this series, so I wanted to attempt a more complete fix, but I haven't >> had time to do any deeper investigation. I think this series is still beneficial >> even if it's not a complete solution, so I'll come back with another patch on >> top of this if/once I get it fully fixed. >> >> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz >> timer) ≈ 150 should be a conservative iteration limit? > > Should be OK. > > Maxime: How do you want to deal with the documentation aspect? We need > an erratum number, but AFAIU the concept hasn't made it into the silicom > vendor's brain yet. Any chance you could come up with something that > uniquely identifies this? I went through the different pointers provided in the description but I did not find a clear statement that is a hardware issue or may be I missed it. Are we sure there isn't another subsystem responsible on this instability ? (eg PM or something else) >> Also, does this make sense to CC to stable? > > Probably not, as the HW never worked, so it is not a regression. > > Thanks, > > M. > -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 8:41 ` Daniel Lezcano @ 2018-07-12 2:23 ` Samuel Holland 0 siblings, 0 replies; 30+ messages in thread From: Samuel Holland @ 2018-07-12 2:23 UTC (permalink / raw) To: Daniel Lezcano, Marc Zyngier, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 07/04/18 03:41, Daniel Lezcano wrote: > On 04/07/2018 10:16, Marc Zyngier wrote: >> On 03/07/18 19:42, Samuel Holland wrote: >>> On 07/03/18 10:09, Marc Zyngier wrote: >>>> On 11/05/18 03:27, Samuel Holland wrote: >>>>> Hello, >>>>> >>>>> Several people (including me) have experienced extremely large system >>>>> clock jumps on their A64-based devices, apparently due to the architectural >>>>> timer going backward, which is interpreted by Linux as the timer wrapping >>>>> around after 2^56 cycles. >>>>> >>>>> Investigation led to discovery of some obvious problems with this SoC's >>>>> architectural timer, and this patch series introduces what I believe is >>>>> the simplest workaround. More details are in the commit message for patch >>>>> 1. Patch 2 simply enables the workaround in the device tree. >>>> >>>> What's the deal with this series? There was a couple of nits to address, and >>>> I was more or less expecting a v2. >>> >>> I got reports that people were still occasionally having clock jumps after >>> applying this series, so I wanted to attempt a more complete fix, but I haven't >>> had time to do any deeper investigation. I think this series is still beneficial >>> even if it's not a complete solution, so I'll come back with another patch on >>> top of this if/once I get it fully fixed. >>> >>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz >>> timer) ≈ 150 should be a conservative iteration limit? >> >> Should be OK. >> >> Maxime: How do you want to deal with the documentation aspect? We need >> an erratum number, but AFAIU the concept hasn't made it into the silicom >> vendor's brain yet. Any chance you could come up with something that >> uniquely identifies this? > > I went through the different pointers provided in the description but I > did not find a clear statement that is a hardware issue or may be I > missed it. > > Are we sure there isn't another subsystem responsible on this > instability ? (eg PM or something else) This issue has been observed on kernels with and without DVFS, across several Linux, U-Boot, and Trusted Firmware versions. It has not been observed on any other Allwinner SoC, including the A64's twin, the H5. In fact, this workaround was recently successfully used in U-Boot [1] to fix issues with an MMC driver that needed reliable numbers from CNTVCT. So while the vendor hasn't confirmed it (and I wouldn't count on that happening), everything I've seen points to it being a silicon bug, not a software issue. [1]: https://git.denx.de/?p=u-boot.git;a=commit;h=be0d217952222b2bd3ed071de9bb0c66d8cc80d9 >>> Also, does this make sense to CC to stable? >> >> Probably not, as the HW never worked, so it is not a regression. >> >> Thanks, >> >> M. >> > > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-04 8:16 ` Marc Zyngier ` (2 preceding siblings ...) 2018-07-04 8:41 ` Daniel Lezcano @ 2018-07-04 9:06 ` Maxime Ripard 3 siblings, 0 replies; 30+ messages in thread From: Maxime Ripard @ 2018-07-04 9:06 UTC (permalink / raw) To: Marc Zyngier Cc: Samuel Holland, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner, linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland [-- Attachment #1: Type: text/plain, Size: 2025 bytes --] On Wed, Jul 04, 2018 at 09:16:32AM +0100, Marc Zyngier wrote: > On 03/07/18 19:42, Samuel Holland wrote: > > On 07/03/18 10:09, Marc Zyngier wrote: > >> On 11/05/18 03:27, Samuel Holland wrote: > >>> Hello, > >>> > >>> Several people (including me) have experienced extremely large system > >>> clock jumps on their A64-based devices, apparently due to the architectural > >>> timer going backward, which is interpreted by Linux as the timer wrapping > >>> around after 2^56 cycles. > >>> > >>> Investigation led to discovery of some obvious problems with this SoC's > >>> architectural timer, and this patch series introduces what I believe is > >>> the simplest workaround. More details are in the commit message for patch > >>> 1. Patch 2 simply enables the workaround in the device tree. > >> > >> What's the deal with this series? There was a couple of nits to address, and > >> I was more or less expecting a v2. > > > > I got reports that people were still occasionally having clock jumps after > > applying this series, so I wanted to attempt a more complete fix, but I haven't > > had time to do any deeper investigation. I think this series is still beneficial > > even if it's not a complete solution, so I'll come back with another patch on > > top of this if/once I get it fully fixed. > > > > I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz > > timer) ≈ 150 should be a conservative iteration limit? > > Should be OK. > > Maxime: How do you want to deal with the documentation aspect? We need > an erratum number, but AFAIU the concept hasn't made it into the silicom > vendor's brain yet. Any chance you could come up with something that > uniquely identifies this? Yeah, I don't know how we can address that unfortunately. Or maybe we can call it timer-broken-1 ? It's as good as an ID than any other ID :) Maxime -- Maxime Ripard, Bootlin (formerly Free Electrons) Embedded Linux and Kernel engineering https://bootlin.com [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-03 18:42 ` Samuel Holland 2018-07-04 8:16 ` Marc Zyngier @ 2018-07-04 8:41 ` Daniel Lezcano 2018-07-04 12:52 ` [linux-sunxi] " Andre Przywara 2 siblings, 0 replies; 30+ messages in thread From: Daniel Lezcano @ 2018-07-04 8:41 UTC (permalink / raw) To: Samuel Holland, Marc Zyngier, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland On 03/07/2018 20:42, Samuel Holland wrote: > On 07/03/18 10:09, Marc Zyngier wrote: >> On 11/05/18 03:27, Samuel Holland wrote: >>> Hello, >>> >>> Several people (including me) have experienced extremely large system >>> clock jumps on their A64-based devices, apparently due to the architectural >>> timer going backward, which is interpreted by Linux as the timer wrapping >>> around after 2^56 cycles. >>> >>> Investigation led to discovery of some obvious problems with this SoC's >>> architectural timer, and this patch series introduces what I believe is >>> the simplest workaround. More details are in the commit message for patch >>> 1. Patch 2 simply enables the workaround in the device tree. >> >> What's the deal with this series? There was a couple of nits to address, and >> I was more or less expecting a v2. > > I got reports that people were still occasionally having clock jumps after > applying this series, so I wanted to attempt a more complete fix, but I haven't > had time to do any deeper investigation. I think this series is still beneficial > even if it's not a complete solution, so I'll come back with another patch on > top of this if/once I get it fully fixed. > > I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz > timer) ≈ 150 should be a conservative iteration limit? > > Also, does this make sense to CC to stable? I understand a partial fix is better than nothing but if you can narrow down the issue and provide patches to fix it in one shot that would be awesome. -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround 2018-07-03 18:42 ` Samuel Holland 2018-07-04 8:16 ` Marc Zyngier 2018-07-04 8:41 ` Daniel Lezcano @ 2018-07-04 12:52 ` Andre Przywara 2 siblings, 0 replies; 30+ messages in thread From: Andre Przywara @ 2018-07-04 12:52 UTC (permalink / raw) To: samuel, Marc Zyngier, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland Hi, On 03/07/18 19:42, Samuel Holland wrote: > On 07/03/18 10:09, Marc Zyngier wrote: >> On 11/05/18 03:27, Samuel Holland wrote: >>> Hello, >>> >>> Several people (including me) have experienced extremely large system >>> clock jumps on their A64-based devices, apparently due to the architectural >>> timer going backward, which is interpreted by Linux as the timer wrapping >>> around after 2^56 cycles. >>> >>> Investigation led to discovery of some obvious problems with this SoC's >>> architectural timer, and this patch series introduces what I believe is >>> the simplest workaround. More details are in the commit message for patch >>> 1. Patch 2 simply enables the workaround in the device tree. >> >> What's the deal with this series? There was a couple of nits to address, and >> I was more or less expecting a v2. > > I got reports that people were still occasionally having clock jumps after > applying this series, so I wanted to attempt a more complete fix, but I haven't > had time to do any deeper investigation. Looking at the FSL workaround, I see that they cover TVAL reads as well. Not sure entirely why, but maybe it's worth to follow this lead? Cheers, Andre. > I think this series is still beneficial > even if it's not a complete solution, so I'll come back with another patch on > top of this if/once I get it fully fixed. > > I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz > timer) ≈ 150 should be a conservative iteration limit? > > Also, does this make sense to CC to stable? > > Thanks, > Samuel > ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2018-07-12 2:31 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-11 2:27 [PATCH 0/2] Allwinner A64 timer workaround Samuel Holland 2018-05-11 2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland 2018-05-11 8:26 ` Maxime Ripard 2018-05-11 8:48 ` Marc Zyngier 2018-05-11 15:08 ` Samuel Holland 2018-05-26 15:15 ` André Przywara 2018-05-11 2:27 ` [PATCH 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland 2018-05-11 9:24 ` [PATCH 0/2] Allwinner " Andre Przywara 2018-07-03 15:09 ` Marc Zyngier 2018-07-03 18:42 ` Samuel Holland 2018-07-04 8:16 ` Marc Zyngier 2018-07-04 8:19 ` Chen-Yu Tsai 2018-07-04 8:23 ` Daniel Lezcano 2018-07-04 8:39 ` Marc Zyngier 2018-07-04 10:00 ` Thomas Gleixner 2018-07-04 13:08 ` [linux-sunxi] " Andre Przywara 2018-07-04 14:31 ` Thomas Gleixner 2018-07-04 14:44 ` Andre Przywara 2018-07-04 15:01 ` Marc Zyngier 2018-07-04 15:15 ` Andre Przywara 2018-07-04 15:30 ` Marc Zyngier 2018-07-04 15:23 ` Samuel Holland 2018-07-04 15:14 ` Thomas Gleixner 2018-07-04 15:43 ` Andre Przywara 2018-07-04 19:49 ` Thomas Gleixner 2018-07-04 8:41 ` Daniel Lezcano 2018-07-12 2:23 ` Samuel Holland 2018-07-04 9:06 ` Maxime Ripard 2018-07-04 8:41 ` Daniel Lezcano 2018-07-04 12:52 ` [linux-sunxi] " Andre Przywara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).