All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-05-11  2:27 ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11  2:27 UTC (permalink / raw)
  To: Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon,
	Daniel Lezcano, Thomas Gleixner, Marc Zyngier
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi

Hello,

Several people (including me) have experienced extremely large system
clock jumps on their A64-based devices, apparently due to the
architectural timer going backward, which is interpreted by Linux as
the timer wrapping around after 2^56 cycles.

Investigation led to discovery of some obvious problems with this SoC's
architectural timer, and this patch series introduces what I believe is
the simplest workaround. More details are in the commit message for
patch 1. Patch 2 simply enables the workaround in the device tree.

Thanks,
Samuel

Samuel Holland (2):
  arm64: arch_timer: Workaround for Allwinner A64 timer instability
  arm64: dts: allwinner: a64: Enable A64 timer workaround

 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi |  1 +
 drivers/clocksource/Kconfig                   | 11 ++++++++
 drivers/clocksource/arm_arch_timer.c          | 39 +++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

--
2.16.1

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-05-11  2:27 ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11  2:27 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

Several people (including me) have experienced extremely large system
clock jumps on their A64-based devices, apparently due to the
architectural timer going backward, which is interpreted by Linux as
the timer wrapping around after 2^56 cycles.

Investigation led to discovery of some obvious problems with this SoC's
architectural timer, and this patch series introduces what I believe is
the simplest workaround. More details are in the commit message for
patch 1. Patch 2 simply enables the workaround in the device tree.

Thanks,
Samuel

Samuel Holland (2):
  arm64: arch_timer: Workaround for Allwinner A64 timer instability
  arm64: dts: allwinner: a64: Enable A64 timer workaround

 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi |  1 +
 drivers/clocksource/Kconfig                   | 11 ++++++++
 drivers/clocksource/arm_arch_timer.c          | 39 +++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

--
2.16.1

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2018-05-11  2:27 ` Samuel Holland
@ 2018-05-11  2:27   ` Samuel Holland
  -1 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11  2:27 UTC (permalink / raw)
  To: Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon,
	Daniel Lezcano, Thomas Gleixner, Marc Zyngier
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Samuel Holland

The Allwinner A64 SoC is known [1] to have an unstable architectural
timer, which manifests itself most obviously in the time jumping forward
a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
timer frequency of 24 MHz, implying that the time went slightly backward
(and this was interpreted by the kernel as it jumping forward and
wrapping around past the epoch).

Further investigation revealed instability in the low bits of CNTVCT at
the point a high bit rolls over. This leads to power-of-two cycle
forward and backward jumps. (Testing shows that forward jumps are about
twice as likely as backward jumps.)

Without trapping reads to CNTVCT, a userspace program is able to read it
in a loop faster than it changes. A test program running on all 4 CPU
cores that reported jumps larger than 100 ms was run for 13.6 hours and
reported the following:

 Count | Event
-------+---------------------------
  9940 | jumped backward      699ms
   268 | jumped backward     1398ms
     1 | jumped backward     2097ms
 16020 | jumped forward       175ms
  6443 | jumped forward       699ms
  2976 | jumped forward      1398ms
     9 | jumped forward    356516ms
     9 | jumped forward    357215ms
     4 | jumped forward    714430ms
     1 | jumped forward   3578440ms

This works out to a jump larger than 100 ms about every 5.5 seconds on
each CPU core.

The largest jump (almost an hour!) was the following sequence of reads:
      0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000

Note that the middle bits don't necessarily all read as all zeroes or
all ones during the anomalous behavior; however the low 11 bits checked
by the function in this patch have never been observed with any other
value.

Also note that smaller jumps are much more common, with the smallest
backward jumps of 2048 cycles observed over 400 times per second on each
core. (Of course, this is partially due to lower bits rolling over more
frequently.) Any one of these could have caused the 95 year time skip.

Similar anomalies were observed while reading CNTPCT (after patching the
kernel to allow reads from userspace). However, the jumps are much less
frequent, and only small jumps were observed. The same program as before
(except now reading CNTPCT) observed after 72 hours:

 Count | Event
-------+---------------------------
    17 | jumped backward      699ms
    52 | jumped forward       175ms
  2831 | jumped forward       699ms
     5 | jumped forward      1398ms

========================================================================

Because the CPU can read the CNTPCT/CNTVCT registers faster than they
change, performing two reads of the register and comparing the high bits
(like other workarounds) is not a workable solution. And because the
timer can jump both forward and backward, no pair of reads can
distinguish a good value from a bad one. The only way to guarantee a
good value from consecutive reads would be to read _three_ times, and
take the middle value iff the three values are 1) individually unique
and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
an anomaly is detected.

However, since there is a distinct pattern to the bad values, we can
optimize the common case (2046/2048 of the time) to a single read by
simply ignoring values that match the pattern. This still takes no more
than 3 cycles in the worst case, and requires much less code.

[1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
[2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
[3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26

Signed-off-by: Samuel Holland <samuel@sholland.org>
---
 drivers/clocksource/Kconfig          | 11 ++++++++++
 drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 8e8a09755d10..7a5d434dd30b 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921
 	  The workaround will be dynamically enabled when an affected
 	  core is detected.
 
+config SUN50I_A64_UNSTABLE_TIMER
+	bool "Workaround for Allwinner A64 timer instability"
+	default y
+	depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI
+	select ARM_ARCH_TIMER_OOL_WORKAROUND
+	help
+	  This option enables a workaround for instability in the timer on
+	  the Allwinner A64 SoC. The workaround will only be active if the
+	  allwinner,sun50i-a64-unstable-timer property is found in the
+	  timer node.
+
 config ARM_GLOBAL_TIMER
 	bool "Support for the ARM global timer" if COMPILE_TEST
 	select TIMER_OF if OF
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 57cb2f00fc07..66ce13578c52 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
 }
 #endif
 
+#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
+/*
+ * The low bits of each register can transiently read as all ones or all zeroes
+ * when bit 11 or greater rolls over. Since the value can jump both backward
+ * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just
+ * ignore register values with all ones or zeros in the low bits.
+ */
+static u64 notrace sun50i_a64_read_cntpct_el0(void)
+{
+	u64 val;
+
+	do {
+		val = read_sysreg(cntpct_el0);
+	} while (((val + 1) & GENMASK(10, 0)) <= 1);
+
+	return val;
+}
+
+static u64 notrace sun50i_a64_read_cntvct_el0(void)
+{
+	u64 val;
+
+	do {
+		val = read_sysreg(cntvct_el0);
+	} while (((val + 1) & GENMASK(10, 0)) <= 1);
+
+	return val;
+}
+#endif
+
 #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND
 DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround);
 EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround);
@@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = {
 		.read_cntvct_el0 = arm64_858921_read_cntvct_el0,
 	},
 #endif
+#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
+	{
+		.match_type = ate_match_dt,
+		.id = "allwinner,sun50i-a64-unstable-timer",
+		.desc = "Allwinner A64 timer instability",
+		.read_cntpct_el0 = sun50i_a64_read_cntpct_el0,
+		.read_cntvct_el0 = sun50i_a64_read_cntvct_el0,
+	},
+#endif
 };
 
 typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *,
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
@ 2018-05-11  2:27   ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11  2:27 UTC (permalink / raw)
  To: linux-arm-kernel

The Allwinner A64 SoC is known [1] to have an unstable architectural
timer, which manifests itself most obviously in the time jumping forward
a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
timer frequency of 24 MHz, implying that the time went slightly backward
(and this was interpreted by the kernel as it jumping forward and
wrapping around past the epoch).

Further investigation revealed instability in the low bits of CNTVCT at
the point a high bit rolls over. This leads to power-of-two cycle
forward and backward jumps. (Testing shows that forward jumps are about
twice as likely as backward jumps.)

Without trapping reads to CNTVCT, a userspace program is able to read it
in a loop faster than it changes. A test program running on all 4 CPU
cores that reported jumps larger than 100 ms was run for 13.6 hours and
reported the following:

 Count | Event
-------+---------------------------
  9940 | jumped backward      699ms
   268 | jumped backward     1398ms
     1 | jumped backward     2097ms
 16020 | jumped forward       175ms
  6443 | jumped forward       699ms
  2976 | jumped forward      1398ms
     9 | jumped forward    356516ms
     9 | jumped forward    357215ms
     4 | jumped forward    714430ms
     1 | jumped forward   3578440ms

This works out to a jump larger than 100 ms about every 5.5 seconds on
each CPU core.

The largest jump (almost an hour!) was the following sequence of reads:
      0x0000007fffffffff ? 0x00000093feffffff ? 0x0000008000000000

Note that the middle bits don't necessarily all read as all zeroes or
all ones during the anomalous behavior; however the low 11 bits checked
by the function in this patch have never been observed with any other
value.

Also note that smaller jumps are much more common, with the smallest
backward jumps of 2048 cycles observed over 400 times per second on each
core. (Of course, this is partially due to lower bits rolling over more
frequently.) Any one of these could have caused the 95 year time skip.

Similar anomalies were observed while reading CNTPCT (after patching the
kernel to allow reads from userspace). However, the jumps are much less
frequent, and only small jumps were observed. The same program as before
(except now reading CNTPCT) observed after 72 hours:

 Count | Event
-------+---------------------------
    17 | jumped backward      699ms
    52 | jumped forward       175ms
  2831 | jumped forward       699ms
     5 | jumped forward      1398ms

========================================================================

Because the CPU can read the CNTPCT/CNTVCT registers faster than they
change, performing two reads of the register and comparing the high bits
(like other workarounds) is not a workable solution. And because the
timer can jump both forward and backward, no pair of reads can
distinguish a good value from a bad one. The only way to guarantee a
good value from consecutive reads would be to read _three_ times, and
take the middle value iff the three values are 1) individually unique
and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
an anomaly is detected.

However, since there is a distinct pattern to the bad values, we can
optimize the common case (2046/2048 of the time) to a single read by
simply ignoring values that match the pattern. This still takes no more
than 3 cycles in the worst case, and requires much less code.

[1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
[2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
[3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26

Signed-off-by: Samuel Holland <samuel@sholland.org>
---
 drivers/clocksource/Kconfig          | 11 ++++++++++
 drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 8e8a09755d10..7a5d434dd30b 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921
 	  The workaround will be dynamically enabled when an affected
 	  core is detected.
 
+config SUN50I_A64_UNSTABLE_TIMER
+	bool "Workaround for Allwinner A64 timer instability"
+	default y
+	depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI
+	select ARM_ARCH_TIMER_OOL_WORKAROUND
+	help
+	  This option enables a workaround for instability in the timer on
+	  the Allwinner A64 SoC. The workaround will only be active if the
+	  allwinner,sun50i-a64-unstable-timer property is found in the
+	  timer node.
+
 config ARM_GLOBAL_TIMER
 	bool "Support for the ARM global timer" if COMPILE_TEST
 	select TIMER_OF if OF
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 57cb2f00fc07..66ce13578c52 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
 }
 #endif
 
+#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
+/*
+ * The low bits of each register can transiently read as all ones or all zeroes
+ * when bit 11 or greater rolls over. Since the value can jump both backward
+ * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just
+ * ignore register values with all ones or zeros in the low bits.
+ */
+static u64 notrace sun50i_a64_read_cntpct_el0(void)
+{
+	u64 val;
+
+	do {
+		val = read_sysreg(cntpct_el0);
+	} while (((val + 1) & GENMASK(10, 0)) <= 1);
+
+	return val;
+}
+
+static u64 notrace sun50i_a64_read_cntvct_el0(void)
+{
+	u64 val;
+
+	do {
+		val = read_sysreg(cntvct_el0);
+	} while (((val + 1) & GENMASK(10, 0)) <= 1);
+
+	return val;
+}
+#endif
+
 #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND
 DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround);
 EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround);
@@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = {
 		.read_cntvct_el0 = arm64_858921_read_cntvct_el0,
 	},
 #endif
+#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
+	{
+		.match_type = ate_match_dt,
+		.id = "allwinner,sun50i-a64-unstable-timer",
+		.desc = "Allwinner A64 timer instability",
+		.read_cntpct_el0 = sun50i_a64_read_cntpct_el0,
+		.read_cntvct_el0 = sun50i_a64_read_cntvct_el0,
+	},
+#endif
 };
 
 typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *,
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround
  2018-05-11  2:27 ` Samuel Holland
@ 2018-05-11  2:27   ` Samuel Holland
  -1 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11  2:27 UTC (permalink / raw)
  To: Maxime Ripard, Chen-Yu Tsai, Catalin Marinas, Will Deacon,
	Daniel Lezcano, Thomas Gleixner, Marc Zyngier
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Samuel Holland

As instability in the architectural timer has been observed on multiple
devices using this SoC, inluding the Pine64 and the Orange Pi Win,
enable the workaround in the SoC's device tree.

Signed-off-by: Samuel Holland <samuel@sholland.org>
---
 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
index 1b2ef28c42bd..5202b76e9684 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
@@ -152,6 +152,7 @@
 
 	timer {
 		compatible = "arm,armv8-timer";
+		allwinner,sun50i-a64-unstable-timer;
 		interrupts = <GIC_PPI 13
 			(GIC_CPU_MASK_SIMPLE(4) | IRQ_TYPE_LEVEL_HIGH)>,
 			     <GIC_PPI 14
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround
@ 2018-05-11  2:27   ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11  2:27 UTC (permalink / raw)
  To: linux-arm-kernel

As instability in the architectural timer has been observed on multiple
devices using this SoC, inluding the Pine64 and the Orange Pi Win,
enable the workaround in the SoC's device tree.

Signed-off-by: Samuel Holland <samuel@sholland.org>
---
 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
index 1b2ef28c42bd..5202b76e9684 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
@@ -152,6 +152,7 @@
 
 	timer {
 		compatible = "arm,armv8-timer";
+		allwinner,sun50i-a64-unstable-timer;
 		interrupts = <GIC_PPI 13
 			(GIC_CPU_MASK_SIMPLE(4) | IRQ_TYPE_LEVEL_HIGH)>,
 			     <GIC_PPI 14
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2018-05-11  2:27   ` Samuel Holland
@ 2018-05-11  8:26     ` Maxime Ripard
  -1 siblings, 0 replies; 60+ messages in thread
From: Maxime Ripard @ 2018-05-11  8:26 UTC (permalink / raw)
  To: Samuel Holland
  Cc: Chen-Yu Tsai, Catalin Marinas, Will Deacon, Daniel Lezcano,
	Thomas Gleixner, Marc Zyngier, linux-arm-kernel, linux-kernel,
	linux-sunxi

[-- Attachment #1: Type: text/plain, Size: 4259 bytes --]

On Thu, May 10, 2018 at 09:27:50PM -0500, Samuel Holland wrote:
> The Allwinner A64 SoC is known [1] to have an unstable architectural
> timer, which manifests itself most obviously in the time jumping forward
> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
> timer frequency of 24 MHz, implying that the time went slightly backward
> (and this was interpreted by the kernel as it jumping forward and
> wrapping around past the epoch).
> 
> Further investigation revealed instability in the low bits of CNTVCT at
> the point a high bit rolls over. This leads to power-of-two cycle
> forward and backward jumps. (Testing shows that forward jumps are about
> twice as likely as backward jumps.)
> 
> Without trapping reads to CNTVCT, a userspace program is able to read it
> in a loop faster than it changes. A test program running on all 4 CPU
> cores that reported jumps larger than 100 ms was run for 13.6 hours and
> reported the following:
> 
>  Count | Event
> -------+---------------------------
>   9940 | jumped backward      699ms
>    268 | jumped backward     1398ms
>      1 | jumped backward     2097ms
>  16020 | jumped forward       175ms
>   6443 | jumped forward       699ms
>   2976 | jumped forward      1398ms
>      9 | jumped forward    356516ms
>      9 | jumped forward    357215ms
>      4 | jumped forward    714430ms
>      1 | jumped forward   3578440ms
> 
> This works out to a jump larger than 100 ms about every 5.5 seconds on
> each CPU core.
> 
> The largest jump (almost an hour!) was the following sequence of reads:
>       0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000
> 
> Note that the middle bits don't necessarily all read as all zeroes or
> all ones during the anomalous behavior; however the low 11 bits checked
> by the function in this patch have never been observed with any other
> value.
> 
> Also note that smaller jumps are much more common, with the smallest
> backward jumps of 2048 cycles observed over 400 times per second on each
> core. (Of course, this is partially due to lower bits rolling over more
> frequently.) Any one of these could have caused the 95 year time skip.
> 
> Similar anomalies were observed while reading CNTPCT (after patching the
> kernel to allow reads from userspace). However, the jumps are much less
> frequent, and only small jumps were observed. The same program as before
> (except now reading CNTPCT) observed after 72 hours:
> 
>  Count | Event
> -------+---------------------------
>     17 | jumped backward      699ms
>     52 | jumped forward       175ms
>   2831 | jumped forward       699ms
>      5 | jumped forward      1398ms
> 
> ========================================================================
> 
> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> change, performing two reads of the register and comparing the high bits
> (like other workarounds) is not a workable solution. And because the
> timer can jump both forward and backward, no pair of reads can
> distinguish a good value from a bad one. The only way to guarantee a
> good value from consecutive reads would be to read _three_ times, and
> take the middle value iff the three values are 1) individually unique
> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
> an anomaly is detected.
> 
> However, since there is a distinct pattern to the bad values, we can
> optimize the common case (2046/2048 of the time) to a single read by
> simply ignoring values that match the pattern. This still takes no more
> than 3 cycles in the worst case, and requires much less code.

That's an awesome commit log, thanks!

For both patches:
Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>

> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/

Sigh. So armbian knew about this for more than a year and had a fix,
and didn't judge necessary to report it anywhere. That's some solid,
responsible, development right there...

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
@ 2018-05-11  8:26     ` Maxime Ripard
  0 siblings, 0 replies; 60+ messages in thread
From: Maxime Ripard @ 2018-05-11  8:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 10, 2018 at 09:27:50PM -0500, Samuel Holland wrote:
> The Allwinner A64 SoC is known [1] to have an unstable architectural
> timer, which manifests itself most obviously in the time jumping forward
> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
> timer frequency of 24 MHz, implying that the time went slightly backward
> (and this was interpreted by the kernel as it jumping forward and
> wrapping around past the epoch).
> 
> Further investigation revealed instability in the low bits of CNTVCT at
> the point a high bit rolls over. This leads to power-of-two cycle
> forward and backward jumps. (Testing shows that forward jumps are about
> twice as likely as backward jumps.)
> 
> Without trapping reads to CNTVCT, a userspace program is able to read it
> in a loop faster than it changes. A test program running on all 4 CPU
> cores that reported jumps larger than 100 ms was run for 13.6 hours and
> reported the following:
> 
>  Count | Event
> -------+---------------------------
>   9940 | jumped backward      699ms
>    268 | jumped backward     1398ms
>      1 | jumped backward     2097ms
>  16020 | jumped forward       175ms
>   6443 | jumped forward       699ms
>   2976 | jumped forward      1398ms
>      9 | jumped forward    356516ms
>      9 | jumped forward    357215ms
>      4 | jumped forward    714430ms
>      1 | jumped forward   3578440ms
> 
> This works out to a jump larger than 100 ms about every 5.5 seconds on
> each CPU core.
> 
> The largest jump (almost an hour!) was the following sequence of reads:
>       0x0000007fffffffff ? 0x00000093feffffff ? 0x0000008000000000
> 
> Note that the middle bits don't necessarily all read as all zeroes or
> all ones during the anomalous behavior; however the low 11 bits checked
> by the function in this patch have never been observed with any other
> value.
> 
> Also note that smaller jumps are much more common, with the smallest
> backward jumps of 2048 cycles observed over 400 times per second on each
> core. (Of course, this is partially due to lower bits rolling over more
> frequently.) Any one of these could have caused the 95 year time skip.
> 
> Similar anomalies were observed while reading CNTPCT (after patching the
> kernel to allow reads from userspace). However, the jumps are much less
> frequent, and only small jumps were observed. The same program as before
> (except now reading CNTPCT) observed after 72 hours:
> 
>  Count | Event
> -------+---------------------------
>     17 | jumped backward      699ms
>     52 | jumped forward       175ms
>   2831 | jumped forward       699ms
>      5 | jumped forward      1398ms
> 
> ========================================================================
> 
> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> change, performing two reads of the register and comparing the high bits
> (like other workarounds) is not a workable solution. And because the
> timer can jump both forward and backward, no pair of reads can
> distinguish a good value from a bad one. The only way to guarantee a
> good value from consecutive reads would be to read _three_ times, and
> take the middle value iff the three values are 1) individually unique
> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
> an anomaly is detected.
> 
> However, since there is a distinct pattern to the bad values, we can
> optimize the common case (2046/2048 of the time) to a single read by
> simply ignoring values that match the pattern. This still takes no more
> than 3 cycles in the worst case, and requires much less code.

That's an awesome commit log, thanks!

For both patches:
Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>

> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/

Sigh. So armbian knew about this for more than a year and had a fix,
and didn't judge necessary to report it anywhere. That's some solid,
responsible, development right there...

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20180511/3e662f74/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2018-05-11  2:27   ` Samuel Holland
@ 2018-05-11  8:48     ` Marc Zyngier
  -1 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-05-11  8:48 UTC (permalink / raw)
  To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas,
	Will Deacon, Daniel Lezcano, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

[+Mark, who co-maintains the arch timer code with me]

Hi Samuel,

On 11/05/18 03:27, Samuel Holland wrote:
> The Allwinner A64 SoC is known [1] to have an unstable architectural
> timer, which manifests itself most obviously in the time jumping forward
> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
> timer frequency of 24 MHz, implying that the time went slightly backward
> (and this was interpreted by the kernel as it jumping forward and
> wrapping around past the epoch).
> 
> Further investigation revealed instability in the low bits of CNTVCT at
> the point a high bit rolls over. This leads to power-of-two cycle
> forward and backward jumps. (Testing shows that forward jumps are about
> twice as likely as backward jumps.)
> 
> Without trapping reads to CNTVCT, a userspace program is able to read it
> in a loop faster than it changes. A test program running on all 4 CPU
> cores that reported jumps larger than 100 ms was run for 13.6 hours and
> reported the following:
> 
>  Count | Event
> -------+---------------------------
>   9940 | jumped backward      699ms
>    268 | jumped backward     1398ms
>      1 | jumped backward     2097ms
>  16020 | jumped forward       175ms
>   6443 | jumped forward       699ms
>   2976 | jumped forward      1398ms
>      9 | jumped forward    356516ms
>      9 | jumped forward    357215ms
>      4 | jumped forward    714430ms
>      1 | jumped forward   3578440ms
> 
> This works out to a jump larger than 100 ms about every 5.5 seconds on
> each CPU core.
> 
> The largest jump (almost an hour!) was the following sequence of reads:
>       0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000
> 
> Note that the middle bits don't necessarily all read as all zeroes or
> all ones during the anomalous behavior; however the low 11 bits checked
> by the function in this patch have never been observed with any other
> value.
> 
> Also note that smaller jumps are much more common, with the smallest
> backward jumps of 2048 cycles observed over 400 times per second on each
> core. (Of course, this is partially due to lower bits rolling over more
> frequently.) Any one of these could have caused the 95 year time skip.
> 
> Similar anomalies were observed while reading CNTPCT (after patching the
> kernel to allow reads from userspace). However, the jumps are much less
> frequent, and only small jumps were observed. The same program as before
> (except now reading CNTPCT) observed after 72 hours:
> 
>  Count | Event
> -------+---------------------------
>     17 | jumped backward      699ms
>     52 | jumped forward       175ms
>   2831 | jumped forward       699ms
>      5 | jumped forward      1398ms
> 
> ========================================================================
> 
> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> change, performing two reads of the register and comparing the high bits
> (like other workarounds) is not a workable solution. And because the
> timer can jump both forward and backward, no pair of reads can
> distinguish a good value from a bad one. The only way to guarantee a
> good value from consecutive reads would be to read _three_ times, and
> take the middle value iff the three values are 1) individually unique
> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
> an anomaly is detected.
> 
> However, since there is a distinct pattern to the bad values, we can
> optimize the common case (2046/2048 of the time) to a single read by
> simply ignoring values that match the pattern. This still takes no more
> than 3 cycles in the worst case, and requires much less code.

Thanks for the thorough description of the problem. A couple of questions:

- Have the 000/7ff values of the bottom bits only been experimentally
found? Or do you have more concrete evidence of this is what happens in
the HW?

- Do you have an official erratum number from the silicon vendor, so
that Documentation/arm64/silicon-errata.txt can be kept up to date?

> 
> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
> [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26
> 
> Signed-off-by: Samuel Holland <samuel@sholland.org>
> ---
>  drivers/clocksource/Kconfig          | 11 ++++++++++
>  drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> index 8e8a09755d10..7a5d434dd30b 100644
> --- a/drivers/clocksource/Kconfig
> +++ b/drivers/clocksource/Kconfig
> @@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921
>  	  The workaround will be dynamically enabled when an affected
>  	  core is detected.
>  
> +config SUN50I_A64_UNSTABLE_TIMER
> +	bool "Workaround for Allwinner A64 timer instability"
> +	default y
> +	depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI
> +	select ARM_ARCH_TIMER_OOL_WORKAROUND
> +	help
> +	  This option enables a workaround for instability in the timer on
> +	  the Allwinner A64 SoC. The workaround will only be active if the
> +	  allwinner,sun50i-a64-unstable-timer property is found in the
> +	  timer node.
> +
>  config ARM_GLOBAL_TIMER
>  	bool "Support for the ARM global timer" if COMPILE_TEST
>  	select TIMER_OF if OF
> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index 57cb2f00fc07..66ce13578c52 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
> +/*
> + * The low bits of each register can transiently read as all ones or all zeroes
> + * when bit 11 or greater rolls over. Since the value can jump both backward
> + * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just
> + * ignore register values with all ones or zeros in the low bits.
> + */
> +static u64 notrace sun50i_a64_read_cntpct_el0(void)
> +{
> +	u64 val;
> +
> +	do {
> +		val = read_sysreg(cntpct_el0);
> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);

Other workarounds of the same kind have a bounded loop. Have you done
any investigation on how many loops you need at most to sort the timer?
Depending on how many loops you need, it might be worth sticking a WFE
here to just wait for something to happen instead of busy looping.

> +
> +	return val;
> +}
> +
> +static u64 notrace sun50i_a64_read_cntvct_el0(void)
> +{
> +	u64 val;
> +
> +	do {
> +		val = read_sysreg(cntvct_el0);
> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);
> +
> +	return val;
> +}
> +#endif
> +
>  #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND
>  DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround);
>  EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround);
> @@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = {
>  		.read_cntvct_el0 = arm64_858921_read_cntvct_el0,
>  	},
>  #endif
> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
> +	{
> +		.match_type = ate_match_dt,
> +		.id = "allwinner,sun50i-a64-unstable-timer",
> +		.desc = "Allwinner A64 timer instability",
> +		.read_cntpct_el0 = sun50i_a64_read_cntpct_el0,
> +		.read_cntvct_el0 = sun50i_a64_read_cntvct_el0,
> +	},
> +#endif
>  };
>  
>  typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *,
> 

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
@ 2018-05-11  8:48     ` Marc Zyngier
  0 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-05-11  8:48 UTC (permalink / raw)
  To: linux-arm-kernel

[+Mark, who co-maintains the arch timer code with me]

Hi Samuel,

On 11/05/18 03:27, Samuel Holland wrote:
> The Allwinner A64 SoC is known [1] to have an unstable architectural
> timer, which manifests itself most obviously in the time jumping forward
> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
> timer frequency of 24 MHz, implying that the time went slightly backward
> (and this was interpreted by the kernel as it jumping forward and
> wrapping around past the epoch).
> 
> Further investigation revealed instability in the low bits of CNTVCT at
> the point a high bit rolls over. This leads to power-of-two cycle
> forward and backward jumps. (Testing shows that forward jumps are about
> twice as likely as backward jumps.)
> 
> Without trapping reads to CNTVCT, a userspace program is able to read it
> in a loop faster than it changes. A test program running on all 4 CPU
> cores that reported jumps larger than 100 ms was run for 13.6 hours and
> reported the following:
> 
>  Count | Event
> -------+---------------------------
>   9940 | jumped backward      699ms
>    268 | jumped backward     1398ms
>      1 | jumped backward     2097ms
>  16020 | jumped forward       175ms
>   6443 | jumped forward       699ms
>   2976 | jumped forward      1398ms
>      9 | jumped forward    356516ms
>      9 | jumped forward    357215ms
>      4 | jumped forward    714430ms
>      1 | jumped forward   3578440ms
> 
> This works out to a jump larger than 100 ms about every 5.5 seconds on
> each CPU core.
> 
> The largest jump (almost an hour!) was the following sequence of reads:
>       0x0000007fffffffff ? 0x00000093feffffff ? 0x0000008000000000
> 
> Note that the middle bits don't necessarily all read as all zeroes or
> all ones during the anomalous behavior; however the low 11 bits checked
> by the function in this patch have never been observed with any other
> value.
> 
> Also note that smaller jumps are much more common, with the smallest
> backward jumps of 2048 cycles observed over 400 times per second on each
> core. (Of course, this is partially due to lower bits rolling over more
> frequently.) Any one of these could have caused the 95 year time skip.
> 
> Similar anomalies were observed while reading CNTPCT (after patching the
> kernel to allow reads from userspace). However, the jumps are much less
> frequent, and only small jumps were observed. The same program as before
> (except now reading CNTPCT) observed after 72 hours:
> 
>  Count | Event
> -------+---------------------------
>     17 | jumped backward      699ms
>     52 | jumped forward       175ms
>   2831 | jumped forward       699ms
>      5 | jumped forward      1398ms
> 
> ========================================================================
> 
> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> change, performing two reads of the register and comparing the high bits
> (like other workarounds) is not a workable solution. And because the
> timer can jump both forward and backward, no pair of reads can
> distinguish a good value from a bad one. The only way to guarantee a
> good value from consecutive reads would be to read _three_ times, and
> take the middle value iff the three values are 1) individually unique
> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
> an anomaly is detected.
> 
> However, since there is a distinct pattern to the bad values, we can
> optimize the common case (2046/2048 of the time) to a single read by
> simply ignoring values that match the pattern. This still takes no more
> than 3 cycles in the worst case, and requires much less code.

Thanks for the thorough description of the problem. A couple of questions:

- Have the 000/7ff values of the bottom bits only been experimentally
found? Or do you have more concrete evidence of this is what happens in
the HW?

- Do you have an official erratum number from the silicon vendor, so
that Documentation/arm64/silicon-errata.txt can be kept up to date?

> 
> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
> [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26
> 
> Signed-off-by: Samuel Holland <samuel@sholland.org>
> ---
>  drivers/clocksource/Kconfig          | 11 ++++++++++
>  drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> index 8e8a09755d10..7a5d434dd30b 100644
> --- a/drivers/clocksource/Kconfig
> +++ b/drivers/clocksource/Kconfig
> @@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921
>  	  The workaround will be dynamically enabled when an affected
>  	  core is detected.
>  
> +config SUN50I_A64_UNSTABLE_TIMER
> +	bool "Workaround for Allwinner A64 timer instability"
> +	default y
> +	depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI
> +	select ARM_ARCH_TIMER_OOL_WORKAROUND
> +	help
> +	  This option enables a workaround for instability in the timer on
> +	  the Allwinner A64 SoC. The workaround will only be active if the
> +	  allwinner,sun50i-a64-unstable-timer property is found in the
> +	  timer node.
> +
>  config ARM_GLOBAL_TIMER
>  	bool "Support for the ARM global timer" if COMPILE_TEST
>  	select TIMER_OF if OF
> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index 57cb2f00fc07..66ce13578c52 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
> +/*
> + * The low bits of each register can transiently read as all ones or all zeroes
> + * when bit 11 or greater rolls over. Since the value can jump both backward
> + * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just
> + * ignore register values with all ones or zeros in the low bits.
> + */
> +static u64 notrace sun50i_a64_read_cntpct_el0(void)
> +{
> +	u64 val;
> +
> +	do {
> +		val = read_sysreg(cntpct_el0);
> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);

Other workarounds of the same kind have a bounded loop. Have you done
any investigation on how many loops you need@most to sort the timer?
Depending on how many loops you need, it might be worth sticking a WFE
here to just wait for something to happen instead of busy looping.

> +
> +	return val;
> +}
> +
> +static u64 notrace sun50i_a64_read_cntvct_el0(void)
> +{
> +	u64 val;
> +
> +	do {
> +		val = read_sysreg(cntvct_el0);
> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);
> +
> +	return val;
> +}
> +#endif
> +
>  #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND
>  DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround);
>  EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround);
> @@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = {
>  		.read_cntvct_el0 = arm64_858921_read_cntvct_el0,
>  	},
>  #endif
> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
> +	{
> +		.match_type = ate_match_dt,
> +		.id = "allwinner,sun50i-a64-unstable-timer",
> +		.desc = "Allwinner A64 timer instability",
> +		.read_cntpct_el0 = sun50i_a64_read_cntpct_el0,
> +		.read_cntvct_el0 = sun50i_a64_read_cntvct_el0,
> +	},
> +#endif
>  };
>  
>  typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *,
> 

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-05-11  2:27 ` Samuel Holland
@ 2018-05-11  9:24   ` Andre Przywara
  -1 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-05-11  9:24 UTC (permalink / raw)
  To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas,
	Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier
  Cc: linux-sunxi, linux-kernel, linux-arm-kernel

Hi,

On 11/05/18 03:27, Samuel Holland wrote:
> Hello,
> 
> Several people (including me) have experienced extremely large system
> clock jumps on their A64-based devices, apparently due to the
> architectural timer going backward, which is interpreted by Linux as
> the timer wrapping around after 2^56 cycles.

So I experienced this before, though I didn't see actual clock jumps,
only that subsequent reads of CNTVCT_EL0, both directly via mrs and
indirectly via CLOCK_MONOTONIC, from userland (triggered by a directed
test) *sometimes* went backwards, with a number of '1's in the lower bits.
But that didn't happen on every boot, and I was suspecting that some
timer setup was missing on the hardware/firmware side. And later on I
failed to reproduce it anymore.
So do you see it on every boot, with recent U-Boot/ATF?

Cheers,
Andre.

> Investigation led to discovery of some obvious problems with this SoC's
> architectural timer, and this patch series introduces what I believe is
> the simplest workaround. More details are in the commit message for
> patch 1. Patch 2 simply enables the workaround in the device tree.
> 
> Thanks,
> Samuel
> 
> Samuel Holland (2):
>   arm64: arch_timer: Workaround for Allwinner A64 timer instability
>   arm64: dts: allwinner: a64: Enable A64 timer workaround
> 
>  arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi |  1 +
>  drivers/clocksource/Kconfig                   | 11 ++++++++
>  drivers/clocksource/arm_arch_timer.c          | 39 +++++++++++++++++++++++++++
>  3 files changed, 51 insertions(+)
> 
> --
> 2.16.1
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-05-11  9:24   ` Andre Przywara
  0 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-05-11  9:24 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 11/05/18 03:27, Samuel Holland wrote:
> Hello,
> 
> Several people (including me) have experienced extremely large system
> clock jumps on their A64-based devices, apparently due to the
> architectural timer going backward, which is interpreted by Linux as
> the timer wrapping around after 2^56 cycles.

So I experienced this before, though I didn't see actual clock jumps,
only that subsequent reads of CNTVCT_EL0, both directly via mrs and
indirectly via CLOCK_MONOTONIC, from userland (triggered by a directed
test) *sometimes* went backwards, with a number of '1's in the lower bits.
But that didn't happen on every boot, and I was suspecting that some
timer setup was missing on the hardware/firmware side. And later on I
failed to reproduce it anymore.
So do you see it on every boot, with recent U-Boot/ATF?

Cheers,
Andre.

> Investigation led to discovery of some obvious problems with this SoC's
> architectural timer, and this patch series introduces what I believe is
> the simplest workaround. More details are in the commit message for
> patch 1. Patch 2 simply enables the workaround in the device tree.
> 
> Thanks,
> Samuel
> 
> Samuel Holland (2):
>   arm64: arch_timer: Workaround for Allwinner A64 timer instability
>   arm64: dts: allwinner: a64: Enable A64 timer workaround
> 
>  arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi |  1 +
>  drivers/clocksource/Kconfig                   | 11 ++++++++
>  drivers/clocksource/arm_arch_timer.c          | 39 +++++++++++++++++++++++++++
>  3 files changed, 51 insertions(+)
> 
> --
> 2.16.1
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2018-05-11  8:48     ` Marc Zyngier
@ 2018-05-11 15:08       ` Samuel Holland
  -1 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11 15:08 UTC (permalink / raw)
  To: Marc Zyngier, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas,
	Will Deacon, Daniel Lezcano, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 05/11/18 03:48, Marc Zyngier wrote:
> [+Mark, who co-maintains the arch timer code with me]
> 
> Hi Samuel,
> 
> On 11/05/18 03:27, Samuel Holland wrote:
>> The Allwinner A64 SoC is known [1] to have an unstable architectural
>> timer, which manifests itself most obviously in the time jumping forward
>> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
>> timer frequency of 24 MHz, implying that the time went slightly backward
>> (and this was interpreted by the kernel as it jumping forward and
>> wrapping around past the epoch).
>>
>> Further investigation revealed instability in the low bits of CNTVCT at
>> the point a high bit rolls over. This leads to power-of-two cycle
>> forward and backward jumps. (Testing shows that forward jumps are about
>> twice as likely as backward jumps.)
>>
>> Without trapping reads to CNTVCT, a userspace program is able to read it
>> in a loop faster than it changes. A test program running on all 4 CPU
>> cores that reported jumps larger than 100 ms was run for 13.6 hours and
>> reported the following:
>>
>>  Count | Event
>> -------+---------------------------
>>   9940 | jumped backward      699ms
>>    268 | jumped backward     1398ms
>>      1 | jumped backward     2097ms
>>  16020 | jumped forward       175ms
>>   6443 | jumped forward       699ms
>>   2976 | jumped forward      1398ms
>>      9 | jumped forward    356516ms
>>      9 | jumped forward    357215ms
>>      4 | jumped forward    714430ms
>>      1 | jumped forward   3578440ms
>>
>> This works out to a jump larger than 100 ms about every 5.5 seconds on
>> each CPU core.
>>
>> The largest jump (almost an hour!) was the following sequence of reads:
>>       0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000
>>
>> Note that the middle bits don't necessarily all read as all zeroes or
>> all ones during the anomalous behavior; however the low 11 bits checked
>> by the function in this patch have never been observed with any other
>> value.
>>
>> Also note that smaller jumps are much more common, with the smallest
>> backward jumps of 2048 cycles observed over 400 times per second on each
>> core. (Of course, this is partially due to lower bits rolling over more
>> frequently.) Any one of these could have caused the 95 year time skip.
>>
>> Similar anomalies were observed while reading CNTPCT (after patching the
>> kernel to allow reads from userspace). However, the jumps are much less
>> frequent, and only small jumps were observed. The same program as before
>> (except now reading CNTPCT) observed after 72 hours:
>>
>>  Count | Event
>> -------+---------------------------
>>     17 | jumped backward      699ms
>>     52 | jumped forward       175ms
>>   2831 | jumped forward       699ms
>>      5 | jumped forward      1398ms
>>
>> ========================================================================
>>
>> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
>> change, performing two reads of the register and comparing the high bits
>> (like other workarounds) is not a workable solution. And because the
>> timer can jump both forward and backward, no pair of reads can
>> distinguish a good value from a bad one. The only way to guarantee a
>> good value from consecutive reads would be to read _three_ times, and
>> take the middle value iff the three values are 1) individually unique
>> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
>> an anomaly is detected.
>>
>> However, since there is a distinct pattern to the bad values, we can
>> optimize the common case (2046/2048 of the time) to a single read by
>> simply ignoring values that match the pattern. This still takes no more
>> than 3 cycles in the worst case, and requires much less code.
> 
> Thanks for the thorough description of the problem. A couple of questions:
> 
> - Have the 000/7ff values of the bottom bits only been experimentally
> found? Or do you have more concrete evidence of this is what happens in
> the HW?

Only experimentally found. Here's a sample of the larger jumps:

CPU 3: 0x000000a5ebffffff → 0x000000a5edffffff (retry = 0x000000a5ec000000)
CPU 1: 0x000000a5ebffffff → 0x000000a5edffffff (retry = 0x000000a5ec000000)
CPU 1: 0x000000a5ecffffff > 0x000000a5ec000000 (retry = 0x000000a5ed000000)
CPU 0: 0x000000a5eeffffff > 0x000000a5ee000000 (retry = 0x000000a5ef000000)
CPU 2: 0x000000a5f0bfffff → 0x000000a5f0ffffff (retry = 0x000000a5f0c00000)
CPU 0: 0x000000a5f0ffffff > 0x000000a5f0000000 (retry = 0x000000a5f1000000)
CPU 2: 0x000000a5f3ffffff > 0x000000a5f2000000 (retry = 0x000000a5f4000000)
CPU 0: 0x000000a5f6ffffff > 0x000000a5f6000000 (retry = 0x000000a5f7000000)
CPU 1: 0x000000a5fbbfffff → 0x000000a5fbffffff (retry = 0x000000a5fbc00000)
CPU 0: 0x000000a5fdffffff → 0x000000a5feffffff (retry = 0x000000a5fe000000)
CPU 2: 0x000000a5ffffffff → 0x000000a7fe000000 (retry = 0x000000a600000000)
CPU 3: 0x000000a5ffffffff → 0x000000a7fe000000 (retry = 0x000000a600000000)
CPU 1: 0x000000a603bfffff → 0x000000a603ffffff (retry = 0x000000a603c00000)
CPU 1: 0x000000a607bfffff → 0x000000a607ffffff (retry = 0x000000a607c00000)
CPU 3: 0x000000a60cbfffff → 0x000000a60cffffff (retry = 0x000000a60cc00000)
CPU 2: 0x000000a60cbfffff → 0x000000a60cffffff (retry = 0x000000a60cc00000)
CPU 2: 0x000000a60dffffff → 0x000000a60effffff (retry = 0x000000a60e000000)
CPU 0: 0x000000a60e3fffff → 0x000000a60e7fffff (retry = 0x000000a60e400000)
CPU 1: 0x000000a60effffff > 0x000000a60e000000 (retry = 0x000000a60f000000)

(Note that for the large jump, 0x000000a5ffffffff → 0x000000a7fe000000, not
all of the low bits are zeroes during the transition.)

And of the tiny jumps backward:

CPU 3: 0x000000d2f010bfff > 0x000000d2f010b800 (retry = 0x000000d2f010b801)
CPU 3: 0x000000d2f0110fff > 0x000000d2f0110800 (retry = 0x000000d2f0110801)
CPU 3: 0x000000d2f0112fff > 0x000000d2f0112800 (retry = 0x000000d2f0112801)
CPU 3: 0x000000d2f0119fff > 0x000000d2f0119800 (retry = 0x000000d2f0119801)
CPU 3: 0x000000d2f011f7ff > 0x000000d2f011f000 (retry = 0x000000d2f011f800)
CPU 3: 0x000000d2f01257ff > 0x000000d2f0125000 (retry = 0x000000d2f0125800)
CPU 3: 0x000000d2f012b7ff > 0x000000d2f012b000 (retry = 0x000000d2f012b800)
CPU 3: 0x000000d2f01317ff > 0x000000d2f0131000 (retry = 0x000000d2f0131800)
CPU 2: 0x000000d2f00c17ff > 0x000000d2f00c1000 (retry = 0x000000d2f00c1800)
CPU 2: 0x000000d2f0136fff > 0x000000d2f0136800 (retry = 0x000000d2f0136801)
CPU 2: 0x000000d2f01377ff > 0x000000d2f0137000 (retry = 0x000000d2f0137800)
CPU 2: 0x000000d2f013afff > 0x000000d2f013a800 (retry = 0x000000d2f013a801)
CPU 2: 0x000000d2f01417ff > 0x000000d2f0141000 (retry = 0x000000d2f0141800)
CPU 2: 0x000000d2f0142fff > 0x000000d2f0142800 (retry = 0x000000d2f0142801)
CPU 2: 0x000000d2f01497ff > 0x000000d2f0149000 (retry = 0x000000d2f0149800)
CPU 2: 0x000000d2f014b7ff > 0x000000d2f014b000 (retry = 0x000000d2f014b800)
CPU 2: 0x000000d2f0150fff > 0x000000d2f0150800 (retry = 0x000000d2f0150801)
CPU 2: 0x000000d2f01577ff > 0x000000d2f0157000 (retry = 0x000000d2f0157800)
CPU 2: 0x000000d2f015d7ff > 0x000000d2f015d000 (retry = 0x000000d2f015d800)
CPU 2: 0x000000d2f015f7ff > 0x000000d2f015f000 (retry = 0x000000d2f015f800)
CPU 2: 0x000000d2f01617ff > 0x000000d2f0161000 (retry = 0x000000d2f0161800)
CPU 2: 0x000000d2f0166fff > 0x000000d2f0166800 (retry = 0x000000d2f0166801)
CPU 2: 0x000000d2f0168fff > 0x000000d2f0168800 (retry = 0x000000d2f0168801)
CPU 2: 0x000000d2f016b7ff > 0x000000d2f016b000 (retry = 0x000000d2f016b800)

I have close to 100 hours of data, and the "bottom 11 bits" value came from
the observations that:
1) Jumps by 2048 cycles were the smallest jumps seen
2) Even when the anomalous values were not all ones/all zeroes, the bottom
   11 bits _were_ all ones/all zeroes

I'd be happy to run more tests if there's something else you think I should
look for.

> - Do you have an official erratum number from the silicon vendor, so
> that Documentation/arm64/silicon-errata.txt can be kept up to date?

No, unfortunately the only information we have from the vendor is in the
form of BSP kernel code drops. There _is_ a workaround present in their
official sources [1], but it looks to be for a different issue (one that I
haven't experienced).

[1]
https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/e634a960316d/drivers/clocksource/arm_arch_timer.c#L272-L327

>>
>> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
>> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
>> [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26
>>
>> Signed-off-by: Samuel Holland <samuel@sholland.org>
>> ---
>>  drivers/clocksource/Kconfig          | 11 ++++++++++
>>  drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++
>>  2 files changed, 50 insertions(+)
>>
>> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
>> index 8e8a09755d10..7a5d434dd30b 100644
>> --- a/drivers/clocksource/Kconfig
>> +++ b/drivers/clocksource/Kconfig
>> @@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921
>>  	  The workaround will be dynamically enabled when an affected
>>  	  core is detected.
>>  
>> +config SUN50I_A64_UNSTABLE_TIMER
>> +	bool "Workaround for Allwinner A64 timer instability"
>> +	default y
>> +	depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI
>> +	select ARM_ARCH_TIMER_OOL_WORKAROUND
>> +	help
>> +	  This option enables a workaround for instability in the timer on
>> +	  the Allwinner A64 SoC. The workaround will only be active if the
>> +	  allwinner,sun50i-a64-unstable-timer property is found in the
>> +	  timer node.
>> +
>>  config ARM_GLOBAL_TIMER
>>  	bool "Support for the ARM global timer" if COMPILE_TEST
>>  	select TIMER_OF if OF
>> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
>> index 57cb2f00fc07..66ce13578c52 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
>>  }
>>  #endif
>>  
>> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
>> +/*
>> + * The low bits of each register can transiently read as all ones or all zeroes
>> + * when bit 11 or greater rolls over. Since the value can jump both backward
>> + * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just
>> + * ignore register values with all ones or zeros in the low bits.
>> + */
>> +static u64 notrace sun50i_a64_read_cntpct_el0(void)
>> +{
>> +	u64 val;
>> +
>> +	do {
>> +		val = read_sysreg(cntpct_el0);
>> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);
> 
> Other workarounds of the same kind have a bounded loop. Have you done
> any investigation on how many loops you need at most to sort the timer?
> Depending on how many loops you need, it might be worth sticking a WFE
> here to just wait for something to happen instead of busy looping.

The worst case delay inside the loop is two timer cycles at 24 MHz (83.3 ns).
So the number of iterations depends on the CPU speed. For example:

0x???4ffe -> read once and return
0x???4fff -> read and loop (((val + 1) & GENMASK(10, 0)) == 0)
0x???5000 -> read and loop (((val + 1) & GENMASK(10, 0)) == 1)
0x???5001 -> read once and return

Now of course this function could be preempted, in which case it most likely
does even fewer reads. (The probability of "read 0x???4fff", "preempt", "read
0x???5fff" is pretty low.)

I do not think it is possible to know if an 000/7ff value is valid without
reading three unique values from the timer. So the latency entirely depends
on the timer frequency, not the CPU speed.

>> +
>> +	return val;
>> +}
>> +
>> +static u64 notrace sun50i_a64_read_cntvct_el0(void)
>> +{
>> +	u64 val;
>> +
>> +	do {
>> +		val = read_sysreg(cntvct_el0);
>> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);
>> +
>> +	return val;
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND
>>  DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround);
>>  EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround);
>> @@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = {
>>  		.read_cntvct_el0 = arm64_858921_read_cntvct_el0,
>>  	},
>>  #endif
>> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
>> +	{
>> +		.match_type = ate_match_dt,
>> +		.id = "allwinner,sun50i-a64-unstable-timer",
>> +		.desc = "Allwinner A64 timer instability",
>> +		.read_cntpct_el0 = sun50i_a64_read_cntpct_el0,
>> +		.read_cntvct_el0 = sun50i_a64_read_cntvct_el0,
>> +	},
>> +#endif
>>  };
>>  
>>  typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *,
>>
> 
> Thanks,
> 
> 	M.
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
@ 2018-05-11 15:08       ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-05-11 15:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/11/18 03:48, Marc Zyngier wrote:
> [+Mark, who co-maintains the arch timer code with me]
> 
> Hi Samuel,
> 
> On 11/05/18 03:27, Samuel Holland wrote:
>> The Allwinner A64 SoC is known [1] to have an unstable architectural
>> timer, which manifests itself most obviously in the time jumping forward
>> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
>> timer frequency of 24 MHz, implying that the time went slightly backward
>> (and this was interpreted by the kernel as it jumping forward and
>> wrapping around past the epoch).
>>
>> Further investigation revealed instability in the low bits of CNTVCT at
>> the point a high bit rolls over. This leads to power-of-two cycle
>> forward and backward jumps. (Testing shows that forward jumps are about
>> twice as likely as backward jumps.)
>>
>> Without trapping reads to CNTVCT, a userspace program is able to read it
>> in a loop faster than it changes. A test program running on all 4 CPU
>> cores that reported jumps larger than 100 ms was run for 13.6 hours and
>> reported the following:
>>
>>  Count | Event
>> -------+---------------------------
>>   9940 | jumped backward      699ms
>>    268 | jumped backward     1398ms
>>      1 | jumped backward     2097ms
>>  16020 | jumped forward       175ms
>>   6443 | jumped forward       699ms
>>   2976 | jumped forward      1398ms
>>      9 | jumped forward    356516ms
>>      9 | jumped forward    357215ms
>>      4 | jumped forward    714430ms
>>      1 | jumped forward   3578440ms
>>
>> This works out to a jump larger than 100 ms about every 5.5 seconds on
>> each CPU core.
>>
>> The largest jump (almost an hour!) was the following sequence of reads:
>>       0x0000007fffffffff ? 0x00000093feffffff ? 0x0000008000000000
>>
>> Note that the middle bits don't necessarily all read as all zeroes or
>> all ones during the anomalous behavior; however the low 11 bits checked
>> by the function in this patch have never been observed with any other
>> value.
>>
>> Also note that smaller jumps are much more common, with the smallest
>> backward jumps of 2048 cycles observed over 400 times per second on each
>> core. (Of course, this is partially due to lower bits rolling over more
>> frequently.) Any one of these could have caused the 95 year time skip.
>>
>> Similar anomalies were observed while reading CNTPCT (after patching the
>> kernel to allow reads from userspace). However, the jumps are much less
>> frequent, and only small jumps were observed. The same program as before
>> (except now reading CNTPCT) observed after 72 hours:
>>
>>  Count | Event
>> -------+---------------------------
>>     17 | jumped backward      699ms
>>     52 | jumped forward       175ms
>>   2831 | jumped forward       699ms
>>      5 | jumped forward      1398ms
>>
>> ========================================================================
>>
>> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
>> change, performing two reads of the register and comparing the high bits
>> (like other workarounds) is not a workable solution. And because the
>> timer can jump both forward and backward, no pair of reads can
>> distinguish a good value from a bad one. The only way to guarantee a
>> good value from consecutive reads would be to read _three_ times, and
>> take the middle value iff the three values are 1) individually unique
>> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
>> an anomaly is detected.
>>
>> However, since there is a distinct pattern to the bad values, we can
>> optimize the common case (2046/2048 of the time) to a single read by
>> simply ignoring values that match the pattern. This still takes no more
>> than 3 cycles in the worst case, and requires much less code.
> 
> Thanks for the thorough description of the problem. A couple of questions:
> 
> - Have the 000/7ff values of the bottom bits only been experimentally
> found? Or do you have more concrete evidence of this is what happens in
> the HW?

Only experimentally found. Here's a sample of the larger jumps:

CPU 3: 0x000000a5ebffffff ? 0x000000a5edffffff (retry = 0x000000a5ec000000)
CPU 1: 0x000000a5ebffffff ? 0x000000a5edffffff (retry = 0x000000a5ec000000)
CPU 1: 0x000000a5ecffffff > 0x000000a5ec000000 (retry = 0x000000a5ed000000)
CPU 0: 0x000000a5eeffffff > 0x000000a5ee000000 (retry = 0x000000a5ef000000)
CPU 2: 0x000000a5f0bfffff ? 0x000000a5f0ffffff (retry = 0x000000a5f0c00000)
CPU 0: 0x000000a5f0ffffff > 0x000000a5f0000000 (retry = 0x000000a5f1000000)
CPU 2: 0x000000a5f3ffffff > 0x000000a5f2000000 (retry = 0x000000a5f4000000)
CPU 0: 0x000000a5f6ffffff > 0x000000a5f6000000 (retry = 0x000000a5f7000000)
CPU 1: 0x000000a5fbbfffff ? 0x000000a5fbffffff (retry = 0x000000a5fbc00000)
CPU 0: 0x000000a5fdffffff ? 0x000000a5feffffff (retry = 0x000000a5fe000000)
CPU 2: 0x000000a5ffffffff ? 0x000000a7fe000000 (retry = 0x000000a600000000)
CPU 3: 0x000000a5ffffffff ? 0x000000a7fe000000 (retry = 0x000000a600000000)
CPU 1: 0x000000a603bfffff ? 0x000000a603ffffff (retry = 0x000000a603c00000)
CPU 1: 0x000000a607bfffff ? 0x000000a607ffffff (retry = 0x000000a607c00000)
CPU 3: 0x000000a60cbfffff ? 0x000000a60cffffff (retry = 0x000000a60cc00000)
CPU 2: 0x000000a60cbfffff ? 0x000000a60cffffff (retry = 0x000000a60cc00000)
CPU 2: 0x000000a60dffffff ? 0x000000a60effffff (retry = 0x000000a60e000000)
CPU 0: 0x000000a60e3fffff ? 0x000000a60e7fffff (retry = 0x000000a60e400000)
CPU 1: 0x000000a60effffff > 0x000000a60e000000 (retry = 0x000000a60f000000)

(Note that for the large jump, 0x000000a5ffffffff ? 0x000000a7fe000000, not
all of the low bits are zeroes during the transition.)

And of the tiny jumps backward:

CPU 3: 0x000000d2f010bfff > 0x000000d2f010b800 (retry = 0x000000d2f010b801)
CPU 3: 0x000000d2f0110fff > 0x000000d2f0110800 (retry = 0x000000d2f0110801)
CPU 3: 0x000000d2f0112fff > 0x000000d2f0112800 (retry = 0x000000d2f0112801)
CPU 3: 0x000000d2f0119fff > 0x000000d2f0119800 (retry = 0x000000d2f0119801)
CPU 3: 0x000000d2f011f7ff > 0x000000d2f011f000 (retry = 0x000000d2f011f800)
CPU 3: 0x000000d2f01257ff > 0x000000d2f0125000 (retry = 0x000000d2f0125800)
CPU 3: 0x000000d2f012b7ff > 0x000000d2f012b000 (retry = 0x000000d2f012b800)
CPU 3: 0x000000d2f01317ff > 0x000000d2f0131000 (retry = 0x000000d2f0131800)
CPU 2: 0x000000d2f00c17ff > 0x000000d2f00c1000 (retry = 0x000000d2f00c1800)
CPU 2: 0x000000d2f0136fff > 0x000000d2f0136800 (retry = 0x000000d2f0136801)
CPU 2: 0x000000d2f01377ff > 0x000000d2f0137000 (retry = 0x000000d2f0137800)
CPU 2: 0x000000d2f013afff > 0x000000d2f013a800 (retry = 0x000000d2f013a801)
CPU 2: 0x000000d2f01417ff > 0x000000d2f0141000 (retry = 0x000000d2f0141800)
CPU 2: 0x000000d2f0142fff > 0x000000d2f0142800 (retry = 0x000000d2f0142801)
CPU 2: 0x000000d2f01497ff > 0x000000d2f0149000 (retry = 0x000000d2f0149800)
CPU 2: 0x000000d2f014b7ff > 0x000000d2f014b000 (retry = 0x000000d2f014b800)
CPU 2: 0x000000d2f0150fff > 0x000000d2f0150800 (retry = 0x000000d2f0150801)
CPU 2: 0x000000d2f01577ff > 0x000000d2f0157000 (retry = 0x000000d2f0157800)
CPU 2: 0x000000d2f015d7ff > 0x000000d2f015d000 (retry = 0x000000d2f015d800)
CPU 2: 0x000000d2f015f7ff > 0x000000d2f015f000 (retry = 0x000000d2f015f800)
CPU 2: 0x000000d2f01617ff > 0x000000d2f0161000 (retry = 0x000000d2f0161800)
CPU 2: 0x000000d2f0166fff > 0x000000d2f0166800 (retry = 0x000000d2f0166801)
CPU 2: 0x000000d2f0168fff > 0x000000d2f0168800 (retry = 0x000000d2f0168801)
CPU 2: 0x000000d2f016b7ff > 0x000000d2f016b000 (retry = 0x000000d2f016b800)

I have close to 100 hours of data, and the "bottom 11 bits" value came from
the observations that:
1) Jumps by 2048 cycles were the smallest jumps seen
2) Even when the anomalous values were not all ones/all zeroes, the bottom
   11 bits _were_ all ones/all zeroes

I'd be happy to run more tests if there's something else you think I should
look for.

> - Do you have an official erratum number from the silicon vendor, so
> that Documentation/arm64/silicon-errata.txt can be kept up to date?

No, unfortunately the only information we have from the vendor is in the
form of BSP kernel code drops. There _is_ a workaround present in their
official sources [1], but it looks to be for a different issue (one that I
haven't experienced).

[1]
https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/e634a960316d/drivers/clocksource/arm_arch_timer.c#L272-L327

>>
>> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
>> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
>> [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26
>>
>> Signed-off-by: Samuel Holland <samuel@sholland.org>
>> ---
>>  drivers/clocksource/Kconfig          | 11 ++++++++++
>>  drivers/clocksource/arm_arch_timer.c | 39 ++++++++++++++++++++++++++++++++++++
>>  2 files changed, 50 insertions(+)
>>
>> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
>> index 8e8a09755d10..7a5d434dd30b 100644
>> --- a/drivers/clocksource/Kconfig
>> +++ b/drivers/clocksource/Kconfig
>> @@ -364,6 +364,17 @@ config ARM64_ERRATUM_858921
>>  	  The workaround will be dynamically enabled when an affected
>>  	  core is detected.
>>  
>> +config SUN50I_A64_UNSTABLE_TIMER
>> +	bool "Workaround for Allwinner A64 timer instability"
>> +	default y
>> +	depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI
>> +	select ARM_ARCH_TIMER_OOL_WORKAROUND
>> +	help
>> +	  This option enables a workaround for instability in the timer on
>> +	  the Allwinner A64 SoC. The workaround will only be active if the
>> +	  allwinner,sun50i-a64-unstable-timer property is found in the
>> +	  timer node.
>> +
>>  config ARM_GLOBAL_TIMER
>>  	bool "Support for the ARM global timer" if COMPILE_TEST
>>  	select TIMER_OF if OF
>> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
>> index 57cb2f00fc07..66ce13578c52 100644
>> --- a/drivers/clocksource/arm_arch_timer.c
>> +++ b/drivers/clocksource/arm_arch_timer.c
>> @@ -319,6 +319,36 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
>>  }
>>  #endif
>>  
>> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
>> +/*
>> + * The low bits of each register can transiently read as all ones or all zeroes
>> + * when bit 11 or greater rolls over. Since the value can jump both backward
>> + * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), it is simplest to just
>> + * ignore register values with all ones or zeros in the low bits.
>> + */
>> +static u64 notrace sun50i_a64_read_cntpct_el0(void)
>> +{
>> +	u64 val;
>> +
>> +	do {
>> +		val = read_sysreg(cntpct_el0);
>> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);
> 
> Other workarounds of the same kind have a bounded loop. Have you done
> any investigation on how many loops you need at most to sort the timer?
> Depending on how many loops you need, it might be worth sticking a WFE
> here to just wait for something to happen instead of busy looping.

The worst case delay inside the loop is two timer cycles at 24 MHz (83.3 ns).
So the number of iterations depends on the CPU speed. For example:

0x???4ffe -> read once and return
0x???4fff -> read and loop (((val + 1) & GENMASK(10, 0)) == 0)
0x???5000 -> read and loop (((val + 1) & GENMASK(10, 0)) == 1)
0x???5001 -> read once and return

Now of course this function could be preempted, in which case it most likely
does even fewer reads. (The probability of "read 0x???4fff", "preempt", "read
0x???5fff" is pretty low.)

I do not think it is possible to know if an 000/7ff value is valid without
reading three unique values from the timer. So the latency entirely depends
on the timer frequency, not the CPU speed.

>> +
>> +	return val;
>> +}
>> +
>> +static u64 notrace sun50i_a64_read_cntvct_el0(void)
>> +{
>> +	u64 val;
>> +
>> +	do {
>> +		val = read_sysreg(cntvct_el0);
>> +	} while (((val + 1) & GENMASK(10, 0)) <= 1);
>> +
>> +	return val;
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND
>>  DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround);
>>  EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround);
>> @@ -408,6 +438,15 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = {
>>  		.read_cntvct_el0 = arm64_858921_read_cntvct_el0,
>>  	},
>>  #endif
>> +#ifdef CONFIG_SUN50I_A64_UNSTABLE_TIMER
>> +	{
>> +		.match_type = ate_match_dt,
>> +		.id = "allwinner,sun50i-a64-unstable-timer",
>> +		.desc = "Allwinner A64 timer instability",
>> +		.read_cntpct_el0 = sun50i_a64_read_cntpct_el0,
>> +		.read_cntvct_el0 = sun50i_a64_read_cntvct_el0,
>> +	},
>> +#endif
>>  };
>>  
>>  typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *,
>>
> 
> Thanks,
> 
> 	M.
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2018-05-11  2:27   ` Samuel Holland
@ 2018-05-26 15:15     ` André Przywara
  -1 siblings, 0 replies; 60+ messages in thread
From: André Przywara @ 2018-05-26 15:15 UTC (permalink / raw)
  To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas,
	Will Deacon, Daniel Lezcano, Thomas Gleixner, Marc Zyngier
  Cc: linux-sunxi, linux-kernel, linux-arm-kernel, Mark Rutland

On 05/11/2018 03:27 AM, Samuel Holland wrote:
> The Allwinner A64 SoC is known [1] to have an unstable architectural
> timer, which manifests itself most obviously in the time jumping forward
> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
> timer frequency of 24 MHz, implying that the time went slightly backward
> (and this was interpreted by the kernel as it jumping forward and
> wrapping around past the epoch).
> 
> Further investigation revealed instability in the low bits of CNTVCT at
> the point a high bit rolls over. This leads to power-of-two cycle
> forward and backward jumps. (Testing shows that forward jumps are about
> twice as likely as backward jumps.)
> 
> Without trapping reads to CNTVCT, a userspace program is able to read it
> in a loop faster than it changes. A test program running on all 4 CPU
> cores that reported jumps larger than 100 ms was run for 13.6 hours and
> reported the following:
> 
>  Count | Event
> -------+---------------------------
>   9940 | jumped backward      699ms
>    268 | jumped backward     1398ms
>      1 | jumped backward     2097ms
>  16020 | jumped forward       175ms
>   6443 | jumped forward       699ms
>   2976 | jumped forward      1398ms
>      9 | jumped forward    356516ms
>      9 | jumped forward    357215ms
>      4 | jumped forward    714430ms
>      1 | jumped forward   3578440ms
> 
> This works out to a jump larger than 100 ms about every 5.5 seconds on
> each CPU core.
> 
> The largest jump (almost an hour!) was the following sequence of reads:
>       0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000
> 
> Note that the middle bits don't necessarily all read as all zeroes or
> all ones during the anomalous behavior; however the low 11 bits checked
> by the function in this patch have never been observed with any other
> value.
> 
> Also note that smaller jumps are much more common, with the smallest
> backward jumps of 2048 cycles observed over 400 times per second on each
> core. (Of course, this is partially due to lower bits rolling over more
> frequently.) Any one of these could have caused the 95 year time skip.
> 
> Similar anomalies were observed while reading CNTPCT (after patching the
> kernel to allow reads from userspace). However, the jumps are much less
> frequent, and only small jumps were observed. The same program as before
> (except now reading CNTPCT) observed after 72 hours:
> 
>  Count | Event
> -------+---------------------------
>     17 | jumped backward      699ms
>     52 | jumped forward       175ms
>   2831 | jumped forward       699ms
>      5 | jumped forward      1398ms
> 
> ========================================================================
> 
> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> change, performing two reads of the register and comparing the high bits
> (like other workarounds) is not a workable solution. And because the
> timer can jump both forward and backward, no pair of reads can
> distinguish a good value from a bad one. The only way to guarantee a
> good value from consecutive reads would be to read _three_ times, and
> take the middle value iff the three values are 1) individually unique
> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
> an anomaly is detected.
> 
> However, since there is a distinct pattern to the bad values, we can
> optimize the common case (2046/2048 of the time) to a single read by
> simply ignoring values that match the pattern. This still takes no more
> than 3 cycles in the worst case, and requires much less code.

Clever solution, and indeed much less costly than other workarounds.

FWIW, I tested this on a Pine64 and can confirm that it works.
I put a test program here:
https://github.com/apritzel/pine64/blob/master/tools/test_timer.c

This only checks for consecutive reads going backwards, but within a
split second yells on an unpatched kernel:
https://gist.github.com/apritzel/fc78ca6edb17be2024d5adfd35edb520

Applying the patch and adding the DT property fixed that for me.

I second Marc's request for an upper bound on the loop. Question is just
what we do when we reach the loop count limit?

But regardless, given that this fixes this nasty issue:

Tested-by: Andre Przywara <andre.przywara@arm.com>

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
@ 2018-05-26 15:15     ` André Przywara
  0 siblings, 0 replies; 60+ messages in thread
From: André Przywara @ 2018-05-26 15:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/11/2018 03:27 AM, Samuel Holland wrote:
> The Allwinner A64 SoC is known [1] to have an unstable architectural
> timer, which manifests itself most obviously in the time jumping forward
> a multiple of 95 years [2][3]. This coincides with 2^56 cycles at a
> timer frequency of 24 MHz, implying that the time went slightly backward
> (and this was interpreted by the kernel as it jumping forward and
> wrapping around past the epoch).
> 
> Further investigation revealed instability in the low bits of CNTVCT at
> the point a high bit rolls over. This leads to power-of-two cycle
> forward and backward jumps. (Testing shows that forward jumps are about
> twice as likely as backward jumps.)
> 
> Without trapping reads to CNTVCT, a userspace program is able to read it
> in a loop faster than it changes. A test program running on all 4 CPU
> cores that reported jumps larger than 100 ms was run for 13.6 hours and
> reported the following:
> 
>  Count | Event
> -------+---------------------------
>   9940 | jumped backward      699ms
>    268 | jumped backward     1398ms
>      1 | jumped backward     2097ms
>  16020 | jumped forward       175ms
>   6443 | jumped forward       699ms
>   2976 | jumped forward      1398ms
>      9 | jumped forward    356516ms
>      9 | jumped forward    357215ms
>      4 | jumped forward    714430ms
>      1 | jumped forward   3578440ms
> 
> This works out to a jump larger than 100 ms about every 5.5 seconds on
> each CPU core.
> 
> The largest jump (almost an hour!) was the following sequence of reads:
>       0x0000007fffffffff ? 0x00000093feffffff ? 0x0000008000000000
> 
> Note that the middle bits don't necessarily all read as all zeroes or
> all ones during the anomalous behavior; however the low 11 bits checked
> by the function in this patch have never been observed with any other
> value.
> 
> Also note that smaller jumps are much more common, with the smallest
> backward jumps of 2048 cycles observed over 400 times per second on each
> core. (Of course, this is partially due to lower bits rolling over more
> frequently.) Any one of these could have caused the 95 year time skip.
> 
> Similar anomalies were observed while reading CNTPCT (after patching the
> kernel to allow reads from userspace). However, the jumps are much less
> frequent, and only small jumps were observed. The same program as before
> (except now reading CNTPCT) observed after 72 hours:
> 
>  Count | Event
> -------+---------------------------
>     17 | jumped backward      699ms
>     52 | jumped forward       175ms
>   2831 | jumped forward       699ms
>      5 | jumped forward      1398ms
> 
> ========================================================================
> 
> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> change, performing two reads of the register and comparing the high bits
> (like other workarounds) is not a workable solution. And because the
> timer can jump both forward and backward, no pair of reads can
> distinguish a good value from a bad one. The only way to guarantee a
> good value from consecutive reads would be to read _three_ times, and
> take the middle value iff the three values are 1) individually unique
> and 2) increasing. This takes at minimum 3 cycles (125 ns), or more if
> an anomaly is detected.
> 
> However, since there is a distinct pattern to the bad values, we can
> optimize the common case (2046/2048 of the time) to a single read by
> simply ignoring values that match the pattern. This still takes no more
> than 3 cycles in the worst case, and requires much less code.

Clever solution, and indeed much less costly than other workarounds.

FWIW, I tested this on a Pine64 and can confirm that it works.
I put a test program here:
https://github.com/apritzel/pine64/blob/master/tools/test_timer.c

This only checks for consecutive reads going backwards, but within a
split second yells on an unpatched kernel:
https://gist.github.com/apritzel/fc78ca6edb17be2024d5adfd35edb520

Applying the patch and adding the DT property fixed that for me.

I second Marc's request for an upper bound on the loop. Question is just
what we do when we reach the loop count limit?

But regardless, given that this fixes this nasty issue:

Tested-by: Andre Przywara <andre.przywara@arm.com>

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-05-11  2:27 ` Samuel Holland
@ 2018-07-03 15:09   ` Marc Zyngier
  -1 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-03 15:09 UTC (permalink / raw)
  To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas,
	Will Deacon, Daniel Lezcano, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi

On 11/05/18 03:27, Samuel Holland wrote:
> Hello,
> 
> Several people (including me) have experienced extremely large system
> clock jumps on their A64-based devices, apparently due to the
> architectural timer going backward, which is interpreted by Linux as
> the timer wrapping around after 2^56 cycles.
> 
> Investigation led to discovery of some obvious problems with this SoC's
> architectural timer, and this patch series introduces what I believe is
> the simplest workaround. More details are in the commit message for
> patch 1. Patch 2 simply enables the workaround in the device tree.

What's the deal with this series? There was a couple of nits to address,
and I was more or less expecting a v2.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-03 15:09   ` Marc Zyngier
  0 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-03 15:09 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/05/18 03:27, Samuel Holland wrote:
> Hello,
> 
> Several people (including me) have experienced extremely large system
> clock jumps on their A64-based devices, apparently due to the
> architectural timer going backward, which is interpreted by Linux as
> the timer wrapping around after 2^56 cycles.
> 
> Investigation led to discovery of some obvious problems with this SoC's
> architectural timer, and this patch series introduces what I believe is
> the simplest workaround. More details are in the commit message for
> patch 1. Patch 2 simply enables the workaround in the device tree.

What's the deal with this series? There was a couple of nits to address,
and I was more or less expecting a v2.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-03 15:09   ` Marc Zyngier
@ 2018-07-03 18:42     ` Samuel Holland
  -1 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-07-03 18:42 UTC (permalink / raw)
  To: Marc Zyngier, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas,
	Will Deacon, Daniel Lezcano, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 07/03/18 10:09, Marc Zyngier wrote:
> On 11/05/18 03:27, Samuel Holland wrote:
>> Hello,
>> 
>> Several people (including me) have experienced extremely large system
>> clock jumps on their A64-based devices, apparently due to the architectural
>> timer going backward, which is interpreted by Linux as the timer wrapping
>> around after 2^56 cycles.
>> 
>> Investigation led to discovery of some obvious problems with this SoC's 
>> architectural timer, and this patch series introduces what I believe is
>> the simplest workaround. More details are in the commit message for patch
>> 1. Patch 2 simply enables the workaround in the device tree.
> 
> What's the deal with this series? There was a couple of nits to address, and 
> I was more or less expecting a v2.

I got reports that people were still occasionally having clock jumps after
applying this series, so I wanted to attempt a more complete fix, but I haven't
had time to do any deeper investigation. I think this series is still beneficial
even if it's not a complete solution, so I'll come back with another patch on
top of this if/once I get it fully fixed.

I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
timer) ≈ 150 should be a conservative iteration limit?

Also, does this make sense to CC to stable?

Thanks,
Samuel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-03 18:42     ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-07-03 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/03/18 10:09, Marc Zyngier wrote:
> On 11/05/18 03:27, Samuel Holland wrote:
>> Hello,
>> 
>> Several people (including me) have experienced extremely large system
>> clock jumps on their A64-based devices, apparently due to the architectural
>> timer going backward, which is interpreted by Linux as the timer wrapping
>> around after 2^56 cycles.
>> 
>> Investigation led to discovery of some obvious problems with this SoC's 
>> architectural timer, and this patch series introduces what I believe is
>> the simplest workaround. More details are in the commit message for patch
>> 1. Patch 2 simply enables the workaround in the device tree.
> 
> What's the deal with this series? There was a couple of nits to address, and 
> I was more or less expecting a v2.

I got reports that people were still occasionally having clock jumps after
applying this series, so I wanted to attempt a more complete fix, but I haven't
had time to do any deeper investigation. I think this series is still beneficial
even if it's not a complete solution, so I'll come back with another patch on
top of this if/once I get it fully fixed.

I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
timer) ? 150 should be a conservative iteration limit?

Also, does this make sense to CC to stable?

Thanks,
Samuel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-03 18:42     ` Samuel Holland
@ 2018-07-04  8:16       ` Marc Zyngier
  -1 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04  8:16 UTC (permalink / raw)
  To: Samuel Holland, Maxime Ripard, Chen-Yu Tsai, Catalin Marinas,
	Will Deacon, Daniel Lezcano, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 03/07/18 19:42, Samuel Holland wrote:
> On 07/03/18 10:09, Marc Zyngier wrote:
>> On 11/05/18 03:27, Samuel Holland wrote:
>>> Hello,
>>>
>>> Several people (including me) have experienced extremely large system
>>> clock jumps on their A64-based devices, apparently due to the architectural
>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>> around after 2^56 cycles.
>>>
>>> Investigation led to discovery of some obvious problems with this SoC's 
>>> architectural timer, and this patch series introduces what I believe is
>>> the simplest workaround. More details are in the commit message for patch
>>> 1. Patch 2 simply enables the workaround in the device tree.
>>
>> What's the deal with this series? There was a couple of nits to address, and 
>> I was more or less expecting a v2.
> 
> I got reports that people were still occasionally having clock jumps after
> applying this series, so I wanted to attempt a more complete fix, but I haven't
> had time to do any deeper investigation. I think this series is still beneficial
> even if it's not a complete solution, so I'll come back with another patch on
> top of this if/once I get it fully fixed.
> 
> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> timer) ≈ 150 should be a conservative iteration limit?

Should be OK.

Maxime: How do you want to deal with the documentation aspect? We need
an erratum number, but AFAIU the concept hasn't made it into the silicom
vendor's brain yet. Any chance you could come up with something that
uniquely identifies this?

> Also, does this make sense to CC to stable?

Probably not, as the HW never worked, so it is not a regression.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04  8:16       ` Marc Zyngier
  0 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04  8:16 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/07/18 19:42, Samuel Holland wrote:
> On 07/03/18 10:09, Marc Zyngier wrote:
>> On 11/05/18 03:27, Samuel Holland wrote:
>>> Hello,
>>>
>>> Several people (including me) have experienced extremely large system
>>> clock jumps on their A64-based devices, apparently due to the architectural
>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>> around after 2^56 cycles.
>>>
>>> Investigation led to discovery of some obvious problems with this SoC's 
>>> architectural timer, and this patch series introduces what I believe is
>>> the simplest workaround. More details are in the commit message for patch
>>> 1. Patch 2 simply enables the workaround in the device tree.
>>
>> What's the deal with this series? There was a couple of nits to address, and 
>> I was more or less expecting a v2.
> 
> I got reports that people were still occasionally having clock jumps after
> applying this series, so I wanted to attempt a more complete fix, but I haven't
> had time to do any deeper investigation. I think this series is still beneficial
> even if it's not a complete solution, so I'll come back with another patch on
> top of this if/once I get it fully fixed.
> 
> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> timer) ? 150 should be a conservative iteration limit?

Should be OK.

Maxime: How do you want to deal with the documentation aspect? We need
an erratum number, but AFAIU the concept hasn't made it into the silicom
vendor's brain yet. Any chance you could come up with something that
uniquely identifies this?

> Also, does this make sense to CC to stable?

Probably not, as the HW never worked, so it is not a regression.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04  8:16       ` Marc Zyngier
@ 2018-07-04  8:19         ` Chen-Yu Tsai
  -1 siblings, 0 replies; 60+ messages in thread
From: Chen-Yu Tsai @ 2018-07-04  8:19 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Samuel Holland, Maxime Ripard, Catalin Marinas, Will Deacon,
	Daniel Lezcano, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	linux-sunxi, Mark Rutland

On Wed, Jul 4, 2018 at 4:16 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
>> On 07/03/18 10:09, Marc Zyngier wrote:
>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>> Hello,
>>>>
>>>> Several people (including me) have experienced extremely large system
>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>> around after 2^56 cycles.
>>>>
>>>> Investigation led to discovery of some obvious problems with this SoC's
>>>> architectural timer, and this patch series introduces what I believe is
>>>> the simplest workaround. More details are in the commit message for patch
>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>
>>> What's the deal with this series? There was a couple of nits to address, and
>>> I was more or less expecting a v2.
>>
>> I got reports that people were still occasionally having clock jumps after
>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>> had time to do any deeper investigation. I think this series is still beneficial
>> even if it's not a complete solution, so I'll come back with another patch on
>> top of this if/once I get it fully fixed.
>>
>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>> timer) ≈ 150 should be a conservative iteration limit?
>
> Should be OK.
>
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?
>
>> Also, does this make sense to CC to stable?
>
> Probably not, as the HW never worked, so it is not a regression.

A64 support has been in for a few releases now. So while this is not
fixing a regression, people will still benefit from it being in stable.

ChenYu

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04  8:19         ` Chen-Yu Tsai
  0 siblings, 0 replies; 60+ messages in thread
From: Chen-Yu Tsai @ 2018-07-04  8:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 4, 2018 at 4:16 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
>> On 07/03/18 10:09, Marc Zyngier wrote:
>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>> Hello,
>>>>
>>>> Several people (including me) have experienced extremely large system
>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>> around after 2^56 cycles.
>>>>
>>>> Investigation led to discovery of some obvious problems with this SoC's
>>>> architectural timer, and this patch series introduces what I believe is
>>>> the simplest workaround. More details are in the commit message for patch
>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>
>>> What's the deal with this series? There was a couple of nits to address, and
>>> I was more or less expecting a v2.
>>
>> I got reports that people were still occasionally having clock jumps after
>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>> had time to do any deeper investigation. I think this series is still beneficial
>> even if it's not a complete solution, so I'll come back with another patch on
>> top of this if/once I get it fully fixed.
>>
>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>> timer) ? 150 should be a conservative iteration limit?
>
> Should be OK.
>
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?
>
>> Also, does this make sense to CC to stable?
>
> Probably not, as the HW never worked, so it is not a regression.

A64 support has been in for a few releases now. So while this is not
fixing a regression, people will still benefit from it being in stable.

ChenYu

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04  8:16       ` Marc Zyngier
@ 2018-07-04  8:23         ` Daniel Lezcano
  -1 siblings, 0 replies; 60+ messages in thread
From: Daniel Lezcano @ 2018-07-04  8:23 UTC (permalink / raw)
  To: Marc Zyngier, Samuel Holland, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 04/07/2018 10:16, Marc Zyngier wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
>> On 07/03/18 10:09, Marc Zyngier wrote:
>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>> Hello,
>>>>
>>>> Several people (including me) have experienced extremely large system
>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>> around after 2^56 cycles.
>>>>
>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>> architectural timer, and this patch series introduces what I believe is
>>>> the simplest workaround. More details are in the commit message for patch
>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>
>>> What's the deal with this series? There was a couple of nits to address, and 
>>> I was more or less expecting a v2.
>>
>> I got reports that people were still occasionally having clock jumps after
>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>> had time to do any deeper investigation. I think this series is still beneficial
>> even if it's not a complete solution, so I'll come back with another patch on
>> top of this if/once I get it fully fixed.
>>
>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>> timer) ≈ 150 should be a conservative iteration limit?
> 
> Should be OK.
> 
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?
> 
>> Also, does this make sense to CC to stable?
> 
> Probably not, as the HW never worked, so it is not a regression.

If the patches fix a bug which already exist, it makes sense to
propagated the fix back to the stable versions.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04  8:23         ` Daniel Lezcano
  0 siblings, 0 replies; 60+ messages in thread
From: Daniel Lezcano @ 2018-07-04  8:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/07/2018 10:16, Marc Zyngier wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
>> On 07/03/18 10:09, Marc Zyngier wrote:
>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>> Hello,
>>>>
>>>> Several people (including me) have experienced extremely large system
>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>> around after 2^56 cycles.
>>>>
>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>> architectural timer, and this patch series introduces what I believe is
>>>> the simplest workaround. More details are in the commit message for patch
>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>
>>> What's the deal with this series? There was a couple of nits to address, and 
>>> I was more or less expecting a v2.
>>
>> I got reports that people were still occasionally having clock jumps after
>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>> had time to do any deeper investigation. I think this series is still beneficial
>> even if it's not a complete solution, so I'll come back with another patch on
>> top of this if/once I get it fully fixed.
>>
>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>> timer) ? 150 should be a conservative iteration limit?
> 
> Should be OK.
> 
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?
> 
>> Also, does this make sense to CC to stable?
> 
> Probably not, as the HW never worked, so it is not a regression.

If the patches fix a bug which already exist, it makes sense to
propagated the fix back to the stable versions.


-- 
 <http://www.linaro.org/> Linaro.org ? Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04  8:23         ` Daniel Lezcano
@ 2018-07-04  8:39           ` Marc Zyngier
  -1 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04  8:39 UTC (permalink / raw)
  To: Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 04/07/18 09:23, Daniel Lezcano wrote:
> On 04/07/2018 10:16, Marc Zyngier wrote:
>> On 03/07/18 19:42, Samuel Holland wrote:
>>> On 07/03/18 10:09, Marc Zyngier wrote:
>>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>>> Hello,
>>>>>
>>>>> Several people (including me) have experienced extremely large system
>>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>>> around after 2^56 cycles.
>>>>>
>>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>>> architectural timer, and this patch series introduces what I believe is
>>>>> the simplest workaround. More details are in the commit message for patch
>>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>>
>>>> What's the deal with this series? There was a couple of nits to address, and 
>>>> I was more or less expecting a v2.
>>>
>>> I got reports that people were still occasionally having clock jumps after
>>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>>> had time to do any deeper investigation. I think this series is still beneficial
>>> even if it's not a complete solution, so I'll come back with another patch on
>>> top of this if/once I get it fully fixed.
>>>
>>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>>> timer) ≈ 150 should be a conservative iteration limit?
>>
>> Should be OK.
>>
>> Maxime: How do you want to deal with the documentation aspect? We need
>> an erratum number, but AFAIU the concept hasn't made it into the silicom
>> vendor's brain yet. Any chance you could come up with something that
>> uniquely identifies this?
>>
>>> Also, does this make sense to CC to stable?
>>
>> Probably not, as the HW never worked, so it is not a regression.
> 
> If the patches fix a bug which already exist, it makes sense to
> propagated the fix back to the stable versions.
That's your call, but I'm not supportive of that decision, specially as
we have information from the person developing the workaround that this
doesn't fully address the issue.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04  8:39           ` Marc Zyngier
  0 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04  8:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/07/18 09:23, Daniel Lezcano wrote:
> On 04/07/2018 10:16, Marc Zyngier wrote:
>> On 03/07/18 19:42, Samuel Holland wrote:
>>> On 07/03/18 10:09, Marc Zyngier wrote:
>>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>>> Hello,
>>>>>
>>>>> Several people (including me) have experienced extremely large system
>>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>>> around after 2^56 cycles.
>>>>>
>>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>>> architectural timer, and this patch series introduces what I believe is
>>>>> the simplest workaround. More details are in the commit message for patch
>>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>>
>>>> What's the deal with this series? There was a couple of nits to address, and 
>>>> I was more or less expecting a v2.
>>>
>>> I got reports that people were still occasionally having clock jumps after
>>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>>> had time to do any deeper investigation. I think this series is still beneficial
>>> even if it's not a complete solution, so I'll come back with another patch on
>>> top of this if/once I get it fully fixed.
>>>
>>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>>> timer) ? 150 should be a conservative iteration limit?
>>
>> Should be OK.
>>
>> Maxime: How do you want to deal with the documentation aspect? We need
>> an erratum number, but AFAIU the concept hasn't made it into the silicom
>> vendor's brain yet. Any chance you could come up with something that
>> uniquely identifies this?
>>
>>> Also, does this make sense to CC to stable?
>>
>> Probably not, as the HW never worked, so it is not a regression.
> 
> If the patches fix a bug which already exist, it makes sense to
> propagated the fix back to the stable versions.
That's your call, but I'm not supportive of that decision, specially as
we have information from the person developing the workaround that this
doesn't fully address the issue.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04  8:16       ` Marc Zyngier
@ 2018-07-04  8:41         ` Daniel Lezcano
  -1 siblings, 0 replies; 60+ messages in thread
From: Daniel Lezcano @ 2018-07-04  8:41 UTC (permalink / raw)
  To: Marc Zyngier, Samuel Holland, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 04/07/2018 10:16, Marc Zyngier wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
>> On 07/03/18 10:09, Marc Zyngier wrote:
>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>> Hello,
>>>>
>>>> Several people (including me) have experienced extremely large system
>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>> around after 2^56 cycles.
>>>>
>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>> architectural timer, and this patch series introduces what I believe is
>>>> the simplest workaround. More details are in the commit message for patch
>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>
>>> What's the deal with this series? There was a couple of nits to address, and 
>>> I was more or less expecting a v2.
>>
>> I got reports that people were still occasionally having clock jumps after
>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>> had time to do any deeper investigation. I think this series is still beneficial
>> even if it's not a complete solution, so I'll come back with another patch on
>> top of this if/once I get it fully fixed.
>>
>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>> timer) ≈ 150 should be a conservative iteration limit?
> 
> Should be OK.
> 
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?

I went through the different pointers provided in the description but I
did not find a clear statement that is a hardware issue or may be I
missed it.

Are we sure there isn't another subsystem responsible on this
instability ? (eg PM or something else)

>> Also, does this make sense to CC to stable?
> 
> Probably not, as the HW never worked, so it is not a regression.
> 
> Thanks,
> 
> 	M.
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04  8:41         ` Daniel Lezcano
  0 siblings, 0 replies; 60+ messages in thread
From: Daniel Lezcano @ 2018-07-04  8:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/07/2018 10:16, Marc Zyngier wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
>> On 07/03/18 10:09, Marc Zyngier wrote:
>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>> Hello,
>>>>
>>>> Several people (including me) have experienced extremely large system
>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>> around after 2^56 cycles.
>>>>
>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>> architectural timer, and this patch series introduces what I believe is
>>>> the simplest workaround. More details are in the commit message for patch
>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>
>>> What's the deal with this series? There was a couple of nits to address, and 
>>> I was more or less expecting a v2.
>>
>> I got reports that people were still occasionally having clock jumps after
>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>> had time to do any deeper investigation. I think this series is still beneficial
>> even if it's not a complete solution, so I'll come back with another patch on
>> top of this if/once I get it fully fixed.
>>
>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>> timer) ? 150 should be a conservative iteration limit?
> 
> Should be OK.
> 
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?

I went through the different pointers provided in the description but I
did not find a clear statement that is a hardware issue or may be I
missed it.

Are we sure there isn't another subsystem responsible on this
instability ? (eg PM or something else)

>> Also, does this make sense to CC to stable?
> 
> Probably not, as the HW never worked, so it is not a regression.
> 
> Thanks,
> 
> 	M.
> 


-- 
 <http://www.linaro.org/> Linaro.org ? Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-03 18:42     ` Samuel Holland
@ 2018-07-04  8:41       ` Daniel Lezcano
  -1 siblings, 0 replies; 60+ messages in thread
From: Daniel Lezcano @ 2018-07-04  8:41 UTC (permalink / raw)
  To: Samuel Holland, Marc Zyngier, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 03/07/2018 20:42, Samuel Holland wrote:
> On 07/03/18 10:09, Marc Zyngier wrote:
>> On 11/05/18 03:27, Samuel Holland wrote:
>>> Hello,
>>>
>>> Several people (including me) have experienced extremely large system
>>> clock jumps on their A64-based devices, apparently due to the architectural
>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>> around after 2^56 cycles.
>>>
>>> Investigation led to discovery of some obvious problems with this SoC's 
>>> architectural timer, and this patch series introduces what I believe is
>>> the simplest workaround. More details are in the commit message for patch
>>> 1. Patch 2 simply enables the workaround in the device tree.
>>
>> What's the deal with this series? There was a couple of nits to address, and 
>> I was more or less expecting a v2.
> 
> I got reports that people were still occasionally having clock jumps after
> applying this series, so I wanted to attempt a more complete fix, but I haven't
> had time to do any deeper investigation. I think this series is still beneficial
> even if it's not a complete solution, so I'll come back with another patch on
> top of this if/once I get it fully fixed.
> 
> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> timer) ≈ 150 should be a conservative iteration limit?
> 
> Also, does this make sense to CC to stable?

I understand a partial fix is better than nothing but if you can narrow
down the issue and provide patches to fix it in one shot that would be
awesome.





-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04  8:41       ` Daniel Lezcano
  0 siblings, 0 replies; 60+ messages in thread
From: Daniel Lezcano @ 2018-07-04  8:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/07/2018 20:42, Samuel Holland wrote:
> On 07/03/18 10:09, Marc Zyngier wrote:
>> On 11/05/18 03:27, Samuel Holland wrote:
>>> Hello,
>>>
>>> Several people (including me) have experienced extremely large system
>>> clock jumps on their A64-based devices, apparently due to the architectural
>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>> around after 2^56 cycles.
>>>
>>> Investigation led to discovery of some obvious problems with this SoC's 
>>> architectural timer, and this patch series introduces what I believe is
>>> the simplest workaround. More details are in the commit message for patch
>>> 1. Patch 2 simply enables the workaround in the device tree.
>>
>> What's the deal with this series? There was a couple of nits to address, and 
>> I was more or less expecting a v2.
> 
> I got reports that people were still occasionally having clock jumps after
> applying this series, so I wanted to attempt a more complete fix, but I haven't
> had time to do any deeper investigation. I think this series is still beneficial
> even if it's not a complete solution, so I'll come back with another patch on
> top of this if/once I get it fully fixed.
> 
> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> timer) ? 150 should be a conservative iteration limit?
> 
> Also, does this make sense to CC to stable?

I understand a partial fix is better than nothing but if you can narrow
down the issue and provide patches to fix it in one shot that would be
awesome.





-- 
 <http://www.linaro.org/> Linaro.org ? Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04  8:16       ` Marc Zyngier
@ 2018-07-04  9:06         ` Maxime Ripard
  -1 siblings, 0 replies; 60+ messages in thread
From: Maxime Ripard @ 2018-07-04  9:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Samuel Holland, Chen-Yu Tsai, Catalin Marinas, Will Deacon,
	Daniel Lezcano, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	linux-sunxi, Mark Rutland

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

On Wed, Jul 04, 2018 at 09:16:32AM +0100, Marc Zyngier wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
> > On 07/03/18 10:09, Marc Zyngier wrote:
> >> On 11/05/18 03:27, Samuel Holland wrote:
> >>> Hello,
> >>>
> >>> Several people (including me) have experienced extremely large system
> >>> clock jumps on their A64-based devices, apparently due to the architectural
> >>> timer going backward, which is interpreted by Linux as the timer wrapping
> >>> around after 2^56 cycles.
> >>>
> >>> Investigation led to discovery of some obvious problems with this SoC's 
> >>> architectural timer, and this patch series introduces what I believe is
> >>> the simplest workaround. More details are in the commit message for patch
> >>> 1. Patch 2 simply enables the workaround in the device tree.
> >>
> >> What's the deal with this series? There was a couple of nits to address, and 
> >> I was more or less expecting a v2.
> > 
> > I got reports that people were still occasionally having clock jumps after
> > applying this series, so I wanted to attempt a more complete fix, but I haven't
> > had time to do any deeper investigation. I think this series is still beneficial
> > even if it's not a complete solution, so I'll come back with another patch on
> > top of this if/once I get it fully fixed.
> > 
> > I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> > timer) ≈ 150 should be a conservative iteration limit?
> 
> Should be OK.
> 
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?

Yeah, I don't know how we can address that unfortunately. Or maybe we
can call it timer-broken-1 ? It's as good as an ID than any other ID :)

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04  9:06         ` Maxime Ripard
  0 siblings, 0 replies; 60+ messages in thread
From: Maxime Ripard @ 2018-07-04  9:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 04, 2018 at 09:16:32AM +0100, Marc Zyngier wrote:
> On 03/07/18 19:42, Samuel Holland wrote:
> > On 07/03/18 10:09, Marc Zyngier wrote:
> >> On 11/05/18 03:27, Samuel Holland wrote:
> >>> Hello,
> >>>
> >>> Several people (including me) have experienced extremely large system
> >>> clock jumps on their A64-based devices, apparently due to the architectural
> >>> timer going backward, which is interpreted by Linux as the timer wrapping
> >>> around after 2^56 cycles.
> >>>
> >>> Investigation led to discovery of some obvious problems with this SoC's 
> >>> architectural timer, and this patch series introduces what I believe is
> >>> the simplest workaround. More details are in the commit message for patch
> >>> 1. Patch 2 simply enables the workaround in the device tree.
> >>
> >> What's the deal with this series? There was a couple of nits to address, and 
> >> I was more or less expecting a v2.
> > 
> > I got reports that people were still occasionally having clock jumps after
> > applying this series, so I wanted to attempt a more complete fix, but I haven't
> > had time to do any deeper investigation. I think this series is still beneficial
> > even if it's not a complete solution, so I'll come back with another patch on
> > top of this if/once I get it fully fixed.
> > 
> > I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> > timer) ? 150 should be a conservative iteration limit?
> 
> Should be OK.
> 
> Maxime: How do you want to deal with the documentation aspect? We need
> an erratum number, but AFAIU the concept hasn't made it into the silicom
> vendor's brain yet. Any chance you could come up with something that
> uniquely identifies this?

Yeah, I don't know how we can address that unfortunately. Or maybe we
can call it timer-broken-1 ? It's as good as an ID than any other ID :)

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20180704/0afc6017/attachment.sig>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04  8:39           ` Marc Zyngier
@ 2018-07-04 10:00             ` Thomas Gleixner
  -1 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 10:00 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	linux-sunxi, Mark Rutland

On Wed, 4 Jul 2018, Marc Zyngier wrote:
> On 04/07/18 09:23, Daniel Lezcano wrote:
> > 
> > If the patches fix a bug which already exist, it makes sense to
> > propagated the fix back to the stable versions.
>
> That's your call, but I'm not supportive of that decision, specially as
> we have information from the person developing the workaround that this
> doesn't fully address the issue.

The patches should not be applied at all. Simply because they don't fix the
issue completely.

From a quick glance at various links and information about this, this very
much smells like the FSL_ERRATUM_A008585.

Has that been tried? It looks way more robust than the magic 11 bit
crystal ball logic.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 10:00             ` Thomas Gleixner
  0 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 10:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 4 Jul 2018, Marc Zyngier wrote:
> On 04/07/18 09:23, Daniel Lezcano wrote:
> > 
> > If the patches fix a bug which already exist, it makes sense to
> > propagated the fix back to the stable versions.
>
> That's your call, but I'm not supportive of that decision, specially as
> we have information from the person developing the workaround that this
> doesn't fully address the issue.

The patches should not be applied at all. Simply because they don't fix the
issue completely.

>From a quick glance at various links and information about this, this very
much smells like the FSL_ERRATUM_A008585.

Has that been tried? It looks way more robust than the magic 11 bit
crystal ball logic.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-03 18:42     ` Samuel Holland
@ 2018-07-04 12:52       ` Andre Przywara
  -1 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 12:52 UTC (permalink / raw)
  To: samuel, Marc Zyngier, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, Daniel Lezcano, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

Hi,

On 03/07/18 19:42, Samuel Holland wrote:
> On 07/03/18 10:09, Marc Zyngier wrote:
>> On 11/05/18 03:27, Samuel Holland wrote:
>>> Hello,
>>>
>>> Several people (including me) have experienced extremely large system
>>> clock jumps on their A64-based devices, apparently due to the architectural
>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>> around after 2^56 cycles.
>>>
>>> Investigation led to discovery of some obvious problems with this SoC's 
>>> architectural timer, and this patch series introduces what I believe is
>>> the simplest workaround. More details are in the commit message for patch
>>> 1. Patch 2 simply enables the workaround in the device tree.
>>
>> What's the deal with this series? There was a couple of nits to address, and 
>> I was more or less expecting a v2.
> 
> I got reports that people were still occasionally having clock jumps after
> applying this series, so I wanted to attempt a more complete fix, but I haven't
> had time to do any deeper investigation.

Looking at the FSL workaround, I see that they cover TVAL reads as well.
Not sure entirely why, but maybe it's worth to follow this lead?

Cheers,
Andre.

> I think this series is still beneficial
> even if it's not a complete solution, so I'll come back with another patch on
> top of this if/once I get it fully fixed.
> 
> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> timer) ≈ 150 should be a conservative iteration limit?
> 
> Also, does this make sense to CC to stable?
> 
> Thanks,
> Samuel
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 12:52       ` Andre Przywara
  0 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 12:52 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 03/07/18 19:42, Samuel Holland wrote:
> On 07/03/18 10:09, Marc Zyngier wrote:
>> On 11/05/18 03:27, Samuel Holland wrote:
>>> Hello,
>>>
>>> Several people (including me) have experienced extremely large system
>>> clock jumps on their A64-based devices, apparently due to the architectural
>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>> around after 2^56 cycles.
>>>
>>> Investigation led to discovery of some obvious problems with this SoC's 
>>> architectural timer, and this patch series introduces what I believe is
>>> the simplest workaround. More details are in the commit message for patch
>>> 1. Patch 2 simply enables the workaround in the device tree.
>>
>> What's the deal with this series? There was a couple of nits to address, and 
>> I was more or less expecting a v2.
> 
> I got reports that people were still occasionally having clock jumps after
> applying this series, so I wanted to attempt a more complete fix, but I haven't
> had time to do any deeper investigation.

Looking at the FSL workaround, I see that they cover TVAL reads as well.
Not sure entirely why, but maybe it's worth to follow this lead?

Cheers,
Andre.

> I think this series is still beneficial
> even if it's not a complete solution, so I'll come back with another patch on
> top of this if/once I get it fully fixed.
> 
> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
> timer) ? 150 should be a conservative iteration limit?
> 
> Also, does this make sense to CC to stable?
> 
> Thanks,
> Samuel
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 10:00             ` Thomas Gleixner
@ 2018-07-04 13:08               ` Andre Przywara
  -1 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 13:08 UTC (permalink / raw)
  To: tglx, Marc Zyngier
  Cc: Daniel Lezcano, Samuel Holland, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	linux-sunxi, Mark Rutland

Hi,

On 04/07/18 11:00, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>
>>> If the patches fix a bug which already exist, it makes sense to
>>> propagated the fix back to the stable versions.
>>
>> That's your call, but I'm not supportive of that decision, specially as
>> we have information from the person developing the workaround that this
>> doesn't fully address the issue.
> 
> The patches should not be applied at all. Simply because they don't fix the
> issue completely.
> 
> From a quick glance at various links and information about this, this very
> much smells like the FSL_ERRATUM_A008585.
> Has that been tried? It looks way more robust than the magic 11 bit
> crystal ball logic.

The Freescale erratum is similar, but not identical [1].
It seems like the A64 is less variable, so we can use a cheaper
workaround, which gets away with normally just one sysreg read. But then
again the newer error reports may actually suggest otherwise ...

And as it currently stands, the Freescale erratum has the drawback of
relying on the CPU running much faster than the timer. The A64 can run
at 24 MHz (for power savings, or possibly during DVFS transitions),
which is the timer frequency. So subsequent counter reads will never
return the same value and the workaround times out.

Cheers,
Andre.

[1] https://lists.denx.de/pipermail/u-boot/2016-November/271836.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 13:08               ` Andre Przywara
  0 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 13:08 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 04/07/18 11:00, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>
>>> If the patches fix a bug which already exist, it makes sense to
>>> propagated the fix back to the stable versions.
>>
>> That's your call, but I'm not supportive of that decision, specially as
>> we have information from the person developing the workaround that this
>> doesn't fully address the issue.
> 
> The patches should not be applied at all. Simply because they don't fix the
> issue completely.
> 
> From a quick glance at various links and information about this, this very
> much smells like the FSL_ERRATUM_A008585.
> Has that been tried? It looks way more robust than the magic 11 bit
> crystal ball logic.

The Freescale erratum is similar, but not identical [1].
It seems like the A64 is less variable, so we can use a cheaper
workaround, which gets away with normally just one sysreg read. But then
again the newer error reports may actually suggest otherwise ...

And as it currently stands, the Freescale erratum has the drawback of
relying on the CPU running much faster than the timer. The A64 can run
at 24 MHz (for power savings, or possibly during DVFS transitions),
which is the timer frequency. So subsequent counter reads will never
return the same value and the workaround times out.

Cheers,
Andre.

[1] https://lists.denx.de/pipermail/u-boot/2016-November/271836.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 13:08               ` Andre Przywara
@ 2018-07-04 14:31                 ` Thomas Gleixner
  -1 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 14:31 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

On Wed, 4 Jul 2018, Andre Przywara wrote:
> On 04/07/18 11:00, Thomas Gleixner wrote:
> > On Wed, 4 Jul 2018, Marc Zyngier wrote:
> >> On 04/07/18 09:23, Daniel Lezcano wrote:
> >>>
> >>> If the patches fix a bug which already exist, it makes sense to
> >>> propagated the fix back to the stable versions.
> >>
> >> That's your call, but I'm not supportive of that decision, specially as
> >> we have information from the person developing the workaround that this
> >> doesn't fully address the issue.
> > 
> > The patches should not be applied at all. Simply because they don't fix the
> > issue completely.
> > 
> > From a quick glance at various links and information about this, this very
> > much smells like the FSL_ERRATUM_A008585.
> > Has that been tried? It looks way more robust than the magic 11 bit
> > crystal ball logic.
> 
> The Freescale erratum is similar, but not identical [1].
> It seems like the A64 is less variable, so we can use a cheaper
> workaround, which gets away with normally just one sysreg read. But then
> again the newer error reports may actually suggest otherwise ...
> 
> And as it currently stands, the Freescale erratum has the drawback of
> relying on the CPU running much faster than the timer. The A64 can run
> at 24 MHz (for power savings, or possibly during DVFS transitions),
> which is the timer frequency. So subsequent counter reads will never
> return the same value and the workaround times out.

If that's the case then you need to find a different functional timer for
time keeping. Having an erratic behaving timer for time keeping is not an
option at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 14:31                 ` Thomas Gleixner
  0 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 14:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 4 Jul 2018, Andre Przywara wrote:
> On 04/07/18 11:00, Thomas Gleixner wrote:
> > On Wed, 4 Jul 2018, Marc Zyngier wrote:
> >> On 04/07/18 09:23, Daniel Lezcano wrote:
> >>>
> >>> If the patches fix a bug which already exist, it makes sense to
> >>> propagated the fix back to the stable versions.
> >>
> >> That's your call, but I'm not supportive of that decision, specially as
> >> we have information from the person developing the workaround that this
> >> doesn't fully address the issue.
> > 
> > The patches should not be applied at all. Simply because they don't fix the
> > issue completely.
> > 
> > From a quick glance at various links and information about this, this very
> > much smells like the FSL_ERRATUM_A008585.
> > Has that been tried? It looks way more robust than the magic 11 bit
> > crystal ball logic.
> 
> The Freescale erratum is similar, but not identical [1].
> It seems like the A64 is less variable, so we can use a cheaper
> workaround, which gets away with normally just one sysreg read. But then
> again the newer error reports may actually suggest otherwise ...
> 
> And as it currently stands, the Freescale erratum has the drawback of
> relying on the CPU running much faster than the timer. The A64 can run
> at 24 MHz (for power savings, or possibly during DVFS transitions),
> which is the timer frequency. So subsequent counter reads will never
> return the same value and the workaround times out.

If that's the case then you need to find a different functional timer for
time keeping. Having an erratic behaving timer for time keeping is not an
option at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 14:31                 ` Thomas Gleixner
@ 2018-07-04 14:44                   ` Andre Przywara
  -1 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 14:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

Hi,

On 04/07/18 15:31, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Andre Przywara wrote:
>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>
>>>>> If the patches fix a bug which already exist, it makes sense to
>>>>> propagated the fix back to the stable versions.
>>>>
>>>> That's your call, but I'm not supportive of that decision, specially as
>>>> we have information from the person developing the workaround that this
>>>> doesn't fully address the issue.
>>>
>>> The patches should not be applied at all. Simply because they don't fix the
>>> issue completely.
>>>
>>> From a quick glance at various links and information about this, this very
>>> much smells like the FSL_ERRATUM_A008585.
>>> Has that been tried? It looks way more robust than the magic 11 bit
>>> crystal ball logic.
>>
>> The Freescale erratum is similar, but not identical [1].
>> It seems like the A64 is less variable, so we can use a cheaper
>> workaround, which gets away with normally just one sysreg read. But then
>> again the newer error reports may actually suggest otherwise ...
>>
>> And as it currently stands, the Freescale erratum has the drawback of
>> relying on the CPU running much faster than the timer. The A64 can run
>> at 24 MHz (for power savings, or possibly during DVFS transitions),
>> which is the timer frequency. So subsequent counter reads will never
>> return the same value and the workaround times out.
> 
> If that's the case then you need to find a different functional timer for
> time keeping. Having an erratic behaving timer for time keeping is not an
> option at all.

That's not an option on arm64. There are other usable time sources in
the SoC, but the arch timer is somewhat mandatory for all practical
purposes on arm64. We rely on it in some many places that it's not
feasible to run without it. That's why we call it "architected" timer
after all ;-)
But I am quite confident that we can find a correct workaround. Maybe
it's really the TVAL (the downcounter) write which is the culprit here,
since the hardware actually writes "now() + TVAL" into the CVAL
(upcounter) register. This internal counter access may be flawed as well.

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 14:44                   ` Andre Przywara
  0 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 14:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 04/07/18 15:31, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Andre Przywara wrote:
>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>
>>>>> If the patches fix a bug which already exist, it makes sense to
>>>>> propagated the fix back to the stable versions.
>>>>
>>>> That's your call, but I'm not supportive of that decision, specially as
>>>> we have information from the person developing the workaround that this
>>>> doesn't fully address the issue.
>>>
>>> The patches should not be applied at all. Simply because they don't fix the
>>> issue completely.
>>>
>>> From a quick glance at various links and information about this, this very
>>> much smells like the FSL_ERRATUM_A008585.
>>> Has that been tried? It looks way more robust than the magic 11 bit
>>> crystal ball logic.
>>
>> The Freescale erratum is similar, but not identical [1].
>> It seems like the A64 is less variable, so we can use a cheaper
>> workaround, which gets away with normally just one sysreg read. But then
>> again the newer error reports may actually suggest otherwise ...
>>
>> And as it currently stands, the Freescale erratum has the drawback of
>> relying on the CPU running much faster than the timer. The A64 can run
>> at 24 MHz (for power savings, or possibly during DVFS transitions),
>> which is the timer frequency. So subsequent counter reads will never
>> return the same value and the workaround times out.
> 
> If that's the case then you need to find a different functional timer for
> time keeping. Having an erratic behaving timer for time keeping is not an
> option at all.

That's not an option on arm64. There are other usable time sources in
the SoC, but the arch timer is somewhat mandatory for all practical
purposes on arm64. We rely on it in some many places that it's not
feasible to run without it. That's why we call it "architected" timer
after all ;-)
But I am quite confident that we can find a correct workaround. Maybe
it's really the TVAL (the downcounter) write which is the culprit here,
since the hardware actually writes "now() + TVAL" into the CVAL
(upcounter) register. This internal counter access may be flawed as well.

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 14:44                   ` Andre Przywara
@ 2018-07-04 15:01                     ` Marc Zyngier
  -1 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04 15:01 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Thomas Gleixner, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

On Wed, 04 Jul 2018 15:44:36 +0100,
Andre Przywara <andre.przywara@arm.com> wrote:
> 
> Hi,
> 
> On 04/07/18 15:31, Thomas Gleixner wrote:
> > On Wed, 4 Jul 2018, Andre Przywara wrote:
> >> On 04/07/18 11:00, Thomas Gleixner wrote:
> >>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
> >>>> On 04/07/18 09:23, Daniel Lezcano wrote:
> >>>>>
> >>>>> If the patches fix a bug which already exist, it makes sense to
> >>>>> propagated the fix back to the stable versions.
> >>>>
> >>>> That's your call, but I'm not supportive of that decision, specially as
> >>>> we have information from the person developing the workaround that this
> >>>> doesn't fully address the issue.
> >>>
> >>> The patches should not be applied at all. Simply because they don't fix the
> >>> issue completely.
> >>>
> >>> From a quick glance at various links and information about this, this very
> >>> much smells like the FSL_ERRATUM_A008585.
> >>> Has that been tried? It looks way more robust than the magic 11 bit
> >>> crystal ball logic.
> >>
> >> The Freescale erratum is similar, but not identical [1].
> >> It seems like the A64 is less variable, so we can use a cheaper
> >> workaround, which gets away with normally just one sysreg read. But then
> >> again the newer error reports may actually suggest otherwise ...
> >>
> >> And as it currently stands, the Freescale erratum has the drawback of
> >> relying on the CPU running much faster than the timer. The A64 can run
> >> at 24 MHz (for power savings, or possibly during DVFS transitions),
> >> which is the timer frequency. So subsequent counter reads will never
> >> return the same value and the workaround times out.
> > 
> > If that's the case then you need to find a different functional timer for
> > time keeping. Having an erratic behaving timer for time keeping is not an
> > option at all.
> 
> That's not an option on arm64. There are other usable time sources in
> the SoC, but the arch timer is somewhat mandatory for all practical
> purposes on arm64. We rely on it in some many places that it's not
> feasible to run without it. That's why we call it "architected" timer
> after all ;-)
> But I am quite confident that we can find a correct workaround. Maybe
> it's really the TVAL (the downcounter) write which is the culprit here,
> since the hardware actually writes "now() + TVAL" into the CVAL
> (upcounter) register. This internal counter access may be flawed as well.

You got it backward: CVAL is not a counter at all. It is a
Comparator. And TVAL has an implicit read from the counter, as it is
defined as "CVAL - CNT" (i.e. the number of ticks until the timer
expires).

So it might be worth trying to handle TVAL entirely in SW.

But this relies on being able to read the timer and get a number of
correct values out of it. One possibility would be to sacrifice
precision and always ignore some of the bottom bits, but this is
always going to suck terribly.

The alternative is burn that thing, and pretend it never existed.

	M.

-- 
Jazz is not dead, it just smell funny.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 15:01                     ` Marc Zyngier
  0 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04 15:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 04 Jul 2018 15:44:36 +0100,
Andre Przywara <andre.przywara@arm.com> wrote:
> 
> Hi,
> 
> On 04/07/18 15:31, Thomas Gleixner wrote:
> > On Wed, 4 Jul 2018, Andre Przywara wrote:
> >> On 04/07/18 11:00, Thomas Gleixner wrote:
> >>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
> >>>> On 04/07/18 09:23, Daniel Lezcano wrote:
> >>>>>
> >>>>> If the patches fix a bug which already exist, it makes sense to
> >>>>> propagated the fix back to the stable versions.
> >>>>
> >>>> That's your call, but I'm not supportive of that decision, specially as
> >>>> we have information from the person developing the workaround that this
> >>>> doesn't fully address the issue.
> >>>
> >>> The patches should not be applied at all. Simply because they don't fix the
> >>> issue completely.
> >>>
> >>> From a quick glance at various links and information about this, this very
> >>> much smells like the FSL_ERRATUM_A008585.
> >>> Has that been tried? It looks way more robust than the magic 11 bit
> >>> crystal ball logic.
> >>
> >> The Freescale erratum is similar, but not identical [1].
> >> It seems like the A64 is less variable, so we can use a cheaper
> >> workaround, which gets away with normally just one sysreg read. But then
> >> again the newer error reports may actually suggest otherwise ...
> >>
> >> And as it currently stands, the Freescale erratum has the drawback of
> >> relying on the CPU running much faster than the timer. The A64 can run
> >> at 24 MHz (for power savings, or possibly during DVFS transitions),
> >> which is the timer frequency. So subsequent counter reads will never
> >> return the same value and the workaround times out.
> > 
> > If that's the case then you need to find a different functional timer for
> > time keeping. Having an erratic behaving timer for time keeping is not an
> > option at all.
> 
> That's not an option on arm64. There are other usable time sources in
> the SoC, but the arch timer is somewhat mandatory for all practical
> purposes on arm64. We rely on it in some many places that it's not
> feasible to run without it. That's why we call it "architected" timer
> after all ;-)
> But I am quite confident that we can find a correct workaround. Maybe
> it's really the TVAL (the downcounter) write which is the culprit here,
> since the hardware actually writes "now() + TVAL" into the CVAL
> (upcounter) register. This internal counter access may be flawed as well.

You got it backward: CVAL is not a counter at all. It is a
Comparator. And TVAL has an implicit read from the counter, as it is
defined as "CVAL - CNT" (i.e. the number of ticks until the timer
expires).

So it might be worth trying to handle TVAL entirely in SW.

But this relies on being able to read the timer and get a number of
correct values out of it. One possibility would be to sacrifice
precision and always ignore some of the bottom bits, but this is
always going to suck terribly.

The alternative is burn that thing, and pretend it never existed.

	M.

-- 
Jazz is not dead, it just smell funny.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 14:44                   ` Andre Przywara
@ 2018-07-04 15:14                     ` Thomas Gleixner
  -1 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 15:14 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

On Wed, 4 Jul 2018, Andre Przywara wrote:
> On 04/07/18 15:31, Thomas Gleixner wrote:
> > If that's the case then you need to find a different functional timer for
> > time keeping. Having an erratic behaving timer for time keeping is not an
> > option at all.
> 
> That's not an option on arm64. There are other usable time sources in
> the SoC, but the arch timer is somewhat mandatory for all practical
> purposes on arm64. We rely on it in some many places that it's not
> feasible to run without it. That's why we call it "architected" timer
> after all ;-)

The argument that it has to be used just because someone defined it as
'architected' is bullshit and doesn't change the fact that it's broken and
not usable for timekeeping. There is no wiggle room. Either it works or
not, but works mostly is not an option.

> But I am quite confident that we can find a correct workaround. Maybe
> it's really the TVAL (the downcounter) write which is the culprit here,
> since the hardware actually writes "now() + TVAL" into the CVAL
> (upcounter) register. This internal counter access may be flawed as well.

If the write to the event device is wreckaging the counter which provides
time, then there is something seriously wrong either in the design or in
that particular piece of silicon.

Yet another proof for the theory that timers are implemented by janitors
and that silicon/IP vendors have a competition running who can create the
most subtly broken timers. Intel surely had a head start with that, but ARM
is definitely catching up.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 15:14                     ` Thomas Gleixner
  0 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 15:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 4 Jul 2018, Andre Przywara wrote:
> On 04/07/18 15:31, Thomas Gleixner wrote:
> > If that's the case then you need to find a different functional timer for
> > time keeping. Having an erratic behaving timer for time keeping is not an
> > option at all.
> 
> That's not an option on arm64. There are other usable time sources in
> the SoC, but the arch timer is somewhat mandatory for all practical
> purposes on arm64. We rely on it in some many places that it's not
> feasible to run without it. That's why we call it "architected" timer
> after all ;-)

The argument that it has to be used just because someone defined it as
'architected' is bullshit and doesn't change the fact that it's broken and
not usable for timekeeping. There is no wiggle room. Either it works or
not, but works mostly is not an option.

> But I am quite confident that we can find a correct workaround. Maybe
> it's really the TVAL (the downcounter) write which is the culprit here,
> since the hardware actually writes "now() + TVAL" into the CVAL
> (upcounter) register. This internal counter access may be flawed as well.

If the write to the event device is wreckaging the counter which provides
time, then there is something seriously wrong either in the design or in
that particular piece of silicon.

Yet another proof for the theory that timers are implemented by janitors
and that silicon/IP vendors have a competition running who can create the
most subtly broken timers. Intel surely had a head start with that, but ARM
is definitely catching up.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 15:01                     ` Marc Zyngier
@ 2018-07-04 15:15                       ` Andre Przywara
  -1 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 15:15 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Thomas Gleixner, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

Hi,

On 04/07/18 16:01, Marc Zyngier wrote:
> On Wed, 04 Jul 2018 15:44:36 +0100,
> Andre Przywara <andre.przywara@arm.com> wrote:
>>
>> Hi,
>>
>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>> On Wed, 4 Jul 2018, Andre Przywara wrote:
>>>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>>>
>>>>>>> If the patches fix a bug which already exist, it makes sense to
>>>>>>> propagated the fix back to the stable versions.
>>>>>>
>>>>>> That's your call, but I'm not supportive of that decision, specially as
>>>>>> we have information from the person developing the workaround that this
>>>>>> doesn't fully address the issue.
>>>>>
>>>>> The patches should not be applied at all. Simply because they don't fix the
>>>>> issue completely.
>>>>>
>>>>> From a quick glance at various links and information about this, this very
>>>>> much smells like the FSL_ERRATUM_A008585.
>>>>> Has that been tried? It looks way more robust than the magic 11 bit
>>>>> crystal ball logic.
>>>>
>>>> The Freescale erratum is similar, but not identical [1].
>>>> It seems like the A64 is less variable, so we can use a cheaper
>>>> workaround, which gets away with normally just one sysreg read. But then
>>>> again the newer error reports may actually suggest otherwise ...
>>>>
>>>> And as it currently stands, the Freescale erratum has the drawback of
>>>> relying on the CPU running much faster than the timer. The A64 can run
>>>> at 24 MHz (for power savings, or possibly during DVFS transitions),
>>>> which is the timer frequency. So subsequent counter reads will never
>>>> return the same value and the workaround times out.
>>>
>>> If that's the case then you need to find a different functional timer for
>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>> option at all.
>>
>> That's not an option on arm64. There are other usable time sources in
>> the SoC, but the arch timer is somewhat mandatory for all practical
>> purposes on arm64. We rely on it in some many places that it's not
>> feasible to run without it. That's why we call it "architected" timer
>> after all ;-)
>> But I am quite confident that we can find a correct workaround. Maybe
>> it's really the TVAL (the downcounter) write which is the culprit here,
>> since the hardware actually writes "now() + TVAL" into the CVAL
>> (upcounter) register. This internal counter access may be flawed as well.
> 
> You got it backward: CVAL is not a counter at all. It is a
> Comparator. And TVAL has an implicit read from the counter, as it is
> defined as "CVAL - CNT" (i.e. the number of ticks until the timer
> expires).

Yes, that's what I meant actually, sorry for the lousy wording.

What I am actually more concerned about than reading (do we actually
read TVAL?), is writing TVAL. The original BSP errata hack hints at this
being a problem:
https://github.com/longsleep/linux-pine64/blob/5b10a45ae8b0/drivers/clocksource/arm_arch_timer.c#L231-L244

> So it might be worth trying to handle TVAL entirely in SW.
> 
> But this relies on being able to read the timer and get a number of
> correct values out of it. One possibility would be to sacrifice
> precision and always ignore some of the bottom bits, but this is
> always going to suck terribly.
> 
> The alternative is burn that thing, and pretend it never existed.

Yes, that crossed my mind multiple times.

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 15:15                       ` Andre Przywara
  0 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 15:15 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 04/07/18 16:01, Marc Zyngier wrote:
> On Wed, 04 Jul 2018 15:44:36 +0100,
> Andre Przywara <andre.przywara@arm.com> wrote:
>>
>> Hi,
>>
>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>> On Wed, 4 Jul 2018, Andre Przywara wrote:
>>>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>>>
>>>>>>> If the patches fix a bug which already exist, it makes sense to
>>>>>>> propagated the fix back to the stable versions.
>>>>>>
>>>>>> That's your call, but I'm not supportive of that decision, specially as
>>>>>> we have information from the person developing the workaround that this
>>>>>> doesn't fully address the issue.
>>>>>
>>>>> The patches should not be applied at all. Simply because they don't fix the
>>>>> issue completely.
>>>>>
>>>>> From a quick glance at various links and information about this, this very
>>>>> much smells like the FSL_ERRATUM_A008585.
>>>>> Has that been tried? It looks way more robust than the magic 11 bit
>>>>> crystal ball logic.
>>>>
>>>> The Freescale erratum is similar, but not identical [1].
>>>> It seems like the A64 is less variable, so we can use a cheaper
>>>> workaround, which gets away with normally just one sysreg read. But then
>>>> again the newer error reports may actually suggest otherwise ...
>>>>
>>>> And as it currently stands, the Freescale erratum has the drawback of
>>>> relying on the CPU running much faster than the timer. The A64 can run
>>>> at 24 MHz (for power savings, or possibly during DVFS transitions),
>>>> which is the timer frequency. So subsequent counter reads will never
>>>> return the same value and the workaround times out.
>>>
>>> If that's the case then you need to find a different functional timer for
>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>> option at all.
>>
>> That's not an option on arm64. There are other usable time sources in
>> the SoC, but the arch timer is somewhat mandatory for all practical
>> purposes on arm64. We rely on it in some many places that it's not
>> feasible to run without it. That's why we call it "architected" timer
>> after all ;-)
>> But I am quite confident that we can find a correct workaround. Maybe
>> it's really the TVAL (the downcounter) write which is the culprit here,
>> since the hardware actually writes "now() + TVAL" into the CVAL
>> (upcounter) register. This internal counter access may be flawed as well.
> 
> You got it backward: CVAL is not a counter at all. It is a
> Comparator. And TVAL has an implicit read from the counter, as it is
> defined as "CVAL - CNT" (i.e. the number of ticks until the timer
> expires).

Yes, that's what I meant actually, sorry for the lousy wording.

What I am actually more concerned about than reading (do we actually
read TVAL?), is writing TVAL. The original BSP errata hack hints at this
being a problem:
https://github.com/longsleep/linux-pine64/blob/5b10a45ae8b0/drivers/clocksource/arm_arch_timer.c#L231-L244

> So it might be worth trying to handle TVAL entirely in SW.
> 
> But this relies on being able to read the timer and get a number of
> correct values out of it. One possibility would be to sacrifice
> precision and always ignore some of the bottom bits, but this is
> always going to suck terribly.
> 
> The alternative is burn that thing, and pretend it never existed.

Yes, that crossed my mind multiple times.

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 15:01                     ` Marc Zyngier
@ 2018-07-04 15:23                       ` Samuel Holland
  -1 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-07-04 15:23 UTC (permalink / raw)
  To: Marc Zyngier, Andre Przywara
  Cc: Thomas Gleixner, Daniel Lezcano, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	linux-sunxi, Mark Rutland

On 07/04/18 10:01, Marc Zyngier wrote:
> On Wed, 04 Jul 2018 15:44:36, Andre Przywara <andre.przywara@arm.com> wrote:
>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>> On Wed, 4 Jul 2018, Andre Przywara wrote:
>>>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>>> 
>>>>>>> If the patches fix a bug which already exist, it makes sense to 
>>>>>>> propagated the fix back to the stable versions.
>>>>>> 
>>>>>> That's your call, but I'm not supportive of that decision, 
>>>>>> specially as we have information from the person developing the 
>>>>>> workaround that this doesn't fully address the issue.
>>>>> 
>>>>> The patches should not be applied at all. Simply because they don't 
>>>>> fix the issue completely.
>>>>> 
>>>>> From a quick glance at various links and information about this,
>>>>> this very much smells like the FSL_ERRATUM_A008585. Has that been
>>>>> tried? It looks way more robust than the magic 11 bit crystal ball
>>>>> logic.
>>>> 
>>>> The Freescale erratum is similar, but not identical [1]. It seems like 
>>>> the A64 is less variable, so we can use a cheaper workaround, which 
>>>> gets away with normally just one sysreg read. But then again the newer 
>>>> error reports may actually suggest otherwise ...
>>>> 
>>>> And as it currently stands, the Freescale erratum has the drawback of 
>>>> relying on the CPU running much faster than the timer. The A64 can run
>>>>  at 24 MHz (for power savings, or possibly during DVFS transitions), 
>>>> which is the timer frequency. So subsequent counter reads will never 
>>>> return the same value and the workaround times out.
>>> 
>>> If that's the case then you need to find a different functional timer for
>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>> option at all.
>> 
>> That's not an option on arm64. There are other usable time sources in the 
>> SoC, but the arch timer is somewhat mandatory for all practical purposes
>> on arm64. We rely on it in some many places that it's not feasible to run 
>> without it. That's why we call it "architected" timer after all ;-) But I 
>> am quite confident that we can find a correct workaround. Maybe it's
>> really the TVAL (the downcounter) write which is the culprit here, since
>> the hardware actually writes "now() + TVAL" into the CVAL (upcounter)
>> register. This internal counter access may be flawed as well.
> 
> You got it backward: CVAL is not a counter at all. It is a Comparator. And 
> TVAL has an implicit read from the counter, as it is defined as "CVAL - CNT" 
> (i.e. the number of ticks until the timer expires).
> 
> So it might be worth trying to handle TVAL entirely in SW.
> 
> But this relies on being able to read the timer and get a number of correct 
> values out of it. One possibility would be to sacrifice precision and always 
> ignore some of the bottom bits, but this is always going to suck terribly.
> 
> The alternative is burn that thing, and pretend it never existed.

From the testing I have done, this patch series fully stabilizes reading CNTPCT
and CNTVCT. So with the workaround, the timer *can* accurately count time. So
merging this would be an improvement on the current situation.

The system clock jumps might be explained by interaction with CVAL/TVAL, and
that's the part I haven't investigated yet. As I mentioned before, and Andre
just mentioned again, the BSP provided by the vendor has another workaround for
writing the TVAL register. Hopefully, that's the missing piece which will fix
the clock jumps.

Thanks,
Samuel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 15:23                       ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-07-04 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/04/18 10:01, Marc Zyngier wrote:
> On Wed, 04 Jul 2018 15:44:36, Andre Przywara <andre.przywara@arm.com> wrote:
>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>> On Wed, 4 Jul 2018, Andre Przywara wrote:
>>>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>>> 
>>>>>>> If the patches fix a bug which already exist, it makes sense to 
>>>>>>> propagated the fix back to the stable versions.
>>>>>> 
>>>>>> That's your call, but I'm not supportive of that decision, 
>>>>>> specially as we have information from the person developing the 
>>>>>> workaround that this doesn't fully address the issue.
>>>>> 
>>>>> The patches should not be applied at all. Simply because they don't 
>>>>> fix the issue completely.
>>>>> 
>>>>> From a quick glance at various links and information about this,
>>>>> this very much smells like the FSL_ERRATUM_A008585. Has that been
>>>>> tried? It looks way more robust than the magic 11 bit crystal ball
>>>>> logic.
>>>> 
>>>> The Freescale erratum is similar, but not identical [1]. It seems like 
>>>> the A64 is less variable, so we can use a cheaper workaround, which 
>>>> gets away with normally just one sysreg read. But then again the newer 
>>>> error reports may actually suggest otherwise ...
>>>> 
>>>> And as it currently stands, the Freescale erratum has the drawback of 
>>>> relying on the CPU running much faster than the timer. The A64 can run
>>>>  at 24 MHz (for power savings, or possibly during DVFS transitions), 
>>>> which is the timer frequency. So subsequent counter reads will never 
>>>> return the same value and the workaround times out.
>>> 
>>> If that's the case then you need to find a different functional timer for
>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>> option at all.
>> 
>> That's not an option on arm64. There are other usable time sources in the 
>> SoC, but the arch timer is somewhat mandatory for all practical purposes
>> on arm64. We rely on it in some many places that it's not feasible to run 
>> without it. That's why we call it "architected" timer after all ;-) But I 
>> am quite confident that we can find a correct workaround. Maybe it's
>> really the TVAL (the downcounter) write which is the culprit here, since
>> the hardware actually writes "now() + TVAL" into the CVAL (upcounter)
>> register. This internal counter access may be flawed as well.
> 
> You got it backward: CVAL is not a counter at all. It is a Comparator. And 
> TVAL has an implicit read from the counter, as it is defined as "CVAL - CNT" 
> (i.e. the number of ticks until the timer expires).
> 
> So it might be worth trying to handle TVAL entirely in SW.
> 
> But this relies on being able to read the timer and get a number of correct 
> values out of it. One possibility would be to sacrifice precision and always 
> ignore some of the bottom bits, but this is always going to suck terribly.
> 
> The alternative is burn that thing, and pretend it never existed.

>From the testing I have done, this patch series fully stabilizes reading CNTPCT
and CNTVCT. So with the workaround, the timer *can* accurately count time. So
merging this would be an improvement on the current situation.

The system clock jumps might be explained by interaction with CVAL/TVAL, and
that's the part I haven't investigated yet. As I mentioned before, and Andre
just mentioned again, the BSP provided by the vendor has another workaround for
writing the TVAL register. Hopefully, that's the missing piece which will fix
the clock jumps.

Thanks,
Samuel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 15:15                       ` Andre Przywara
@ 2018-07-04 15:30                         ` Marc Zyngier
  -1 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04 15:30 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Thomas Gleixner, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

On 04/07/18 16:15, Andre Przywara wrote:
> Hi,
> 
> On 04/07/18 16:01, Marc Zyngier wrote:
>> On Wed, 04 Jul 2018 15:44:36 +0100,
>> Andre Przywara <andre.przywara@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>>> On Wed, 4 Jul 2018, Andre Przywara wrote:
>>>>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>>>>
>>>>>>>> If the patches fix a bug which already exist, it makes sense to
>>>>>>>> propagated the fix back to the stable versions.
>>>>>>>
>>>>>>> That's your call, but I'm not supportive of that decision, specially as
>>>>>>> we have information from the person developing the workaround that this
>>>>>>> doesn't fully address the issue.
>>>>>>
>>>>>> The patches should not be applied at all. Simply because they don't fix the
>>>>>> issue completely.
>>>>>>
>>>>>> From a quick glance at various links and information about this, this very
>>>>>> much smells like the FSL_ERRATUM_A008585.
>>>>>> Has that been tried? It looks way more robust than the magic 11 bit
>>>>>> crystal ball logic.
>>>>>
>>>>> The Freescale erratum is similar, but not identical [1].
>>>>> It seems like the A64 is less variable, so we can use a cheaper
>>>>> workaround, which gets away with normally just one sysreg read. But then
>>>>> again the newer error reports may actually suggest otherwise ...
>>>>>
>>>>> And as it currently stands, the Freescale erratum has the drawback of
>>>>> relying on the CPU running much faster than the timer. The A64 can run
>>>>> at 24 MHz (for power savings, or possibly during DVFS transitions),
>>>>> which is the timer frequency. So subsequent counter reads will never
>>>>> return the same value and the workaround times out.
>>>>
>>>> If that's the case then you need to find a different functional timer for
>>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>>> option at all.
>>>
>>> That's not an option on arm64. There are other usable time sources in
>>> the SoC, but the arch timer is somewhat mandatory for all practical
>>> purposes on arm64. We rely on it in some many places that it's not
>>> feasible to run without it. That's why we call it "architected" timer
>>> after all ;-)
>>> But I am quite confident that we can find a correct workaround. Maybe
>>> it's really the TVAL (the downcounter) write which is the culprit here,
>>> since the hardware actually writes "now() + TVAL" into the CVAL
>>> (upcounter) register. This internal counter access may be flawed as well.
>>
>> You got it backward: CVAL is not a counter at all. It is a
>> Comparator. And TVAL has an implicit read from the counter, as it is
>> defined as "CVAL - CNT" (i.e. the number of ticks until the timer
>> expires).
> 
> Yes, that's what I meant actually, sorry for the lousy wording.
> 
> What I am actually more concerned about than reading (do we actually
> read TVAL?), is writing TVAL. The original BSP errata hack hints at this
> being a problem:
> https://github.com/longsleep/linux-pine64/blob/5b10a45ae8b0/drivers/clocksource/arm_arch_timer.c#L231-L244

Right, and they only address the comparator, ignoring the counter.
Braindead. I specially enjoy the "we should try to fix this" comment.

>> So it might be worth trying to handle TVAL entirely in SW.

Given the above, I think the above makes sense:

- write TVAL: read CNT until stable, add the delta, write CVAL instead
- read TVAL: read CNT until stable, substract CVAL, return the delta

The low frequency problem remains. If it can't be solved, drop the arch
timer from the DT (it is dead), and use a separate timer/counter. Simply
not fit for purpose.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 15:30                         ` Marc Zyngier
  0 siblings, 0 replies; 60+ messages in thread
From: Marc Zyngier @ 2018-07-04 15:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/07/18 16:15, Andre Przywara wrote:
> Hi,
> 
> On 04/07/18 16:01, Marc Zyngier wrote:
>> On Wed, 04 Jul 2018 15:44:36 +0100,
>> Andre Przywara <andre.przywara@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>>> On Wed, 4 Jul 2018, Andre Przywara wrote:
>>>>> On 04/07/18 11:00, Thomas Gleixner wrote:
>>>>>> On Wed, 4 Jul 2018, Marc Zyngier wrote:
>>>>>>> On 04/07/18 09:23, Daniel Lezcano wrote:
>>>>>>>>
>>>>>>>> If the patches fix a bug which already exist, it makes sense to
>>>>>>>> propagated the fix back to the stable versions.
>>>>>>>
>>>>>>> That's your call, but I'm not supportive of that decision, specially as
>>>>>>> we have information from the person developing the workaround that this
>>>>>>> doesn't fully address the issue.
>>>>>>
>>>>>> The patches should not be applied at all. Simply because they don't fix the
>>>>>> issue completely.
>>>>>>
>>>>>> From a quick glance at various links and information about this, this very
>>>>>> much smells like the FSL_ERRATUM_A008585.
>>>>>> Has that been tried? It looks way more robust than the magic 11 bit
>>>>>> crystal ball logic.
>>>>>
>>>>> The Freescale erratum is similar, but not identical [1].
>>>>> It seems like the A64 is less variable, so we can use a cheaper
>>>>> workaround, which gets away with normally just one sysreg read. But then
>>>>> again the newer error reports may actually suggest otherwise ...
>>>>>
>>>>> And as it currently stands, the Freescale erratum has the drawback of
>>>>> relying on the CPU running much faster than the timer. The A64 can run
>>>>> at 24 MHz (for power savings, or possibly during DVFS transitions),
>>>>> which is the timer frequency. So subsequent counter reads will never
>>>>> return the same value and the workaround times out.
>>>>
>>>> If that's the case then you need to find a different functional timer for
>>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>>> option at all.
>>>
>>> That's not an option on arm64. There are other usable time sources in
>>> the SoC, but the arch timer is somewhat mandatory for all practical
>>> purposes on arm64. We rely on it in some many places that it's not
>>> feasible to run without it. That's why we call it "architected" timer
>>> after all ;-)
>>> But I am quite confident that we can find a correct workaround. Maybe
>>> it's really the TVAL (the downcounter) write which is the culprit here,
>>> since the hardware actually writes "now() + TVAL" into the CVAL
>>> (upcounter) register. This internal counter access may be flawed as well.
>>
>> You got it backward: CVAL is not a counter at all. It is a
>> Comparator. And TVAL has an implicit read from the counter, as it is
>> defined as "CVAL - CNT" (i.e. the number of ticks until the timer
>> expires).
> 
> Yes, that's what I meant actually, sorry for the lousy wording.
> 
> What I am actually more concerned about than reading (do we actually
> read TVAL?), is writing TVAL. The original BSP errata hack hints at this
> being a problem:
> https://github.com/longsleep/linux-pine64/blob/5b10a45ae8b0/drivers/clocksource/arm_arch_timer.c#L231-L244

Right, and they only address the comparator, ignoring the counter.
Braindead. I specially enjoy the "we should try to fix this" comment.

>> So it might be worth trying to handle TVAL entirely in SW.

Given the above, I think the above makes sense:

- write TVAL: read CNT until stable, add the delta, write CVAL instead
- read TVAL: read CNT until stable, substract CVAL, return the delta

The low frequency problem remains. If it can't be solved, drop the arch
timer from the DT (it is dead), and use a separate timer/counter. Simply
not fit for purpose.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 15:14                     ` Thomas Gleixner
@ 2018-07-04 15:43                       ` Andre Przywara
  -1 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 15:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

Hi,

On 04/07/18 16:14, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Andre Przywara wrote:
>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>> If that's the case then you need to find a different functional timer for
>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>> option at all.
>>
>> That's not an option on arm64. There are other usable time sources in
>> the SoC, but the arch timer is somewhat mandatory for all practical
>> purposes on arm64. We rely on it in some many places that it's not
>> feasible to run without it. That's why we call it "architected" timer
>> after all ;-)
> 
> The argument that it has to be used just because someone defined it as
> 'architected' is bullshit and doesn't change the fact that it's broken and
> not usable for timekeeping. There is no wiggle room. Either it works or
> not, but works mostly is not an option.

The "architected" part of the arch timer is fine, it's just that
eventually someone has to implement that at some point. And as you
mention below, this is where Murphy's law is kicking in ;-) Especially
for such seemingly simple tasks as connecting a counter in the "uncore"
part of the chip (Allwinner SoC) to the counter register interface in
the core (ARM Cortex-A53) [1]. Apparently the propagation is not really
atomic for all bits here ...

>> But I am quite confident that we can find a correct workaround. Maybe
>> it's really the TVAL (the downcounter) write which is the culprit here,
>> since the hardware actually writes "now() + TVAL" into the CVAL
>> (upcounter) register. This internal counter access may be flawed as well.
> 
> If the write to the event device is wreckaging the counter which provides
> time, then there is something seriously wrong either in the design or in
> that particular piece of silicon.

Apologies, that was my lousy wording: There is one 64-bit comparison
register (CVAL), which signals when the counter (an independent
register) is greater or equal. TVAL is just a different *view* of that
same relation. So this part is fine, it's really that the "strictly
monotonic counter" nature of CNTPCT is not really observed by the chip.

> Yet another proof for the theory that timers are implemented by janitors
> and that silicon/IP vendors have a competition running who can create the
> most subtly broken timers. Intel surely had a head start with that, but ARM
> is definitely catching up.

ARM is trying really hard to be actually better ;-)

Cheers,
Andre.

[1]
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500g/BABIGHII.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 15:43                       ` Andre Przywara
  0 siblings, 0 replies; 60+ messages in thread
From: Andre Przywara @ 2018-07-04 15:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 04/07/18 16:14, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Andre Przywara wrote:
>> On 04/07/18 15:31, Thomas Gleixner wrote:
>>> If that's the case then you need to find a different functional timer for
>>> time keeping. Having an erratic behaving timer for time keeping is not an
>>> option at all.
>>
>> That's not an option on arm64. There are other usable time sources in
>> the SoC, but the arch timer is somewhat mandatory for all practical
>> purposes on arm64. We rely on it in some many places that it's not
>> feasible to run without it. That's why we call it "architected" timer
>> after all ;-)
> 
> The argument that it has to be used just because someone defined it as
> 'architected' is bullshit and doesn't change the fact that it's broken and
> not usable for timekeeping. There is no wiggle room. Either it works or
> not, but works mostly is not an option.

The "architected" part of the arch timer is fine, it's just that
eventually someone has to implement that at some point. And as you
mention below, this is where Murphy's law is kicking in ;-) Especially
for such seemingly simple tasks as connecting a counter in the "uncore"
part of the chip (Allwinner SoC) to the counter register interface in
the core (ARM Cortex-A53) [1]. Apparently the propagation is not really
atomic for all bits here ...

>> But I am quite confident that we can find a correct workaround. Maybe
>> it's really the TVAL (the downcounter) write which is the culprit here,
>> since the hardware actually writes "now() + TVAL" into the CVAL
>> (upcounter) register. This internal counter access may be flawed as well.
> 
> If the write to the event device is wreckaging the counter which provides
> time, then there is something seriously wrong either in the design or in
> that particular piece of silicon.

Apologies, that was my lousy wording: There is one 64-bit comparison
register (CVAL), which signals when the counter (an independent
register) is greater or equal. TVAL is just a different *view* of that
same relation. So this part is fine, it's really that the "strictly
monotonic counter" nature of CNTPCT is not really observed by the chip.

> Yet another proof for the theory that timers are implemented by janitors
> and that silicon/IP vendors have a competition running who can create the
> most subtly broken timers. Intel surely had a head start with that, but ARM
> is definitely catching up.

ARM is trying really hard to be actually better ;-)

Cheers,
Andre.

[1]
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500g/BABIGHII.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04 15:43                       ` Andre Przywara
@ 2018-07-04 19:49                         ` Thomas Gleixner
  -1 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 19:49 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Marc Zyngier, Daniel Lezcano, Samuel Holland, Maxime Ripard,
	Chen-Yu Tsai, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-sunxi, Mark Rutland

On Wed, 4 Jul 2018, Andre Przywara wrote:
> On 04/07/18 16:14, Thomas Gleixner wrote:
> > On Wed, 4 Jul 2018, Andre Przywara wrote:
> >> On 04/07/18 15:31, Thomas Gleixner wrote:
> >>> If that's the case then you need to find a different functional timer for
> >>> time keeping. Having an erratic behaving timer for time keeping is not an
> >>> option at all.
> >>
> >> That's not an option on arm64. There are other usable time sources in
> >> the SoC, but the arch timer is somewhat mandatory for all practical
> >> purposes on arm64. We rely on it in some many places that it's not
> >> feasible to run without it. That's why we call it "architected" timer
> >> after all ;-)
> > 
> > The argument that it has to be used just because someone defined it as
> > 'architected' is bullshit and doesn't change the fact that it's broken and
> > not usable for timekeeping. There is no wiggle room. Either it works or
> > not, but works mostly is not an option.
> 
> The "architected" part of the arch timer is fine, it's just that
> eventually someone has to implement that at some point. And as you
> mention below, this is where Murphy's law is kicking in ;-) Especially
> for such seemingly simple tasks as connecting a counter in the "uncore"
> part of the chip (Allwinner SoC) to the counter register interface in
> the core (ARM Cortex-A53) [1]. Apparently the propagation is not really
> atomic for all bits here ...

I've immediately spotted the fail in that document:

  The Cortex-A53 processor does not include the system counter. This
  resides in the SoC.

> >> But I am quite confident that we can find a correct workaround. Maybe
> >> it's really the TVAL (the downcounter) write which is the culprit here,
> >> since the hardware actually writes "now() + TVAL" into the CVAL
> >> (upcounter) register. This internal counter access may be flawed as well.
> > 
> > If the write to the event device is wreckaging the counter which provides
> > time, then there is something seriously wrong either in the design or in
> > that particular piece of silicon.
> 
> Apologies, that was my lousy wording: There is one 64-bit comparison
> register (CVAL), which signals when the counter (an independent
> register) is greater or equal. TVAL is just a different *view* of that
> same relation. So this part is fine, it's really that the "strictly
> monotonic counter" nature of CNTPCT is not really observed by the chip.
> 
> > Yet another proof for the theory that timers are implemented by janitors
> > and that silicon/IP vendors have a competition running who can create the
> > most subtly broken timers. Intel surely had a head start with that, but ARM
> > is definitely catching up.
> 
> ARM is trying really hard to be actually better ;-)

Better in terms of subtle brokenness? I surely can do consulting for
that. I've seen most of it in all colours, but I surely can come up with
new even subtler ways to wreckage them. You know how to reach me.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [linux-sunxi] Re: [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-04 19:49                         ` Thomas Gleixner
  0 siblings, 0 replies; 60+ messages in thread
From: Thomas Gleixner @ 2018-07-04 19:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 4 Jul 2018, Andre Przywara wrote:
> On 04/07/18 16:14, Thomas Gleixner wrote:
> > On Wed, 4 Jul 2018, Andre Przywara wrote:
> >> On 04/07/18 15:31, Thomas Gleixner wrote:
> >>> If that's the case then you need to find a different functional timer for
> >>> time keeping. Having an erratic behaving timer for time keeping is not an
> >>> option at all.
> >>
> >> That's not an option on arm64. There are other usable time sources in
> >> the SoC, but the arch timer is somewhat mandatory for all practical
> >> purposes on arm64. We rely on it in some many places that it's not
> >> feasible to run without it. That's why we call it "architected" timer
> >> after all ;-)
> > 
> > The argument that it has to be used just because someone defined it as
> > 'architected' is bullshit and doesn't change the fact that it's broken and
> > not usable for timekeeping. There is no wiggle room. Either it works or
> > not, but works mostly is not an option.
> 
> The "architected" part of the arch timer is fine, it's just that
> eventually someone has to implement that at some point. And as you
> mention below, this is where Murphy's law is kicking in ;-) Especially
> for such seemingly simple tasks as connecting a counter in the "uncore"
> part of the chip (Allwinner SoC) to the counter register interface in
> the core (ARM Cortex-A53) [1]. Apparently the propagation is not really
> atomic for all bits here ...

I've immediately spotted the fail in that document:

  The Cortex-A53 processor does not include the system counter. This
  resides in the SoC.

> >> But I am quite confident that we can find a correct workaround. Maybe
> >> it's really the TVAL (the downcounter) write which is the culprit here,
> >> since the hardware actually writes "now() + TVAL" into the CVAL
> >> (upcounter) register. This internal counter access may be flawed as well.
> > 
> > If the write to the event device is wreckaging the counter which provides
> > time, then there is something seriously wrong either in the design or in
> > that particular piece of silicon.
> 
> Apologies, that was my lousy wording: There is one 64-bit comparison
> register (CVAL), which signals when the counter (an independent
> register) is greater or equal. TVAL is just a different *view* of that
> same relation. So this part is fine, it's really that the "strictly
> monotonic counter" nature of CNTPCT is not really observed by the chip.
> 
> > Yet another proof for the theory that timers are implemented by janitors
> > and that silicon/IP vendors have a competition running who can create the
> > most subtly broken timers. Intel surely had a head start with that, but ARM
> > is definitely catching up.
> 
> ARM is trying really hard to be actually better ;-)

Better in terms of subtle brokenness? I surely can do consulting for
that. I've seen most of it in all colours, but I surely can come up with
new even subtler ways to wreckage them. You know how to reach me.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/2] Allwinner A64 timer workaround
  2018-07-04  8:41         ` Daniel Lezcano
@ 2018-07-12  2:23           ` Samuel Holland
  -1 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-07-12  2:23 UTC (permalink / raw)
  To: Daniel Lezcano, Marc Zyngier, Maxime Ripard, Chen-Yu Tsai,
	Catalin Marinas, Will Deacon, Thomas Gleixner
  Cc: linux-arm-kernel, linux-kernel, linux-sunxi, Mark Rutland

On 07/04/18 03:41, Daniel Lezcano wrote:
> On 04/07/2018 10:16, Marc Zyngier wrote:
>> On 03/07/18 19:42, Samuel Holland wrote:
>>> On 07/03/18 10:09, Marc Zyngier wrote:
>>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>>> Hello,
>>>>>
>>>>> Several people (including me) have experienced extremely large system
>>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>>> around after 2^56 cycles.
>>>>>
>>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>>> architectural timer, and this patch series introduces what I believe is
>>>>> the simplest workaround. More details are in the commit message for patch
>>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>>
>>>> What's the deal with this series? There was a couple of nits to address, and 
>>>> I was more or less expecting a v2.
>>>
>>> I got reports that people were still occasionally having clock jumps after
>>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>>> had time to do any deeper investigation. I think this series is still beneficial
>>> even if it's not a complete solution, so I'll come back with another patch on
>>> top of this if/once I get it fully fixed.
>>>
>>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>>> timer) ≈ 150 should be a conservative iteration limit?
>>
>> Should be OK.
>>
>> Maxime: How do you want to deal with the documentation aspect? We need
>> an erratum number, but AFAIU the concept hasn't made it into the silicom
>> vendor's brain yet. Any chance you could come up with something that
>> uniquely identifies this?
> 
> I went through the different pointers provided in the description but I
> did not find a clear statement that is a hardware issue or may be I
> missed it.
> 
> Are we sure there isn't another subsystem responsible on this
> instability ? (eg PM or something else)

This issue has been observed on kernels with and without DVFS, across several
Linux, U-Boot, and Trusted Firmware versions. It has not been observed on any
other Allwinner SoC, including the A64's twin, the H5.

In fact, this workaround was recently successfully used in U-Boot [1] to fix
issues with an MMC driver that needed reliable numbers from CNTVCT.

So while the vendor hasn't confirmed it (and I wouldn't count on that
happening), everything I've seen points to it being a silicon bug, not a
software issue.

[1]:
https://git.denx.de/?p=u-boot.git;a=commit;h=be0d217952222b2bd3ed071de9bb0c66d8cc80d9

>>> Also, does this make sense to CC to stable?
>>
>> Probably not, as the HW never worked, so it is not a regression.
>>
>> Thanks,
>>
>> 	M.
>>
> 
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/2] Allwinner A64 timer workaround
@ 2018-07-12  2:23           ` Samuel Holland
  0 siblings, 0 replies; 60+ messages in thread
From: Samuel Holland @ 2018-07-12  2:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/04/18 03:41, Daniel Lezcano wrote:
> On 04/07/2018 10:16, Marc Zyngier wrote:
>> On 03/07/18 19:42, Samuel Holland wrote:
>>> On 07/03/18 10:09, Marc Zyngier wrote:
>>>> On 11/05/18 03:27, Samuel Holland wrote:
>>>>> Hello,
>>>>>
>>>>> Several people (including me) have experienced extremely large system
>>>>> clock jumps on their A64-based devices, apparently due to the architectural
>>>>> timer going backward, which is interpreted by Linux as the timer wrapping
>>>>> around after 2^56 cycles.
>>>>>
>>>>> Investigation led to discovery of some obvious problems with this SoC's 
>>>>> architectural timer, and this patch series introduces what I believe is
>>>>> the simplest workaround. More details are in the commit message for patch
>>>>> 1. Patch 2 simply enables the workaround in the device tree.
>>>>
>>>> What's the deal with this series? There was a couple of nits to address, and 
>>>> I was more or less expecting a v2.
>>>
>>> I got reports that people were still occasionally having clock jumps after
>>> applying this series, so I wanted to attempt a more complete fix, but I haven't
>>> had time to do any deeper investigation. I think this series is still beneficial
>>> even if it's not a complete solution, so I'll come back with another patch on
>>> top of this if/once I get it fully fixed.
>>>
>>> I'll prepare a v2 with a bounded loop. Presumably, 3 * (max CPU Hz) / (24MHz
>>> timer) ? 150 should be a conservative iteration limit?
>>
>> Should be OK.
>>
>> Maxime: How do you want to deal with the documentation aspect? We need
>> an erratum number, but AFAIU the concept hasn't made it into the silicom
>> vendor's brain yet. Any chance you could come up with something that
>> uniquely identifies this?
> 
> I went through the different pointers provided in the description but I
> did not find a clear statement that is a hardware issue or may be I
> missed it.
> 
> Are we sure there isn't another subsystem responsible on this
> instability ? (eg PM or something else)

This issue has been observed on kernels with and without DVFS, across several
Linux, U-Boot, and Trusted Firmware versions. It has not been observed on any
other Allwinner SoC, including the A64's twin, the H5.

In fact, this workaround was recently successfully used in U-Boot [1] to fix
issues with an MMC driver that needed reliable numbers from CNTVCT.

So while the vendor hasn't confirmed it (and I wouldn't count on that
happening), everything I've seen points to it being a silicon bug, not a
software issue.

[1]:
https://git.denx.de/?p=u-boot.git;a=commit;h=be0d217952222b2bd3ed071de9bb0c66d8cc80d9

>>> Also, does this make sense to CC to stable?
>>
>> Probably not, as the HW never worked, so it is not a regression.
>>
>> Thanks,
>>
>> 	M.
>>
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2018-07-12  2:31 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-11  2:27 [PATCH 0/2] Allwinner A64 timer workaround Samuel Holland
2018-05-11  2:27 ` Samuel Holland
2018-05-11  2:27 ` [PATCH 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland
2018-05-11  2:27   ` Samuel Holland
2018-05-11  8:26   ` Maxime Ripard
2018-05-11  8:26     ` Maxime Ripard
2018-05-11  8:48   ` Marc Zyngier
2018-05-11  8:48     ` Marc Zyngier
2018-05-11 15:08     ` Samuel Holland
2018-05-11 15:08       ` Samuel Holland
2018-05-26 15:15   ` André Przywara
2018-05-26 15:15     ` André Przywara
2018-05-11  2:27 ` [PATCH 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland
2018-05-11  2:27   ` Samuel Holland
2018-05-11  9:24 ` [PATCH 0/2] Allwinner " Andre Przywara
2018-05-11  9:24   ` Andre Przywara
2018-07-03 15:09 ` Marc Zyngier
2018-07-03 15:09   ` Marc Zyngier
2018-07-03 18:42   ` Samuel Holland
2018-07-03 18:42     ` Samuel Holland
2018-07-04  8:16     ` Marc Zyngier
2018-07-04  8:16       ` Marc Zyngier
2018-07-04  8:19       ` Chen-Yu Tsai
2018-07-04  8:19         ` Chen-Yu Tsai
2018-07-04  8:23       ` Daniel Lezcano
2018-07-04  8:23         ` Daniel Lezcano
2018-07-04  8:39         ` Marc Zyngier
2018-07-04  8:39           ` Marc Zyngier
2018-07-04 10:00           ` Thomas Gleixner
2018-07-04 10:00             ` Thomas Gleixner
2018-07-04 13:08             ` [linux-sunxi] " Andre Przywara
2018-07-04 13:08               ` Andre Przywara
2018-07-04 14:31               ` Thomas Gleixner
2018-07-04 14:31                 ` Thomas Gleixner
2018-07-04 14:44                 ` Andre Przywara
2018-07-04 14:44                   ` Andre Przywara
2018-07-04 15:01                   ` Marc Zyngier
2018-07-04 15:01                     ` Marc Zyngier
2018-07-04 15:15                     ` Andre Przywara
2018-07-04 15:15                       ` Andre Przywara
2018-07-04 15:30                       ` Marc Zyngier
2018-07-04 15:30                         ` Marc Zyngier
2018-07-04 15:23                     ` Samuel Holland
2018-07-04 15:23                       ` Samuel Holland
2018-07-04 15:14                   ` Thomas Gleixner
2018-07-04 15:14                     ` Thomas Gleixner
2018-07-04 15:43                     ` Andre Przywara
2018-07-04 15:43                       ` Andre Przywara
2018-07-04 19:49                       ` Thomas Gleixner
2018-07-04 19:49                         ` Thomas Gleixner
2018-07-04  8:41       ` Daniel Lezcano
2018-07-04  8:41         ` Daniel Lezcano
2018-07-12  2:23         ` Samuel Holland
2018-07-12  2:23           ` Samuel Holland
2018-07-04  9:06       ` Maxime Ripard
2018-07-04  9:06         ` Maxime Ripard
2018-07-04  8:41     ` Daniel Lezcano
2018-07-04  8:41       ` Daniel Lezcano
2018-07-04 12:52     ` [linux-sunxi] " Andre Przywara
2018-07-04 12:52       ` Andre Przywara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.