linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Allwinner A64 timer workaround
@ 2019-01-13  2:17 Samuel Holland
  2019-01-13  2:17 ` [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Samuel Holland @ 2019-01-13  2:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Maxime Ripard, Chen-Yu Tsai,
	Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	Marc Zyngier
  Cc: devicetree, linux-arm-kernel, linux-kernel, linux-sunxi, Samuel Holland

This is the third version of a patch series to fix system clock jumps
and other timer instability on the Allwinner A64 SoC. It has now been
tested for a week, and I've received no reports of date jumps with this
version. So this is, as far as I can tell, a complete workaround.

See the commit messages for a detailed description of the issue, but the
summary is that, when a high counter bit rolls over, indeterminance in
the low bits causes CNTPCT/CNTVCT and their respective TVAL registers to
jump forward or backward. Backward jumps (or the next read after forward
jumps) are sometimes seen by the kernel and interpreted as the timer
wrapping around after 2^56 cycles. This causes the system clock to jump
forward approximately 91 years.

changes since v2;
- Reduced workaround threshold from 11 to 10 bits based on reports from
  other hardare and the U-Boot version of this workaround
- Added TVAL handling based on Marc's suggestion
- Added erratum documentation and renamed symbols to match
- Added Maxime's Acked-by

changes since v1:
- Add an iteration limit like most other arch timer workarounds
- Added Andre's Tested-by

Samuel Holland (2):
  arm64: arch_timer: Workaround for Allwinner A64 timer instability
  arm64: dts: allwinner: a64: Enable A64 timer workaround

 Documentation/arm64/silicon-errata.txt        |  2 +
 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi |  1 +
 drivers/clocksource/Kconfig                   | 10 ++++
 drivers/clocksource/arm_arch_timer.c          | 55 +++++++++++++++++++
 4 files changed, 68 insertions(+)

-- 
2.19.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2019-01-13  2:17 [PATCH v3 0/2] Allwinner A64 timer workaround Samuel Holland
@ 2019-01-13  2:17 ` Samuel Holland
  2019-01-14  9:25   ` Marc Zyngier
  2019-01-13  2:17 ` [PATCH v3 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland
  2019-01-14 12:56 ` [PATCH v3 0/2] Allwinner " Daniel Lezcano
  2 siblings, 1 reply; 10+ messages in thread
From: Samuel Holland @ 2019-01-13  2:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Maxime Ripard, Chen-Yu Tsai,
	Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	Marc Zyngier
  Cc: devicetree, linux-arm-kernel, linux-kernel, linux-sunxi, Samuel Holland

The Allwinner A64 SoC is known[1] to have an unstable architectural
timer, which manifests itself most obviously in the time jumping forward
a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a
timer frequency of 24 MHz, implying that the time went slightly backward
(and this was interpreted by the kernel as it jumping forward and
wrapping around past the epoch).

Investigation revealed instability in the low bits of CNTVCT at the
point a high bit rolls over. This leads to power-of-two cycle forward
and backward jumps. (Testing shows that forward jumps are about twice as
likely as backward jumps.) Since the counter value returns to normal
after an indeterminate read, each "jump" really consists of both a
forward and backward jump from the software perspective.

Unless the kernel is trapping CNTVCT reads, a userspace program is able
to read the register in a loop faster than it changes. A test program
running on all 4 CPU cores that reported jumps larger than 100 ms was
run for 13.6 hours and reported the following:

 Count | Event
-------+---------------------------
  9940 | jumped backward      699ms
   268 | jumped backward     1398ms
     1 | jumped backward     2097ms
 16020 | jumped forward       175ms
  6443 | jumped forward       699ms
  2976 | jumped forward      1398ms
     9 | jumped forward    356516ms
     9 | jumped forward    357215ms
     4 | jumped forward    714430ms
     1 | jumped forward   3578440ms

This works out to a jump larger than 100 ms about every 5.5 seconds on
each CPU core.

The largest jump (almost an hour!) was the following sequence of reads:
    0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000

Note that the middle bits don't necessarily all read as all zeroes or
all ones during the anomalous behavior; however the low 10 bits checked
by the function in this patch have never been observed with any other
value.

Also note that smaller jumps are much more common, with backward jumps
of 2048 (2^11) cycles observed over 400 times per second on each core.
(Of course, this is partially explained by lower bits rolling over more
frequently.) Any one of these could have caused the 95 year time skip.

Similar anomalies were observed while reading CNTPCT (after patching the
kernel to allow reads from userspace). However, the CNTPCT jumps are
much less frequent, and only small jumps were observed. The same program
as before (except now reading CNTPCT) observed after 72 hours:

 Count | Event
-------+---------------------------
    17 | jumped backward      699ms
    52 | jumped forward       175ms
  2831 | jumped forward       699ms
     5 | jumped forward      1398ms

Further investigation showed that the instability in CNTPCT/CNTVCT also
affected the respective timer's TVAL register. The following values were
observed immediately after writing CNVT_TVAL to 0x10000000:

 CNTVCT             | CNTV_TVAL  | CNTV_CVAL          | CNTV_TVAL Error
--------------------+------------+--------------------+-----------------
 0x000000d4a2d8bfff | 0x10003fff | 0x000000d4b2d8bfff | +0x00004000
 0x000000d4a2d94000 | 0x0fffffff | 0x000000d4b2d97fff | -0x00004000
 0x000000d4a2d97fff | 0x10003fff | 0x000000d4b2d97fff | +0x00004000
 0x000000d4a2d9c000 | 0x0fffffff | 0x000000d4b2d9ffff | -0x00004000

The pattern of errors in CNTV_TVAL seemed to depend on exactly which
value was written to it. For example, after writing 0x10101010:

 CNTVCT             | CNTV_TVAL  | CNTV_CVAL          | CNTV_TVAL Error
--------------------+------------+--------------------+-----------------
 0x000001ac3effffff | 0x1110100f | 0x000001ac4f10100f | +0x1000000
 0x000001ac40000000 | 0x1010100f | 0x000001ac5110100f | -0x1000000
 0x000001ac58ffffff | 0x1110100f | 0x000001ac6910100f | +0x1000000
 0x000001ac66000000 | 0x1010100f | 0x000001ac7710100f | -0x1000000
 0x000001ac6affffff | 0x1110100f | 0x000001ac7b10100f | +0x1000000
 0x000001ac6e000000 | 0x1010100f | 0x000001ac7f10100f | -0x1000000

I was also twice able to reproduce the issue covered by Allwinner's
workaround[4], that writing to TVAL sometimes fails, and both CVAL and
TVAL are left with entirely bogus values. One was the following values:

 CNTVCT             | CNTV_TVAL  | CNTV_CVAL
--------------------+------------+--------------------------------------
 0x000000d4a2d6014c | 0x8fbd5721 | 0x000000d132935fff (615s in the past)

========================================================================

Because the CPU can read the CNTPCT/CNTVCT registers faster than they
change, performing two reads of the register and comparing the high bits
(like other workarounds) is not a workable solution. And because the
timer can jump both forward and backward, no pair of reads can
distinguish a good value from a bad one. The only way to guarantee a
good value from consecutive reads would be to read _three_ times, and
take the middle value only if the three values are 1) each unique and
2) increasing. This takes at minimum 3 counter cycles (125 ns), or more
if an anomaly is detected.

However, since there is a distinct pattern to the bad values, we can
optimize the common case (1022/1024 of the time) to a single read by
simply ignoring values that match the error pattern. This still takes no
more than 3 cycles in the worst case, and requires much less code. As an
additional safety check, we still limit the loop iteration to the number
of max-frequency (1.2 GHz) CPU cycles in three 24 MHz counter periods.

For the TVAL registers, the simple solution is to not use them. Instead,
read or write the CVAL and calculate the TVAL value in software.

Although the manufacturer is aware of at least part of the erratum[4],
there is no official name for it. For now, use the kernel-internal name
"UNKNOWN1".

[1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
[2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
[3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26
[4]: https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/drivers/clocksource/arm_arch_timer.c#L272

Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>
Tested-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Samuel Holland <samuel@sholland.org>
---
 Documentation/arm64/silicon-errata.txt |  2 +
 drivers/clocksource/Kconfig            | 10 +++++
 drivers/clocksource/arm_arch_timer.c   | 55 ++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt
index 8f9577621144..4a269732d2a0 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -44,6 +44,8 @@ stable kernels.
 
 | Implementor    | Component       | Erratum ID      | Kconfig                     |
 +----------------+-----------------+-----------------+-----------------------------+
+| Allwinner      | A64/R18         | UNKNOWN1        | SUN50I_ERRATUM_UNKNOWN1     |
+|                |                 |                 |                             |
 | ARM            | Cortex-A53      | #826319         | ARM64_ERRATUM_826319        |
 | ARM            | Cortex-A53      | #827319         | ARM64_ERRATUM_827319        |
 | ARM            | Cortex-A53      | #824069         | ARM64_ERRATUM_824069        |
diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 55c77e44bb2d..d20ff4da07c3 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -364,6 +364,16 @@ config ARM64_ERRATUM_858921
 	  The workaround will be dynamically enabled when an affected
 	  core is detected.
 
+config SUN50I_ERRATUM_UNKNOWN1
+	bool "Workaround for Allwinner A64 erratum UNKNOWN1"
+	default y
+	depends on ARM_ARCH_TIMER && ARM64 && ARCH_SUNXI
+	select ARM_ARCH_TIMER_OOL_WORKAROUND
+	help
+	  This option enables a workaround for instability in the timer on
+	  the Allwinner A64 SoC. The workaround will only be active if the
+	  allwinner,erratum-unknown1 property is found in the timer node.
+
 config ARM_GLOBAL_TIMER
 	bool "Support for the ARM global timer" if COMPILE_TEST
 	select TIMER_OF if OF
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 9a7d4dc00b6e..a8b20b65bd4b 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -326,6 +326,48 @@ static u64 notrace arm64_1188873_read_cntvct_el0(void)
 }
 #endif
 
+#ifdef CONFIG_SUN50I_ERRATUM_UNKNOWN1
+/*
+ * The low bits of the counter registers are indeterminate while bit 10 or
+ * greater is rolling over. Since the counter value can jump both backward
+ * (7ff -> 000 -> 800) and forward (7ff -> fff -> 800), ignore register values
+ * with all ones or all zeros in the low bits. Bound the loop by the maximum
+ * number of CPU cycles in 3 consecutive 24 MHz counter periods.
+ */
+#define __sun50i_a64_read_reg(reg) ({					\
+	u64 _val;							\
+	int _retries = 150;						\
+									\
+	do {								\
+		_val = read_sysreg(reg);				\
+		_retries--;						\
+	} while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries);	\
+									\
+	WARN_ON_ONCE(!_retries);					\
+	_val;								\
+})
+
+static u64 notrace sun50i_a64_read_cntpct_el0(void)
+{
+	return __sun50i_a64_read_reg(cntpct_el0);
+}
+
+static u64 notrace sun50i_a64_read_cntvct_el0(void)
+{
+	return __sun50i_a64_read_reg(cntvct_el0);
+}
+
+static u32 notrace sun50i_a64_read_cntp_tval_el0(void)
+{
+	return read_sysreg(cntp_cval_el0) - sun50i_a64_read_cntpct_el0();
+}
+
+static u32 notrace sun50i_a64_read_cntv_tval_el0(void)
+{
+	return read_sysreg(cntv_cval_el0) - sun50i_a64_read_cntvct_el0();
+}
+#endif
+
 #ifdef CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND
 DEFINE_PER_CPU(const struct arch_timer_erratum_workaround *, timer_unstable_counter_workaround);
 EXPORT_SYMBOL_GPL(timer_unstable_counter_workaround);
@@ -423,6 +465,19 @@ static const struct arch_timer_erratum_workaround ool_workarounds[] = {
 		.read_cntvct_el0 = arm64_1188873_read_cntvct_el0,
 	},
 #endif
+#ifdef CONFIG_SUN50I_ERRATUM_UNKNOWN1
+	{
+		.match_type = ate_match_dt,
+		.id = "allwinner,erratum-unknown1",
+		.desc = "Allwinner erratum UNKNOWN1",
+		.read_cntp_tval_el0 = sun50i_a64_read_cntp_tval_el0,
+		.read_cntv_tval_el0 = sun50i_a64_read_cntv_tval_el0,
+		.read_cntpct_el0 = sun50i_a64_read_cntpct_el0,
+		.read_cntvct_el0 = sun50i_a64_read_cntvct_el0,
+		.set_next_event_phys = erratum_set_next_event_tval_phys,
+		.set_next_event_virt = erratum_set_next_event_tval_virt,
+	},
+#endif
 };
 
 typedef bool (*ate_match_fn_t)(const struct arch_timer_erratum_workaround *,
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround
  2019-01-13  2:17 [PATCH v3 0/2] Allwinner A64 timer workaround Samuel Holland
  2019-01-13  2:17 ` [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland
@ 2019-01-13  2:17 ` Samuel Holland
  2019-01-24 10:06   ` [linux-sunxi] " Chen-Yu Tsai
  2019-01-14 12:56 ` [PATCH v3 0/2] Allwinner " Daniel Lezcano
  2 siblings, 1 reply; 10+ messages in thread
From: Samuel Holland @ 2019-01-13  2:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Maxime Ripard, Chen-Yu Tsai,
	Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	Marc Zyngier
  Cc: devicetree, linux-arm-kernel, linux-kernel, linux-sunxi, Samuel Holland

As instability in the architectural timer has been observed on multiple
devices using this SoC, inluding the Pine64 and the Orange Pi Win,
enable the workaround in the SoC's device tree.

Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>
Signed-off-by: Samuel Holland <samuel@sholland.org>
---
 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
index f3a66f888205..13eac92a8c55 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
@@ -175,6 +175,7 @@
 
 	timer {
 		compatible = "arm,armv8-timer";
+		allwinner,erratum-unknown1;
 		interrupts = <GIC_PPI 13
 			(GIC_CPU_MASK_SIMPLE(4) | IRQ_TYPE_LEVEL_HIGH)>,
 			     <GIC_PPI 14
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2019-01-13  2:17 ` [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland
@ 2019-01-14  9:25   ` Marc Zyngier
  2019-12-04  4:18     ` [linux-sunxi] " Vasily Khoruzhick
  0 siblings, 1 reply; 10+ messages in thread
From: Marc Zyngier @ 2019-01-14  9:25 UTC (permalink / raw)
  To: Samuel Holland, Catalin Marinas, Will Deacon, Maxime Ripard,
	Chen-Yu Tsai, Rob Herring, Mark Rutland, Daniel Lezcano,
	Thomas Gleixner
  Cc: devicetree, linux-arm-kernel, linux-kernel, linux-sunxi

Hi Samuel,

On 13/01/2019 02:17, Samuel Holland wrote:
> The Allwinner A64 SoC is known[1] to have an unstable architectural
> timer, which manifests itself most obviously in the time jumping forward
> a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a
> timer frequency of 24 MHz, implying that the time went slightly backward
> (and this was interpreted by the kernel as it jumping forward and
> wrapping around past the epoch).
> 
> Investigation revealed instability in the low bits of CNTVCT at the
> point a high bit rolls over. This leads to power-of-two cycle forward
> and backward jumps. (Testing shows that forward jumps are about twice as
> likely as backward jumps.) Since the counter value returns to normal
> after an indeterminate read, each "jump" really consists of both a
> forward and backward jump from the software perspective.
> 
> Unless the kernel is trapping CNTVCT reads, a userspace program is able
> to read the register in a loop faster than it changes. A test program
> running on all 4 CPU cores that reported jumps larger than 100 ms was
> run for 13.6 hours and reported the following:
> 
>  Count | Event
> -------+---------------------------
>   9940 | jumped backward      699ms
>    268 | jumped backward     1398ms
>      1 | jumped backward     2097ms
>  16020 | jumped forward       175ms
>   6443 | jumped forward       699ms
>   2976 | jumped forward      1398ms
>      9 | jumped forward    356516ms
>      9 | jumped forward    357215ms
>      4 | jumped forward    714430ms
>      1 | jumped forward   3578440ms
> 
> This works out to a jump larger than 100 ms about every 5.5 seconds on
> each CPU core.
> 
> The largest jump (almost an hour!) was the following sequence of reads:
>     0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000
> 
> Note that the middle bits don't necessarily all read as all zeroes or
> all ones during the anomalous behavior; however the low 10 bits checked
> by the function in this patch have never been observed with any other
> value.
> 
> Also note that smaller jumps are much more common, with backward jumps
> of 2048 (2^11) cycles observed over 400 times per second on each core.
> (Of course, this is partially explained by lower bits rolling over more
> frequently.) Any one of these could have caused the 95 year time skip.
> 
> Similar anomalies were observed while reading CNTPCT (after patching the
> kernel to allow reads from userspace). However, the CNTPCT jumps are
> much less frequent, and only small jumps were observed. The same program
> as before (except now reading CNTPCT) observed after 72 hours:
> 
>  Count | Event
> -------+---------------------------
>     17 | jumped backward      699ms
>     52 | jumped forward       175ms
>   2831 | jumped forward       699ms
>      5 | jumped forward      1398ms
> 
> Further investigation showed that the instability in CNTPCT/CNTVCT also
> affected the respective timer's TVAL register. The following values were
> observed immediately after writing CNVT_TVAL to 0x10000000:
> 
>  CNTVCT             | CNTV_TVAL  | CNTV_CVAL          | CNTV_TVAL Error
> --------------------+------------+--------------------+-----------------
>  0x000000d4a2d8bfff | 0x10003fff | 0x000000d4b2d8bfff | +0x00004000
>  0x000000d4a2d94000 | 0x0fffffff | 0x000000d4b2d97fff | -0x00004000
>  0x000000d4a2d97fff | 0x10003fff | 0x000000d4b2d97fff | +0x00004000
>  0x000000d4a2d9c000 | 0x0fffffff | 0x000000d4b2d9ffff | -0x00004000
> 
> The pattern of errors in CNTV_TVAL seemed to depend on exactly which
> value was written to it. For example, after writing 0x10101010:
> 
>  CNTVCT             | CNTV_TVAL  | CNTV_CVAL          | CNTV_TVAL Error
> --------------------+------------+--------------------+-----------------
>  0x000001ac3effffff | 0x1110100f | 0x000001ac4f10100f | +0x1000000
>  0x000001ac40000000 | 0x1010100f | 0x000001ac5110100f | -0x1000000
>  0x000001ac58ffffff | 0x1110100f | 0x000001ac6910100f | +0x1000000
>  0x000001ac66000000 | 0x1010100f | 0x000001ac7710100f | -0x1000000
>  0x000001ac6affffff | 0x1110100f | 0x000001ac7b10100f | +0x1000000
>  0x000001ac6e000000 | 0x1010100f | 0x000001ac7f10100f | -0x1000000
> 
> I was also twice able to reproduce the issue covered by Allwinner's
> workaround[4], that writing to TVAL sometimes fails, and both CVAL and
> TVAL are left with entirely bogus values. One was the following values:
> 
>  CNTVCT             | CNTV_TVAL  | CNTV_CVAL
> --------------------+------------+--------------------------------------
>  0x000000d4a2d6014c | 0x8fbd5721 | 0x000000d132935fff (615s in the past)
> 
> ========================================================================
> 
> Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> change, performing two reads of the register and comparing the high bits
> (like other workarounds) is not a workable solution. And because the
> timer can jump both forward and backward, no pair of reads can
> distinguish a good value from a bad one. The only way to guarantee a
> good value from consecutive reads would be to read _three_ times, and
> take the middle value only if the three values are 1) each unique and
> 2) increasing. This takes at minimum 3 counter cycles (125 ns), or more
> if an anomaly is detected.
> 
> However, since there is a distinct pattern to the bad values, we can
> optimize the common case (1022/1024 of the time) to a single read by
> simply ignoring values that match the error pattern. This still takes no
> more than 3 cycles in the worst case, and requires much less code. As an
> additional safety check, we still limit the loop iteration to the number
> of max-frequency (1.2 GHz) CPU cycles in three 24 MHz counter periods.
> 
> For the TVAL registers, the simple solution is to not use them. Instead,
> read or write the CVAL and calculate the TVAL value in software.
> 
> Although the manufacturer is aware of at least part of the erratum[4],
> there is no official name for it. For now, use the kernel-internal name
> "UNKNOWN1".
> 
> [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
> [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
> [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26
> [4]: https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/drivers/clocksource/arm_arch_timer.c#L272

nit: In general, I'm not overly keen on URLs in commit messages, as they
may vanish without notice and the commit message becomes less useful. In
the future, please keep those in the cover letter (though in this
particular case, the commit message explains the issue pretty well, so
no harm done once GitHub dies a horrible death... ;-).

The fix itself looks pretty solid, and will hopefully make the
"AllLoosers" HW more usable.

Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>

Daniel, please consider this for v5.1.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/2] Allwinner A64 timer workaround
  2019-01-13  2:17 [PATCH v3 0/2] Allwinner A64 timer workaround Samuel Holland
  2019-01-13  2:17 ` [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland
  2019-01-13  2:17 ` [PATCH v3 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland
@ 2019-01-14 12:56 ` Daniel Lezcano
  2019-01-15  2:52   ` Chen-Yu Tsai
  2 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2019-01-14 12:56 UTC (permalink / raw)
  To: Samuel Holland, Catalin Marinas, Will Deacon, Maxime Ripard,
	Chen-Yu Tsai, Rob Herring, Mark Rutland, Thomas Gleixner,
	Marc Zyngier
  Cc: devicetree, linux-arm-kernel, linux-kernel, linux-sunxi

On 13/01/2019 03:17, Samuel Holland wrote:
> This is the third version of a patch series to fix system clock jumps
> and other timer instability on the Allwinner A64 SoC. It has now been
> tested for a week, and I've received no reports of date jumps with this
> version. So this is, as far as I can tell, a complete workaround.
> 
> See the commit messages for a detailed description of the issue, but the
> summary is that, when a high counter bit rolls over, indeterminance in
> the low bits causes CNTPCT/CNTVCT and their respective TVAL registers to
> jump forward or backward. Backward jumps (or the next read after forward
> jumps) are sometimes seen by the kernel and interpreted as the timer
> wrapping around after 2^56 cycles. This causes the system clock to jump
> forward approximately 91 years.
> 
> changes since v2;
> - Reduced workaround threshold from 11 to 10 bits based on reports from
>   other hardare and the U-Boot version of this workaround
> - Added TVAL handling based on Marc's suggestion
> - Added erratum documentation and renamed symbols to match
> - Added Maxime's Acked-by
> 
> changes since v1:
> - Add an iteration limit like most other arch timer workarounds
> - Added Andre's Tested-by
> 
> Samuel Holland (2):
>   arm64: arch_timer: Workaround for Allwinner A64 timer instability
>   arm64: dts: allwinner: a64: Enable A64 timer workaround
> 
>  Documentation/arm64/silicon-errata.txt        |  2 +
>  arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi |  1 +
>  drivers/clocksource/Kconfig                   | 10 ++++
>  drivers/clocksource/arm_arch_timer.c          | 55 +++++++++++++++++++
>  4 files changed, 68 insertions(+)
> 

Applied. Took the opportunity to add the stable@ tag.

Thanks

  -- Daniel


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/2] Allwinner A64 timer workaround
  2019-01-14 12:56 ` [PATCH v3 0/2] Allwinner " Daniel Lezcano
@ 2019-01-15  2:52   ` Chen-Yu Tsai
  0 siblings, 0 replies; 10+ messages in thread
From: Chen-Yu Tsai @ 2019-01-15  2:52 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Samuel Holland, Catalin Marinas, Will Deacon, Maxime Ripard,
	Rob Herring, Mark Rutland, Thomas Gleixner, Marc Zyngier,
	devicetree, linux-arm-kernel, linux-kernel, linux-sunxi

On Mon, Jan 14, 2019 at 8:57 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 13/01/2019 03:17, Samuel Holland wrote:
> > This is the third version of a patch series to fix system clock jumps
> > and other timer instability on the Allwinner A64 SoC. It has now been
> > tested for a week, and I've received no reports of date jumps with this
> > version. So this is, as far as I can tell, a complete workaround.
> >
> > See the commit messages for a detailed description of the issue, but the
> > summary is that, when a high counter bit rolls over, indeterminance in
> > the low bits causes CNTPCT/CNTVCT and their respective TVAL registers to
> > jump forward or backward. Backward jumps (or the next read after forward
> > jumps) are sometimes seen by the kernel and interpreted as the timer
> > wrapping around after 2^56 cycles. This causes the system clock to jump
> > forward approximately 91 years.
> >
> > changes since v2;
> > - Reduced workaround threshold from 11 to 10 bits based on reports from
> >   other hardare and the U-Boot version of this workaround
> > - Added TVAL handling based on Marc's suggestion
> > - Added erratum documentation and renamed symbols to match
> > - Added Maxime's Acked-by
> >
> > changes since v1:
> > - Add an iteration limit like most other arch timer workarounds
> > - Added Andre's Tested-by
> >
> > Samuel Holland (2):
> >   arm64: arch_timer: Workaround for Allwinner A64 timer instability
> >   arm64: dts: allwinner: a64: Enable A64 timer workaround
> >
> >  Documentation/arm64/silicon-errata.txt        |  2 +
> >  arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi |  1 +
> >  drivers/clocksource/Kconfig                   | 10 ++++
> >  drivers/clocksource/arm_arch_timer.c          | 55 +++++++++++++++++++
> >  4 files changed, 68 insertions(+)
> >
>
> Applied. Took the opportunity to add the stable@ tag.

Not seeing it in tip just yet. Was it applied for -rc or -next?
Need to know which branch to apply the device tree patch to.

Thanks
ChenYu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linux-sunxi] [PATCH v3 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround
  2019-01-13  2:17 ` [PATCH v3 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland
@ 2019-01-24 10:06   ` Chen-Yu Tsai
  0 siblings, 0 replies; 10+ messages in thread
From: Chen-Yu Tsai @ 2019-01-24 10:06 UTC (permalink / raw)
  To: Samuel Holland
  Cc: Catalin Marinas, Will Deacon, Maxime Ripard, Rob Herring,
	Mark Rutland, Daniel Lezcano, Thomas Gleixner, Marc Zyngier,
	devicetree, linux-arm-kernel, linux-kernel, linux-sunxi

On Sun, Jan 13, 2019 at 10:17 AM Samuel Holland <samuel@sholland.org> wrote:
>
> As instability in the architectural timer has been observed on multiple
> devices using this SoC, inluding the Pine64 and the Orange Pi Win,
> enable the workaround in the SoC's device tree.
>
> Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>
> Signed-off-by: Samuel Holland <samuel@sholland.org>

Applied for 5.1.

ChenYu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2019-01-14  9:25   ` Marc Zyngier
@ 2019-12-04  4:18     ` Vasily Khoruzhick
  2019-12-04 12:21       ` Marc Zyngier
  0 siblings, 1 reply; 10+ messages in thread
From: Vasily Khoruzhick @ 2019-12-04  4:18 UTC (permalink / raw)
  To: marc.zyngier
  Cc: Samuel Holland, Catalin Marinas, Will Deacon, Maxime Ripard,
	Chen-Yu Tsai, Rob Herring, Mark Rutland, Daniel Lezcano,
	Thomas Gleixner, devicetree, arm-linux, linux-kernel,
	linux-sunxi

On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier <marc.zyngier@arm.com> wrote:
>
> Hi Samuel,

Hi Samuel,

> On 13/01/2019 02:17, Samuel Holland wrote:
> > The Allwinner A64 SoC is known[1] to have an unstable architectural
> > timer, which manifests itself most obviously in the time jumping forward
> > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a
> > timer frequency of 24 MHz, implying that the time went slightly backward
> > (and this was interpreted by the kernel as it jumping forward and
> > wrapping around past the epoch).
> >
> > Investigation revealed instability in the low bits of CNTVCT at the
> > point a high bit rolls over. This leads to power-of-two cycle forward
> > and backward jumps. (Testing shows that forward jumps are about twice as
> > likely as backward jumps.) Since the counter value returns to normal
> > after an indeterminate read, each "jump" really consists of both a
> > forward and backward jump from the software perspective.
> >
> > Unless the kernel is trapping CNTVCT reads, a userspace program is able
> > to read the register in a loop faster than it changes. A test program
> > running on all 4 CPU cores that reported jumps larger than 100 ms was
> > run for 13.6 hours and reported the following:
> >
> >  Count | Event
> > -------+---------------------------
> >   9940 | jumped backward      699ms
> >    268 | jumped backward     1398ms
> >      1 | jumped backward     2097ms
> >  16020 | jumped forward       175ms
> >   6443 | jumped forward       699ms
> >   2976 | jumped forward      1398ms
> >      9 | jumped forward    356516ms
> >      9 | jumped forward    357215ms
> >      4 | jumped forward    714430ms
> >      1 | jumped forward   3578440ms
> >
> > This works out to a jump larger than 100 ms about every 5.5 seconds on
> > each CPU core.
> >
> > The largest jump (almost an hour!) was the following sequence of reads:
> >     0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000
> >
> > Note that the middle bits don't necessarily all read as all zeroes or
> > all ones during the anomalous behavior; however the low 10 bits checked
> > by the function in this patch have never been observed with any other
> > value.
> >
> > Also note that smaller jumps are much more common, with backward jumps
> > of 2048 (2^11) cycles observed over 400 times per second on each core.
> > (Of course, this is partially explained by lower bits rolling over more
> > frequently.) Any one of these could have caused the 95 year time skip.
> >
> > Similar anomalies were observed while reading CNTPCT (after patching the
> > kernel to allow reads from userspace). However, the CNTPCT jumps are
> > much less frequent, and only small jumps were observed. The same program
> > as before (except now reading CNTPCT) observed after 72 hours:
> >
> >  Count | Event
> > -------+---------------------------
> >     17 | jumped backward      699ms
> >     52 | jumped forward       175ms
> >   2831 | jumped forward       699ms
> >      5 | jumped forward      1398ms
> >
> > Further investigation showed that the instability in CNTPCT/CNTVCT also
> > affected the respective timer's TVAL register. The following values were
> > observed immediately after writing CNVT_TVAL to 0x10000000:
> >
> >  CNTVCT             | CNTV_TVAL  | CNTV_CVAL          | CNTV_TVAL Error
> > --------------------+------------+--------------------+-----------------
> >  0x000000d4a2d8bfff | 0x10003fff | 0x000000d4b2d8bfff | +0x00004000
> >  0x000000d4a2d94000 | 0x0fffffff | 0x000000d4b2d97fff | -0x00004000
> >  0x000000d4a2d97fff | 0x10003fff | 0x000000d4b2d97fff | +0x00004000
> >  0x000000d4a2d9c000 | 0x0fffffff | 0x000000d4b2d9ffff | -0x00004000
> >
> > The pattern of errors in CNTV_TVAL seemed to depend on exactly which
> > value was written to it. For example, after writing 0x10101010:
> >
> >  CNTVCT             | CNTV_TVAL  | CNTV_CVAL          | CNTV_TVAL Error
> > --------------------+------------+--------------------+-----------------
> >  0x000001ac3effffff | 0x1110100f | 0x000001ac4f10100f | +0x1000000
> >  0x000001ac40000000 | 0x1010100f | 0x000001ac5110100f | -0x1000000
> >  0x000001ac58ffffff | 0x1110100f | 0x000001ac6910100f | +0x1000000
> >  0x000001ac66000000 | 0x1010100f | 0x000001ac7710100f | -0x1000000
> >  0x000001ac6affffff | 0x1110100f | 0x000001ac7b10100f | +0x1000000
> >  0x000001ac6e000000 | 0x1010100f | 0x000001ac7f10100f | -0x1000000
> >
> > I was also twice able to reproduce the issue covered by Allwinner's
> > workaround[4], that writing to TVAL sometimes fails, and both CVAL and
> > TVAL are left with entirely bogus values. One was the following values:
> >
> >  CNTVCT             | CNTV_TVAL  | CNTV_CVAL
> > --------------------+------------+--------------------------------------
> >  0x000000d4a2d6014c | 0x8fbd5721 | 0x000000d132935fff (615s in the past)
> >
> > ========================================================================
> >
> > Because the CPU can read the CNTPCT/CNTVCT registers faster than they
> > change, performing two reads of the register and comparing the high bits
> > (like other workarounds) is not a workable solution. And because the
> > timer can jump both forward and backward, no pair of reads can
> > distinguish a good value from a bad one. The only way to guarantee a
> > good value from consecutive reads would be to read _three_ times, and
> > take the middle value only if the three values are 1) each unique and
> > 2) increasing. This takes at minimum 3 counter cycles (125 ns), or more
> > if an anomaly is detected.
> >
> > However, since there is a distinct pattern to the bad values, we can
> > optimize the common case (1022/1024 of the time) to a single read by
> > simply ignoring values that match the error pattern. This still takes no
> > more than 3 cycles in the worst case, and requires much less code. As an
> > additional safety check, we still limit the loop iteration to the number
> > of max-frequency (1.2 GHz) CPU cycles in three 24 MHz counter periods.
> >
> > For the TVAL registers, the simple solution is to not use them. Instead,
> > read or write the CVAL and calculate the TVAL value in software.
> >
> > Although the manufacturer is aware of at least part of the erratum[4],
> > there is no official name for it. For now, use the kernel-internal name
> > "UNKNOWN1".
> >
> > [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9
> > [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/
> > [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26
> > [4]: https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/drivers/clocksource/arm_arch_timer.c#L272
>
> nit: In general, I'm not overly keen on URLs in commit messages, as they
> may vanish without notice and the commit message becomes less useful. In
> the future, please keep those in the cover letter (though in this
> particular case, the commit message explains the issue pretty well, so
> no harm done once GitHub dies a horrible death... ;-).
>
> The fix itself looks pretty solid, and will hopefully make the
> "AllLoosers" HW more usable.

Unfortunately this patch doesn't completely eliminate the jumps. There
have been reports from users who still saw 95y jump even with the
patch applied.

Personally I've seen it once or twice on my Pine64-LTS.

Looks like we need bigger hammer. Does anyone have any idea what it could be?

Regards,
Vasily


> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
>
> Daniel, please consider this for v5.1.
>
> Thanks,
>
>         M.
> --
> Jazz is not dead. It just smells funny...
>
> --
> You received this message because you are subscribed to the Google Groups "linux-sunxi" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-sunxi+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64:  arch_timer: Workaround for Allwinner A64 timer instability
  2019-12-04  4:18     ` [linux-sunxi] " Vasily Khoruzhick
@ 2019-12-04 12:21       ` Marc Zyngier
  2019-12-04 15:37         ` Vasily Khoruzhick
  0 siblings, 1 reply; 10+ messages in thread
From: Marc Zyngier @ 2019-12-04 12:21 UTC (permalink / raw)
  To: anarsoul
  Cc: marc.zyngier, Samuel Holland, Catalin Marinas, Will Deacon,
	Maxime Ripard, Chen-Yu Tsai, Rob Herring, Mark Rutland,
	Daniel Lezcano, Thomas Gleixner, devicetree, arm-linux,
	linux-kernel, linux-sunxi

[please note that my email address has changed]

On 2019-12-04 04:18, Vasily Khoruzhick wrote:

[...]

> Unfortunately this patch doesn't completely eliminate the jumps. 
> There
> have been reports from users who still saw 95y jump even with the
> patch applied.
>
> Personally I've seen it once or twice on my Pine64-LTS.
>
> Looks like we need bigger hammer. Does anyone have any idea what it 
> could be?

Which kernel version did you see this happening on?

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
  2019-12-04 12:21       ` Marc Zyngier
@ 2019-12-04 15:37         ` Vasily Khoruzhick
  0 siblings, 0 replies; 10+ messages in thread
From: Vasily Khoruzhick @ 2019-12-04 15:37 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Samuel Holland, Catalin Marinas, Will Deacon, Chen-Yu Tsai,
	Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	devicetree, arm-linux, linux-kernel, linux-sunxi, Maxime Ripard

On Wed, Dec 4, 2019 at 4:21 AM Marc Zyngier <maz@kernel.org> wrote:
>
> [please note that my email address has changed]
>
> On 2019-12-04 04:18, Vasily Khoruzhick wrote:
>
> [...]
>
> > Unfortunately this patch doesn't completely eliminate the jumps.
> > There
> > have been reports from users who still saw 95y jump even with the
> > patch applied.
> >
> > Personally I've seen it once or twice on my Pine64-LTS.
> >
> > Looks like we need bigger hammer. Does anyone have any idea what it
> > could be?
>
> Which kernel version did you see this happening on?

I've seen it on 5.3

>          M.
> --
> Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-12-04 15:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-13  2:17 [PATCH v3 0/2] Allwinner A64 timer workaround Samuel Holland
2019-01-13  2:17 ` [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability Samuel Holland
2019-01-14  9:25   ` Marc Zyngier
2019-12-04  4:18     ` [linux-sunxi] " Vasily Khoruzhick
2019-12-04 12:21       ` Marc Zyngier
2019-12-04 15:37         ` Vasily Khoruzhick
2019-01-13  2:17 ` [PATCH v3 2/2] arm64: dts: allwinner: a64: Enable A64 timer workaround Samuel Holland
2019-01-24 10:06   ` [linux-sunxi] " Chen-Yu Tsai
2019-01-14 12:56 ` [PATCH v3 0/2] Allwinner " Daniel Lezcano
2019-01-15  2:52   ` Chen-Yu Tsai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).