All of lore.kernel.org
 help / color / mirror / Atom feed
* Broken watchdog in u-boot master branch
@ 2022-10-09 19:12 Pali Rohár
  2022-10-10 13:55 ` Tom Rini
  2022-10-10 16:28 ` Tom Rini
  0 siblings, 2 replies; 15+ messages in thread
From: Pali Rohár @ 2022-10-09 19:12 UTC (permalink / raw)
  To: u-boot

Hello! Watchdog code seems to be broken in u-boot master branch.
On Nokia N900 I'm getting following message in qemu:

cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling

Seems that watchdog core code is not prepared for "slower" watchdogs
which communicate over slower i2c bus, like it is the case for N900.

Disabling slower watchdog is a bad idea as it would result in reboot
loop instead of slower - but working code.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-09 19:12 Broken watchdog in u-boot master branch Pali Rohár
@ 2022-10-10 13:55 ` Tom Rini
  2022-10-10 16:19   ` Pali Rohár
  2022-10-11  7:18   ` Rasmus Villemoes
  2022-10-10 16:28 ` Tom Rini
  1 sibling, 2 replies; 15+ messages in thread
From: Tom Rini @ 2022-10-10 13:55 UTC (permalink / raw)
  To: Pali Rohár, Stefan Roese; +Cc: u-boot

[-- Attachment #1: Type: text/plain, Size: 634 bytes --]

On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:

> Hello! Watchdog code seems to be broken in u-boot master branch.
> On Nokia N900 I'm getting following message in qemu:
> 
> cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> 
> Seems that watchdog core code is not prepared for "slower" watchdogs
> which communicate over slower i2c bus, like it is the case for N900.
> 
> Disabling slower watchdog is a bad idea as it would result in reboot
> loop instead of slower - but working code.

Adding the watchdog custodian. Have you bisected this behavior yet?
Thanks.

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 13:55 ` Tom Rini
@ 2022-10-10 16:19   ` Pali Rohár
  2022-10-11  7:18   ` Rasmus Villemoes
  1 sibling, 0 replies; 15+ messages in thread
From: Pali Rohár @ 2022-10-10 16:19 UTC (permalink / raw)
  To: Tom Rini; +Cc: Stefan Roese, u-boot

On Monday 10 October 2022 09:55:12 Tom Rini wrote:
> On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> 
> > Hello! Watchdog code seems to be broken in u-boot master branch.
> > On Nokia N900 I'm getting following message in qemu:
> > 
> > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > 
> > Seems that watchdog core code is not prepared for "slower" watchdogs
> > which communicate over slower i2c bus, like it is the case for N900.
> > 
> > Disabling slower watchdog is a bad idea as it would result in reboot
> > loop instead of slower - but working code.
> 
> Adding the watchdog custodian. Have you bisected this behavior yet?
> Thanks.

No, I have not bisected it yet, I have not had time for it as I did
other thing (asm compiler error, HW runtime issues, ...) mentioned in
other threads.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-09 19:12 Broken watchdog in u-boot master branch Pali Rohár
  2022-10-10 13:55 ` Tom Rini
@ 2022-10-10 16:28 ` Tom Rini
  2022-10-10 17:22   ` Pali Rohár
  1 sibling, 1 reply; 15+ messages in thread
From: Tom Rini @ 2022-10-10 16:28 UTC (permalink / raw)
  To: Pali Rohár, Stefan Roese; +Cc: u-boot

[-- Attachment #1: Type: text/plain, Size: 926 bytes --]

On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> Hello! Watchdog code seems to be broken in u-boot master branch.
> On Nokia N900 I'm getting following message in qemu:
> 
> cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> 
> Seems that watchdog core code is not prepared for "slower" watchdogs
> which communicate over slower i2c bus, like it is the case for N900.
> 
> Disabling slower watchdog is a bad idea as it would result in reboot
> loop instead of slower - but working code.

So, looking at this in more detail, we have
CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
the too long comes from). And picking a random CI run:
https://source.denx.de/u-boot/u-boot/-/jobs/511177
I do see we hit this in CI once, but not every time, QEMU runs here. Is
that the max time is configurable enough to satisfy your concerns here?

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 16:28 ` Tom Rini
@ 2022-10-10 17:22   ` Pali Rohár
  2022-10-10 17:40     ` Tom Rini
  0 siblings, 1 reply; 15+ messages in thread
From: Pali Rohár @ 2022-10-10 17:22 UTC (permalink / raw)
  To: Tom Rini; +Cc: Stefan Roese, u-boot

On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > Hello! Watchdog code seems to be broken in u-boot master branch.
> > On Nokia N900 I'm getting following message in qemu:
> > 
> > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > 
> > Seems that watchdog core code is not prepared for "slower" watchdogs
> > which communicate over slower i2c bus, like it is the case for N900.
> > 
> > Disabling slower watchdog is a bad idea as it would result in reboot
> > loop instead of slower - but working code.
> 
> So, looking at this in more detail, we have
> CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> the too long comes from). And picking a random CI run:
> https://source.denx.de/u-boot/u-boot/-/jobs/511177
> I do see we hit this in CI once, but not every time, QEMU runs here. Is
> that the max time is configurable enough to satisfy your concerns here?
> 
> -- 
> Tom

It is needed to investigate, how to _properly_ fix this issue, not just
workarounded it. Probably other boards may be affected.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 17:22   ` Pali Rohár
@ 2022-10-10 17:40     ` Tom Rini
  2022-10-10 17:44       ` Pali Rohár
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Rini @ 2022-10-10 17:40 UTC (permalink / raw)
  To: Pali Rohár; +Cc: Stefan Roese, u-boot

[-- Attachment #1: Type: text/plain, Size: 1881 bytes --]

On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > On Nokia N900 I'm getting following message in qemu:
> > > 
> > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > 
> > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > which communicate over slower i2c bus, like it is the case for N900.
> > > 
> > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > loop instead of slower - but working code.
> > 
> > So, looking at this in more detail, we have
> > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > the too long comes from). And picking a random CI run:
> > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > that the max time is configurable enough to satisfy your concerns here?
> 
> It is needed to investigate, how to _properly_ fix this issue, not just
> workarounded it. Probably other boards may be affected.

So it's the cyclic watchdog code, which we merged as early as possible
that's the reason here. And it was merged as early as we could to see if
there's problems. Are there problems? We're seeing "system too slow,
disabling" on QEMU, sometimes, and the value of too slow is
configurable. I know you reported other problems with n900 HW, so we
can't see if it's failing there, and I don't have any omap3 HW setup in
my lab atm, just newer generation boards and don't see the problem
there. Which is why I'm asking, is being able to configure the "too
slow" value enough? Or is there something else that needs to be done?

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 17:40     ` Tom Rini
@ 2022-10-10 17:44       ` Pali Rohár
  2022-10-10 17:56         ` Tom Rini
  0 siblings, 1 reply; 15+ messages in thread
From: Pali Rohár @ 2022-10-10 17:44 UTC (permalink / raw)
  To: Tom Rini; +Cc: Stefan Roese, u-boot

On Monday 10 October 2022 13:40:38 Tom Rini wrote:
> On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> > On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > > On Nokia N900 I'm getting following message in qemu:
> > > > 
> > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > > 
> > > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > > which communicate over slower i2c bus, like it is the case for N900.
> > > > 
> > > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > > loop instead of slower - but working code.
> > > 
> > > So, looking at this in more detail, we have
> > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > > the too long comes from). And picking a random CI run:
> > > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > > that the max time is configurable enough to satisfy your concerns here?
> > 
> > It is needed to investigate, how to _properly_ fix this issue, not just
> > workarounded it. Probably other boards may be affected.
> 
> So it's the cyclic watchdog code, which we merged as early as possible
> that's the reason here. And it was merged as early as we could to see if
> there's problems. Are there problems? We're seeing "system too slow,
> disabling" on QEMU, sometimes, and the value of too slow is
> configurable. I know you reported other problems with n900 HW, so we
> can't see if it's failing there

I was tested it with older asm code (as described in that other email,
via git checkout commit -- file) on n900 HW and watchdog problem is
there too. Phone reboots in about 20 seconds. But as I do not have
serial console, I do not know if that "disabling" message is printed
there too (but I guess it is).

> and I don't have any omap3 HW setup in
> my lab atm, just newer generation boards and don't see the problem
> there. Which is why I'm asking, is being able to configure the "too
> slow" value enough? Or is there something else that needs to be done?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 17:44       ` Pali Rohár
@ 2022-10-10 17:56         ` Tom Rini
  2022-10-10 18:01           ` Pali Rohár
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Rini @ 2022-10-10 17:56 UTC (permalink / raw)
  To: Pali Rohár; +Cc: Stefan Roese, u-boot

[-- Attachment #1: Type: text/plain, Size: 2700 bytes --]

On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
> On Monday 10 October 2022 13:40:38 Tom Rini wrote:
> > On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> > > On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > > > On Nokia N900 I'm getting following message in qemu:
> > > > > 
> > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > > > 
> > > > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > > > which communicate over slower i2c bus, like it is the case for N900.
> > > > > 
> > > > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > > > loop instead of slower - but working code.
> > > > 
> > > > So, looking at this in more detail, we have
> > > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > > > the too long comes from). And picking a random CI run:
> > > > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > > > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > > > that the max time is configurable enough to satisfy your concerns here?
> > > 
> > > It is needed to investigate, how to _properly_ fix this issue, not just
> > > workarounded it. Probably other boards may be affected.
> > 
> > So it's the cyclic watchdog code, which we merged as early as possible
> > that's the reason here. And it was merged as early as we could to see if
> > there's problems. Are there problems? We're seeing "system too slow,
> > disabling" on QEMU, sometimes, and the value of too slow is
> > configurable. I know you reported other problems with n900 HW, so we
> > can't see if it's failing there
> 
> I was tested it with older asm code (as described in that other email,
> via git checkout commit -- file) on n900 HW and watchdog problem is
> there too. Phone reboots in about 20 seconds. But as I do not have
> serial console, I do not know if that "disabling" message is printed
> there too (but I guess it is).

I think I'm a bit baffled at this point, honestly. The watchdog timeout
is 60 seconds. If you're confident in it being about 20 seconds,
consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10
seconds) should let you see if U-Boot has configured the watchdog and
it's being tripped, or if it's still at the prior stage value.

I would have expected that QEMU would see problems that real HW doesn't
(the value in your log is much higher than the one in CI), but I could
be wrong here.

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 17:56         ` Tom Rini
@ 2022-10-10 18:01           ` Pali Rohár
  2022-10-10 18:14             ` Tom Rini
  0 siblings, 1 reply; 15+ messages in thread
From: Pali Rohár @ 2022-10-10 18:01 UTC (permalink / raw)
  To: Tom Rini; +Cc: Stefan Roese, u-boot

On Monday 10 October 2022 13:56:10 Tom Rini wrote:
> On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
> > On Monday 10 October 2022 13:40:38 Tom Rini wrote:
> > > On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> > > > On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > > > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > > > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > > > > On Nokia N900 I'm getting following message in qemu:
> > > > > > 
> > > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > > > > 
> > > > > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > > > > which communicate over slower i2c bus, like it is the case for N900.
> > > > > > 
> > > > > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > > > > loop instead of slower - but working code.
> > > > > 
> > > > > So, looking at this in more detail, we have
> > > > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > > > > the too long comes from). And picking a random CI run:
> > > > > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > > > > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > > > > that the max time is configurable enough to satisfy your concerns here?
> > > > 
> > > > It is needed to investigate, how to _properly_ fix this issue, not just
> > > > workarounded it. Probably other boards may be affected.
> > > 
> > > So it's the cyclic watchdog code, which we merged as early as possible
> > > that's the reason here. And it was merged as early as we could to see if
> > > there's problems. Are there problems? We're seeing "system too slow,
> > > disabling" on QEMU, sometimes, and the value of too slow is
> > > configurable. I know you reported other problems with n900 HW, so we
> > > can't see if it's failing there
> > 
> > I was tested it with older asm code (as described in that other email,
> > via git checkout commit -- file) on n900 HW and watchdog problem is
> > there too. Phone reboots in about 20 seconds. But as I do not have
> > serial console, I do not know if that "disabling" message is printed
> > there too (but I guess it is).
> 
> I think I'm a bit baffled at this point, honestly. The watchdog timeout
> is 60 seconds. If you're confident in it being about 20 seconds,
> consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10
> seconds) should let you see if U-Boot has configured the watchdog and
> it's being tripped, or if it's still at the prior stage value.

$ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig
configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000

Also watchdog is started by NOLO (which loads and execute U-Boot) so
there can be some smaller timeout.

So I have feeling that on the real HW is same issue. cyclic code
disabled watchdog kicking and then watchdog restarted phone.

I do not remember exact time (if it is 20s or 25s; I have not measured
it precisely), but it sounds plausible.

> I would have expected that QEMU would see problems that real HW doesn't
> (the value in your log is much higher than the one in CI), but I could
> be wrong here.
> 
> -- 
> Tom



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 18:01           ` Pali Rohár
@ 2022-10-10 18:14             ` Tom Rini
  2022-10-10 19:24               ` Tom Rini
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Rini @ 2022-10-10 18:14 UTC (permalink / raw)
  To: Pali Rohár; +Cc: Stefan Roese, u-boot

[-- Attachment #1: Type: text/plain, Size: 3771 bytes --]

On Mon, Oct 10, 2022 at 08:01:23PM +0200, Pali Rohár wrote:
> On Monday 10 October 2022 13:56:10 Tom Rini wrote:
> > On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
> > > On Monday 10 October 2022 13:40:38 Tom Rini wrote:
> > > > On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> > > > > On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > > > > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > > > > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > > > > > On Nokia N900 I'm getting following message in qemu:
> > > > > > > 
> > > > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > > > > > 
> > > > > > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > > > > > which communicate over slower i2c bus, like it is the case for N900.
> > > > > > > 
> > > > > > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > > > > > loop instead of slower - but working code.
> > > > > > 
> > > > > > So, looking at this in more detail, we have
> > > > > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > > > > > the too long comes from). And picking a random CI run:
> > > > > > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > > > > > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > > > > > that the max time is configurable enough to satisfy your concerns here?
> > > > > 
> > > > > It is needed to investigate, how to _properly_ fix this issue, not just
> > > > > workarounded it. Probably other boards may be affected.
> > > > 
> > > > So it's the cyclic watchdog code, which we merged as early as possible
> > > > that's the reason here. And it was merged as early as we could to see if
> > > > there's problems. Are there problems? We're seeing "system too slow,
> > > > disabling" on QEMU, sometimes, and the value of too slow is
> > > > configurable. I know you reported other problems with n900 HW, so we
> > > > can't see if it's failing there
> > > 
> > > I was tested it with older asm code (as described in that other email,
> > > via git checkout commit -- file) on n900 HW and watchdog problem is
> > > there too. Phone reboots in about 20 seconds. But as I do not have
> > > serial console, I do not know if that "disabling" message is printed
> > > there too (but I guess it is).
> > 
> > I think I'm a bit baffled at this point, honestly. The watchdog timeout
> > is 60 seconds. If you're confident in it being about 20 seconds,
> > consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10
> > seconds) should let you see if U-Boot has configured the watchdog and
> > it's being tripped, or if it's still at the prior stage value.
> 
> $ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig
> configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
> 
> Also watchdog is started by NOLO (which loads and execute U-Boot) so
> there can be some smaller timeout.
> 
> So I have feeling that on the real HW is same issue. cyclic code
> disabled watchdog kicking and then watchdog restarted phone.
> 
> I do not remember exact time (if it is 20s or 25s; I have not measured
> it precisely), but it sounds plausible.

OK, so what happens if you increase CONFIG_CYCLIC_MAX_CPU_TIME_US to
something very high (so we should still enable the watchdog and
configure the timeout) along with CONFIG_WATCHDOG_TIMEOUT_MSECS being
high too (so if we can't service it in time really it's so long as to be
noticeable) ? Or CONFIG_WATCHDOG_TIMEOUT_MSECS to something much lower
(so that if the device is resetting quicker we're crashing elsewhere) ?

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 18:14             ` Tom Rini
@ 2022-10-10 19:24               ` Tom Rini
  2022-10-10 19:33                 ` Pali Rohár
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Rini @ 2022-10-10 19:24 UTC (permalink / raw)
  To: Pali Rohár; +Cc: Stefan Roese, u-boot

[-- Attachment #1: Type: text/plain, Size: 5288 bytes --]

On Mon, Oct 10, 2022 at 02:14:25PM -0400, Tom Rini wrote:
> On Mon, Oct 10, 2022 at 08:01:23PM +0200, Pali Rohár wrote:
> > On Monday 10 October 2022 13:56:10 Tom Rini wrote:
> > > On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
> > > > On Monday 10 October 2022 13:40:38 Tom Rini wrote:
> > > > > On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> > > > > > On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > > > > > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > > > > > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > > > > > > On Nokia N900 I'm getting following message in qemu:
> > > > > > > > 
> > > > > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > > > > > > 
> > > > > > > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > > > > > > which communicate over slower i2c bus, like it is the case for N900.
> > > > > > > > 
> > > > > > > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > > > > > > loop instead of slower - but working code.
> > > > > > > 
> > > > > > > So, looking at this in more detail, we have
> > > > > > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > > > > > > the too long comes from). And picking a random CI run:
> > > > > > > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > > > > > > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > > > > > > that the max time is configurable enough to satisfy your concerns here?
> > > > > > 
> > > > > > It is needed to investigate, how to _properly_ fix this issue, not just
> > > > > > workarounded it. Probably other boards may be affected.
> > > > > 
> > > > > So it's the cyclic watchdog code, which we merged as early as possible
> > > > > that's the reason here. And it was merged as early as we could to see if
> > > > > there's problems. Are there problems? We're seeing "system too slow,
> > > > > disabling" on QEMU, sometimes, and the value of too slow is
> > > > > configurable. I know you reported other problems with n900 HW, so we
> > > > > can't see if it's failing there
> > > > 
> > > > I was tested it with older asm code (as described in that other email,
> > > > via git checkout commit -- file) on n900 HW and watchdog problem is
> > > > there too. Phone reboots in about 20 seconds. But as I do not have
> > > > serial console, I do not know if that "disabling" message is printed
> > > > there too (but I guess it is).
> > > 
> > > I think I'm a bit baffled at this point, honestly. The watchdog timeout
> > > is 60 seconds. If you're confident in it being about 20 seconds,
> > > consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10
> > > seconds) should let you see if U-Boot has configured the watchdog and
> > > it's being tripped, or if it's still at the prior stage value.
> > 
> > $ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig
> > configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
> > 
> > Also watchdog is started by NOLO (which loads and execute U-Boot) so
> > there can be some smaller timeout.
> > 
> > So I have feeling that on the real HW is same issue. cyclic code
> > disabled watchdog kicking and then watchdog restarted phone.
> > 
> > I do not remember exact time (if it is 20s or 25s; I have not measured
> > it precisely), but it sounds plausible.
> 
> OK, so what happens if you increase CONFIG_CYCLIC_MAX_CPU_TIME_US to
> something very high (so we should still enable the watchdog and
> configure the timeout) along with CONFIG_WATCHDOG_TIMEOUT_MSECS being
> high too (so if we can't service it in time really it's so long as to be
> noticeable) ? Or CONFIG_WATCHDOG_TIMEOUT_MSECS to something much lower
> (so that if the device is resetting quicker we're crashing elsewhere) ?

OK, on my beagleboard xM with a small change:
diff --git a/drivers/watchdog/omap_wdt.c b/drivers/watchdog/omap_wdt.c
index ca2bc7cfb59e..f0e57b4f7286 100644
--- a/drivers/watchdog/omap_wdt.c
+++ b/drivers/watchdog/omap_wdt.c
@@ -39,7 +39,7 @@
 #include <common.h>
 #include <log.h>
 #include <watchdog.h>
-#include <asm/arch/hardware.h>
+#include <asm/ti-common/omap_wdt.h>
 #include <asm/io.h>
 #include <asm/processor.h>
 #include <asm/arch/cpu.h>

On my beagleboard xM I now see:
U-Boot SPL 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)
Trying to boot from MMC1


U-Boot 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)

OMAP3630/3730-GP ES1.1, CPU-OPP2, L3-200MHz, Max CPU Clock 800 MHz
Model: TI OMAP3 BeagleBoard
OMAP3 Beagle board + LPDDR/NAND
I2C:   ready
DRAM:  256 MiB
Core:  45 devices, 19 uclasses, devicetree: separate
WDT:   Started wdt@48314000 without servicing  (60s timeout)
NAND:  0 MiB
MMC:   OMAP SD/MMC: 0
Loading Environment from NAND... *** Warning - readenv() failed, using default environment

Beagle xM Rev A/B
No EEPROM on expansion board
OMAP die ID: 6e5e00211ff00000015739eb08031024
Net:   No ethernet found.
Hit any key to stop autoboot:  0

So, this is as close as I can get to testing on n900 HW, and it's fine
here.

-- 
Tom

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 19:24               ` Tom Rini
@ 2022-10-10 19:33                 ` Pali Rohár
  0 siblings, 0 replies; 15+ messages in thread
From: Pali Rohár @ 2022-10-10 19:33 UTC (permalink / raw)
  To: Tom Rini; +Cc: Stefan Roese, u-boot

On Monday 10 October 2022 15:24:13 Tom Rini wrote:
> On Mon, Oct 10, 2022 at 02:14:25PM -0400, Tom Rini wrote:
> > On Mon, Oct 10, 2022 at 08:01:23PM +0200, Pali Rohár wrote:
> > > On Monday 10 October 2022 13:56:10 Tom Rini wrote:
> > > > On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
> > > > > On Monday 10 October 2022 13:40:38 Tom Rini wrote:
> > > > > > On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> > > > > > > On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > > > > > > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > > > > > > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > > > > > > > On Nokia N900 I'm getting following message in qemu:
> > > > > > > > > 
> > > > > > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > > > > > > > 
> > > > > > > > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > > > > > > > which communicate over slower i2c bus, like it is the case for N900.
> > > > > > > > > 
> > > > > > > > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > > > > > > > loop instead of slower - but working code.
> > > > > > > > 
> > > > > > > > So, looking at this in more detail, we have
> > > > > > > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > > > > > > > the too long comes from). And picking a random CI run:
> > > > > > > > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > > > > > > > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > > > > > > > that the max time is configurable enough to satisfy your concerns here?
> > > > > > > 
> > > > > > > It is needed to investigate, how to _properly_ fix this issue, not just
> > > > > > > workarounded it. Probably other boards may be affected.
> > > > > > 
> > > > > > So it's the cyclic watchdog code, which we merged as early as possible
> > > > > > that's the reason here. And it was merged as early as we could to see if
> > > > > > there's problems. Are there problems? We're seeing "system too slow,
> > > > > > disabling" on QEMU, sometimes, and the value of too slow is
> > > > > > configurable. I know you reported other problems with n900 HW, so we
> > > > > > can't see if it's failing there
> > > > > 
> > > > > I was tested it with older asm code (as described in that other email,
> > > > > via git checkout commit -- file) on n900 HW and watchdog problem is
> > > > > there too. Phone reboots in about 20 seconds. But as I do not have
> > > > > serial console, I do not know if that "disabling" message is printed
> > > > > there too (but I guess it is).
> > > > 
> > > > I think I'm a bit baffled at this point, honestly. The watchdog timeout
> > > > is 60 seconds. If you're confident in it being about 20 seconds,
> > > > consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10
> > > > seconds) should let you see if U-Boot has configured the watchdog and
> > > > it's being tripped, or if it's still at the prior stage value.
> > > 
> > > $ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig
> > > configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
> > > 
> > > Also watchdog is started by NOLO (which loads and execute U-Boot) so
> > > there can be some smaller timeout.
> > > 
> > > So I have feeling that on the real HW is same issue. cyclic code
> > > disabled watchdog kicking and then watchdog restarted phone.
> > > 
> > > I do not remember exact time (if it is 20s or 25s; I have not measured
> > > it precisely), but it sounds plausible.
> > 
> > OK, so what happens if you increase CONFIG_CYCLIC_MAX_CPU_TIME_US to
> > something very high (so we should still enable the watchdog and
> > configure the timeout) along with CONFIG_WATCHDOG_TIMEOUT_MSECS being
> > high too (so if we can't service it in time really it's so long as to be
> > noticeable) ? Or CONFIG_WATCHDOG_TIMEOUT_MSECS to something much lower
> > (so that if the device is resetting quicker we're crashing elsewhere) ?
> 
> OK, on my beagleboard xM with a small change:
> diff --git a/drivers/watchdog/omap_wdt.c b/drivers/watchdog/omap_wdt.c
> index ca2bc7cfb59e..f0e57b4f7286 100644
> --- a/drivers/watchdog/omap_wdt.c
> +++ b/drivers/watchdog/omap_wdt.c
> @@ -39,7 +39,7 @@
>  #include <common.h>
>  #include <log.h>
>  #include <watchdog.h>
> -#include <asm/arch/hardware.h>
> +#include <asm/ti-common/omap_wdt.h>
>  #include <asm/io.h>
>  #include <asm/processor.h>
>  #include <asm/arch/cpu.h>
> 
> On my beagleboard xM I now see:
> U-Boot SPL 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)
> Trying to boot from MMC1
> 
> 
> U-Boot 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)
> 
> OMAP3630/3730-GP ES1.1, CPU-OPP2, L3-200MHz, Max CPU Clock 800 MHz
> Model: TI OMAP3 BeagleBoard
> OMAP3 Beagle board + LPDDR/NAND
> I2C:   ready
> DRAM:  256 MiB
> Core:  45 devices, 19 uclasses, devicetree: separate
> WDT:   Started wdt@48314000 without servicing  (60s timeout)
> NAND:  0 MiB
> MMC:   OMAP SD/MMC: 0
> Loading Environment from NAND... *** Warning - readenv() failed, using default environment
> 
> Beagle xM Rev A/B
> No EEPROM on expansion board
> OMAP die ID: 6e5e00211ff00000015739eb08031024
> Net:   No ethernet found.
> Hit any key to stop autoboot:  0
> 
> So, this is as close as I can get to testing on n900 HW, and it's fine
> here.
> 
> -- 
> Tom

This is omap watchdog which has registers directly on the processor.

With N900 is issue that uses watchdog connected via i2c (not that omap
watchdog) and there is some issue in new u-boot that i2c does not work
at high speeds. So i2c is configured at lower speed now (when it works
correctly). u-boot i2c driver has more udelay() calls during i2c
transfer and I think that this is the reason why n900 watchdog kick
spend much more time... (just guessing). If you have watchdog registers
directly mapped to MMIO (and even part of processor) then any ldr or str
instruction is executed immediately and not with ms delay timeouts.

It was easier to lower i2c speed than debugging it as nobody knows why
since some u-boot version, i2c transfers started failing at higher
speeds.

I will try to play with those config options later to see what happens.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-10 13:55 ` Tom Rini
  2022-10-10 16:19   ` Pali Rohár
@ 2022-10-11  7:18   ` Rasmus Villemoes
  2022-10-11  7:25     ` Pali Rohár
  2022-10-17  6:52     ` Stefan Roese
  1 sibling, 2 replies; 15+ messages in thread
From: Rasmus Villemoes @ 2022-10-11  7:18 UTC (permalink / raw)
  To: Tom Rini, Pali Rohár, Stefan Roese; +Cc: u-boot

On 10/10/2022 15.55, Tom Rini wrote:
> On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> 
>> Hello! Watchdog code seems to be broken in u-boot master branch.
>> On Nokia N900 I'm getting following message in qemu:
>>
>> cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
>>
>> Seems that watchdog core code is not prepared for "slower" watchdogs
>> which communicate over slower i2c bus, like it is the case for N900.
>>
>> Disabling slower watchdog is a bad idea as it would result in reboot
>> loop instead of slower - but working code.

So, a few thoughts.

First, I assume that that board has a very coarse-grained tick, probably
just 1000Hz. Otherwise it would be pretty amazing for cpu_time to come
out as 10ms exactly. That's not the board's fault, of course, just an
observation, but it is something we need to bear in mind. If the
resolution is merely 100Hz, so 10ms is simply the granularity, we cannot
really meaningfully compare the cpu_time to anything less than that,
because every once in a while it _will_ happen that we sample "now" just
before the tick, run the function, then sample again just after, and it
may only have taken 17us, yet the diff comes out as 10ms.

Second, perhaps the threshold should not be a compile-time constant, but
instead a fraction of the requested call frequency (say 1.5%, 1/64).
I.e., if we've registered a function to be called every 10 seconds, we'd
check if its runtime exceeded (10000000 >> 6) us. Preferably per above
that bound is rounded up to a multiple of the timer's granularity (we
can get that, right?)

Third, perhaps we shouldn't disable it, but just print a (one-time)
warning. Adding a "already-warned" field to struct cyclic_info is
certainly simple enough.

Rasmus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-11  7:18   ` Rasmus Villemoes
@ 2022-10-11  7:25     ` Pali Rohár
  2022-10-17  6:52     ` Stefan Roese
  1 sibling, 0 replies; 15+ messages in thread
From: Pali Rohár @ 2022-10-11  7:25 UTC (permalink / raw)
  To: Rasmus Villemoes; +Cc: Tom Rini, Stefan Roese, u-boot

On Tuesday 11 October 2022 09:18:56 Rasmus Villemoes wrote:
> On 10/10/2022 15.55, Tom Rini wrote:
> > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > 
> >> Hello! Watchdog code seems to be broken in u-boot master branch.
> >> On Nokia N900 I'm getting following message in qemu:
> >>
> >> cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> >>
> >> Seems that watchdog core code is not prepared for "slower" watchdogs
> >> which communicate over slower i2c bus, like it is the case for N900.
> >>
> >> Disabling slower watchdog is a bad idea as it would result in reboot
> >> loop instead of slower - but working code.
> 
> So, a few thoughts.
> 
> First, I assume that that board has a very coarse-grained tick, probably
> just 1000Hz.

IIRC i2c is running at 100kHz. But u-boot i2c driver contains more
udelay() calls and more busy loops. And I was told that there are also
some hw erratas. So at the end if some other driver calls u-boot dm
function for i2c transfer, it can spend more time than expected...

Also there is another issue is that udelay() itself calls watchdog reset
function. So if watchdog is on i2c which driver calls udelay() it means
that watchdog reset function may be called in infinite loop. For this
reason n900 watchdog driver has lock which prevents and eliminate this
loop (if somebody calls reset during active lock then reset does
nothing).

> Otherwise it would be pretty amazing for cpu_time to come
> out as 10ms exactly. That's not the board's fault, of course, just an
> observation, but it is something we need to bear in mind. If the
> resolution is merely 100Hz, so 10ms is simply the granularity, we cannot
> really meaningfully compare the cpu_time to anything less than that,
> because every once in a while it _will_ happen that we sample "now" just
> before the tick, run the function, then sample again just after, and it
> may only have taken 17us, yet the diff comes out as 10ms.
> 
> Second, perhaps the threshold should not be a compile-time constant, but
> instead a fraction of the requested call frequency (say 1.5%, 1/64).
> I.e., if we've registered a function to be called every 10 seconds, we'd
> check if its runtime exceeded (10000000 >> 6) us. Preferably per above
> that bound is rounded up to a multiple of the timer's granularity (we
> can get that, right?)
> 
> Third, perhaps we shouldn't disable it, but just print a (one-time)
> warning. Adding a "already-warned" field to struct cyclic_info is
> certainly simple enough.
> 
> Rasmus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Broken watchdog in u-boot master branch
  2022-10-11  7:18   ` Rasmus Villemoes
  2022-10-11  7:25     ` Pali Rohár
@ 2022-10-17  6:52     ` Stefan Roese
  1 sibling, 0 replies; 15+ messages in thread
From: Stefan Roese @ 2022-10-17  6:52 UTC (permalink / raw)
  To: Rasmus Villemoes, Tom Rini, Pali Rohár; +Cc: u-boot

On 11.10.22 09:18, Rasmus Villemoes wrote:
> On 10/10/2022 15.55, Tom Rini wrote:
>> On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
>>
>>> Hello! Watchdog code seems to be broken in u-boot master branch.
>>> On Nokia N900 I'm getting following message in qemu:
>>>
>>> cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
>>>
>>> Seems that watchdog core code is not prepared for "slower" watchdogs
>>> which communicate over slower i2c bus, like it is the case for N900.
>>>
>>> Disabling slower watchdog is a bad idea as it would result in reboot
>>> loop instead of slower - but working code.
> 
> So, a few thoughts.
> 
> First, I assume that that board has a very coarse-grained tick, probably
> just 1000Hz. Otherwise it would be pretty amazing for cpu_time to come
> out as 10ms exactly. That's not the board's fault, of course, just an
> observation, but it is something we need to bear in mind. If the
> resolution is merely 100Hz, so 10ms is simply the granularity, we cannot
> really meaningfully compare the cpu_time to anything less than that,
> because every once in a while it _will_ happen that we sample "now" just
> before the tick, run the function, then sample again just after, and it
> may only have taken 17us, yet the diff comes out as 10ms.
> 
> Second, perhaps the threshold should not be a compile-time constant, but
> instead a fraction of the requested call frequency (say 1.5%, 1/64).
> I.e., if we've registered a function to be called every 10 seconds, we'd
> check if its runtime exceeded (10000000 >> 6) us.

In general I think this is a good idea. Now that we have multiple users
of the cyclic IF in one board it shows, that this threshold might better
be defined as per-cyclic-function. Or ...

> Preferably per above
> that bound is rounded up to a multiple of the timer's granularity (we
> can get that, right?)
> 
> Third, perhaps we shouldn't disable it, but just print a (one-time)
> warning. Adding a "already-warned" field to struct cyclic_info is
> certainly simple enough.

... just print this warning once. As it's pretty simple, I'll send
a patch implementing this change later today. We can work on other
improvements later, once we've fixed this watchdog breakage.

Thanks,
Stefan


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-10-17  6:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-09 19:12 Broken watchdog in u-boot master branch Pali Rohár
2022-10-10 13:55 ` Tom Rini
2022-10-10 16:19   ` Pali Rohár
2022-10-11  7:18   ` Rasmus Villemoes
2022-10-11  7:25     ` Pali Rohár
2022-10-17  6:52     ` Stefan Roese
2022-10-10 16:28 ` Tom Rini
2022-10-10 17:22   ` Pali Rohár
2022-10-10 17:40     ` Tom Rini
2022-10-10 17:44       ` Pali Rohár
2022-10-10 17:56         ` Tom Rini
2022-10-10 18:01           ` Pali Rohár
2022-10-10 18:14             ` Tom Rini
2022-10-10 19:24               ` Tom Rini
2022-10-10 19:33                 ` Pali Rohár

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.