All of lore.kernel.org
 help / color / mirror / Atom feed
* The Power9 host booting problem with OpenBMC kernel 5.7.x
@ 2020-08-10 18:44 Alexander A. Filippov
  2020-08-11  6:12 ` Joel Stanley
  0 siblings, 1 reply; 6+ messages in thread
From: Alexander A. Filippov @ 2020-08-10 18:44 UTC (permalink / raw)
  To: Joel Stanley, Eddie James; +Cc: Alexander Amelkin, Artem Senichev, openbmc

[-- Attachment #1: Type: text/plain, Size: 3203 bytes --]

Since the kernel in OpenBMC was updated to 5.7.x we have a problem with the P9
hosts booting.
On host with one Power9 CPU the failure happens during the Petitboot is trying
to initialize the network and it leads to host restarts.
On host with two Power9 CPU the same failure happens during OS booting. It
increases boot time, but at the end the host OS is completely started.

In both cases the error looks like:

[   22.564986] rcu: INFO: rcu_sched self-detected stall on CPU
[   22.565013] rcu:     7-....: (2155 ticks this GP) idle=1da/0/0x3 softirq=92/126 fqs=1050 
[   22.565048]  (t=2100 jiffies g=-1131 q=31)
[   22.565071] NMI backtrace for cpu 7
[   22.565084] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.6.16 #2
[   22.565110] Call Trace:
[   22.565134] [c000003ff463f280] [c000000000c457d8] dump_stack+0xbc/0x104 (unreliable)
[   22.565165] [c000003ff463f2c0] [c000000000c50af4] nmi_cpu_backtrace+0x104/0x130
[   22.565213] [c000003ff463f340] [c000000000c50bf4] nmi_trigger_cpumask_backtrace+0xd4/0x1d0
[   22.565244] [c000003ff463f3e0] [c0000000000679a8] arch_trigger_cpumask_backtrace+0x28/0x40
[   22.565293] [c000003ff463f400] [c0000000001c504c] rcu_dump_cpu_stacks+0xe0/0x154
[   22.565342] [c000003ff463f450] [c0000000001c35e8] rcu_sched_clock_irq+0x408/0xaa0
[   22.565381] [c000003ff463f530] [c0000000001d2774] update_process_times+0x44/0x90
[   22.565410] [c000003ff463f560] [c0000000001e81c8] tick_periodic+0xf8/0x120
[   22.565448] [c000003ff463f590] [c0000000001e822c] tick_handle_periodic+0x3c/0xd0
[   22.565488] [c000003ff463f5d0] [c00000000002cad0] timer_interrupt+0x1d0/0x300
[   22.565528] [c000003ff463f630] [c00000000000e6f8] fast_exception_return+0x1a8/0x1cc
[   22.565570] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
                   LR = arch_local_irq_restore+0x5c/0x90
[   22.565605] [c000003ff463f930] [c000000000164220] vtime_account_irq_enter+0x40/0x70 (unreliable)
[   22.565657] [c000003ff463f950] [c000000000c6d038] __do_softirq+0xd8/0x474
[   22.565704] [c000003ff463fa50] [c000000000121968] irq_exit+0x88/0x100
[   22.565741] [c000003ff463fa80] [c00000000002cbc4] timer_interrupt+0x2c4/0x300
[   22.565781] [c000003ff463fae0] [c00000000000e6f8] fast_exception_return+0x1a8/0x1cc
[   22.565831] --- interrupt: 901 at arch_local_irq_restore+0x7c/0x90
                   LR = arch_local_irq_restore+0x48/0x90
[   22.565896] [c000003ff463fde0] [0000000000000080] 0x80 (unreliable)
[   22.565934] [c000003ff463fe00] [c0000000000242a8] arch_cpu_idle+0xb8/0x150
[   22.565971] [c000003ff463fe30] [c000000000c6c234] default_idle_call+0x64/0x78
[   22.566020] [c000003ff463fe50] [c000000000164ee8] do_idle+0x208/0x3f0
[   22.566067] [c000003ff463fed0] [c000000000165304] cpu_startup_entry+0x44/0x50
[   22.566105] [c000003ff463ff00] [c0000000000542b4] start_secondary+0x614/0x620
[   22.566145] [c000003ff463ff90] [c00000000000b354] start_secondary_prolog+0x10/0x14


So, I have two questions:
- Could you please, check if Romulus is also affected by this issue?
- Do you have any idea what is going wrong?

I've attached the tarball with full logs.
- poopsy is a system with two Power9 CPU
- whoopsy is a system with one Power9 CPU

--
Regards,
Alexander

[-- Attachment #2: nicole-broken-merge.tar.gz --]
[-- Type: application/gzip, Size: 90810 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The Power9 host booting problem with OpenBMC kernel 5.7.x
  2020-08-10 18:44 The Power9 host booting problem with OpenBMC kernel 5.7.x Alexander A. Filippov
@ 2020-08-11  6:12 ` Joel Stanley
  2020-08-11 11:55   ` Artem Senichev
  2020-08-11 18:33   ` Alexander A. Filippov
  0 siblings, 2 replies; 6+ messages in thread
From: Joel Stanley @ 2020-08-11  6:12 UTC (permalink / raw)
  To: Alexander A. Filippov
  Cc: Eddie James, Alexander Amelkin, Artem Senichev, openbmc

On Mon, 10 Aug 2020 at 18:48, Alexander A. Filippov
<a.filippov@yadro.com> wrote:
>
> Since the kernel in OpenBMC was updated to 5.7.x we have a problem with the P9
> hosts booting.
> On host with one Power9 CPU the failure happens during the Petitboot is trying
> to initialize the network and it leads to host restarts.
> On host with two Power9 CPU the same failure happens during OS booting. It
> increases boot time, but at the end the host OS is completely started.

Oh no. I have spent some time testing the 5.7 tree primarily on
Tacoma, our ast2600/p9 platform. We saw some strange systemd failures,
where services such as udevd and journald would be killed by systemd's
watchdog functionality. I did some preliminary debugging but didn't
find a root cause.

I have since published a 5.8 based tree that does not suffer from this
issue. Could you give that a spin on your hardware and see if it
recreates your issue?

 https://gerrit.openbmc-project.xyz/c/openbmc/meta-aspeed/+/35315

> So, I have two questions:
> - Could you please, check if Romulus is also affected by this issue?
> - Do you have any idea what is going wrong?

I'll fire up a romulus and see if it reproduces.

My guess is it's something to do with the timekeeping, irq or rcu
code. All areas of complexity!

Thanks for the report.

Cheers,

Joel

> I've attached the tarball with full logs.
> - poopsy is a system with two Power9 CPU
> - whoopsy is a system with one Power9 CPU
>
> --
> Regards,
> Alexander

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The Power9 host booting problem with OpenBMC kernel 5.7.x
  2020-08-11  6:12 ` Joel Stanley
@ 2020-08-11 11:55   ` Artem Senichev
  2020-08-11 18:33   ` Alexander A. Filippov
  1 sibling, 0 replies; 6+ messages in thread
From: Artem Senichev @ 2020-08-11 11:55 UTC (permalink / raw)
  To: Joel Stanley
  Cc: Alexander A. Filippov, Eddie James, Alexander Amelkin, openbmc

On Tue, Aug 11, 2020 at 06:12:30AM +0000, Joel Stanley wrote:
> On Mon, 10 Aug 2020 at 18:48, Alexander A. Filippov
> <a.filippov@yadro.com> wrote:
> >
> > Since the kernel in OpenBMC was updated to 5.7.x we have a problem with the P9
> > hosts booting.
> > On host with one Power9 CPU the failure happens during the Petitboot is trying
> > to initialize the network and it leads to host restarts.
> > On host with two Power9 CPU the same failure happens during OS booting. It
> > increases boot time, but at the end the host OS is completely started.
> 
> Oh no. I have spent some time testing the 5.7 tree primarily on
> Tacoma, our ast2600/p9 platform. We saw some strange systemd failures,
> where services such as udevd and journald would be killed by systemd's
> watchdog functionality. I did some preliminary debugging but didn't
> find a root cause.
> 
> I'll fire up a romulus and see if it reproduces.
> 
> My guess is it's something to do with the timekeeping, irq or rcu
> code. All areas of complexity!
> 

We had similar behaviour in P8 when tried to use ColdFire FSI:
https://github.com/openbmc/openbmc/issues/3433

In this issue, htop shows 100% load of one CPU on the host and it is not an OS
task. Looks like FSI doesn't stop working and fully loads one core.

-- 
Regards,
Artem Senichev
Software Engineer, YADRO.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The Power9 host booting problem with OpenBMC kernel 5.7.x
  2020-08-11  6:12 ` Joel Stanley
  2020-08-11 11:55   ` Artem Senichev
@ 2020-08-11 18:33   ` Alexander A. Filippov
  2020-08-12  8:56     ` Joel Stanley
  1 sibling, 1 reply; 6+ messages in thread
From: Alexander A. Filippov @ 2020-08-11 18:33 UTC (permalink / raw)
  To: Joel Stanley
  Cc: Alexander A. Filippov, Eddie James, Alexander Amelkin,
	Artem Senichev, openbmc

On Tue, Aug 11, 2020 at 06:12:30AM +0000, Joel Stanley wrote:
> On Mon, 10 Aug 2020 at 18:48, Alexander A. Filippov
> <a.filippov@yadro.com> wrote:
> >
> > Since the kernel in OpenBMC was updated to 5.7.x we have a problem with the P9
> > hosts booting.
> > On host with one Power9 CPU the failure happens during the Petitboot is trying
> > to initialize the network and it leads to host restarts.
> > On host with two Power9 CPU the same failure happens during OS booting. It
> > increases boot time, but at the end the host OS is completely started.
> 
> Oh no. I have spent some time testing the 5.7 tree primarily on
> Tacoma, our ast2600/p9 platform. We saw some strange systemd failures,
> where services such as udevd and journald would be killed by systemd's
> watchdog functionality. I did some preliminary debugging but didn't
> find a root cause.
> 
> I have since published a 5.8 based tree that does not suffer from this
> issue. Could you give that a spin on your hardware and see if it
> recreates your issue?
> 
>  https://gerrit.openbmc-project.xyz/c/openbmc/meta-aspeed/+/35315
> 

With the kerenl 5.8 the host is still not booting.
I've checked on both machines and they have very different results:
 - On the machine with two CPUs the issue is still reproduced.
   I see no difference, neither in the behavior, nor in the logs.
 - On the machine with one CPU the failure happens due the PNOR flash.
   It looks like this:

[ 16:23:27 ] --== Welcome to Hostboot hostboot-9865ef9/hbicore.bin ==--
[ 16:23:27 ] 
[ 16:23:27 ]   5.31049|secure|SecureROM valid - enabling functionality
[ 16:23:30 ]   8.00820|Booting from SBE side 0 on master proc=00050000
[ 16:23:30 ]   8.04587|ISTEP  6. 5 - host_init_fsi
[ 16:23:30 ]   8.21815|ISTEP  6. 6 - host_set_ipl_parms
[ 16:23:30 ]   8.40171|ISTEP  6. 7 - host_discover_targets
[ 16:23:32 ]   9.55142|HWAS|PRESENT> DIMM[03]=A0A0000000000000
[ 16:23:32 ]   9.55144|HWAS|PRESENT> Proc[05]=8000000000000000
[ 16:23:32 ]   9.55145|HWAS|PRESENT> Core[07]=33FFC30000000000
[ 16:23:33 ]  10.38865|ISTEP  6. 8 - host_update_master_tpm
[ 16:23:33 ]  10.41071|SECURE|Security Access Bit> 0x0000000000000000
[ 16:23:33 ]  10.41072|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
[ 16:23:33 ]  10.41089|ISTEP  6. 9 - host_gard
[ 16:23:33 ]  10.68154|HWAS|FUNCTIONAL> DIMM[03]=A0A0000000000000
[ 16:23:33 ]  10.68156|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
[ 16:23:33 ]  10.68157|HWAS|FUNCTIONAL> Core[07]=33FFC30000000000
[ 16:23:33 ]  10.68776|ISTEP  6.11 - host_start_occ_xstop_handler
[ 16:23:34 ]  11.10376|ECC error in PNOR flash in section offset 0x030DF600
[ 16:23:34 ] 
[ 16:23:34 ]  11.10387|System shutting down with error status 0x60F
[ 16:24:52 ] 
[ 16:24:52 ] 
[ 16:24:52 ] --== Welcome to SBE - CommitId[0xc58e8fd0] ==--


   After that the PNOR flash is corrupted and all other trying to boot stops
   at stage 'SBE starting hostboot'.

I've noticed that the kernel 5.8 detect the flash driver incorrectly:
mx25l51245g instead of mx66l51235f.
It happens on both machines and I don't understand why it leads to the problems
on only one of them.

After restoring the previous firmware and power cycle both machines work fine.

> > So, I have two questions:
> > - Could you please, check if Romulus is also affected by this issue?
> > - Do you have any idea what is going wrong?
> 
> I'll fire up a romulus and see if it reproduces.
> 
> My guess is it's something to do with the timekeeping, irq or rcu
> code. All areas of complexity!
> 
> Thanks for the report.
> 
> Cheers,
> 
> Joel
> 
> > I've attached the tarball with full logs.
> > - poopsy is a system with two Power9 CPU
> > - whoopsy is a system with one Power9 CPU
> >
> > --
> > Regards,
> > Alexander

--
Regards,
Alexander

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The Power9 host booting problem with OpenBMC kernel 5.7.x
  2020-08-11 18:33   ` Alexander A. Filippov
@ 2020-08-12  8:56     ` Joel Stanley
  2020-08-12 13:59       ` Alexander A. Filippov
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Stanley @ 2020-08-12  8:56 UTC (permalink / raw)
  To: Alexander A. Filippov
  Cc: Eddie James, Alexander Amelkin, Artem Senichev, openbmc

Thanks for the response. I've merged the two threads, and I have a
candidate for a fix.

On Tue, 11 Aug 2020 at 18:33, Alexander A. Filippov
<a.filippov@yadro.com> wrote:
> With the kerenl 5.8 the host is still not booting.
> I've checked on both machines and they have very different results:
>  - On the machine with two CPUs the issue is still reproduced.
>    I see no difference, neither in the behavior, nor in the logs.
>  - On the machine with one CPU the failure happens due the PNOR flash.
>    It looks like this:

>
> I've noticed that the kernel 5.8 detect the flash driver incorrectly:
> mx25l51245g instead of mx66l51235f.
> It happens on both machines and I don't understand why it leads to the problems
> on only one of them.

I found upstream v5.8 has a regression in the spi-nor driver on
aspeed. I've put a revert of the patch that caused the regression on
the list, but it requires some more investigation to find a proper
fix:

 https://patchwork.ozlabs.org/project/openbmc/patch/20200812035847.2352277-1-joel@jms.id.au/

On Tue, 11 Aug 2020 at 11:54, Artem Senichev <artemsen@gmail.com> wrote:
> > My guess is it's something to do with the timekeeping, irq or rcu
> > code. All areas of complexity!
> >
>
> We had similar behaviour in P8 when tried to use ColdFire FSI:
> https://github.com/openbmc/openbmc/issues/3433
>
> In this issue, htop shows 100% load of one CPU on the host and it is not an OS
> task. Looks like FSI doesn't stop working and fully loads one core.

I think we have an issue with the irq polarity of the vuart device.
Did you notice an excessive number of lpc_serirq interrupts on the
host (check /proc/interrupts)?

Try doing this on your BMC before booting your host:

root@bmc:~# echo 0 >
/sys/devices/platform/ahb/ahb:apb/1e787000.serial/sirq_polarity

If that fixes it we can make a change to the device tree to make the
setting permanent.

Cheers,

Joel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The Power9 host booting problem with OpenBMC kernel 5.7.x
  2020-08-12  8:56     ` Joel Stanley
@ 2020-08-12 13:59       ` Alexander A. Filippov
  0 siblings, 0 replies; 6+ messages in thread
From: Alexander A. Filippov @ 2020-08-12 13:59 UTC (permalink / raw)
  To: Joel Stanley
  Cc: Alexander A. Filippov, Eddie James, Alexander Amelkin,
	Artem Senichev, openbmc

On Wed, Aug 12, 2020 at 08:56:16AM +0000, Joel Stanley wrote:
> Thanks for the response. I've merged the two threads, and I have a
> candidate for a fix.
> 
> On Tue, 11 Aug 2020 at 18:33, Alexander A. Filippov
> <a.filippov@yadro.com> wrote:
> > With the kerenl 5.8 the host is still not booting.
> > I've checked on both machines and they have very different results:
> >  - On the machine with two CPUs the issue is still reproduced.
> >    I see no difference, neither in the behavior, nor in the logs.
> >  - On the machine with one CPU the failure happens due the PNOR flash.
> >    It looks like this:
> 
> >
> > I've noticed that the kernel 5.8 detect the flash driver incorrectly:
> > mx25l51245g instead of mx66l51235f.
> > It happens on both machines and I don't understand why it leads to the problems
> > on only one of them.
> 
> I found upstream v5.8 has a regression in the spi-nor driver on
> aspeed. I've put a revert of the patch that caused the regression on
> the list, but it requires some more investigation to find a proper
> fix:
> 
>  https://patchwork.ozlabs.org/project/openbmc/patch/20200812035847.2352277-1-joel@jms.id.au/
> 

Yes, this solves the problem with the flash drives.
They are still reported other model names, but work properly.


> On Tue, 11 Aug 2020 at 11:54, Artem Senichev <artemsen@gmail.com> wrote:
> > > My guess is it's something to do with the timekeeping, irq or rcu
> > > code. All areas of complexity!
> > >
> >
> > We had similar behaviour in P8 when tried to use ColdFire FSI:
> > https://github.com/openbmc/openbmc/issues/3433
> >
> > In this issue, htop shows 100% load of one CPU on the host and it is not an OS
> > task. Looks like FSI doesn't stop working and fully loads one core.
> 
> I think we have an issue with the irq polarity of the vuart device.
> Did you notice an excessive number of lpc_serirq interrupts on the
> host (check /proc/interrupts)?

You are right, lpc_serirq_mux1 is 183507008 after the host OS has just booted.

> 
> Try doing this on your BMC before booting your host:
> 
> root@bmc:~# echo 0 >
> /sys/devices/platform/ahb/ahb:apb/1e787000.serial/sirq_polarity
>

Yes, after this the both hosts work properly.

Thanks for your help.

> If that fixes it we can make a change to the device tree to make the
> setting permanent.
> 
> Cheers,
> 
> Joel

--
Regards,
Alexander

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-08-12 13:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-10 18:44 The Power9 host booting problem with OpenBMC kernel 5.7.x Alexander A. Filippov
2020-08-11  6:12 ` Joel Stanley
2020-08-11 11:55   ` Artem Senichev
2020-08-11 18:33   ` Alexander A. Filippov
2020-08-12  8:56     ` Joel Stanley
2020-08-12 13:59       ` Alexander A. Filippov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.