All of lore.kernel.org
 help / color / mirror / Atom feed
* SQUASHFS errors and OpenBMC hang
@ 2020-08-29  0:40 Kun Zhao
  2020-09-01 12:35 ` Patrick Williams
  2020-09-01 23:07 ` Milton Miller II
  0 siblings, 2 replies; 5+ messages in thread
From: Kun Zhao @ 2020-08-29  0:40 UTC (permalink / raw)
  To: openbmc

[-- Attachment #1: Type: text/plain, Size: 5173 bytes --]

Hi Team,

I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system original 3rd-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?

There are 2 symptoms,

#1,

BMC debug console somehow shows this error,

[ 4242.029061] SQUASHFS error: xz decompression failed, data probably corrupt
[ 4242.035970] SQUASHFS error: squashfs_read_data failed to read block 0xce5cb0
[ 4242.043159] SQUASHFS error: Unable to read data cache entry [ce5cb0]
[ 4242.049627] SQUASHFS error: Unable to read page, block ce5cb0, size da44
[ 4242.056386] SQUASHFS error: Unable to read data cache entry [ce5cb0]

After rebooting, BMC may show that error again and then stop at reading rootfs with the following errors,

[ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}
[ 3.383951] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0a60. {1985,e002,15000044,98f7fb1d}
[ 3.394949] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e09e4. {1985,e002,15000044,98f7fb1d}
[ 3.405958] jffs2: notice: (78) check_node_data: wrong data CRC in data node at 0x003e0af0: read 0x5ab53bf4, calculated 0xb6f14204.
[ 3.417873] jffs2: warning: (78) jffs2_do_read_inode_internal: no data nodes found for ino #8
[ 3.426478] jffs2: Returned error for crccheck of ino #8. Expect badness...
[ 3.492939] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0bc8. {1985,e002,15000044,98f7fb1d}
[ 3.503923] jffs2: warning: (78) jffs2_do_read_inode_internal: no data nodes found for ino #9
[ 3.512462] jffs2: Returned error for crccheck of ino #9. Expect badness...

After that, BMC either enter  recovery mode or hang.

#2,

BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,

root@dgx:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
dev 212904 0 212904 0% /dev
tmpfs 246728 20172 226556 8% /run
/dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
/dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
cow 4096 880 3216 21% /
tmpfs 246728 8 246720 0% /dev/shm
tmpfs 246728 0 246728 0% /sys/fs/cgroup
tmpfs 246728 0 246728 0% /tmp
tmpfs 246728 8 246720 0% /var/volatile

and can see more and more ipmid coredump files,

root@dgx:~# ls -al /run/initramfs/rw/cow/var/lib/systemd/coredump/
drwxr-xr-x 2 root root 0 Aug 21 16:04 .
rw-r---- 1 root root 57344 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5710.1598025874000000000000.xzaba143da6d9b5571
rw-r---- 1 root root 655360 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5710.1598025874000000000000ba58c927628d3950
rw-r---- 1 root root 0 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5713.1598025880000000000000.xzee8c94e72fc5b173
rw-r---- 1 root root 655360 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5713.159802588000000000000089ee90c2a557ac1c
drwxr-xr-x 6 root root 0 Jan 1 1970 ..
rw-r---- 1 root root 92492 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5630.1598025699000000000000.xz
rw-r---- 1 root root 92572 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5641.1598025723000000000000.xz
rw-r---- 1 root root 92652 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5645.1598025728000000000000.xz
rw-r---- 1 root root 92476 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5651.1598025754000000000000.xz

By checking journal logs and found ipmid failed on access files like /usr/share/ipmi-providers/channel_config.json. So seems ipmid is also a victim from the filesystem failure.
And after a while, BMC just hang.


Some recovery methods are available, but success rate are very low,


  *   leave BMC there for some time, it will be back to work. but not always.
  *   reboot BMC or AC cycle sometime can make it work but not always.


I found the following actions could trigger this failure,


  1.  do SSH login to BMC debug console remotely, it will show this error when triggered,
$ ssh root@<bmc ip>
ssh_exchange_identification: read: Connection reset by peer


  1.  set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.


The code is based on upstream commit 5ddb5fa99ec259 on master branch.
The flash layout definition is the default openbmc-flash-layout.dtsi.
The SPI ROM is Macronix MX25L25635F

Some questions,

  1.  Any SPI lock feature enabled in OpenBMC?
  2.  If yes, do I have to unlock u-boot-env partition before fw_setenv?


Thanks.

Best regards,

Kun Zhao
/*
  zkxz@hotmail.com<mailto:zkxz@hotmail.com>
*/


[-- Attachment #2: Type: text/html, Size: 16966 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SQUASHFS errors and OpenBMC hang
  2020-08-29  0:40 SQUASHFS errors and OpenBMC hang Kun Zhao
@ 2020-09-01 12:35 ` Patrick Williams
  2020-09-02 22:46   ` Kun Zhao
  2020-09-01 23:07 ` Milton Miller II
  1 sibling, 1 reply; 5+ messages in thread
From: Patrick Williams @ 2020-09-01 12:35 UTC (permalink / raw)
  To: Kun Zhao; +Cc: openbmc

[-- Attachment #1: Type: text/plain, Size: 3540 bytes --]

On Sat, Aug 29, 2020 at 12:40:31AM +0000, Kun Zhao wrote:
> Hi Team,
> 
> I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system original 3rd-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?

Yeah, this does look like a bad SPI NOR.  Have you tried flashing on a
fresh image to the NOR and then reading it back to confirm all the bits
keep their values?  It is possible that the corruption is hitting the
other BMC code in a less-important location.

> [ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}

I'm surprised to see anyone using jffs2.  Don't we generally use ubifs
in OpenBMC?  Is there a reason you've chosen to use jffs2?

I don't necessarily think jffs2 will be better or worse in this
particular scenario but we've seen lots of upgrade issues over the years
with jffs2.

> BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,
> 
> root@dgx:~# df
> Filesystem 1K-blocks Used Available Use% Mounted on
> dev 212904 0 212904 0% /dev
> tmpfs 246728 20172 226556 8% /run
> /dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
> /dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
> cow 4096 880 3216 21% /
> tmpfs 246728 8 246720 0% /dev/shm
> tmpfs 246728 0 246728 0% /sys/fs/cgroup
> tmpfs 246728 0 246728 0% /tmp
> tmpfs 246728 8 246720 0% /var/volatile
> 
> and can see more and more ipmid coredump files,

This implies to me that we need to adjust the systemd recovery for
ipmid.  We shouldn't just keep re-launching the same process over and
over after a coredump.  Systemd has some thresholding capability.

> I found the following actions could trigger this failure,
> 
> 
>   1.  do SSH login to BMC debug console remotely, it will show this error when triggered,
> $ ssh root@<bmc ip>
> ssh_exchange_identification: read: Connection reset by peer
> 
> 
>   1.  set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.

I have no idea why this procedure would solve SPI NOR issues.  It
doesn't seem connected on the surface.

> The code is based on upstream commit 5ddb5fa99ec259 on master branch.
> The flash layout definition is the default openbmc-flash-layout.dtsi.
> The SPI ROM is Macronix MX25L25635F
> 
> Some questions,
> 
>   1.  Any SPI lock feature enabled in OpenBMC?
>   2.  If yes, do I have to unlock u-boot-env partition before fw_setenv?

There is not, to my knowledge, a software SPI lock.  Some machines have
a 'golden' NOR which they enable by, in hardware, setting the
write-protect input pin on the SPI NOR (with a strapping resistor).
Does your machine do this mechanism?  If so, it is possible that you're
booting onto the 'wrong' NOR flash in some conditions and a reboot
resets the chip-select logic in the SPI controller.  (Usually, you have
the watchdog configured to automatically swap the chip-select after some
number of boot failures.)

-- 
Patrick Williams

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: SQUASHFS errors and OpenBMC hang
  2020-08-29  0:40 SQUASHFS errors and OpenBMC hang Kun Zhao
  2020-09-01 12:35 ` Patrick Williams
@ 2020-09-01 23:07 ` Milton Miller II
  2020-09-02 22:56   ` Kun Zhao
  1 sibling, 1 reply; 5+ messages in thread
From: Milton Miller II @ 2020-09-01 23:07 UTC (permalink / raw)
  To: Patrick Williams; +Cc: Kun Zhao, openbmc

On September 1, 2020 around 7:36AM in some timezone, Patrick Williams wrote:
>On Sat, Aug 29, 2020 at 12:40:31AM +0000, Kun Zhao wrote:
>> I’m working on validating OpenBMC on our POC system for a while,
>but starting from 2 weeks ago, the BMC filesystem sometimes report
>failures, and after that sometimes the BMC will hang after running
>for a while. It started to happen on one system and then on another.
>Tried to use programmer to re-flash, still see this issue. Tried to
>flash back to the very first known good OpenBMC image we built, still
>see the same symptoms. It seems like a SPI ROM failure. But when
>flash back the POC system original 3rd-party BMC, no such issue at
>all. Not sure if anyone ever met similar issues before?
>
>Yeah, this does look like a bad SPI NOR.  Have you tried flashing on
>a
>fresh image to the NOR and then reading it back to confirm all the
>bits
>keep their values?  It is possible that the corruption is hitting the
>other BMC code in a less-important location.
>
>> [ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header
>CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}
>
>I'm surprised to see anyone using jffs2.  Don't we generally use
>ubifs
>in OpenBMC?  Is there a reason you've chosen to use jffs2?
>
>I don't necessarily think jffs2 will be better or worse in this
>particular scenario but we've seen lots of upgrade issues over the
>years
>with jffs2.

The default layout is static partitions with squashfs over mtdblock 
for the read-only layer and jffs2 for the read-write layer.

The ubifs option is opt-in and the code update supports two images 
so that a new image is always available.  These options should be 
orthogonal but in practice are probably tied in the code update 
repository.

The third option is eMMC support on the sdhci controller.  This 
was prototyped on ast2500 and in use on the ast2600.

There are some differences in the overlay strategy in the current 
builds but I will support anyone willing to test to merge the new 
limited writable directories from ubifs and emmc to the static mtd 
layout.   This means I'm willing to update the init scripts.

>
>> BMC debug console shows the same SQUASHFS error as above, by
>checking filesystem usage we could see rwfs usage keep increasing
>like this,
>> 
>> root@dgx:~# df
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> dev 212904 0 212904 0% /dev
>> tmpfs 246728 20172 226556 8% /run
>> /dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
>> /dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
>> cow 4096 880 3216 21% /
>> tmpfs 246728 8 246720 0% /dev/shm
>> tmpfs 246728 0 246728 0% /sys/fs/cgroup
>> tmpfs 246728 0 246728 0% /tmp
>> tmpfs 246728 8 246720 0% /var/volatile
>> 
>> and can see more and more ipmid coredump files,
>
>This implies to me that we need to adjust the systemd recovery for
>ipmid.  We shouldn't just keep re-launching the same process over and
>over after a coredump.  Systemd has some thresholding capability.
>

I've seen problems in the past where the squashfs image was bigger 
than the aloted space and it became partially overwritten by the 
jffs2 writable filesystem.   We added code that tries to catch this 
and have seen such reports but wanted to bring it up.  Also we don't 
support the host accessing the flash controller while linux is up in 
case your host is trying to flash the bmc bios (or even read it
directly; all data must go through API such as IPMI or REST.

>> I found the following actions could trigger this failure,
>> 
>> 
>>   1.  do SSH login to BMC debug console remotely, it will show this
>error when triggered,
>> $ ssh root@<bmc ip>
>> ssh_exchange_identification: read: Connection reset by peer
>> 
>> 
>>   1.  set BMC MAC address by fw_setenv in BMC debug console, reboot
>BMC, and do 'ip -a'.
>
>I have no idea why this procedure would solve SPI NOR issues.  It
>doesn't seem connected on the surface.
>
>> The code is based on upstream commit 5ddb5fa99ec259 on master
>branch.
>> The flash layout definition is the default
>openbmc-flash-layout.dtsi.
>> The SPI ROM is Macronix MX25L25635F
>> 
>> Some questions,
>> 
>>   1.  Any SPI lock feature enabled in OpenBMC?
>>   2.  If yes, do I have to unlock u-boot-env partition before
>fw_setenv?
>
>There is not, to my knowledge, a software SPI lock.  Some machines
>have
>a 'golden' NOR which they enable by, in hardware, setting the
>write-protect input pin on the SPI NOR (with a strapping resistor).
>Does your machine do this mechanism?  If so, it is possible that
>you're
>booting onto the 'wrong' NOR flash in some conditions and a reboot
>resets the chip-select logic in the SPI controller.  (Usually, you
>have
>the watchdog configured to automatically swap the chip-select after
>some
>number of boot failures.)
>
>-- 
>Patrick Williams

Our default is that the os is in control of the flash an we do not 
mark any areas as read-only.

milton
---
I speak only for myself.  But I have written or reviewed the layouts 
and initrd scripting.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SQUASHFS errors and OpenBMC hang
  2020-09-01 12:35 ` Patrick Williams
@ 2020-09-02 22:46   ` Kun Zhao
  0 siblings, 0 replies; 5+ messages in thread
From: Kun Zhao @ 2020-09-02 22:46 UTC (permalink / raw)
  To: Patrick Williams; +Cc: openbmc


On 9/1/20 5:35 AM, Patrick Williams wrote:
> On Sat, Aug 29, 2020 at 12:40:31AM +0000, Kun Zhao wrote:
>> Hi Team,
>>
>> I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system original 3rd-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?
> Yeah, this does look like a bad SPI NOR. 
Thank you, Patrick for the comments. I think so. But my only confusion is about the POC system original 3rd-party BMC doesn't have any issue, it also uses jffs2.
>  Have you tried flashing on a
> fresh image to the NOR and then reading it back to confirm all the bits
> keep their values?  It is possible that the corruption is hitting the
> other BMC code in a less-important location.

I doubted that, too. So I tried to burn my image to the NOR, boot it, and then read it back. But the only differences are there are contents in u-boot-env and rwfs partitions in the read-back image that is as expected, and no any data overflowed crossing any partition boundaries there either.

I also tried to move rofs/rwfs positions, change their sizes bigger/smaller, reduce kernel partition size, making 64KB neutral zones between them. But none of them improves the case.

>
>> [ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}
> I'm surprised to see anyone using jffs2.  Don't we generally use ubifs
> in OpenBMC?  Is there a reason you've chosen to use jffs2?
I just uses the default settings based on ast2500-evb for our POC. But thanks for the hint. I'm trying to enable ubifs now.
>
> I don't necessarily think jffs2 will be better or worse in this
> particular scenario but we've seen lots of upgrade issues over the years
> with jffs2.
>
>> BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,
>>
>> root@dgx:~# df
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> dev 212904 0 212904 0% /dev
>> tmpfs 246728 20172 226556 8% /run
>> /dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
>> /dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
>> cow 4096 880 3216 21% /
>> tmpfs 246728 8 246720 0% /dev/shm
>> tmpfs 246728 0 246728 0% /sys/fs/cgroup
>> tmpfs 246728 0 246728 0% /tmp
>> tmpfs 246728 8 246720 0% /var/volatile
>>
>> and can see more and more ipmid coredump files,
> This implies to me that we need to adjust the systemd recovery for
> ipmid.  We shouldn't just keep re-launching the same process over and
> over after a coredump.  Systemd has some thresholding capability.
Can I disable the coredump for ipmid?
>> I found the following actions could trigger this failure,
>>
>>
>>   1.  do SSH login to BMC debug console remotely, it will show this error when triggered,
>> $ ssh root@<bmc ip>
>> ssh_exchange_identification: read: Connection reset by peer
>>
>>
>>   1.  set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.
> I have no idea why this procedure would solve SPI NOR issues.  It
> doesn't seem connected on the surface.
Not to solve the issues, they can trigger the errors to be printed on BMC debug console. I think the reason is some files on rwfs or u-boot-env will be read/write when we do them.
>> The code is based on upstream commit 5ddb5fa99ec259 on master branch.
>> The flash layout definition is the default openbmc-flash-layout.dtsi.
>> The SPI ROM is Macronix MX25L25635F
>>
>> Some questions,
>>
>>   1.  Any SPI lock feature enabled in OpenBMC?
>>   2.  If yes, do I have to unlock u-boot-env partition before fw_setenv?
> There is not, to my knowledge, a software SPI lock.  Some machines have
> a 'golden' NOR which they enable by, in hardware, setting the
> write-protect input pin on the SPI NOR (with a strapping resistor).
> Does your machine do this mechanism?  If so, it is possible that you're
> booting onto the 'wrong' NOR flash in some conditions and a reboot
> resets the chip-select logic in the SPI controller.  (Usually, you have
> the watchdog configured to automatically swap the chip-select after some
> number of boot failures.)
>
No, we have only one NOR flash in the system. The SPI lock feature, I mean, is the NOR flash chip SW Block Protection functions which can enable/disable write-protect for particular blocks for BMC code, not the HW W/P pin.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SQUASHFS errors and OpenBMC hang
  2020-09-01 23:07 ` Milton Miller II
@ 2020-09-02 22:56   ` Kun Zhao
  0 siblings, 0 replies; 5+ messages in thread
From: Kun Zhao @ 2020-09-02 22:56 UTC (permalink / raw)
  To: Milton Miller II, Patrick Williams; +Cc: openbmc


On 9/1/20 4:07 PM, Milton Miller II wrote:
> On September 1, 2020 around 7:36AM in some timezone, Patrick Williams wrote:
>> On Sat, Aug 29, 2020 at 12:40:31AM +0000, Kun Zhao wrote:
>>> I’m working on validating OpenBMC on our POC system for a while,
>> but starting from 2 weeks ago, the BMC filesystem sometimes report
>> failures, and after that sometimes the BMC will hang after running
>> for a while. It started to happen on one system and then on another.
>> Tried to use programmer to re-flash, still see this issue. Tried to
>> flash back to the very first known good OpenBMC image we built, still
>> see the same symptoms. It seems like a SPI ROM failure. But when
>> flash back the POC system original 3rd-party BMC, no such issue at
>> all. Not sure if anyone ever met similar issues before?
>>
>> Yeah, this does look like a bad SPI NOR.  Have you tried flashing on
>> a
>> fresh image to the NOR and then reading it back to confirm all the
>> bits
>> keep their values?  It is possible that the corruption is hitting the
>> other BMC code in a less-important location.
>>
>>> [ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header
>> CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}
>>
>> I'm surprised to see anyone using jffs2.  Don't we generally use
>> ubifs
>> in OpenBMC?  Is there a reason you've chosen to use jffs2?
>>
>> I don't necessarily think jffs2 will be better or worse in this
>> particular scenario but we've seen lots of upgrade issues over the
>> years
>> with jffs2.
> The default layout is static partitions with squashfs over mtdblock 
> for the read-only layer and jffs2 for the read-write layer.
>
> The ubifs option is opt-in and the code update supports two images 
> so that a new image is always available.  These options should be 
> orthogonal but in practice are probably tied in the code update 
> repository.
>
> The third option is eMMC support on the sdhci controller.  This 
> was prototyped on ast2500 and in use on the ast2600.
>
> There are some differences in the overlay strategy in the current 
> builds but I will support anyone willing to test to merge the new 
> limited writable directories from ubifs and emmc to the static mtd 
> layout.   This means I'm willing to update the init scripts.
Thank you, Milton for the comments. Can I update ubifs image to static partitioned BMC with code update? Or I have to program it directly to the NOR flash?
>>> BMC debug console shows the same SQUASHFS error as above, by
>> checking filesystem usage we could see rwfs usage keep increasing
>> like this,
>>> root@dgx:~# df
>>> Filesystem 1K-blocks Used Available Use% Mounted on
>>> dev 212904 0 212904 0% /dev
>>> tmpfs 246728 20172 226556 8% /run
>>> /dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
>>> /dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
>>> cow 4096 880 3216 21% /
>>> tmpfs 246728 8 246720 0% /dev/shm
>>> tmpfs 246728 0 246728 0% /sys/fs/cgroup
>>> tmpfs 246728 0 246728 0% /tmp
>>> tmpfs 246728 8 246720 0% /var/volatile
>>>
>>> and can see more and more ipmid coredump files,
>> This implies to me that we need to adjust the systemd recovery for
>> ipmid.  We shouldn't just keep re-launching the same process over and
>> over after a coredump.  Systemd has some thresholding capability.
>>
> I've seen problems in the past where the squashfs image was bigger 
> than the aloted space and it became partially overwritten by the 
> jffs2 writable filesystem.   We added code that tries to catch this 
> and have seen such reports but wanted to bring it up. 
Do you still have that issue links?
>  Also we don't 
> support the host accessing the flash controller while linux is up in 
> case your host is trying to flash the bmc bios (or even read it
> directly; all data must go through API such as IPMI or REST.
Do you mean if BMC is up and running and I use tools like socflash to program the BMC  directly in host OS, there will be problem?
>>> I found the following actions could trigger this failure,
>>>
>>>
>>>   1.  do SSH login to BMC debug console remotely, it will show this
>> error when triggered,
>>> $ ssh root@<bmc ip>
>>> ssh_exchange_identification: read: Connection reset by peer
>>>
>>>
>>>   1.  set BMC MAC address by fw_setenv in BMC debug console, reboot
>> BMC, and do 'ip -a'.
>>
>> I have no idea why this procedure would solve SPI NOR issues.  It
>> doesn't seem connected on the surface.
>>
>>> The code is based on upstream commit 5ddb5fa99ec259 on master
>> branch.
>>> The flash layout definition is the default
>> openbmc-flash-layout.dtsi.
>>> The SPI ROM is Macronix MX25L25635F
>>>
>>> Some questions,
>>>
>>>   1.  Any SPI lock feature enabled in OpenBMC?
>>>   2.  If yes, do I have to unlock u-boot-env partition before
>> fw_setenv?
>>
>> There is not, to my knowledge, a software SPI lock.  Some machines
>> have
>> a 'golden' NOR which they enable by, in hardware, setting the
>> write-protect input pin on the SPI NOR (with a strapping resistor).
>> Does your machine do this mechanism?  If so, it is possible that
>> you're
>> booting onto the 'wrong' NOR flash in some conditions and a reboot
>> resets the chip-select logic in the SPI controller.  (Usually, you
>> have
>> the watchdog configured to automatically swap the chip-select after
>> some
>> number of boot failures.)
>>
>> -- 
>> Patrick Williams
> Our default is that the os is in control of the flash an we do not 
> mark any areas as read-only.

Got it. Thanks for the confirmation.


> milton
> ---
> I speak only for myself.  But I have written or reviewed the layouts 
> and initrd scripting.
>
Kun

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-02 22:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-29  0:40 SQUASHFS errors and OpenBMC hang Kun Zhao
2020-09-01 12:35 ` Patrick Williams
2020-09-02 22:46   ` Kun Zhao
2020-09-01 23:07 ` Milton Miller II
2020-09-02 22:56   ` Kun Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.