openbmc.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: "Matuszczak, Piotr" <piotr.matuszczak@intel.com>
To: Andrew Geissler <geissonator@gmail.com>,
	Patrick Williams <patrick@stwcx.xyz>
Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: RE: Critical BMC process failure recovery
Date: Mon, 26 Oct 2020 13:19:14 +0000	[thread overview]
Message-ID: <CY4PR1101MB2311ABDFBA0EA222BB602B7686190@CY4PR1101MB2311.namprd11.prod.outlook.com> (raw)
In-Reply-To: <A7171080-B143-42AD-B235-951A06B247A4@gmail.com>

Hi, It's quite interesting discussion. Have you considered some kind of minimal set of features recovery image, to which BMC can switch after N resets during defined amount of time? Such image could hold error log and send periodic event about BMC failure. 

Piotr Matuszczak
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o. 
ul. Slowackiego 173, 80-298 Gdansk
KRS 101882
NIP 957-07-52-316

-----Original Message-----
From: openbmc <openbmc-bounces+piotr.matuszczak=intel.com@lists.ozlabs.org> On Behalf Of Andrew Geissler
Sent: Thursday, October 22, 2020 6:00 PM
To: Patrick Williams <patrick@stwcx.xyz>
Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: Critical BMC process failure recovery



> On Oct 20, 2020, at 9:28 AM, Patrick Williams <patrick@stwcx.xyz> wrote:
> 
> Hi Andrew,
> 
> I like the proposal to reuse what systemd already provides.  It does 
> look like Lei pointed to some existing bbclass that could be enhanced 
> for this purpose so that any recipe can simply 'inherit ...' and maybe 
> set a variable to indicate that it is providing "critical services”.

Yeah, looks like currently it opts in every service (except for a few special cases). I like the idea of putting it on the individual service to opt itself in. I’ve def seen what James mentions in his response where you get in situations where the BMC is rebooting itself too much due to non-critical services failing.

> 
> On Mon, Oct 19, 2020 at 02:53:11PM -0500, Andrew Geissler wrote:
>> Greetings,
>> 
>> I've started initial investigation into two IBM requirements:
>> 
>> - Reboot the BMC if a "critical" process fails and can not recover
>> - Limit the amount of times the BMC reboots for recovery
>>  - Limit should be configurable, i.e. 3 resets within 5 minutes
> 
> I like that it has a time bound on it here.  If the reset didn't have 
> a time bound that would be a problem to me because it means that a 
> slow memory leak could eventually get the BMCs into this state.
> 
> Do you need to do anything in relationship with the WDT and failover 
> settings there?  I'm thinking you'll need to do something to ensure 
> that you don't swap flash banks between these resets.  Do you need to 
> do N resets on one flash bank and then M on the other?

I’m hoping to keep the flash bank switch a separate discussion. The key here is to not impact whatever design decision is made there.

We’re still going back and forth a bit on whether we want to continue with that automatic flash bank switch design point. It sometimes causes more confusion than it’s worth.

I know we did make this work with our Witherspoon system from a watchdog perspective. We would reboot a certain amount of times and swap flash banks after a certain limit was reached. I’m not sure how we did it though :)

> 
> It seems that the most likely cause of N resets in a short time is 
> some sort of flash corruption, BMC chip error, or a bug aggravated 
> some RWFS setting.  None of these are particularly recovered by the 
> reset but at least you know your in a bad situation at that point.

Yeah, I would really like some data on how often a reboot of the BMC really does fix an issue. The focus for us should def be on avoiding the reboot in the first place. But the reboot is our last ditch effort.

> 
>>  - If limit reached, display error to panel (if one available) and halt
>>    the BMC.
> 
> And then what?  What is the remediation for this condition?  Are there 
> any services, such as SSH, that will continue to run in this state?  I 
> hope the only answer for remediation is physical access / power cycle.

I believe the best option (and what we’ve done historically) is to try and put an error code on the panel and halt in u-boot, requiring physical access / power cycle to recover.

> 
> --
> Patrick Williams


  reply	other threads:[~2020-10-26 13:22 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
2020-10-19 21:35 ` Neil Bradley
2020-10-22 15:41   ` Andrew Geissler
2020-10-20  2:58 ` Lei Yu
2020-10-20 18:30   ` Bills, Jason M
2020-10-20 14:28 ` Patrick Williams
2020-10-22 16:00   ` Andrew Geissler
2020-10-26 13:19     ` Matuszczak, Piotr [this message]
2020-10-27 21:57       ` Andrew Geissler
2021-09-02  0:29 ` Andrew Geissler
2022-02-21 20:54   ` Andrew Geissler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR1101MB2311ABDFBA0EA222BB602B7686190@CY4PR1101MB2311.namprd11.prod.outlook.com \
    --to=piotr.matuszczak@intel.com \
    --cc=geissonator@gmail.com \
    --cc=openbmc@lists.ozlabs.org \
    --cc=patrick@stwcx.xyz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).