All of lore.kernel.org
 help / color / mirror / Atom feed
From: Patrick Williams <patrick@stwcx.xyz>
To: Andrew Geissler <geissonator@gmail.com>
Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: Critical BMC process failure recovery
Date: Tue, 20 Oct 2020 09:28:46 -0500	[thread overview]
Message-ID: <20201020142846.GB5030@patrickw3-mbp.lan.stwcx.xyz> (raw)
In-Reply-To: <C270F145-2236-4CA1-8D57-A63AB622A47C@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1736 bytes --]

Hi Andrew,

I like the proposal to reuse what systemd already provides.  It does
look like Lei pointed to some existing bbclass that could be enhanced
for this purpose so that any recipe can simply 'inherit ...' and maybe
set a variable to indicate that it is providing "critical services".

On Mon, Oct 19, 2020 at 02:53:11PM -0500, Andrew Geissler wrote:
> Greetings,
> 
> I've started initial investigation into two IBM requirements:
> 
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>   - Limit should be configurable, i.e. 3 resets within 5 minutes

I like that it has a time bound on it here.  If the reset didn't have a
time bound that would be a problem to me because it means that a slow
memory leak could eventually get the BMCs into this state.

Do you need to do anything in relationship with the WDT and failover
settings there?  I'm thinking you'll need to do something to ensure that
you don't swap flash banks between these resets.  Do you need to do N
resets on one flash bank and then M on the other?

It seems that the most likely cause of N resets in a short time is some
sort of flash corruption, BMC chip error, or a bug aggravated some RWFS
setting.  None of these are particularly recovered by the reset but at
least you know your in a bad situation at that point.

>   - If limit reached, display error to panel (if one available) and halt
>     the BMC.

And then what?  What is the remediation for this condition?  Are there
any services, such as SSH, that will continue to run in this state?  I
hope the only answer for remediation is physical access / power cycle.

-- 
Patrick Williams

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2020-10-20 14:32 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
2020-10-19 21:35 ` Neil Bradley
2020-10-22 15:41   ` Andrew Geissler
2020-10-20  2:58 ` Lei Yu
2020-10-20 18:30   ` Bills, Jason M
2020-10-20 14:28 ` Patrick Williams [this message]
2020-10-22 16:00   ` Andrew Geissler
2020-10-26 13:19     ` Matuszczak, Piotr
2020-10-27 21:57       ` Andrew Geissler
2021-09-02  0:29 ` Andrew Geissler
2022-02-21 20:54   ` Andrew Geissler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201020142846.GB5030@patrickw3-mbp.lan.stwcx.xyz \
    --to=patrick@stwcx.xyz \
    --cc=geissonator@gmail.com \
    --cc=openbmc@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.