openbmc.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Geissler <geissonator@gmail.com>
To: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: Critical BMC process failure recovery
Date: Wed, 1 Sep 2021 19:29:02 -0500	[thread overview]
Message-ID: <F503539B-1F5B-4EC0-A11F-A8A6EEA950B2@gmail.com> (raw)
In-Reply-To: <C270F145-2236-4CA1-8D57-A63AB622A47C@gmail.com>



> On Oct 19, 2020, at 2:53 PM, Andrew Geissler <geissonator@gmail.com> wrote:
> 
> Greetings,
> 
> I've started initial investigation into two IBM requirements:
> 
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>  - Limit should be configurable, i.e. 3 resets within 5 minutes
>  - If limit reached, display error to panel (if one available) and halt
>    the BMC.

I’ve started to dig into this, and have had some internal discussions on this
here at IBM. We're starting to look in a bit different direction when a service
goes into the fail state. Rebooting the BMC has rarely shown to help these
application failure scenarios and it makes debug of the issue very difficult.
We'd prefer to log an error (with a bmc dump) and maybe but the BMC state into
something reflecting this error state.

It does seem like based on our previous emails though that there is some
interest in that capability though (bmc reboot on service failure). As a
flexible option, I'm thinking the following:

- Create a new obmc-bmc-service-failure.target
- Create a bbclass or some other mechanism for services to have a
  "OnFailure=obmc-bmc-service-failure.target"
- By default an error log is created in this target
- System owners can plop whatever other services they want into this target
  - Reboot BMC
  - Capture additional debug data
  - ...
- Introduce a new BMC State, Quiesce. The BMC state changes to this when the
  new obmc-bmc-service-failure.target is started. This then gets mapped to
  the redfish/v1/Managers/bmc status as Quiesced so users know the BMC
  has entered a bad state.

BMC kernel panics and such would still trigger the BMC reboot path and some 
TBD function will ensure we only reboot X amount of times before stopping
in the boot loader or systemd rescue mode.

Thoughts on this new idea?

> 
> The goal here is to have the BMC try and get itself back into a working state
> via a reboot of itself.
> 
> This same reboot logic and limits would also apply to kernel panics and/or
> BMC hardware watchdog expirations.
> 
> Some thoughts that have been thrown around internally:
> 
> - Spend more time ensuring code doesn't fail vs. handling them failing
> - Put all BMC code into a single application so it's all or nothing (vs. 
>  trying to pick and choose specific applications and dealing with all of
>  the intricacies of restarting individual ones)
> - Rebooting the BMC and getting the proper ordering of service starts is
>  sometimes easier then testing every individual service restart for recovery
>  paths
> 
> "Critical" processes would be things like mapper or dbus-broker. There's
> definitely a grey area though with other services so we'd need some
> guidelines around defining them and allow the meta layers to have a way
> to deem whichever they want critical.
> 
> So anyway, just throwing this out there to see if anyone has any input
> or is looking for something similar.
> 
> High level, I'd probably start looking into utilizing systemd as much as
> possible. "FailureAction=reboot-force" in the critical services and something
> that monitors for these types of reboots and enforces the reboot limits.
> 
> Andrew


  parent reply	other threads:[~2021-09-02  0:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
2020-10-19 21:35 ` Neil Bradley
2020-10-22 15:41   ` Andrew Geissler
2020-10-20  2:58 ` Lei Yu
2020-10-20 18:30   ` Bills, Jason M
2020-10-20 14:28 ` Patrick Williams
2020-10-22 16:00   ` Andrew Geissler
2020-10-26 13:19     ` Matuszczak, Piotr
2020-10-27 21:57       ` Andrew Geissler
2021-09-02  0:29 ` Andrew Geissler [this message]
2022-02-21 20:54   ` Andrew Geissler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F503539B-1F5B-4EC0-A11F-A8A6EEA950B2@gmail.com \
    --to=geissonator@gmail.com \
    --cc=openbmc@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).