openbmc.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Neil Bradley <Neil_Bradley@phoenix.com>
To: Andrew Geissler <geissonator@gmail.com>,
	OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: RE: Critical BMC process failure recovery
Date: Mon, 19 Oct 2020 21:35:43 +0000	[thread overview]
Message-ID: <95ad99d7921c405e93b794463d702853@SCL-EXCHMB-13.phoenix.com> (raw)
In-Reply-To: <C270F145-2236-4CA1-8D57-A63AB622A47C@gmail.com>

Hey Andrew!

At least initially, the requirements don't really seem like requirements - they seem like what someone's idea of what they think a solution would be.  For example, why reset 3 times? Why not 10? Or 2? Seems completely arbitrary. If the BMC resets twice in a row, there's no reason to think it would be OK the 3rd time. It's kinda like how people have been known do 4-5 firmware updates to "fix" a problem and it "still doesn't work". 😉

If the ultimate goal is availability, then there's more nuance to the discussion to be had. Let's assume the goal is "highest availability possible".

With that in mind, defining what "failure" is gets to be a bit more convoluted. Back when we did the CMM code for the Intel modular server, we had a several-pronged approach:

1) Run procmon - Look for any service that is supposed to be running (but isn't) and restart it and/or its process dependency tree.
2) Create a monitor (either a standalone program or a script) that periodically connects to the various services available - IPMI, web, KVM, etc.... - think of it like a functional "ping". A bit more involved, as this master control program (Tron reference 😉 ) would have to speak sentiently to each service to gauge how alive it is. There have been plenty of situations where a BMC is otherwise healthy but one service wasn't working, and it's overkill to have a 30-45 second outage while the BMC restarts.
3) Kernel panics were set to automatically reboot the BMC, with a double whammy of the hardware watchdog being enabled in case the CPU didn't reset.

There's more to it than this, as sometimes you'd have to quiesce procmon to not restart services that, through normal operation, would cease functioning for a brief period of time, so tuning would be required.

-->Neil

-----Original Message-----
From: openbmc <openbmc-bounces+neil_bradley=phoenix.com@lists.ozlabs.org> On Behalf Of Andrew Geissler
Sent: Monday, October 19, 2020 12:53 PM
To: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Critical BMC process failure recovery

Greetings,

I've started initial investigation into two IBM requirements:

- Reboot the BMC if a "critical" process fails and can not recover
- Limit the amount of times the BMC reboots for recovery
  - Limit should be configurable, i.e. 3 resets within 5 minutes
  - If limit reached, display error to panel (if one available) and halt
    the BMC.

The goal here is to have the BMC try and get itself back into a working state via a reboot of itself.

This same reboot logic and limits would also apply to kernel panics and/or BMC hardware watchdog expirations.

Some thoughts that have been thrown around internally:

- Spend more time ensuring code doesn't fail vs. handling them failing
- Put all BMC code into a single application so it's all or nothing (vs. 
  trying to pick and choose specific applications and dealing with all of
  the intricacies of restarting individual ones)
- Rebooting the BMC and getting the proper ordering of service starts is
  sometimes easier then testing every individual service restart for recovery
  paths

"Critical" processes would be things like mapper or dbus-broker. There's definitely a grey area though with other services so we'd need some guidelines around defining them and allow the meta layers to have a way to deem whichever they want critical.

So anyway, just throwing this out there to see if anyone has any input or is looking for something similar.

High level, I'd probably start looking into utilizing systemd as much as possible. "FailureAction=reboot-force" in the critical services and something that monitors for these types of reboots and enforces the reboot limits.

Andrew


  reply	other threads:[~2020-10-20  1:39 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
2020-10-19 21:35 ` Neil Bradley [this message]
2020-10-22 15:41   ` Andrew Geissler
2020-10-20  2:58 ` Lei Yu
2020-10-20 18:30   ` Bills, Jason M
2020-10-20 14:28 ` Patrick Williams
2020-10-22 16:00   ` Andrew Geissler
2020-10-26 13:19     ` Matuszczak, Piotr
2020-10-27 21:57       ` Andrew Geissler
2021-09-02  0:29 ` Andrew Geissler
2022-02-21 20:54   ` Andrew Geissler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95ad99d7921c405e93b794463d702853@SCL-EXCHMB-13.phoenix.com \
    --to=neil_bradley@phoenix.com \
    --cc=geissonator@gmail.com \
    --cc=openbmc@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).