Re: Critical BMC process failure recovery

From: Andrew Geissler <geissonator@gmail.com>
To: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: Critical BMC process failure recovery
Date: Mon, 21 Feb 2022 14:54:33 -0600	[thread overview]
Message-ID: <1663E960-38C1-401C-910D-32A74EFC6455@gmail.com> (raw)
In-Reply-To: <F503539B-1F5B-4EC0-A11F-A8A6EEA950B2@gmail.com>

> On Sep 1, 2021, at 7:29 PM, Andrew Geissler <geissonator@gmail.com> wrote:
> 
> 
> 
>> On Oct 19, 2020, at 2:53 PM, Andrew Geissler <geissonator@gmail.com> wrote:
>> 
>> Greetings,
>> 
>> I've started initial investigation into two IBM requirements:
>> 
>> - Reboot the BMC if a "critical" process fails and can not recover
>> - Limit the amount of times the BMC reboots for recovery
>> - Limit should be configurable, i.e. 3 resets within 5 minutes
>> - If limit reached, display error to panel (if one available) and halt
>>   the BMC.
> 
> I’ve started to dig into this, and have had some internal discussions on this
> here at IBM. We're starting to look in a bit different direction when a service
> goes into the fail state. Rebooting the BMC has rarely shown to help these
> application failure scenarios and it makes debug of the issue very difficult.
> We'd prefer to log an error (with a bmc dump) and maybe but the BMC state into
> something reflecting this error state.
> 

For those following along, there has been yet another (slight) direction change.
Instead of trying to manipulate the service files directly for each “critical” service,
a json file design was done. So we’ll have a default json that defines
the base BMC “critical” services and then system owners can override or
just append their own json file for their critical services.

This builds upon the existing target monitoring design which made implementation
quite easy.

# Updated Design Docs
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/46690
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/51413

# Initial critical services:
https://gerrit.openbmc-project.xyz/c/openbmc/phosphor-state-manager/+/51128/2/data/phosphor-service-monitor-default.json

> 
> Thoughts on this new idea?
> 
>> 
>> The goal here is to have the BMC try and get itself back into a working state
>> via a reboot of itself.
>> 
>> This same reboot logic and limits would also apply to kernel panics and/or
>> BMC hardware watchdog expirations.
>> 
>> Some thoughts that have been thrown around internally:
>> 
>> - Spend more time ensuring code doesn't fail vs. handling them failing
>> - Put all BMC code into a single application so it's all or nothing (vs. 
>> trying to pick and choose specific applications and dealing with all of
>> the intricacies of restarting individual ones)
>> - Rebooting the BMC and getting the proper ordering of service starts is
>> sometimes easier then testing every individual service restart for recovery
>> paths
>> 
>> "Critical" processes would be things like mapper or dbus-broker. There's
>> definitely a grey area though with other services so we'd need some
>> guidelines around defining them and allow the meta layers to have a way
>> to deem whichever they want critical.
>> 
>> So anyway, just throwing this out there to see if anyone has any input
>> or is looking for something similar.
>> 
>> High level, I'd probably start looking into utilizing systemd as much as
>> possible. "FailureAction=reboot-force" in the critical services and something
>> that monitors for these types of reboots and enforces the reboot limits.
>> 
>> Andrew
>