openbmc.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Critical BMC process failure recovery
@ 2020-10-19 19:53 Andrew Geissler
  2020-10-19 21:35 ` Neil Bradley
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Andrew Geissler @ 2020-10-19 19:53 UTC (permalink / raw)
  To: OpenBMC Maillist

Greetings,

I've started initial investigation into two IBM requirements:

- Reboot the BMC if a "critical" process fails and can not recover
- Limit the amount of times the BMC reboots for recovery
  - Limit should be configurable, i.e. 3 resets within 5 minutes
  - If limit reached, display error to panel (if one available) and halt
    the BMC.

The goal here is to have the BMC try and get itself back into a working state
via a reboot of itself.

This same reboot logic and limits would also apply to kernel panics and/or
BMC hardware watchdog expirations.

Some thoughts that have been thrown around internally:

- Spend more time ensuring code doesn't fail vs. handling them failing
- Put all BMC code into a single application so it's all or nothing (vs. 
  trying to pick and choose specific applications and dealing with all of
  the intricacies of restarting individual ones)
- Rebooting the BMC and getting the proper ordering of service starts is
  sometimes easier then testing every individual service restart for recovery
  paths

"Critical" processes would be things like mapper or dbus-broker. There's
definitely a grey area though with other services so we'd need some
guidelines around defining them and allow the meta layers to have a way
to deem whichever they want critical.

So anyway, just throwing this out there to see if anyone has any input
or is looking for something similar.

High level, I'd probably start looking into utilizing systemd as much as
possible. "FailureAction=reboot-force" in the critical services and something
that monitors for these types of reboots and enforces the reboot limits.

Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Critical BMC process failure recovery
  2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
@ 2020-10-19 21:35 ` Neil Bradley
  2020-10-22 15:41   ` Andrew Geissler
  2020-10-20  2:58 ` Lei Yu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Neil Bradley @ 2020-10-19 21:35 UTC (permalink / raw)
  To: Andrew Geissler, OpenBMC Maillist

Hey Andrew!

At least initially, the requirements don't really seem like requirements - they seem like what someone's idea of what they think a solution would be.  For example, why reset 3 times? Why not 10? Or 2? Seems completely arbitrary. If the BMC resets twice in a row, there's no reason to think it would be OK the 3rd time. It's kinda like how people have been known do 4-5 firmware updates to "fix" a problem and it "still doesn't work". 😉

If the ultimate goal is availability, then there's more nuance to the discussion to be had. Let's assume the goal is "highest availability possible".

With that in mind, defining what "failure" is gets to be a bit more convoluted. Back when we did the CMM code for the Intel modular server, we had a several-pronged approach:

1) Run procmon - Look for any service that is supposed to be running (but isn't) and restart it and/or its process dependency tree.
2) Create a monitor (either a standalone program or a script) that periodically connects to the various services available - IPMI, web, KVM, etc.... - think of it like a functional "ping". A bit more involved, as this master control program (Tron reference 😉 ) would have to speak sentiently to each service to gauge how alive it is. There have been plenty of situations where a BMC is otherwise healthy but one service wasn't working, and it's overkill to have a 30-45 second outage while the BMC restarts.
3) Kernel panics were set to automatically reboot the BMC, with a double whammy of the hardware watchdog being enabled in case the CPU didn't reset.

There's more to it than this, as sometimes you'd have to quiesce procmon to not restart services that, through normal operation, would cease functioning for a brief period of time, so tuning would be required.

-->Neil

-----Original Message-----
From: openbmc <openbmc-bounces+neil_bradley=phoenix.com@lists.ozlabs.org> On Behalf Of Andrew Geissler
Sent: Monday, October 19, 2020 12:53 PM
To: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Critical BMC process failure recovery

Greetings,

I've started initial investigation into two IBM requirements:

- Reboot the BMC if a "critical" process fails and can not recover
- Limit the amount of times the BMC reboots for recovery
  - Limit should be configurable, i.e. 3 resets within 5 minutes
  - If limit reached, display error to panel (if one available) and halt
    the BMC.

The goal here is to have the BMC try and get itself back into a working state via a reboot of itself.

This same reboot logic and limits would also apply to kernel panics and/or BMC hardware watchdog expirations.

Some thoughts that have been thrown around internally:

- Spend more time ensuring code doesn't fail vs. handling them failing
- Put all BMC code into a single application so it's all or nothing (vs. 
  trying to pick and choose specific applications and dealing with all of
  the intricacies of restarting individual ones)
- Rebooting the BMC and getting the proper ordering of service starts is
  sometimes easier then testing every individual service restart for recovery
  paths

"Critical" processes would be things like mapper or dbus-broker. There's definitely a grey area though with other services so we'd need some guidelines around defining them and allow the meta layers to have a way to deem whichever they want critical.

So anyway, just throwing this out there to see if anyone has any input or is looking for something similar.

High level, I'd probably start looking into utilizing systemd as much as possible. "FailureAction=reboot-force" in the critical services and something that monitors for these types of reboots and enforces the reboot limits.

Andrew


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
  2020-10-19 21:35 ` Neil Bradley
@ 2020-10-20  2:58 ` Lei Yu
  2020-10-20 18:30   ` Bills, Jason M
  2020-10-20 14:28 ` Patrick Williams
  2021-09-02  0:29 ` Andrew Geissler
  3 siblings, 1 reply; 11+ messages in thread
From: Lei Yu @ 2020-10-20  2:58 UTC (permalink / raw)
  To: Andrew Geissler; +Cc: OpenBMC Maillist

Hi Andrew,

In Intel-BMC/openbmc, there are watchdog configs for every service
that in case it fails, it will reset the BMC using the watchdog. See
the below related configs and scripts.

https://github.com/Intel-BMC/openbmc/blob/intel/meta-openbmc-mods/meta-common/classes/systemd-watchdog.bbclass
https://github.com/Intel-BMC/openbmc/blob/intel/meta-openbmc-mods/meta-common/recipes-phosphor/watchdog/system-watchdog/watchdog-reset.sh

It probably meets most of the requirements.


On Tue, Oct 20, 2020 at 3:54 AM Andrew Geissler <geissonator@gmail.com> wrote:
>
> Greetings,
>
> I've started initial investigation into two IBM requirements:
>
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>   - Limit should be configurable, i.e. 3 resets within 5 minutes
>   - If limit reached, display error to panel (if one available) and halt
>     the BMC.
>
> The goal here is to have the BMC try and get itself back into a working state
> via a reboot of itself.
>
> This same reboot logic and limits would also apply to kernel panics and/or
> BMC hardware watchdog expirations.
>
> Some thoughts that have been thrown around internally:
>
> - Spend more time ensuring code doesn't fail vs. handling them failing
> - Put all BMC code into a single application so it's all or nothing (vs.
>   trying to pick and choose specific applications and dealing with all of
>   the intricacies of restarting individual ones)
> - Rebooting the BMC and getting the proper ordering of service starts is
>   sometimes easier then testing every individual service restart for recovery
>   paths
>
> "Critical" processes would be things like mapper or dbus-broker. There's
> definitely a grey area though with other services so we'd need some
> guidelines around defining them and allow the meta layers to have a way
> to deem whichever they want critical.
>
> So anyway, just throwing this out there to see if anyone has any input
> or is looking for something similar.
>
> High level, I'd probably start looking into utilizing systemd as much as
> possible. "FailureAction=reboot-force" in the critical services and something
> that monitors for these types of reboots and enforces the reboot limits.
>
> Andrew



-- 
BRs,
Lei YU

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
  2020-10-19 21:35 ` Neil Bradley
  2020-10-20  2:58 ` Lei Yu
@ 2020-10-20 14:28 ` Patrick Williams
  2020-10-22 16:00   ` Andrew Geissler
  2021-09-02  0:29 ` Andrew Geissler
  3 siblings, 1 reply; 11+ messages in thread
From: Patrick Williams @ 2020-10-20 14:28 UTC (permalink / raw)
  To: Andrew Geissler; +Cc: OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 1736 bytes --]

Hi Andrew,

I like the proposal to reuse what systemd already provides.  It does
look like Lei pointed to some existing bbclass that could be enhanced
for this purpose so that any recipe can simply 'inherit ...' and maybe
set a variable to indicate that it is providing "critical services".

On Mon, Oct 19, 2020 at 02:53:11PM -0500, Andrew Geissler wrote:
> Greetings,
> 
> I've started initial investigation into two IBM requirements:
> 
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>   - Limit should be configurable, i.e. 3 resets within 5 minutes

I like that it has a time bound on it here.  If the reset didn't have a
time bound that would be a problem to me because it means that a slow
memory leak could eventually get the BMCs into this state.

Do you need to do anything in relationship with the WDT and failover
settings there?  I'm thinking you'll need to do something to ensure that
you don't swap flash banks between these resets.  Do you need to do N
resets on one flash bank and then M on the other?

It seems that the most likely cause of N resets in a short time is some
sort of flash corruption, BMC chip error, or a bug aggravated some RWFS
setting.  None of these are particularly recovered by the reset but at
least you know your in a bad situation at that point.

>   - If limit reached, display error to panel (if one available) and halt
>     the BMC.

And then what?  What is the remediation for this condition?  Are there
any services, such as SSH, that will continue to run in this state?  I
hope the only answer for remediation is physical access / power cycle.

-- 
Patrick Williams

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2020-10-20  2:58 ` Lei Yu
@ 2020-10-20 18:30   ` Bills, Jason M
  0 siblings, 0 replies; 11+ messages in thread
From: Bills, Jason M @ 2020-10-20 18:30 UTC (permalink / raw)
  To: openbmc



On 10/19/2020 7:58 PM, Lei Yu wrote:
> Hi Andrew,
> 
> In Intel-BMC/openbmc, there are watchdog configs for every service
> that in case it fails, it will reset the BMC using the watchdog. See
> the below related configs and scripts.
> 
> https://github.com/Intel-BMC/openbmc/blob/intel/meta-openbmc-mods/meta-common/classes/systemd-watchdog.bbclass
> https://github.com/Intel-BMC/openbmc/blob/intel/meta-openbmc-mods/meta-common/recipes-phosphor/watchdog/system-watchdog/watchdog-reset.sh
> 
> It probably meets most of the requirements.
As an FYI - this approach has been very aggressive, so it is resetting 
more often than we would like and doesn't seem to be recovering some of 
the cases that it should.

We are considering disabling this watchdog reset in our platforms for 
now, and will likely need to refine the approach before we would enable 
it again.

> 
> 
> On Tue, Oct 20, 2020 at 3:54 AM Andrew Geissler <geissonator@gmail.com> wrote:
>>
>> Greetings,
>>
>> I've started initial investigation into two IBM requirements:
>>
>> - Reboot the BMC if a "critical" process fails and can not recover
>> - Limit the amount of times the BMC reboots for recovery
>>    - Limit should be configurable, i.e. 3 resets within 5 minutes
>>    - If limit reached, display error to panel (if one available) and halt
>>      the BMC.
>>
>> The goal here is to have the BMC try and get itself back into a working state
>> via a reboot of itself.
>>
>> This same reboot logic and limits would also apply to kernel panics and/or
>> BMC hardware watchdog expirations.
>>
>> Some thoughts that have been thrown around internally:
>>
>> - Spend more time ensuring code doesn't fail vs. handling them failing
>> - Put all BMC code into a single application so it's all or nothing (vs.
>>    trying to pick and choose specific applications and dealing with all of
>>    the intricacies of restarting individual ones)
>> - Rebooting the BMC and getting the proper ordering of service starts is
>>    sometimes easier then testing every individual service restart for recovery
>>    paths
>>
>> "Critical" processes would be things like mapper or dbus-broker. There's
>> definitely a grey area though with other services so we'd need some
>> guidelines around defining them and allow the meta layers to have a way
>> to deem whichever they want critical.
>>
>> So anyway, just throwing this out there to see if anyone has any input
>> or is looking for something similar.
>>
>> High level, I'd probably start looking into utilizing systemd as much as
>> possible. "FailureAction=reboot-force" in the critical services and something
>> that monitors for these types of reboots and enforces the reboot limits.
>>
>> Andrew
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2020-10-19 21:35 ` Neil Bradley
@ 2020-10-22 15:41   ` Andrew Geissler
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Geissler @ 2020-10-22 15:41 UTC (permalink / raw)
  To: Neil Bradley; +Cc: OpenBMC Maillist



> On Oct 19, 2020, at 4:35 PM, Neil Bradley <Neil_Bradley@phoenix.com> wrote:
> 
> Hey Andrew!
> 
> At least initially, the requirements don't really seem like requirements - they seem like what someone's idea of what they think a solution would be.  For example, why reset 3 times? Why not 10? Or 2? Seems completely arbitrary.

Hey Neil. I was starting with what our previous closed-source system
requirements were. The processes that cause a reset and the amount
of times we reset should definitely be configurable.

> If the BMC resets twice in a row, there's no reason to think it would be OK the 3rd time. It's kinda like how people have been known do 4-5 firmware updates to "fix" a problem and it "still doesn't work". 😉

Yeah, history has shown that if one reboot doesn’t fix it then you’re
probably out of of luck. But…it is up to the system owner to
configure whatever they like.

> 
> If the ultimate goal is availability, then there's more nuance to the discussion to be had. Let's assume the goal is "highest availability possible".
> 
> With that in mind, defining what "failure" is gets to be a bit more convoluted. Back when we did the CMM code for the Intel modular server, we had a several-pronged approach:
> 
> 1) Run procmon - Look for any service that is supposed to be running (but isn't) and restart it and/or its process dependency tree.
> 2) Create a monitor (either a standalone program or a script) that periodically connects to the various services available - IPMI, web, KVM, etc.... - think of it like a functional "ping". A bit more involved, as this master control program (Tron reference 😉 ) would have to speak sentiently to each service to gauge how alive it is. There have been plenty of situations where a BMC is otherwise healthy but one service wasn't working, and it's overkill to have a 30-45 second outage while the BMC restarts.

This sounds like it fits in with https://github.com/openbmc/phosphor-health-monitor
That to me is the next level of process health and recovery but initially here
I was just looking for a broad, “what do we do if our service is restarted
x amount of times, still in a fail state, and is critical to the basic
functionality of the BMC”. To me the only options are try a reboot
of the BMC or log an error and indicate the BMC is in a unstable
state.

> 
> -----Original Message-----
> From: openbmc <openbmc-bounces+neil_bradley=phoenix.com@lists.ozlabs.org> On Behalf Of Andrew Geissler
> Sent: Monday, October 19, 2020 12:53 PM
> To: OpenBMC Maillist <openbmc@lists.ozlabs.org>
> Subject: Critical BMC process failure recovery
> 
> Greetings,
> 
> I've started initial investigation into two IBM requirements:
> 
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>  - Limit should be configurable, i.e. 3 resets within 5 minutes
>  - If limit reached, display error to panel (if one available) and halt
>    the BMC.
> 
> The goal here is to have the BMC try and get itself back into a working state via a reboot of itself.
> 
> This same reboot logic and limits would also apply to kernel panics and/or BMC hardware watchdog expirations.
> 
> Some thoughts that have been thrown around internally:
> 
> - Spend more time ensuring code doesn't fail vs. handling them failing
> - Put all BMC code into a single application so it's all or nothing (vs. 
>  trying to pick and choose specific applications and dealing with all of
>  the intricacies of restarting individual ones)
> - Rebooting the BMC and getting the proper ordering of service starts is
>  sometimes easier then testing every individual service restart for recovery
>  paths
> 
> "Critical" processes would be things like mapper or dbus-broker. There's definitely a grey area though with other services so we'd need some guidelines around defining them and allow the meta layers to have a way to deem whichever they want critical.
> 
> So anyway, just throwing this out there to see if anyone has any input or is looking for something similar.
> 
> High level, I'd probably start looking into utilizing systemd as much as possible. "FailureAction=reboot-force" in the critical services and something that monitors for these types of reboots and enforces the reboot limits.
> 
> Andrew
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2020-10-20 14:28 ` Patrick Williams
@ 2020-10-22 16:00   ` Andrew Geissler
  2020-10-26 13:19     ` Matuszczak, Piotr
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Geissler @ 2020-10-22 16:00 UTC (permalink / raw)
  To: Patrick Williams; +Cc: OpenBMC Maillist



> On Oct 20, 2020, at 9:28 AM, Patrick Williams <patrick@stwcx.xyz> wrote:
> 
> Hi Andrew,
> 
> I like the proposal to reuse what systemd already provides.  It does
> look like Lei pointed to some existing bbclass that could be enhanced
> for this purpose so that any recipe can simply 'inherit ...' and maybe
> set a variable to indicate that it is providing "critical services”.

Yeah, looks like currently it opts in every service (except for a few special
cases). I like the idea of putting it on the individual service to
opt itself in. I’ve def seen what James mentions in his response
where you get in situations where the BMC is rebooting itself too
much due to non-critical services failing.

> 
> On Mon, Oct 19, 2020 at 02:53:11PM -0500, Andrew Geissler wrote:
>> Greetings,
>> 
>> I've started initial investigation into two IBM requirements:
>> 
>> - Reboot the BMC if a "critical" process fails and can not recover
>> - Limit the amount of times the BMC reboots for recovery
>>  - Limit should be configurable, i.e. 3 resets within 5 minutes
> 
> I like that it has a time bound on it here.  If the reset didn't have a
> time bound that would be a problem to me because it means that a slow
> memory leak could eventually get the BMCs into this state.
> 
> Do you need to do anything in relationship with the WDT and failover
> settings there?  I'm thinking you'll need to do something to ensure that
> you don't swap flash banks between these resets.  Do you need to do N
> resets on one flash bank and then M on the other?

I’m hoping to keep the flash bank switch a separate discussion. The key
here is to not impact whatever design decision is made there.

We’re still going back and forth a bit on whether we want to continue
with that automatic flash bank switch design point. It sometimes causes
more confusion than it’s worth.

I know we did make this work with our Witherspoon system from
a watchdog perspective. We would reboot a certain amount of times
and swap flash banks after a certain limit was reached. I’m not 
sure how we did it though :)

> 
> It seems that the most likely cause of N resets in a short time is some
> sort of flash corruption, BMC chip error, or a bug aggravated some RWFS
> setting.  None of these are particularly recovered by the reset but at
> least you know your in a bad situation at that point.

Yeah, I would really like some data on how often a reboot of the
BMC really does fix an issue. The focus for us should def be on
avoiding the reboot in the first place. But the reboot is our last
ditch effort.

> 
>>  - If limit reached, display error to panel (if one available) and halt
>>    the BMC.
> 
> And then what?  What is the remediation for this condition?  Are there
> any services, such as SSH, that will continue to run in this state?  I
> hope the only answer for remediation is physical access / power cycle.

I believe the best option (and what we’ve done historically) is to try and
put an error code on the panel and halt in u-boot, requiring physical
access / power cycle to recover.

> 
> -- 
> Patrick Williams


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Critical BMC process failure recovery
  2020-10-22 16:00   ` Andrew Geissler
@ 2020-10-26 13:19     ` Matuszczak, Piotr
  2020-10-27 21:57       ` Andrew Geissler
  0 siblings, 1 reply; 11+ messages in thread
From: Matuszczak, Piotr @ 2020-10-26 13:19 UTC (permalink / raw)
  To: Andrew Geissler, Patrick Williams; +Cc: OpenBMC Maillist

Hi, It's quite interesting discussion. Have you considered some kind of minimal set of features recovery image, to which BMC can switch after N resets during defined amount of time? Such image could hold error log and send periodic event about BMC failure. 

Piotr Matuszczak
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o. 
ul. Slowackiego 173, 80-298 Gdansk
KRS 101882
NIP 957-07-52-316

-----Original Message-----
From: openbmc <openbmc-bounces+piotr.matuszczak=intel.com@lists.ozlabs.org> On Behalf Of Andrew Geissler
Sent: Thursday, October 22, 2020 6:00 PM
To: Patrick Williams <patrick@stwcx.xyz>
Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: Critical BMC process failure recovery



> On Oct 20, 2020, at 9:28 AM, Patrick Williams <patrick@stwcx.xyz> wrote:
> 
> Hi Andrew,
> 
> I like the proposal to reuse what systemd already provides.  It does 
> look like Lei pointed to some existing bbclass that could be enhanced 
> for this purpose so that any recipe can simply 'inherit ...' and maybe 
> set a variable to indicate that it is providing "critical services”.

Yeah, looks like currently it opts in every service (except for a few special cases). I like the idea of putting it on the individual service to opt itself in. I’ve def seen what James mentions in his response where you get in situations where the BMC is rebooting itself too much due to non-critical services failing.

> 
> On Mon, Oct 19, 2020 at 02:53:11PM -0500, Andrew Geissler wrote:
>> Greetings,
>> 
>> I've started initial investigation into two IBM requirements:
>> 
>> - Reboot the BMC if a "critical" process fails and can not recover
>> - Limit the amount of times the BMC reboots for recovery
>>  - Limit should be configurable, i.e. 3 resets within 5 minutes
> 
> I like that it has a time bound on it here.  If the reset didn't have 
> a time bound that would be a problem to me because it means that a 
> slow memory leak could eventually get the BMCs into this state.
> 
> Do you need to do anything in relationship with the WDT and failover 
> settings there?  I'm thinking you'll need to do something to ensure 
> that you don't swap flash banks between these resets.  Do you need to 
> do N resets on one flash bank and then M on the other?

I’m hoping to keep the flash bank switch a separate discussion. The key here is to not impact whatever design decision is made there.

We’re still going back and forth a bit on whether we want to continue with that automatic flash bank switch design point. It sometimes causes more confusion than it’s worth.

I know we did make this work with our Witherspoon system from a watchdog perspective. We would reboot a certain amount of times and swap flash banks after a certain limit was reached. I’m not sure how we did it though :)

> 
> It seems that the most likely cause of N resets in a short time is 
> some sort of flash corruption, BMC chip error, or a bug aggravated 
> some RWFS setting.  None of these are particularly recovered by the 
> reset but at least you know your in a bad situation at that point.

Yeah, I would really like some data on how often a reboot of the BMC really does fix an issue. The focus for us should def be on avoiding the reboot in the first place. But the reboot is our last ditch effort.

> 
>>  - If limit reached, display error to panel (if one available) and halt
>>    the BMC.
> 
> And then what?  What is the remediation for this condition?  Are there 
> any services, such as SSH, that will continue to run in this state?  I 
> hope the only answer for remediation is physical access / power cycle.

I believe the best option (and what we’ve done historically) is to try and put an error code on the panel and halt in u-boot, requiring physical access / power cycle to recover.

> 
> --
> Patrick Williams


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2020-10-26 13:19     ` Matuszczak, Piotr
@ 2020-10-27 21:57       ` Andrew Geissler
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Geissler @ 2020-10-27 21:57 UTC (permalink / raw)
  To: Matuszczak, Piotr; +Cc: OpenBMC Maillist



> On Oct 26, 2020, at 8:19 AM, Matuszczak, Piotr <piotr.matuszczak@intel.com> wrote:
> 
> Hi, It's quite interesting discussion. Have you considered some kind of minimal set of features recovery image, to which BMC can switch after N resets during defined amount of time? Such image could hold error log and send periodic event about BMC failure. 

That could definitely be useful. Some sort of safe mode. I believe systemd
has rescue/emergency mode options we could look at. I do think as Patrick
pointed out earlier though that most issues are some sort of BMC hardware
failure. Anything that needs the kernel running and even basic services going
is going to be difficult to get running in those scenarios.

> 
> Piotr Matuszczak
> ---------------------------------------------------------------------
> Intel Technology Poland sp. z o.o. 
> ul. Slowackiego 173, 80-298 Gdansk
> KRS 101882
> NIP 957-07-52-316
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
                   ` (2 preceding siblings ...)
  2020-10-20 14:28 ` Patrick Williams
@ 2021-09-02  0:29 ` Andrew Geissler
  2022-02-21 20:54   ` Andrew Geissler
  3 siblings, 1 reply; 11+ messages in thread
From: Andrew Geissler @ 2021-09-02  0:29 UTC (permalink / raw)
  To: OpenBMC Maillist



> On Oct 19, 2020, at 2:53 PM, Andrew Geissler <geissonator@gmail.com> wrote:
> 
> Greetings,
> 
> I've started initial investigation into two IBM requirements:
> 
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>  - Limit should be configurable, i.e. 3 resets within 5 minutes
>  - If limit reached, display error to panel (if one available) and halt
>    the BMC.

I’ve started to dig into this, and have had some internal discussions on this
here at IBM. We're starting to look in a bit different direction when a service
goes into the fail state. Rebooting the BMC has rarely shown to help these
application failure scenarios and it makes debug of the issue very difficult.
We'd prefer to log an error (with a bmc dump) and maybe but the BMC state into
something reflecting this error state.

It does seem like based on our previous emails though that there is some
interest in that capability though (bmc reboot on service failure). As a
flexible option, I'm thinking the following:

- Create a new obmc-bmc-service-failure.target
- Create a bbclass or some other mechanism for services to have a
  "OnFailure=obmc-bmc-service-failure.target"
- By default an error log is created in this target
- System owners can plop whatever other services they want into this target
  - Reboot BMC
  - Capture additional debug data
  - ...
- Introduce a new BMC State, Quiesce. The BMC state changes to this when the
  new obmc-bmc-service-failure.target is started. This then gets mapped to
  the redfish/v1/Managers/bmc status as Quiesced so users know the BMC
  has entered a bad state.

BMC kernel panics and such would still trigger the BMC reboot path and some 
TBD function will ensure we only reboot X amount of times before stopping
in the boot loader or systemd rescue mode.

Thoughts on this new idea?

> 
> The goal here is to have the BMC try and get itself back into a working state
> via a reboot of itself.
> 
> This same reboot logic and limits would also apply to kernel panics and/or
> BMC hardware watchdog expirations.
> 
> Some thoughts that have been thrown around internally:
> 
> - Spend more time ensuring code doesn't fail vs. handling them failing
> - Put all BMC code into a single application so it's all or nothing (vs. 
>  trying to pick and choose specific applications and dealing with all of
>  the intricacies of restarting individual ones)
> - Rebooting the BMC and getting the proper ordering of service starts is
>  sometimes easier then testing every individual service restart for recovery
>  paths
> 
> "Critical" processes would be things like mapper or dbus-broker. There's
> definitely a grey area though with other services so we'd need some
> guidelines around defining them and allow the meta layers to have a way
> to deem whichever they want critical.
> 
> So anyway, just throwing this out there to see if anyone has any input
> or is looking for something similar.
> 
> High level, I'd probably start looking into utilizing systemd as much as
> possible. "FailureAction=reboot-force" in the critical services and something
> that monitors for these types of reboots and enforces the reboot limits.
> 
> Andrew


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Critical BMC process failure recovery
  2021-09-02  0:29 ` Andrew Geissler
@ 2022-02-21 20:54   ` Andrew Geissler
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Geissler @ 2022-02-21 20:54 UTC (permalink / raw)
  To: OpenBMC Maillist



> On Sep 1, 2021, at 7:29 PM, Andrew Geissler <geissonator@gmail.com> wrote:
> 
> 
> 
>> On Oct 19, 2020, at 2:53 PM, Andrew Geissler <geissonator@gmail.com> wrote:
>> 
>> Greetings,
>> 
>> I've started initial investigation into two IBM requirements:
>> 
>> - Reboot the BMC if a "critical" process fails and can not recover
>> - Limit the amount of times the BMC reboots for recovery
>> - Limit should be configurable, i.e. 3 resets within 5 minutes
>> - If limit reached, display error to panel (if one available) and halt
>>   the BMC.
> 
> I’ve started to dig into this, and have had some internal discussions on this
> here at IBM. We're starting to look in a bit different direction when a service
> goes into the fail state. Rebooting the BMC has rarely shown to help these
> application failure scenarios and it makes debug of the issue very difficult.
> We'd prefer to log an error (with a bmc dump) and maybe but the BMC state into
> something reflecting this error state.
> 

For those following along, there has been yet another (slight) direction change.
Instead of trying to manipulate the service files directly for each “critical” service,
a json file design was done. So we’ll have a default json that defines
the base BMC “critical” services and then system owners can override or
just append their own json file for their critical services.

This builds upon the existing target monitoring design which made implementation
quite easy.

# Updated Design Docs
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/46690
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/51413

# Initial critical services:
https://gerrit.openbmc-project.xyz/c/openbmc/phosphor-state-manager/+/51128/2/data/phosphor-service-monitor-default.json

> 
> Thoughts on this new idea?
> 
>> 
>> The goal here is to have the BMC try and get itself back into a working state
>> via a reboot of itself.
>> 
>> This same reboot logic and limits would also apply to kernel panics and/or
>> BMC hardware watchdog expirations.
>> 
>> Some thoughts that have been thrown around internally:
>> 
>> - Spend more time ensuring code doesn't fail vs. handling them failing
>> - Put all BMC code into a single application so it's all or nothing (vs. 
>> trying to pick and choose specific applications and dealing with all of
>> the intricacies of restarting individual ones)
>> - Rebooting the BMC and getting the proper ordering of service starts is
>> sometimes easier then testing every individual service restart for recovery
>> paths
>> 
>> "Critical" processes would be things like mapper or dbus-broker. There's
>> definitely a grey area though with other services so we'd need some
>> guidelines around defining them and allow the meta layers to have a way
>> to deem whichever they want critical.
>> 
>> So anyway, just throwing this out there to see if anyone has any input
>> or is looking for something similar.
>> 
>> High level, I'd probably start looking into utilizing systemd as much as
>> possible. "FailureAction=reboot-force" in the critical services and something
>> that monitors for these types of reboots and enforces the reboot limits.
>> 
>> Andrew
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-02-21 20:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-19 19:53 Critical BMC process failure recovery Andrew Geissler
2020-10-19 21:35 ` Neil Bradley
2020-10-22 15:41   ` Andrew Geissler
2020-10-20  2:58 ` Lei Yu
2020-10-20 18:30   ` Bills, Jason M
2020-10-20 14:28 ` Patrick Williams
2020-10-22 16:00   ` Andrew Geissler
2020-10-26 13:19     ` Matuszczak, Piotr
2020-10-27 21:57       ` Andrew Geissler
2021-09-02  0:29 ` Andrew Geissler
2022-02-21 20:54   ` Andrew Geissler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).