All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Geissler <geissonator@gmail.com>
To: Andrew Jeffery <andrew@aj.id.au>
Cc: Xo Wang <xow@google.com>, OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: How to deal with failing services in the boot targets
Date: Thu, 9 Feb 2017 10:32:05 -0600	[thread overview]
Message-ID: <CALLMt=rHDvrGppGkNp8hXNu0Lwxbr36HFDwptFkeTP62nS+7XQ@mail.gmail.com> (raw)
In-Reply-To: <1486104037.7271.10.camel@aj.id.au>

Some updates on this topic as I've gone into implementation.

The use of "system" in the quiesce target resulted in a lot of
discussion.  What's a system in relation to a chassis or host?  We
then started discussing blade centers (1 bmc to many blade servers)
and high end servers (1 bmc to many chassis).

This also led to discussion on whether we can really do a generic
recovery when the chassis power on target vs. the host start target
fails.

The outcome was the following:
- We'll rename the  obmc-quiesce-system@.target to
obmc-quiesce-host@.target.  We'll create an instance per host (which
is currently 1 for all systems).
- When the chassis power on target fails, we'll have it's
"onFailure=chassis power off" instead of quiesce.  The power on should
be very minimal and if we fail, the only recovery is to power off and
try again.  When the host start target fails, we'll go to the host
quiesce target.  The recovery from that could be to power on/off again
but it could also be to just re-run the host start target (i.e. leave
chassis power on).

Also, Brad noted in the code review that we should not be putting the
target to service dependencies in the target files.  The target files
are meant to be generic to allow different implementations on the
services required to achieve them.  Instead, I'll be putting the
relationship into the recipe files.

If you'd like to follow along on the reviews:
https://gerrit.openbmc-project.xyz/#/c/2253/  # doc updates
https://gerrit.openbmc-project.xyz/#/c/2115/  # target and recipe updates

Andrew

On Fri, Feb 3, 2017 at 12:40 AM, Andrew Jeffery <andrew@aj.id.au> wrote:
> On Wed, 2017-02-01 at 14:42 -0600, Andrew Geissler wrote:
>> Finally got around to doing some testing on this, here's what I got.
>>
>> > My story this sprint, https://github.com/openbmc/openbmc/issues/1033,
>> is focused on handling errors when things go wrong.  Specifically,
>> when required services fail to execute properly during a systemd
>> target execution (power on, power off).  When a fail happens, the obmc
>> software needs to notify the users of the system and provide
>> mechanisms for either the system to automatically retry the failed
>> operation (i.e. reboot the system) or to stay in a quiesced state so
>> that error data can be collected and the fail can be investigated.
>>
>> Michael is working on a story that ties in with this function this
>> > sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll
>> allow system users to enable or disable the auto reboot function on
>> errors (service failure, host checkstop failure, host watchdog
>> failure).  He will utilize the new target I’ll be creating in my story
>> for this.
>>
>> So we have two main fail scenarios:
>>
>> 1. A service within a target fails
>> - If the service is a oneshot type, and you put that it is required
>> (not wanted) by the target then the target will fail if the service
>> fails
>>   - You can simply define a behavior for when the target fails using
>> the “OnFailure” option (i.e. go to a new failure target if any
>> required service fails)
>>
>> - If the service is not a oneshot, then you can not have it fail the
>> target (the target only knows that it started successfully)
>>   - You have to define a behavior for when the service fails (OnFailure) option.
>>   - The service can not have "RemainAfterExit=yes” otherwise the
>> OnFailure action does not occur until the service is stopped (instead
>> of when it fails)
>>
>> 2. A failure outside of a normal systemd target/service (host watchdog
>> expires, host checkstop detected)
>> - The service which detects this failure is responsible for logging
>> the appropriate error, and instructing systemd to go to the
>> appropriate target
>>
>> The current proposal is that we create a new quiesce target.  This is
>> the target that the target/services put for their “OnFailure=“
>> instruction and where the services in fail #2 above detect a problem
>> will instruct systemd to go to.  We’ll then have code that monitors
>> for the entry into this new quiesce target and handles the halt vs
>> automatic reboot functionality.
>>
>> The above info sets up some general guidelines for our targets and
>> services (and some refactoring for my story this sprint)
>>
>> - All targets should have an “OnFailure=obmc-quiesce-system@.target”
>> - All services which are required for a target to achieve it’s
>> function should be RequiredBy that target (not WantedBy)
>> - All services should first try to be Type=oneshot so that we can just
>> rely on the target fail path
>> - If a service can not be “Type=oneshot”, then it needs to have a
>> “OnFailure=obmc-quiesce-system@.target” and a "RemainAfterExit=no”
>> - If a service can not be any of these then it’s up to the service
>> application to call systemd with the obmc-quiesce-system@.target on
>> failures
>>
>> Thoughts/Questions?
>
> I think this is a sensible set of suggestions. We need to document them
> somewhere obvious so a) we can point people at them and b) reviewers
> can refer to them when reviewing patches adding/updating systemd unit
> files and targets.
>
> Thanks for considering the problem.
>
> Andrew

  reply	other threads:[~2017-02-09 16:32 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-25 23:29 How to deal with failing services in the boot targets Xo Wang
2017-01-26  0:33 ` Andrew Geissler
2017-01-27  1:16 ` Andrew Jeffery
2017-02-01 20:42   ` Andrew Geissler
2017-02-03  6:40     ` Andrew Jeffery
2017-02-09 16:32       ` Andrew Geissler [this message]
2017-02-09 19:28         ` Xo Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALLMt=rHDvrGppGkNp8hXNu0Lwxbr36HFDwptFkeTP62nS+7XQ@mail.gmail.com' \
    --to=geissonator@gmail.com \
    --cc=andrew@aj.id.au \
    --cc=openbmc@lists.ozlabs.org \
    --cc=xow@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.