All of lore.kernel.org
 help / color / mirror / Atom feed
* How to deal with failing services in the boot targets
@ 2017-01-25 23:29 Xo Wang
  2017-01-26  0:33 ` Andrew Geissler
  2017-01-27  1:16 ` Andrew Jeffery
  0 siblings, 2 replies; 7+ messages in thread
From: Xo Wang @ 2017-01-25 23:29 UTC (permalink / raw)
  To: OpenBMC Maillist

Hi folks,

I'm seeing vcs-on@0.service failing occasionally. I know the cause of
it (i2c errors) but I'd like to know how to deal with failing services
in the context of OpenBMC boot sequencing.

For example, the service failure isn't reflect by any subsequent
target failures (it reaches obmc-chassis-start@0.target with no
command line errors, only a journal error for vcs-on@0.service
itself), nor did it prevent the boot from proceeding to pdbg host
control.

This is expected behavior given the systemd Unit relationships I used,
but I don't see a clean way to make a unit like vcs-on@.service block
the boot.

I tried making vcs-on@.service [Install]
RequiredBy=obmc-chassis-start@%i.target (and modifying the service
install similarly), but this only prints out a message that
obmc-chassis-start@0.target couldn't be reached due to its failed
dependency. It did not stop the pdbg start IPL.

I also tried RequiredBy=obmc-host-start-pre@%i.target. This turned out
even worse because our targets don't require their precedent targets,
so obmc-host-start@0.target is still reachable even with a failure in
obmc-host-start-pre@0.target. Likewise for
obmc-chassis-start@0.target, which now prints no console error at all.

Finally I could add RequiredBy=start_host@%i.service to
vcs-on@0.service, but this seems fragile compared to using the targets
as synchronization points.

1) How should I make a host boot service be a blocking step in the chain?

2) Will this require a structural change in the OpenBMC targets?
Making targets require their precedent targets comes to mind. This
would make targets useful not only for sequencing but also for
dependency checking.

3) Do other people also want this? To me it seems obvious that failure
to power on should always block starting IPL, but maybe somebody else
has a good reason to use weaker relationships.

thanks
xo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to deal with failing services in the boot targets
  2017-01-25 23:29 How to deal with failing services in the boot targets Xo Wang
@ 2017-01-26  0:33 ` Andrew Geissler
  2017-01-27  1:16 ` Andrew Jeffery
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Geissler @ 2017-01-26  0:33 UTC (permalink / raw)
  To: Xo Wang; +Cc: OpenBMC Maillist

Hey Xo,

Pretty perfect timing on this email, my new story for this sprint is
https://github.com/openbmc/openbmc/issues/1033 (handle service
failures in openbmc) precisely for the reasons you mentioned above.

We've had a few hallway talks but not much in the way of design yet.
One thought was to use the "OnFailure=" tag to start up a target that
conflicts with everything (to stop everything) and puts the system
into some sort of "termination" state.

We have two different types of targets in openbmc.  The one's that
have "wants' relationships (i.e. run these services) and targets that
are more for synchronization among those services.  I recently added a
new chassis power "wants" target, I think I'll need to add some
dependencies that you mention above there, so if someone starts the
"boot host" target, it will automatically run the "turn on power"
target.

Anyway, hope to have a few more thoughts on this out by the end of
this week.  Will do some experimenting with your notes.

Andrew

On Wed, Jan 25, 2017 at 5:29 PM, Xo Wang <xow@google.com> wrote:
> Hi folks,
>
> I'm seeing vcs-on@0.service failing occasionally. I know the cause of
> it (i2c errors) but I'd like to know how to deal with failing services
> in the context of OpenBMC boot sequencing.
>
> For example, the service failure isn't reflect by any subsequent
> target failures (it reaches obmc-chassis-start@0.target with no
> command line errors, only a journal error for vcs-on@0.service
> itself), nor did it prevent the boot from proceeding to pdbg host
> control.
>
> This is expected behavior given the systemd Unit relationships I used,
> but I don't see a clean way to make a unit like vcs-on@.service block
> the boot.
>
> I tried making vcs-on@.service [Install]
> RequiredBy=obmc-chassis-start@%i.target (and modifying the service
> install similarly), but this only prints out a message that
> obmc-chassis-start@0.target couldn't be reached due to its failed
> dependency. It did not stop the pdbg start IPL.
>
> I also tried RequiredBy=obmc-host-start-pre@%i.target. This turned out
> even worse because our targets don't require their precedent targets,
> so obmc-host-start@0.target is still reachable even with a failure in
> obmc-host-start-pre@0.target. Likewise for
> obmc-chassis-start@0.target, which now prints no console error at all.
>
> Finally I could add RequiredBy=start_host@%i.service to
> vcs-on@0.service, but this seems fragile compared to using the targets
> as synchronization points.
>
> 1) How should I make a host boot service be a blocking step in the chain?
>
> 2) Will this require a structural change in the OpenBMC targets?
> Making targets require their precedent targets comes to mind. This
> would make targets useful not only for sequencing but also for
> dependency checking.
>
> 3) Do other people also want this? To me it seems obvious that failure
> to power on should always block starting IPL, but maybe somebody else
> has a good reason to use weaker relationships.
>
> thanks
> xo
> _______________________________________________
> openbmc mailing list
> openbmc@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/openbmc

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to deal with failing services in the boot targets
  2017-01-25 23:29 How to deal with failing services in the boot targets Xo Wang
  2017-01-26  0:33 ` Andrew Geissler
@ 2017-01-27  1:16 ` Andrew Jeffery
  2017-02-01 20:42   ` Andrew Geissler
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Jeffery @ 2017-01-27  1:16 UTC (permalink / raw)
  To: Xo Wang, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 549 bytes --]

On Wed, 2017-01-25 at 15:29 -0800, Xo Wang wrote:
> 3) Do other people also want this? To me it seems obvious that failure
> to power on should always block starting IPL, but maybe somebody else
> has a good reason to use weaker relationships.

Sounds highly desirable to me. In an effort to better understand our
dependencies I dumped them out with `systemd-analyze dot`. Safe to say
I'm not much wiser having seen the graph:

http://ozlabs.org/~arj/openbmc/systemd.svg

(Source: http://ozlabs.org/~arj/openbmc/systemd.dot.xz )

Andrew

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to deal with failing services in the boot targets
  2017-01-27  1:16 ` Andrew Jeffery
@ 2017-02-01 20:42   ` Andrew Geissler
  2017-02-03  6:40     ` Andrew Jeffery
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Geissler @ 2017-02-01 20:42 UTC (permalink / raw)
  To: Andrew Jeffery; +Cc: Xo Wang, OpenBMC Maillist

Finally got around to doing some testing on this, here's what I got.

My story this sprint, https://github.com/openbmc/openbmc/issues/1033,
is focused on handling errors when things go wrong.  Specifically,
when required services fail to execute properly during a systemd
target execution (power on, power off).  When a fail happens, the obmc
software needs to notify the users of the system and provide
mechanisms for either the system to automatically retry the failed
operation (i.e. reboot the system) or to stay in a quiesced state so
that error data can be collected and the fail can be investigated.

Michael is working on a story that ties in with this function this
sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll
allow system users to enable or disable the auto reboot function on
errors (service failure, host checkstop failure, host watchdog
failure).  He will utilize the new target I’ll be creating in my story
for this.

So we have two main fail scenarios:

1. A service within a target fails
- If the service is a oneshot type, and you put that it is required
(not wanted) by the target then the target will fail if the service
fails
  - You can simply define a behavior for when the target fails using
the “OnFailure” option (i.e. go to a new failure target if any
required service fails)

- If the service is not a oneshot, then you can not have it fail the
target (the target only knows that it started successfully)
  - You have to define a behavior for when the service fails (OnFailure) option.
  - The service can not have "RemainAfterExit=yes” otherwise the
OnFailure action does not occur until the service is stopped (instead
of when it fails)

2. A failure outside of a normal systemd target/service (host watchdog
expires, host checkstop detected)
- The service which detects this failure is responsible for logging
the appropriate error, and instructing systemd to go to the
appropriate target

The current proposal is that we create a new quiesce target.  This is
the target that the target/services put for their “OnFailure=“
instruction and where the services in fail #2 above detect a problem
will instruct systemd to go to.  We’ll then have code that monitors
for the entry into this new quiesce target and handles the halt vs
automatic reboot functionality.

The above info sets up some general guidelines for our targets and
services (and some refactoring for my story this sprint)

- All targets should have an “OnFailure=obmc-quiesce-system@.target”
- All services which are required for a target to achieve it’s
function should be RequiredBy that target (not WantedBy)
- All services should first try to be Type=oneshot so that we can just
rely on the target fail path
- If a service can not be “Type=oneshot”, then it needs to have a
“OnFailure=obmc-quiesce-system@.target” and a "RemainAfterExit=no”
- If a service can not be any of these then it’s up to the service
application to call systemd with the obmc-quiesce-system@.target on
failures

Thoughts/Questions?
Andrew

On Thu, Jan 26, 2017 at 7:16 PM, Andrew Jeffery <andrew@aj.id.au> wrote:
> On Wed, 2017-01-25 at 15:29 -0800, Xo Wang wrote:
>> 3) Do other people also want this? To me it seems obvious that failure
>> to power on should always block starting IPL, but maybe somebody else
>> has a good reason to use weaker relationships.
>
> Sounds highly desirable to me. In an effort to better understand our
> dependencies I dumped them out with `systemd-analyze dot`. Safe to say
> I'm not much wiser having seen the graph:
>
> http://ozlabs.org/~arj/openbmc/systemd.svg
>
> (Source: http://ozlabs.org/~arj/openbmc/systemd.dot.xz )
>
> Andrew
> _______________________________________________
> openbmc mailing list
> openbmc@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/openbmc
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to deal with failing services in the boot targets
  2017-02-01 20:42   ` Andrew Geissler
@ 2017-02-03  6:40     ` Andrew Jeffery
  2017-02-09 16:32       ` Andrew Geissler
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Jeffery @ 2017-02-03  6:40 UTC (permalink / raw)
  To: Andrew Geissler; +Cc: Xo Wang, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 3610 bytes --]

On Wed, 2017-02-01 at 14:42 -0600, Andrew Geissler wrote:
> Finally got around to doing some testing on this, here's what I got.
> 
> > My story this sprint, https://github.com/openbmc/openbmc/issues/1033,
> is focused on handling errors when things go wrong.  Specifically,
> when required services fail to execute properly during a systemd
> target execution (power on, power off).  When a fail happens, the obmc
> software needs to notify the users of the system and provide
> mechanisms for either the system to automatically retry the failed
> operation (i.e. reboot the system) or to stay in a quiesced state so
> that error data can be collected and the fail can be investigated.
> 
> Michael is working on a story that ties in with this function this
> > sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll
> allow system users to enable or disable the auto reboot function on
> errors (service failure, host checkstop failure, host watchdog
> failure).  He will utilize the new target I’ll be creating in my story
> for this.
> 
> So we have two main fail scenarios:
> 
> 1. A service within a target fails
> - If the service is a oneshot type, and you put that it is required
> (not wanted) by the target then the target will fail if the service
> fails
>   - You can simply define a behavior for when the target fails using
> the “OnFailure” option (i.e. go to a new failure target if any
> required service fails)
> 
> - If the service is not a oneshot, then you can not have it fail the
> target (the target only knows that it started successfully)
>   - You have to define a behavior for when the service fails (OnFailure) option.
>   - The service can not have "RemainAfterExit=yes” otherwise the
> OnFailure action does not occur until the service is stopped (instead
> of when it fails)
> 
> 2. A failure outside of a normal systemd target/service (host watchdog
> expires, host checkstop detected)
> - The service which detects this failure is responsible for logging
> the appropriate error, and instructing systemd to go to the
> appropriate target
> 
> The current proposal is that we create a new quiesce target.  This is
> the target that the target/services put for their “OnFailure=“
> instruction and where the services in fail #2 above detect a problem
> will instruct systemd to go to.  We’ll then have code that monitors
> for the entry into this new quiesce target and handles the halt vs
> automatic reboot functionality.
> 
> The above info sets up some general guidelines for our targets and
> services (and some refactoring for my story this sprint)
> 
> - All targets should have an “OnFailure=obmc-quiesce-system@.target”
> - All services which are required for a target to achieve it’s
> function should be RequiredBy that target (not WantedBy)
> - All services should first try to be Type=oneshot so that we can just
> rely on the target fail path
> - If a service can not be “Type=oneshot”, then it needs to have a
> “OnFailure=obmc-quiesce-system@.target” and a "RemainAfterExit=no”
> - If a service can not be any of these then it’s up to the service
> application to call systemd with the obmc-quiesce-system@.target on
> failures
> 
> Thoughts/Questions?

I think this is a sensible set of suggestions. We need to document them
somewhere obvious so a) we can point people at them and b) reviewers
can refer to them when reviewing patches adding/updating systemd unit
files and targets.

Thanks for considering the problem.

Andrew

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to deal with failing services in the boot targets
  2017-02-03  6:40     ` Andrew Jeffery
@ 2017-02-09 16:32       ` Andrew Geissler
  2017-02-09 19:28         ` Xo Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Geissler @ 2017-02-09 16:32 UTC (permalink / raw)
  To: Andrew Jeffery; +Cc: Xo Wang, OpenBMC Maillist

Some updates on this topic as I've gone into implementation.

The use of "system" in the quiesce target resulted in a lot of
discussion.  What's a system in relation to a chassis or host?  We
then started discussing blade centers (1 bmc to many blade servers)
and high end servers (1 bmc to many chassis).

This also led to discussion on whether we can really do a generic
recovery when the chassis power on target vs. the host start target
fails.

The outcome was the following:
- We'll rename the  obmc-quiesce-system@.target to
obmc-quiesce-host@.target.  We'll create an instance per host (which
is currently 1 for all systems).
- When the chassis power on target fails, we'll have it's
"onFailure=chassis power off" instead of quiesce.  The power on should
be very minimal and if we fail, the only recovery is to power off and
try again.  When the host start target fails, we'll go to the host
quiesce target.  The recovery from that could be to power on/off again
but it could also be to just re-run the host start target (i.e. leave
chassis power on).

Also, Brad noted in the code review that we should not be putting the
target to service dependencies in the target files.  The target files
are meant to be generic to allow different implementations on the
services required to achieve them.  Instead, I'll be putting the
relationship into the recipe files.

If you'd like to follow along on the reviews:
https://gerrit.openbmc-project.xyz/#/c/2253/  # doc updates
https://gerrit.openbmc-project.xyz/#/c/2115/  # target and recipe updates

Andrew

On Fri, Feb 3, 2017 at 12:40 AM, Andrew Jeffery <andrew@aj.id.au> wrote:
> On Wed, 2017-02-01 at 14:42 -0600, Andrew Geissler wrote:
>> Finally got around to doing some testing on this, here's what I got.
>>
>> > My story this sprint, https://github.com/openbmc/openbmc/issues/1033,
>> is focused on handling errors when things go wrong.  Specifically,
>> when required services fail to execute properly during a systemd
>> target execution (power on, power off).  When a fail happens, the obmc
>> software needs to notify the users of the system and provide
>> mechanisms for either the system to automatically retry the failed
>> operation (i.e. reboot the system) or to stay in a quiesced state so
>> that error data can be collected and the fail can be investigated.
>>
>> Michael is working on a story that ties in with this function this
>> > sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll
>> allow system users to enable or disable the auto reboot function on
>> errors (service failure, host checkstop failure, host watchdog
>> failure).  He will utilize the new target I’ll be creating in my story
>> for this.
>>
>> So we have two main fail scenarios:
>>
>> 1. A service within a target fails
>> - If the service is a oneshot type, and you put that it is required
>> (not wanted) by the target then the target will fail if the service
>> fails
>>   - You can simply define a behavior for when the target fails using
>> the “OnFailure” option (i.e. go to a new failure target if any
>> required service fails)
>>
>> - If the service is not a oneshot, then you can not have it fail the
>> target (the target only knows that it started successfully)
>>   - You have to define a behavior for when the service fails (OnFailure) option.
>>   - The service can not have "RemainAfterExit=yes” otherwise the
>> OnFailure action does not occur until the service is stopped (instead
>> of when it fails)
>>
>> 2. A failure outside of a normal systemd target/service (host watchdog
>> expires, host checkstop detected)
>> - The service which detects this failure is responsible for logging
>> the appropriate error, and instructing systemd to go to the
>> appropriate target
>>
>> The current proposal is that we create a new quiesce target.  This is
>> the target that the target/services put for their “OnFailure=“
>> instruction and where the services in fail #2 above detect a problem
>> will instruct systemd to go to.  We’ll then have code that monitors
>> for the entry into this new quiesce target and handles the halt vs
>> automatic reboot functionality.
>>
>> The above info sets up some general guidelines for our targets and
>> services (and some refactoring for my story this sprint)
>>
>> - All targets should have an “OnFailure=obmc-quiesce-system@.target”
>> - All services which are required for a target to achieve it’s
>> function should be RequiredBy that target (not WantedBy)
>> - All services should first try to be Type=oneshot so that we can just
>> rely on the target fail path
>> - If a service can not be “Type=oneshot”, then it needs to have a
>> “OnFailure=obmc-quiesce-system@.target” and a "RemainAfterExit=no”
>> - If a service can not be any of these then it’s up to the service
>> application to call systemd with the obmc-quiesce-system@.target on
>> failures
>>
>> Thoughts/Questions?
>
> I think this is a sensible set of suggestions. We need to document them
> somewhere obvious so a) we can point people at them and b) reviewers
> can refer to them when reviewing patches adding/updating systemd unit
> files and targets.
>
> Thanks for considering the problem.
>
> Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to deal with failing services in the boot targets
  2017-02-09 16:32       ` Andrew Geissler
@ 2017-02-09 19:28         ` Xo Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Xo Wang @ 2017-02-09 19:28 UTC (permalink / raw)
  To: Andrew Geissler; +Cc: Andrew Jeffery, OpenBMC Maillist

On Thu, Feb 9, 2017 at 8:32 AM, Andrew Geissler <geissonator@gmail.com> wrote:
> Some updates on this topic as I've gone into implementation.
>
> The use of "system" in the quiesce target resulted in a lot of
> discussion.  What's a system in relation to a chassis or host?  We
> then started discussing blade centers (1 bmc to many blade servers)
> and high end servers (1 bmc to many chassis).
>
> This also led to discussion on whether we can really do a generic
> recovery when the chassis power on target vs. the host start target
> fails.
>
> The outcome was the following:
> - We'll rename the  obmc-quiesce-system@.target to
> obmc-quiesce-host@.target.  We'll create an instance per host (which
> is currently 1 for all systems).
> - When the chassis power on target fails, we'll have it's
> "onFailure=chassis power off" instead of quiesce.  The power on should
> be very minimal and if we fail, the only recovery is to power off and
> try again.  When the host start target fails, we'll go to the host
> quiesce target.  The recovery from that could be to power on/off again
> but it could also be to just re-run the host start target (i.e. leave
> chassis power on).
>

Sounds reasonable to me. If at some point we have specific power
sequencing requirements, this would handle "tearing down" that
sequence in the correct order.

> Also, Brad noted in the code review that we should not be putting the
> target to service dependencies in the target files.  The target files
> are meant to be generic to allow different implementations on the
> services required to achieve them.  Instead, I'll be putting the
> relationship into the recipe files.
>

I guess normally this is expressed by the [Install] directives in the
service files, then associated to targets at service install time by
systemd.

It looks to me like our installation is handled by SYSTEMD_LINK in
obmc-phosphor-systemd so, agreed, that's ultimately where we express
the relationships.

> If you'd like to follow along on the reviews:
> https://gerrit.openbmc-project.xyz/#/c/2253/  # doc updates
> https://gerrit.openbmc-project.xyz/#/c/2115/  # target and recipe updates
>
> Andrew
>

Thanks for taking this on Andrew.

cheers
xo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-02-09 19:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-25 23:29 How to deal with failing services in the boot targets Xo Wang
2017-01-26  0:33 ` Andrew Geissler
2017-01-27  1:16 ` Andrew Jeffery
2017-02-01 20:42   ` Andrew Geissler
2017-02-03  6:40     ` Andrew Jeffery
2017-02-09 16:32       ` Andrew Geissler
2017-02-09 19:28         ` Xo Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.