On Wed, 2017-02-01 at 14:42 -0600, Andrew Geissler wrote: > Finally got around to doing some testing on this, here's what I got. > > > My story this sprint, https://github.com/openbmc/openbmc/issues/1033, > is focused on handling errors when things go wrong.  Specifically, > when required services fail to execute properly during a systemd > target execution (power on, power off).  When a fail happens, the obmc > software needs to notify the users of the system and provide > mechanisms for either the system to automatically retry the failed > operation (i.e. reboot the system) or to stay in a quiesced state so > that error data can be collected and the fail can be investigated. > > Michael is working on a story that ties in with this function this > > sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll > allow system users to enable or disable the auto reboot function on > errors (service failure, host checkstop failure, host watchdog > failure).  He will utilize the new target I’ll be creating in my story > for this. > > So we have two main fail scenarios: > > 1. A service within a target fails > - If the service is a oneshot type, and you put that it is required > (not wanted) by the target then the target will fail if the service > fails >   - You can simply define a behavior for when the target fails using > the “OnFailure” option (i.e. go to a new failure target if any > required service fails) > > - If the service is not a oneshot, then you can not have it fail the > target (the target only knows that it started successfully) >   - You have to define a behavior for when the service fails (OnFailure) option. >   - The service can not have "RemainAfterExit=yes” otherwise the > OnFailure action does not occur until the service is stopped (instead > of when it fails) > > 2. A failure outside of a normal systemd target/service (host watchdog > expires, host checkstop detected) > - The service which detects this failure is responsible for logging > the appropriate error, and instructing systemd to go to the > appropriate target > > The current proposal is that we create a new quiesce target.  This is > the target that the target/services put for their “OnFailure=“ > instruction and where the services in fail #2 above detect a problem > will instruct systemd to go to.  We’ll then have code that monitors > for the entry into this new quiesce target and handles the halt vs > automatic reboot functionality. > > The above info sets up some general guidelines for our targets and > services (and some refactoring for my story this sprint) > > - All targets should have an “OnFailure=obmc-quiesce-system@.target” > - All services which are required for a target to achieve it’s > function should be RequiredBy that target (not WantedBy) > - All services should first try to be Type=oneshot so that we can just > rely on the target fail path > - If a service can not be “Type=oneshot”, then it needs to have a > “OnFailure=obmc-quiesce-system@.target” and a "RemainAfterExit=no” > - If a service can not be any of these then it’s up to the service > application to call systemd with the obmc-quiesce-system@.target on > failures > > Thoughts/Questions? I think this is a sensible set of suggestions. We need to document them somewhere obvious so a) we can point people at them and b) reviewers can refer to them when reviewing patches adding/updating systemd unit files and targets. Thanks for considering the problem. Andrew