* Add support to debug unresponsive host @ 2019-05-15 12:39 Jayanth Othayoth 2019-05-15 18:26 ` Neeraj Ladkani 2019-05-16 6:36 ` Deepak Kodihalli 0 siblings, 2 replies; 7+ messages in thread From: Jayanth Othayoth @ 2019-05-15 12:39 UTC (permalink / raw) To: openbmc, geissonator, bradleyb [-- Attachment #1: Type: text/plain, Size: 2011 bytes --] ## Problem Description Issue #457: Add support to debug unresponsive host. Scope: High level design direction to solve this problem, ## Background and References There are situation at customer places where OPAL/Linux goes unresponsive causing a system hang. And there is no way to figure out what went wrong with Linux kernel or OPAL. Looking for a way to trigger a dump capture on Linux host so that we can capture the OS dump for post analysis. ## Proposed Design for POWER processor based systems: Get all Host CPUs in reset vector and Linux then has a mechanism to patch it into panic-kdump path to trigger dump capture. This will enable us to analyze and fix customer issue where we see Linux hang and unresponsive system. ### Redfish Schema used: * Reference: DSP2046 2018.3, * ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset”, This action is used to reset the system. ResetType parameter is used for indicating type of reset need to be performed. In this use case we can use “Nmi” type * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems) to cease normal operations, perform diagnostic actions and typically halt the system. * ### d-bus : Option 1: Extending the existing d-bus interface state.Host name space ( /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml ) to support new RequestedHostTransition property called “Nmi”. d-bus backend can internally invoke processor specific target to do Sreset( equivalent to x86 NMI) and associated actions. Option 2: Introducing new d-bus interface in the control.state namespace ( /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml) namespace and implement the new d-bus back-end for respective processor specific targets. ## Alternatives Considered NA ## Impacts: NA ## Testing NA Looking for input from the team on this High level design direction approach. [-- Attachment #2: Type: text/html, Size: 2124 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Add support to debug unresponsive host 2019-05-15 12:39 Add support to debug unresponsive host Jayanth Othayoth @ 2019-05-15 18:26 ` Neeraj Ladkani 2019-05-16 9:11 ` Artem Senichev 2019-05-16 6:36 ` Deepak Kodihalli 1 sibling, 1 reply; 7+ messages in thread From: Neeraj Ladkani @ 2019-05-15 18:26 UTC (permalink / raw) To: Jayanth Othayoth, openbmc, geissonator, bradleyb [-- Attachment #1: Type: text/plain, Size: 2555 bytes --] Some questions. 1. How does BMC know when to trigger NMI? Are we relying on agents to run and send heartbeat? Can this be done agentless ? 2. How do we NMI on non x86 platforms ? we should brainstorm to create a generic framework to solve this problem. What Neeraj From: openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> On Behalf Of Jayanth Othayoth Sent: Wednesday, May 15, 2019 5:40 AM To: openbmc@lists.ozlabs.org; geissonator@gmail.com; bradleyb@fuzziesquirrel.com Subject: Add support to debug unresponsive host ## Problem Description Issue #457: Add support to debug unresponsive host. Scope: High level design direction to solve this problem, ## Background and References There are situation at customer places where OPAL/Linux goes unresponsive causing a system hang. And there is no way to figure out what went wrong with Linux kernel or OPAL. Looking for a way to trigger a dump capture on Linux host so that we can capture the OS dump for post analysis. ## Proposed Design for POWER processor based systems: Get all Host CPUs in reset vector and Linux then has a mechanism to patch it into panic-kdump path to trigger dump capture. This will enable us to analyze and fix customer issue where we see Linux hang and unresponsive system. ### Redfish Schema used: * Reference: DSP2046 2018.3, * ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset”, This action is used to reset the system. ResetType parameter is used for indicating type of reset need to be performed. In this use case we can use “Nmi” type * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems) to cease normal operations, perform diagnostic actions and typically halt the system. * ### d-bus : Option 1: Extending the existing d-bus interface state.Host name space ( /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml ) to support new RequestedHostTransition property called “Nmi”. d-bus backend can internally invoke processor specific target to do Sreset( equivalent to x86 NMI) and associated actions. Option 2: Introducing new d-bus interface in the control.state namespace ( /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml) namespace and implement the new d-bus back-end for respective processor specific targets. ## Alternatives Considered NA ## Impacts: NA ## Testing NA Looking for input from the team on this High level design direction approach. [-- Attachment #2: Type: text/html, Size: 8775 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Add support to debug unresponsive host 2019-05-15 18:26 ` Neeraj Ladkani @ 2019-05-16 9:11 ` Artem Senichev 0 siblings, 0 replies; 7+ messages in thread From: Artem Senichev @ 2019-05-16 9:11 UTC (permalink / raw) To: Neeraj Ladkani; +Cc: Jayanth Othayoth, openbmc, geissonator, bradleyb We have solved a similar task for our VESNIN servers, based on POWER8 CPU (OpenPOWER platform). OpenBMC has pdbg debugger (meta-openpower/recipes-bsp/pdbg), this utility, among other things, can be used to send SRESET signal from OpenBMC to the host's CPU. As a result of handling the signal, host side Linux kernel initiates kdump. This procedure inevitably reboots the host system, whether the host is working or the system is hung, so it is not a good idea to do this automatically. A system administrator initiates the procedure manually from OpenBMC console. -- Regards, Artem Senichev Software Engineer, YADRO. On Wed, May 15, 2019 at 06:26:08PM +0000, Neeraj Ladkani wrote: > Some questions. > > > 1. How does BMC know when to trigger NMI? Are we relying on agents to run and send heartbeat? Can this be done agentless ? > 2. How do we NMI on non x86 platforms ? > > we should brainstorm to create a generic framework to solve this problem. > > What > Neeraj > > From: openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> On Behalf Of Jayanth Othayoth > Sent: Wednesday, May 15, 2019 5:40 AM > To: openbmc@lists.ozlabs.org; geissonator@gmail.com; bradleyb@fuzziesquirrel.com > Subject: Add support to debug unresponsive host > > ## Problem Description > Issue #457: Add support to debug unresponsive host. > > Scope: High level design direction to solve this problem, > > ## Background and References > There are situation at customer places where OPAL/Linux goes unresponsive causing a system hang. And there is no way to figure out what went wrong with Linux kernel or OPAL. Looking for a way to trigger a dump capture on Linux host so that we can capture the OS dump for post analysis. > > ## Proposed Design for POWER processor based systems: > Get all Host CPUs in reset vector and Linux then has a mechanism to patch it into panic-kdump path to trigger dump capture. This will enable us to analyze and fix customer issue where we see Linux hang and unresponsive system. > > ### Redfish Schema used: > * Reference: DSP2046 2018.3, > * ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset”, This action is used to reset the system. ResetType parameter is used for indicating type of reset need to be performed. In this use case we can use “Nmi” type > * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems) to cease normal operations, perform diagnostic actions and typically halt the system. > * ### d-bus : > > Option 1: Extending the existing d-bus interface state.Host name space ( /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml ) to support new RequestedHostTransition property called “Nmi”. d-bus backend can internally invoke processor specific target to do Sreset( equivalent to x86 NMI) and associated actions. > > Option 2: Introducing new d-bus interface in the control.state namespace ( /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml) namespace and implement the new d-bus back-end for respective processor specific targets. > > ## Alternatives Considered > NA > > ## Impacts: > NA > > ## Testing > NA > > Looking for input from the team on this High level design direction approach. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Add support to debug unresponsive host 2019-05-15 12:39 Add support to debug unresponsive host Jayanth Othayoth 2019-05-15 18:26 ` Neeraj Ladkani @ 2019-05-16 6:36 ` Deepak Kodihalli 2019-05-16 13:01 ` Andrew Geissler 1 sibling, 1 reply; 7+ messages in thread From: Deepak Kodihalli @ 2019-05-16 6:36 UTC (permalink / raw) To: Jayanth Othayoth; +Cc: OpenBMC Maillist On 15/05/19 6:09 PM, Jayanth Othayoth wrote: > ## Problem Description > Issue #457: Add support to debug unresponsive host. > > Scope: High level design direction to solve this problem, > > ## Background and References > There are situation at customer places where OPAL/Linux goes > unresponsive causing a system hang. And there is no way to figure out > what went wrong with Linux kernel or OPAL. Looking for a way to trigger > a dump capture on Linux host so that we can capture the OS dump for post > analysis. > > ## Proposed Design for POWER processor based systems: > Get all Host CPUs in reset vector and Linux then has a mechanism to > patch it into panic-kdump path to trigger dump capture. This will enable > us to analyze and fix customer issue where we see Linux hang and > unresponsive system. > > ### Redfish Schema used: > * Reference: DSP2046 2018.3, > * ComputerSystem 1.6.0 schema provides an action called > #ComputerSystem.Reset”, This action is used to reset the system. > ResetType parameter is used for indicating type of reset need to be > performed. In this use case we can use “Nmi” type > * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 > systems) to cease normal operations, perform diagnostic actions and > typically halt the system. > * ### d-bus : > > Option 1: Extending the existing d-bus interface state.Host name > space ( > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml > ) to support new RequestedHostTransition property called “Nmi”. d-bus > backend can internally invoke processor specific target to do Sreset( > equivalent to x86 NMI) and associated actions. I don't prefer this option, because this would mean adding host specific code in phoshor-state-manager, which I think until now is host agnostic. So for that reason, Option 2 sounds better. There are some good questions from Neeraj as well, so I would suggest adding this as a design template on Gerrit to gather better feedback. Thanks, Deepak > Option 2: Introducing new d-bus interface in the control.state namespace > ( > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml) > namespace and implement the new d-bus back-end for respective processor > specific targets. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Add support to debug unresponsive host 2019-05-16 6:36 ` Deepak Kodihalli @ 2019-05-16 13:01 ` Andrew Geissler 2019-05-27 7:15 ` Jayanth Othayoth 0 siblings, 1 reply; 7+ messages in thread From: Andrew Geissler @ 2019-05-16 13:01 UTC (permalink / raw) To: Deepak Kodihalli; +Cc: Jayanth Othayoth, OpenBMC Maillist On Thu, May 16, 2019 at 1:36 AM Deepak Kodihalli <dkodihal@linux.vnet.ibm.com> wrote: > > On 15/05/19 6:09 PM, Jayanth Othayoth wrote: > > ## Problem Description > > Issue #457: Add support to debug unresponsive host. > > > > Scope: High level design direction to solve this problem, > > > > ## Background and References > > There are situation at customer places where OPAL/Linux goes > > unresponsive causing a system hang. And there is no way to figure out > > what went wrong with Linux kernel or OPAL. Looking for a way to trigger > > a dump capture on Linux host so that we can capture the OS dump for post > > analysis. > > > > ## Proposed Design for POWER processor based systems: > > Get all Host CPUs in reset vector and Linux then has a mechanism to > > patch it into panic-kdump path to trigger dump capture. This will enable > > us to analyze and fix customer issue where we see Linux hang and > > unresponsive system. > > > > ### Redfish Schema used: > > * Reference: DSP2046 2018.3, > > * ComputerSystem 1.6.0 schema provides an action called > > #ComputerSystem.Reset”, This action is used to reset the system. > > ResetType parameter is used for indicating type of reset need to be > > performed. In this use case we can use “Nmi” type > > * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 > > systems) to cease normal operations, perform diagnostic actions and > > typically halt the system. > > * ### d-bus : > > > > Option 1: Extending the existing d-bus interface state.Host name > > space ( > > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml > > ) to support new RequestedHostTransition property called “Nmi”. d-bus > > backend can internally invoke processor specific target to do Sreset( > > equivalent to x86 NMI) and associated actions. > > I don't prefer this option, because this would mean adding host specific > code in phoshor-state-manager, which I think until now is host agnostic. Yeah, this was my main concern with tying it into phosphor-state-manager. The fact Redfish put it in with their other state related commands (which are implemented by phosphor-state-manager) is the only reason I'm a little wishy-washy here. We could just create a generic systemd target "host-nmi" or something and phosphor-state-manager could just call that to abstract any of the specifics, but it sill doesn't really feel like it fits to me. I think I prefer option 2, and then we can just map bmcweb to that API when the Redfish command comes in. Sounds like for ppc64 systems we can just use pdbg to issue the NMI. > So for that reason, Option 2 sounds better. There are some good > questions from Neeraj as well, so I would suggest adding this as a > design template on Gerrit to gather better feedback. > > Thanks, > Deepak > > > Option 2: Introducing new d-bus interface in the control.state namespace > > ( > > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml) > > namespace and implement the new d-bus back-end for respective processor > > specific targets. > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Add support to debug unresponsive host 2019-05-16 13:01 ` Andrew Geissler @ 2019-05-27 7:15 ` Jayanth Othayoth 2019-05-27 12:42 ` vishwa 0 siblings, 1 reply; 7+ messages in thread From: Jayanth Othayoth @ 2019-05-27 7:15 UTC (permalink / raw) To: Andrew Geissler Cc: OpenBMC Maillist, bradleyb, artemsen, dkodihal, yugang.chen [-- Attachment #1: Type: text/plain, Size: 3506 bytes --] Design template Review is available here https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/21772 On Thu, May 16, 2019 at 6:31 PM Andrew Geissler <geissonator@gmail.com> wrote: > On Thu, May 16, 2019 at 1:36 AM Deepak Kodihalli > <dkodihal@linux.vnet.ibm.com> wrote: > > > > On 15/05/19 6:09 PM, Jayanth Othayoth wrote: > > > ## Problem Description > > > Issue #457: Add support to debug unresponsive host. > > > > > > Scope: High level design direction to solve this problem, > > > > > > ## Background and References > > > There are situation at customer places where OPAL/Linux goes > > > unresponsive causing a system hang. And there is no way to figure out > > > what went wrong with Linux kernel or OPAL. Looking for a way to trigger > > > a dump capture on Linux host so that we can capture the OS dump for > post > > > analysis. > > > > > > ## Proposed Design for POWER processor based systems: > > > Get all Host CPUs in reset vector and Linux then has a mechanism to > > > patch it into panic-kdump path to trigger dump capture. This will > enable > > > us to analyze and fix customer issue where we see Linux hang and > > > unresponsive system. > > > > > > ### Redfish Schema used: > > > * Reference: DSP2046 2018.3, > > > * ComputerSystem 1.6.0 schema provides an action called > > > #ComputerSystem.Reset”, This action is used to reset the system. > > > ResetType parameter is used for indicating type of reset need to be > > > performed. In this use case we can use “Nmi” type > > > * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 > > > systems) to cease normal operations, perform diagnostic actions and > > > typically halt the system. > > > * ### d-bus : > > > > > > Option 1: Extending the existing d-bus interface state.Host name > > > space ( > > > > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml > > > ) to support new RequestedHostTransition property called “Nmi”. > d-bus > > > backend can internally invoke processor specific target to do Sreset( > > > equivalent to x86 NMI) and associated actions. > > > > I don't prefer this option, because this would mean adding host specific > > code in phoshor-state-manager, which I think until now is host agnostic. > > Yeah, this was my main concern with tying it into phosphor-state-manager. > The fact Redfish put it in with their other state related commands (which > are implemented by phosphor-state-manager) is the only reason I'm a little > wishy-washy here. We could just create a generic systemd target "host-nmi" > or something and phosphor-state-manager could just call that to abstract > any of the specifics, but it sill doesn't really feel like it fits to me. > > I think I prefer option 2, and then we can just map bmcweb to that API when > the Redfish command comes in. Sounds like for ppc64 systems we can just > use pdbg to issue the NMI. > > > So for that reason, Option 2 sounds better. There are some good > > questions from Neeraj as well, so I would suggest adding this as a > > design template on Gerrit to gather better feedback. > > > > Thanks, > > Deepak > > > > > Option 2: Introducing new d-bus interface in the control.state > namespace > > > ( > > > > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml) > > > namespace and implement the new d-bus back-end for respective > processor > > > specific targets. > > > [-- Attachment #2: Type: text/html, Size: 4426 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Add support to debug unresponsive host 2019-05-27 7:15 ` Jayanth Othayoth @ 2019-05-27 12:42 ` vishwa 0 siblings, 0 replies; 7+ messages in thread From: vishwa @ 2019-05-27 12:42 UTC (permalink / raw) To: Jayanth Othayoth, Andrew Geissler Cc: OpenBMC Maillist, bradleyb, artemsen, yugang.chen [-- Attachment #1: Type: text/plain, Size: 4123 bytes --] I kind of remember this topic being talked about in the past. Looks like we need to do 2 things prior to calling SRESET. I will comment the review. !! Vishwa !! On 5/27/19 12:45 PM, Jayanth Othayoth wrote: > Design template Review is available here > > https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/21772 > > On Thu, May 16, 2019 at 6:31 PM Andrew Geissler <geissonator@gmail.com > <mailto:geissonator@gmail.com>> wrote: > > On Thu, May 16, 2019 at 1:36 AM Deepak Kodihalli > <dkodihal@linux.vnet.ibm.com <mailto:dkodihal@linux.vnet.ibm.com>> > wrote: > > > > On 15/05/19 6:09 PM, Jayanth Othayoth wrote: > > > ## Problem Description > > > Issue #457: Add support to debug unresponsive host. > > > > > > Scope: High level design direction to solve this problem, > > > > > > ## Background and References > > > There are situation at customer places where OPAL/Linux goes > > > unresponsive causing a system hang. And there is no way to > figure out > > > what went wrong with Linux kernel or OPAL. Looking for a way > to trigger > > > a dump capture on Linux host so that we can capture the OS > dump for post > > > analysis. > > > > > > ## Proposed Design for POWER processor based systems: > > > Get all Host CPUs in reset vector and Linux then has a > mechanism to > > > patch it into panic-kdump path to trigger dump capture. This > will enable > > > us to analyze and fix customer issue where we see Linux hang and > > > unresponsive system. > > > > > > ### Redfish Schema used: > > > * Reference: DSP2046 2018.3, > > > * ComputerSystem 1.6.0 schema provides an action called > > > #ComputerSystem.Reset”, This action is used to reset the system. > > > ResetType parameter is used for indicating type of reset need > to be > > > performed. In this use case we can use “Nmi” type > > > * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 > > > systems) to cease normal operations, perform diagnostic > actions and > > > typically halt the system. > > > * ### d-bus : > > > > > > Option 1: Extending the existing d-bus interface > state.Host name > > > space ( > > > > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml > > > ) to support new RequestedHostTransition property called > “Nmi”. d-bus > > > backend can internally invoke processor specific target to do > Sreset( > > > equivalent to x86 NMI) and associated actions. > > > > I don't prefer this option, because this would mean adding host > specific > > code in phoshor-state-manager, which I think until now is host > agnostic. > > Yeah, this was my main concern with tying it into > phosphor-state-manager. > The fact Redfish put it in with their other state related commands > (which > are implemented by phosphor-state-manager) is the only reason I'm > a little > wishy-washy here. We could just create a generic systemd target > "host-nmi" > or something and phosphor-state-manager could just call that to > abstract > any of the specifics, but it sill doesn't really feel like it fits > to me. > > I think I prefer option 2, and then we can just map bmcweb to that > API when > the Redfish command comes in. Sounds like for ppc64 systems we can > just > use pdbg to issue the NMI. > > > So for that reason, Option 2 sounds better. There are some good > > questions from Neeraj as well, so I would suggest adding this as a > > design template on Gerrit to gather better feedback. > > > > Thanks, > > Deepak > > > > > Option 2: Introducing new d-bus interface in the control.state > namespace > > > ( > > > > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml) > > > namespace and implement the new d-bus back-end for respective > processor > > > specific targets. > > > [-- Attachment #2: Type: text/html, Size: 6243 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-05-27 12:42 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-15 12:39 Add support to debug unresponsive host Jayanth Othayoth 2019-05-15 18:26 ` Neeraj Ladkani 2019-05-16 9:11 ` Artem Senichev 2019-05-16 6:36 ` Deepak Kodihalli 2019-05-16 13:01 ` Andrew Geissler 2019-05-27 7:15 ` Jayanth Othayoth 2019-05-27 12:42 ` vishwa
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.