openbmc.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [RFC] BMC RAS Feature
@ 2023-03-21  5:14 Supreeth Venkatesh
  2023-03-21 10:40 ` Patrick Williams
  2023-07-14 22:05 ` Bills, Jason M
  0 siblings, 2 replies; 20+ messages in thread
From: Supreeth Venkatesh @ 2023-03-21  5:14 UTC (permalink / raw)
  To: openbmc
  Cc: Michael Shen, ed, bradleyb, supreeth.venkatesh, Abinaya.Dhandapani

Thanks in advance for your inputs/feedback.

##Purpose

Gather feedback on the BMC RAS design so that it can be used in 
processor agnostic manner, find collaborators for refining the 
design/implementation

and request for a OpenBMC repository [preferably bmc-ras or oob-ras or 
bmc-crashdump or oob-crashdump] with the initial maintainers

Supreeth Venkatesh [Supreeth.Venkatesh@amd.com] and Abinaya Dhandapani 
[Abinaya.Dhandapani@amd.com]

### BMC RAS, Crash dump

Author:

< Supreeth Venkatesh>

< Abinaya Dhandapani>

Primary assignee:

< Supreeth Venkatesh>

< Abinaya Dhandapani>

Other contributors:

Created:

<03/20/2023>

#### Problem Description

Collection of crash data at runtime in a processor agnostic manner 
presents a challenge and an opportunity to standardize.

#### Background and References

Host processors allows an external management controller (i.e BMC)

to harvest CPU crash data over the Vendor specific interface during 
fatal errors [APML interface in case of AMD processor family]

This feature allows more efficient real-time diagnosis of hardware 
failures without waiting for Boot-Error Record Table (BERT) logs in the 
next boot cycle.

The crash data collected may be used to triage, debug, or attempt to 
reproduce the system conditions that led to the failure.

#### Requirements

 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
    of virtual APIs to allow override for processor specific way of
    collecting the data.
 2. Crash data format shall be stored in common platform error record
    (CPER) format as per UEFI specification
    [https://uefi.org/specs/UEFI/2.10/].
 3. Configuration parameters of the service shall be standard with scope
    for processor specific extensions.

#### Proposed Design

When one or more processors register a fatal error condition , then an 
interrupt is generated to the host processor.

The host processor in the failed state asserts the signal to indicate to 
the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD 
processor family]

BMC RAS application listens on the event [APML_ALERT# in case of AMD 
processor family ].

Upon detection of FATAL error event , BMC will check the status register 
of the host processor [implementation defined method] to see

if the assertion is due to the fatal error.

Upon fatal error , BMC will attempt to harvest crash data from all 
processors. [via the APML interface (mailbox) in case of AMD processor 
family].

BMC will generate a single raw crashdump record and saves it in 
non-volatile location /var/lib/bmc-ras.

As per the BMC policy configuration , BMC initiates a System reset to 
recover the system from the fatal error condition.

The generated crashdump record will be in Common Platform Error Record 
(CPER) format as defined in the UEFI specification 
[https://uefi.org/specs/UEFI/2.10/].

Application has configurable number of records with the default set to 
10 records. If the number of records exceed 10, the records are rotated.

Crashdump records saved in the /var/lib/bmc-ras which can be retrieved 
via redfish interface.

Format of RAS/Crash dump record below

Sample CPER file on fatal error:

_Configuring RAS config file_

A configuration file is created in the /var/lib/bmc-ras application 
which allows the user to configure the below values

AMD specific configuration fields below. However, this can be 
_standardized_ based on the feedback.

“APML retries” – Retry count of APML mailbox command

“harvest PPIN” – If enabled , harvest PPIN and dump into the CPER file

“Harvest Microcode” – If enabled , harvest microcode and dump into the 
CPER file

“System Recovery” – Warm reset or cold reset or no reset as per User’s 
requirement.

The configuration file values can be viewed and changed via redfish 
GET/SET command.

The redfish URI to configure the BMC config file: 
https://<BMC-IP>/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration 
<https://%3cBMC-IP%3e/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration>

Sample redfish output:

curl -s -k -u root:0penBmc -H"Content-type: application/json" -X GET 
https://onyx-63dd/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration

{

   "@odata.id": 
"/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration",

   "@odata.type": "#LogService.v1_2_0.LogService",

   "apmlRetries": 10,

   "harvestPpin": true,

   "systemRecovery": 2,

   "uCodeVersion": true

}

_Dbus__interface for crashdump service_

BMC ras application is started by the systemd service com.amd.crashdump 
[This will be changed to generic service name based on the 
feedback/interest from the community].

A dbus interface is maintained which has the config file info and the 
CPER files currently in the system

Which can be downloaded via the redfish interface.

The service name , object path needs to be renamed instead of OEM 
specific names as all contributors can use the same service name and 
object path to pull the crashdata.

#### Alternatives Considered

In-band mechanisms using System Management Mode (SMM) exists.

However, out of band method to gather RAS data is processor specific.

#### Impacts

Since crash dump data is as per common platform error record (CPER) 
format as per UEFI specification [https://uefi.org/specs/UEFI/2.10/],

No security impact.

This implementation takes off the host processor workload by offloading 
the data collection process to BMC and thereby improving the system 
performance as a whole.

#### Testing

It has been tested on AMD Genoa platforms namely Onyx, Quartz, Ruby and 
Titanite.

Further testing support is appreciated.

Thanks,

Abinaya & Supreeth


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-21  5:14 [RFC] BMC RAS Feature Supreeth Venkatesh
@ 2023-03-21 10:40 ` Patrick Williams
  2023-03-21 15:07   ` Supreeth Venkatesh
  2023-07-14 22:05 ` Bills, Jason M
  1 sibling, 1 reply; 20+ messages in thread
From: Patrick Williams @ 2023-03-21 10:40 UTC (permalink / raw)
  To: Supreeth Venkatesh
  Cc: openbmc, bradleyb, Abinaya.Dhandapani, Michael Shen, ed

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh wrote:

> #### Alternatives Considered
> 
> In-band mechanisms using System Management Mode (SMM) exists.
> 
> However, out of band method to gather RAS data is processor specific.
> 

How does this compare with existing implementations in
phosphor-debug-collector.  I believe there was some attempt to extend
P-D-C previously to handle Intel's crashdump behavior.  The repository
currently handles IBM's processors, I think, or maybe that is covered by
openpower-debug-collector.

In any case, I think you should look at the existing D-Bus interfaces
(and associated Redfish implementation) of these repositories and
determine if you can use those approaches (or document why now).

-- 
Patrick Williams

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-21 10:40 ` Patrick Williams
@ 2023-03-21 15:07   ` Supreeth Venkatesh
  2023-03-21 16:26     ` dhruvaraj S
  0 siblings, 1 reply; 20+ messages in thread
From: Supreeth Venkatesh @ 2023-03-21 15:07 UTC (permalink / raw)
  To: Patrick Williams; +Cc: openbmc, bradleyb, Abinaya.Dhandapani, Michael Shen, ed


On 3/21/23 05:40, Patrick Williams wrote:
> On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh wrote:
>
>> #### Alternatives Considered
>>
>> In-band mechanisms using System Management Mode (SMM) exists.
>>
>> However, out of band method to gather RAS data is processor specific.
>>
> How does this compare with existing implementations in
> phosphor-debug-collector.
Thanks for your feedback. See below.
> I believe there was some attempt to extend
> P-D-C previously to handle Intel's crashdump behavior.
Intel's crashdump interface uses com.intel.crashdump.
We have implemented com.amd.crashdump based on that reference. However, 
can this be made generic?

PoC below:

busctl tree com.amd.crashdump

└─/com
   └─/com/amd
     └─/com/amd/crashdump
       ├─/com/amd/crashdump/0
       ├─/com/amd/crashdump/1
       ├─/com/amd/crashdump/2
       ├─/com/amd/crashdump/3
       ├─/com/amd/crashdump/4
       ├─/com/amd/crashdump/5
       ├─/com/amd/crashdump/6
       ├─/com/amd/crashdump/7
       ├─/com/amd/crashdump/8
       └─/com/amd/crashdump/9

> The repository
> currently handles IBM's processors, I think, or maybe that is covered by
> openpower-debug-collector.
>
> In any case, I think you should look at the existing D-Bus interfaces
> (and associated Redfish implementation) of these repositories and
> determine if you can use those approaches (or document why now).
I could not find an existing D-Bus interface for RAS in 
xyz/openbmc_project/.
It would be helpful if you could point me to it.
There are references to com.intel.crashdump in bmcweb code, but the 
interface itself does not exist in yaml/com/intel/
we can add com.amd.crashdump as a start or even come up with a new 
generic Dbus interface.
As far as Redfish implementation is concerned, we are following the 
specification.
redfish/v1/Systems/system/LogServices/Crashdump schema is being used.

{

"@odata.id": "/redfish/v1/Systems/system/LogServices/Crashdump/Entries",
"@odata.type": "#LogEntryCollection.LogEntryCollection",
"Description": "Collection of Crashdump Entries",
"Members":
  [
{"@odata.id": "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/0",
"@odata.type": "#LogEntry.v1_7_0.LogEntry",
"AdditionalDataURI": 
"/redfish/v1/Systems/system/LogServices/Crashdump/Entries/0/ras-error0.cper",
"Created": "1970-1-1T0:4:12Z",
"DiagnosticDataType": "OEM",
"EntryType": "Oem",
"Id": "0",
"Name": "CPU Crashdump",
"OEMDiagnosticDataType": "APMLCrashdump"
},
{"@odata.id": "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/1",
"@odata.type": "#LogEntry.v1_7_0.LogEntry",
"AdditionalDataURI": 
"/redfish/v1/Systems/system/LogServices/Crashdump/Entries/1/ras-error1.cper",
"Created": "1970-1-1T0:4:12Z",
"DiagnosticDataType": "OEM",
"EntryType": "Oem",
"Id": "1",
"Name": "CPU Crashdump",
"OEMDiagnosticDataType": "APMLCrashdump"
},
],
"Members@odata.count": 2,
"Name": "Open BMC Crashdump Entries"}
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-21 15:07   ` Supreeth Venkatesh
@ 2023-03-21 16:26     ` dhruvaraj S
  2023-03-21 17:25       ` Supreeth Venkatesh
  0 siblings, 1 reply; 20+ messages in thread
From: dhruvaraj S @ 2023-03-21 16:26 UTC (permalink / raw)
  To: Supreeth Venkatesh
  Cc: Michael Shen, openbmc, ed, bradleyb, Abinaya.Dhandapani

[-- Attachment #1: Type: text/plain, Size: 4526 bytes --]

On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh <supreeth.venkatesh@amd.com>
wrote:

>
> On 3/21/23 05:40, Patrick Williams wrote:
> > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh wrote:
> >
> >> #### Alternatives Considered
> >>
> >> In-band mechanisms using System Management Mode (SMM) exists.
> >>
> >> However, out of band method to gather RAS data is processor specific.
> >>
> > How does this compare with existing implementations in
> > phosphor-debug-collector.
> Thanks for your feedback. See below.
> > I believe there was some attempt to extend
> > P-D-C previously to handle Intel's crashdump behavior.
> Intel's crashdump interface uses com.intel.crashdump.
> We have implemented com.amd.crashdump based on that reference. However,
> can this be made generic?
>
> PoC below:
>
> busctl tree com.amd.crashdump
>
> └─/com
>    └─/com/amd
>      └─/com/amd/crashdump
>        ├─/com/amd/crashdump/0
>        ├─/com/amd/crashdump/1
>        ├─/com/amd/crashdump/2
>        ├─/com/amd/crashdump/3
>        ├─/com/amd/crashdump/4
>        ├─/com/amd/crashdump/5
>        ├─/com/amd/crashdump/6
>        ├─/com/amd/crashdump/7
>        ├─/com/amd/crashdump/8
>        └─/com/amd/crashdump/9
>
> > The repository
> > currently handles IBM's processors, I think, or maybe that is covered by
> > openpower-debug-collector.
> >
> > In any case, I think you should look at the existing D-Bus interfaces
> > (and associated Redfish implementation) of these repositories and
> > determine if you can use those approaches (or document why now).
> I could not find an existing D-Bus interface for RAS in
> xyz/openbmc_project/.
> It would be helpful if you could point me to it.
>

There is an interface for the dumps generated from the host, which can be
used for these kinds of dumps
https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml

The fault log also provides similar dumps
https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml


The tree for the dump manager looks like this
`-/xyz
  `-/xyz/openbmc_project
    `-/xyz/openbmc_project/dump
      |-/xyz/openbmc_project/dump/bmc
      | `-/xyz/openbmc_project/dump/bmc/entry
      |   |-/xyz/openbmc_project/dump/bmc/entry/1
      |   |-/xyz/openbmc_project/dump/bmc/entry/2
      |   |-/xyz/openbmc_project/dump/bmc/entry/3
      |   `-/xyz/openbmc_project/dump/bmc/entry/4
      |-/xyz/openbmc_project/dump/faultlog
      |-/xyz/openbmc_project/dump/hardware
      |-/xyz/openbmc_project/dump/hostboot
      |-/xyz/openbmc_project/dump/internal
      | `-/xyz/openbmc_project/dump/internal/manager
      |-/xyz/openbmc_project/dump/resource
      |-/xyz/openbmc_project/dump/sbe
      `-/xyz/openbmc_project/dump/system

There are references to com.intel.crashdump in bmcweb code, but the
> interface itself does not exist in yaml/com/intel/
> we can add com.amd.crashdump as a start or even come up with a new
> generic Dbus interface.
> As far as Redfish implementation is concerned, we are following the
> specification.
> redfish/v1/Systems/system/LogServices/Crashdump schema is being used.
>
> {
>
> "@odata.id": "/redfish/v1/Systems/system/LogServices/Crashdump/Entries",
> "@odata.type": "#LogEntryCollection.LogEntryCollection",
> "Description": "Collection of Crashdump Entries",
> "Members":
>   [
> {"@odata.id":
> "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/0",
> "@odata.type": "#LogEntry.v1_7_0.LogEntry",
> "AdditionalDataURI":
>
> "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/0/ras-error0.cper",
> "Created": "1970-1-1T0:4:12Z",
> "DiagnosticDataType": "OEM",
> "EntryType": "Oem",
> "Id": "0",
> "Name": "CPU Crashdump",
> "OEMDiagnosticDataType": "APMLCrashdump"
> },
> {"@odata.id":
> "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/1",
> "@odata.type": "#LogEntry.v1_7_0.LogEntry",
> "AdditionalDataURI":
>
> "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/1/ras-error1.cper",
> "Created": "1970-1-1T0:4:12Z",
> "DiagnosticDataType": "OEM",
> "EntryType": "Oem",
> "Id": "1",
> "Name": "CPU Crashdump",
> "OEMDiagnosticDataType": "APMLCrashdump"
> },
> ],
> "Members@odata.count": 2,
> "Name": "Open BMC Crashdump Entries"}
> >
>


-- 
--------------
Dhruvaraj S

[-- Attachment #2: Type: text/html, Size: 6552 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-21 16:26     ` dhruvaraj S
@ 2023-03-21 17:25       ` Supreeth Venkatesh
  2023-03-22  7:10         ` Lei Yu
  0 siblings, 1 reply; 20+ messages in thread
From: Supreeth Venkatesh @ 2023-03-21 17:25 UTC (permalink / raw)
  To: dhruvaraj S; +Cc: Michael Shen, openbmc, ed, bradleyb, Abinaya.Dhandapani


On 3/21/23 11:26, dhruvaraj S wrote:
>
> 	
> Caution: This message originated from an External Source. Use proper 
> caution when opening attachments, clicking links, or responding.
>
>
>
>
> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh 
> <supreeth.venkatesh@amd.com> wrote:
>
>
>     On 3/21/23 05:40, Patrick Williams wrote:
>     > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh wrote:
>     >
>     >> #### Alternatives Considered
>     >>
>     >> In-band mechanisms using System Management Mode (SMM) exists.
>     >>
>     >> However, out of band method to gather RAS data is processor
>     specific.
>     >>
>     > How does this compare with existing implementations in
>     > phosphor-debug-collector.
>     Thanks for your feedback. See below.
>     > I believe there was some attempt to extend
>     > P-D-C previously to handle Intel's crashdump behavior.
>     Intel's crashdump interface uses com.intel.crashdump.
>     We have implemented com.amd.crashdump based on that reference.
>     However,
>     can this be made generic?
>
>     PoC below:
>
>     busctl tree com.amd.crashdump
>
>     └─/com
>        └─/com/amd
>          └─/com/amd/crashdump
>            ├─/com/amd/crashdump/0
>            ├─/com/amd/crashdump/1
>            ├─/com/amd/crashdump/2
>            ├─/com/amd/crashdump/3
>            ├─/com/amd/crashdump/4
>            ├─/com/amd/crashdump/5
>            ├─/com/amd/crashdump/6
>            ├─/com/amd/crashdump/7
>            ├─/com/amd/crashdump/8
>            └─/com/amd/crashdump/9
>
>     > The repository
>     > currently handles IBM's processors, I think, or maybe that is
>     covered by
>     > openpower-debug-collector.
>     >
>     > In any case, I think you should look at the existing D-Bus
>     interfaces
>     > (and associated Redfish implementation) of these repositories and
>     > determine if you can use those approaches (or document why now).
>     I could not find an existing D-Bus interface for RAS in
>     xyz/openbmc_project/.
>     It would be helpful if you could point me to it.
>
>
> There is an interface for the dumps generated from the host, which can 
> be used for these kinds of dumps
> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>
> The fault log also provides similar dumps
> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>
ThanksDdhruvraj. The interface looks useful for the purpose. However, 
the current BMCWEB implementation references
https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp 
[com.intel.crashdump]
constexpr char const* crashdumpPath = "/com/intel/crashdump";

constexpr char const* crashdumpInterface = "com.intel.crashdump";
constexpr char const* crashdumpObject = "com.intel.crashdump";

https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
or 
https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
is it exercised in Redfish logservices?

> The tree for the dump manager looks like this
> `-/xyz
>   `-/xyz/openbmc_project
>     `-/xyz/openbmc_project/dump
>       |-/xyz/openbmc_project/dump/bmc
>       | `-/xyz/openbmc_project/dump/bmc/entry
>       |   |-/xyz/openbmc_project/dump/bmc/entry/1
>       |   |-/xyz/openbmc_project/dump/bmc/entry/2
>       |   |-/xyz/openbmc_project/dump/bmc/entry/3
>       |   `-/xyz/openbmc_project/dump/bmc/entry/4
>       |-/xyz/openbmc_project/dump/faultlog
>       |-/xyz/openbmc_project/dump/hardware
>       |-/xyz/openbmc_project/dump/hostboot
>       |-/xyz/openbmc_project/dump/internal
>       | `-/xyz/openbmc_project/dump/internal/manager
>       |-/xyz/openbmc_project/dump/resource
>       |-/xyz/openbmc_project/dump/sbe
>       `-/xyz/openbmc_project/dump/system
>
>     There are references to com.intel.crashdump in bmcweb code, but the
>     interface itself does not exist in yaml/com/intel/
>     we can add com.amd.crashdump as a start or even come up with a new
>     generic Dbus interface.
>     As far as Redfish implementation is concerned, we are following the
>     specification.
>     redfish/v1/Systems/system/LogServices/Crashdump schema is being used.
>
>     {
>
>     "@odata.id <http://odata.id>":
>     "/redfish/v1/Systems/system/LogServices/Crashdump/Entries",
>     "@odata.type": "#LogEntryCollection.LogEntryCollection",
>     "Description": "Collection of Crashdump Entries",
>     "Members":
>       [
>     {"@odata.id <http://odata.id>":
>     "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/0",
>     "@odata.type": "#LogEntry.v1_7_0.LogEntry",
>     "AdditionalDataURI":
>     "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/0/ras-error0.cper",
>     "Created": "1970-1-1T0:4:12Z",
>     "DiagnosticDataType": "OEM",
>     "EntryType": "Oem",
>     "Id": "0",
>     "Name": "CPU Crashdump",
>     "OEMDiagnosticDataType": "APMLCrashdump"
>     },
>     {"@odata.id <http://odata.id>":
>     "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/1",
>     "@odata.type": "#LogEntry.v1_7_0.LogEntry",
>     "AdditionalDataURI":
>     "/redfish/v1/Systems/system/LogServices/Crashdump/Entries/1/ras-error1.cper",
>     "Created": "1970-1-1T0:4:12Z",
>     "DiagnosticDataType": "OEM",
>     "EntryType": "Oem",
>     "Id": "1",
>     "Name": "CPU Crashdump",
>     "OEMDiagnosticDataType": "APMLCrashdump"
>     },
>     ],
>     "Members@odata.count": 2,
>     "Name": "Open BMC Crashdump Entries"}
>     >
>
>
>
> -- 
> --------------
> Dhruvaraj S

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-21 17:25       ` Supreeth Venkatesh
@ 2023-03-22  7:10         ` Lei Yu
  2023-03-23  0:07           ` Supreeth Venkatesh
  0 siblings, 1 reply; 20+ messages in thread
From: Lei Yu @ 2023-03-22  7:10 UTC (permalink / raw)
  To: Supreeth Venkatesh
  Cc: Ed Tanous, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Abinaya.Dhandapani

> > On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
> > <supreeth.venkatesh@amd.com> wrote:
> >
> >
> >     On 3/21/23 05:40, Patrick Williams wrote:
> >     > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh wrote:
> >     >
> >     >> #### Alternatives Considered
> >     >>
> >     >> In-band mechanisms using System Management Mode (SMM) exists.
> >     >>
> >     >> However, out of band method to gather RAS data is processor
> >     specific.
> >     >>
> >     > How does this compare with existing implementations in
> >     > phosphor-debug-collector.
> >     Thanks for your feedback. See below.
> >     > I believe there was some attempt to extend
> >     > P-D-C previously to handle Intel's crashdump behavior.
> >     Intel's crashdump interface uses com.intel.crashdump.
> >     We have implemented com.amd.crashdump based on that reference.
> >     However,
> >     can this be made generic?
> >
> >     PoC below:
> >
> >     busctl tree com.amd.crashdump
> >
> >     └─/com
> >        └─/com/amd
> >          └─/com/amd/crashdump
> >            ├─/com/amd/crashdump/0
> >            ├─/com/amd/crashdump/1
> >            ├─/com/amd/crashdump/2
> >            ├─/com/amd/crashdump/3
> >            ├─/com/amd/crashdump/4
> >            ├─/com/amd/crashdump/5
> >            ├─/com/amd/crashdump/6
> >            ├─/com/amd/crashdump/7
> >            ├─/com/amd/crashdump/8
> >            └─/com/amd/crashdump/9
> >
> >     > The repository
> >     > currently handles IBM's processors, I think, or maybe that is
> >     covered by
> >     > openpower-debug-collector.
> >     >
> >     > In any case, I think you should look at the existing D-Bus
> >     interfaces
> >     > (and associated Redfish implementation) of these repositories and
> >     > determine if you can use those approaches (or document why now).
> >     I could not find an existing D-Bus interface for RAS in
> >     xyz/openbmc_project/.
> >     It would be helpful if you could point me to it.
> >
> >
> > There is an interface for the dumps generated from the host, which can
> > be used for these kinds of dumps
> > https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
> >
> > The fault log also provides similar dumps
> > https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
> >
> ThanksDdhruvraj. The interface looks useful for the purpose. However,
> the current BMCWEB implementation references
> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp
> [com.intel.crashdump]
> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>
> constexpr char const* crashdumpInterface = "com.intel.crashdump";
> constexpr char const* crashdumpObject = "com.intel.crashdump";
>
> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
> or
> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
> is it exercised in Redfish logservices?

In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
to copy the crashdump json file to the dump tarball.
The crashdump tool (Intel or AMD) could trigger a dump after the
crashdump is completed, and then we could get a dump entry containing
the crashdump.


-- 
BRs,
Lei YU

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-22  7:10         ` Lei Yu
@ 2023-03-23  0:07           ` Supreeth Venkatesh
  2023-04-03 11:44             ` Patrick Williams
       [not found]             ` <d65937a46b6fb4f9f94edbdef44af58e@imap.linux.ibm.com>
  0 siblings, 2 replies; 20+ messages in thread
From: Supreeth Venkatesh @ 2023-03-23  0:07 UTC (permalink / raw)
  To: Lei Yu
  Cc: Ed Tanous, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Abinaya.Dhandapani


On 3/22/23 02:10, Lei Yu wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
>>> <supreeth.venkatesh@amd.com> wrote:
>>>
>>>
>>>      On 3/21/23 05:40, Patrick Williams wrote:
>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh wrote:
>>>      >
>>>      >> #### Alternatives Considered
>>>      >>
>>>      >> In-band mechanisms using System Management Mode (SMM) exists.
>>>      >>
>>>      >> However, out of band method to gather RAS data is processor
>>>      specific.
>>>      >>
>>>      > How does this compare with existing implementations in
>>>      > phosphor-debug-collector.
>>>      Thanks for your feedback. See below.
>>>      > I believe there was some attempt to extend
>>>      > P-D-C previously to handle Intel's crashdump behavior.
>>>      Intel's crashdump interface uses com.intel.crashdump.
>>>      We have implemented com.amd.crashdump based on that reference.
>>>      However,
>>>      can this be made generic?
>>>
>>>      PoC below:
>>>
>>>      busctl tree com.amd.crashdump
>>>
>>>      └─/com
>>>         └─/com/amd
>>>           └─/com/amd/crashdump
>>>             ├─/com/amd/crashdump/0
>>>             ├─/com/amd/crashdump/1
>>>             ├─/com/amd/crashdump/2
>>>             ├─/com/amd/crashdump/3
>>>             ├─/com/amd/crashdump/4
>>>             ├─/com/amd/crashdump/5
>>>             ├─/com/amd/crashdump/6
>>>             ├─/com/amd/crashdump/7
>>>             ├─/com/amd/crashdump/8
>>>             └─/com/amd/crashdump/9
>>>
>>>      > The repository
>>>      > currently handles IBM's processors, I think, or maybe that is
>>>      covered by
>>>      > openpower-debug-collector.
>>>      >
>>>      > In any case, I think you should look at the existing D-Bus
>>>      interfaces
>>>      > (and associated Redfish implementation) of these repositories and
>>>      > determine if you can use those approaches (or document why now).
>>>      I could not find an existing D-Bus interface for RAS in
>>>      xyz/openbmc_project/.
>>>      It would be helpful if you could point me to it.
>>>
>>>
>>> There is an interface for the dumps generated from the host, which can
>>> be used for these kinds of dumps
>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>
>>> The fault log also provides similar dumps
>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>
>> ThanksDdhruvraj. The interface looks useful for the purpose. However,
>> the current BMCWEB implementation references
>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp
>> [com.intel.crashdump]
>> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>>
>> constexpr char const* crashdumpInterface = "com.intel.crashdump";
>> constexpr char const* crashdumpObject = "com.intel.crashdump";
>>
>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>> or
>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>> is it exercised in Redfish logservices?
> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
> to copy the crashdump json file to the dump tarball.
> The crashdump tool (Intel or AMD) could trigger a dump after the
> crashdump is completed, and then we could get a dump entry containing
> the crashdump.
Thanks Lei Yu for your input. We are using Redfish to retrieve the CPER 
binary file which can then be passed through a plugin/script for 
detailed analysis.
In any case irrespective of whichever Dbus interface we use, we need a 
repository which will gather data from AMD processor via APML as per AMD 
design.
APML Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip
Can someone please help create bmc-ras or amd-debug-collector repository 
as there are instances of openpower-debug-collector repository used for 
Open Power systems?
>
>
> --
> BRs,
> Lei YU

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-23  0:07           ` Supreeth Venkatesh
@ 2023-04-03 11:44             ` Patrick Williams
  2023-04-03 16:32               ` Supreeth Venkatesh
       [not found]             ` <d65937a46b6fb4f9f94edbdef44af58e@imap.linux.ibm.com>
  1 sibling, 1 reply; 20+ messages in thread
From: Patrick Williams @ 2023-04-03 11:44 UTC (permalink / raw)
  To: Supreeth Venkatesh
  Cc: Lei Yu, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Ed Tanous, Abinaya.Dhandapani

[-- Attachment #1: Type: text/plain, Size: 892 bytes --]

On Wed, Mar 22, 2023 at 07:07:24PM -0500, Supreeth Venkatesh wrote:
> Can someone please help create bmc-ras or amd-debug-collector repository 
> as there are instances of openpower-debug-collector repository used for 
> Open Power systems?

The typical process for requesting a new repository is to open an issue
to the TOF:

    https://github.com/openbmc/technical-oversight-forum/issues

Ideally you would submit a refreshed version of your design into the
Docs repository, as one of the questions the TOF will likely have is
"what is the proposed design for this repository".  You will also need
to have a list of who you expect to be the maintainers (OWNERS) of this
repository.

There have been a few other issues requesting new repositories in the
last year.  You might want to read those for examples of the kinds of
discussion to expect.

-- 
Patrick Williams

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-04-03 11:44             ` Patrick Williams
@ 2023-04-03 16:32               ` Supreeth Venkatesh
  0 siblings, 0 replies; 20+ messages in thread
From: Supreeth Venkatesh @ 2023-04-03 16:32 UTC (permalink / raw)
  To: Patrick Williams
  Cc: Lei Yu, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Ed Tanous, Abinaya.Dhandapani


On 4/3/23 06:44, Patrick Williams wrote:
> On Wed, Mar 22, 2023 at 07:07:24PM -0500, Supreeth Venkatesh wrote:
>> Can someone please help create bmc-ras or amd-debug-collector repository
>> as there are instances of openpower-debug-collector repository used for
>> Open Power systems?
> The typical process for requesting a new repository is to open an issue
> to the TOF:
>
>      https://github.com/openbmc/technical-oversight-forum/issues
>
> Ideally you would submit a refreshed version of your design into the
> Docs repository, as one of the questions the TOF will likely have is
> "what is the proposed design for this repository".  You will also need
> to have a list of who you expect to be the maintainers (OWNERS) of this
> repository.
>
> There have been a few other issues requesting new repositories in the
> last year.  You might want to read those for examples of the kinds of
> discussion to expect.
>
Thanks Patrick. I have opened an issue 
https://github.com/openbmc/technical-oversight-forum/issues/24



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
       [not found]             ` <d65937a46b6fb4f9f94edbdef44af58e@imap.linux.ibm.com>
@ 2023-04-03 16:36               ` Supreeth Venkatesh
  2023-07-21 10:29                 ` J Dhanasekar
  0 siblings, 1 reply; 20+ messages in thread
From: Supreeth Venkatesh @ 2023-04-03 16:36 UTC (permalink / raw)
  To: Zane Shelley
  Cc: Lei Yu, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Ed Tanous, Abinaya.Dhandapani


On 3/23/23 13:57, Zane Shelley wrote:
> Caution: This message originated from an External Source. Use proper 
> caution when opening attachments, clicking links, or responding.
>
>
> On 2023-03-22 19:07, Supreeth Venkatesh wrote:
>> On 3/22/23 02:10, Lei Yu wrote:
>>> Caution: This message originated from an External Source. Use proper
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
>>>>> <supreeth.venkatesh@amd.com> wrote:
>>>>>
>>>>>
>>>>>      On 3/21/23 05:40, Patrick Williams wrote:
>>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh
>>>>> wrote:
>>>>>      >
>>>>>      >> #### Alternatives Considered
>>>>>      >>
>>>>>      >> In-band mechanisms using System Management Mode (SMM)
>>>>> exists.
>>>>>      >>
>>>>>      >> However, out of band method to gather RAS data is processor
>>>>>      specific.
>>>>>      >>
>>>>>      > How does this compare with existing implementations in
>>>>>      > phosphor-debug-collector.
>>>>>      Thanks for your feedback. See below.
>>>>>      > I believe there was some attempt to extend
>>>>>      > P-D-C previously to handle Intel's crashdump behavior.
>>>>>      Intel's crashdump interface uses com.intel.crashdump.
>>>>>      We have implemented com.amd.crashdump based on that reference.
>>>>>      However,
>>>>>      can this be made generic?
>>>>>
>>>>>      PoC below:
>>>>>
>>>>>      busctl tree com.amd.crashdump
>>>>>
>>>>>      └─/com
>>>>>         └─/com/amd
>>>>>           └─/com/amd/crashdump
>>>>>             ├─/com/amd/crashdump/0
>>>>>             ├─/com/amd/crashdump/1
>>>>>             ├─/com/amd/crashdump/2
>>>>>             ├─/com/amd/crashdump/3
>>>>>             ├─/com/amd/crashdump/4
>>>>>             ├─/com/amd/crashdump/5
>>>>>             ├─/com/amd/crashdump/6
>>>>>             ├─/com/amd/crashdump/7
>>>>>             ├─/com/amd/crashdump/8
>>>>>             └─/com/amd/crashdump/9
>>>>>
>>>>>      > The repository
>>>>>      > currently handles IBM's processors, I think, or maybe that is
>>>>>      covered by
>>>>>      > openpower-debug-collector.
>>>>>      >
>>>>>      > In any case, I think you should look at the existing D-Bus
>>>>>      interfaces
>>>>>      > (and associated Redfish implementation) of these repositories
>>>>> and
>>>>>      > determine if you can use those approaches (or document why
>>>>> now).
>>>>>      I could not find an existing D-Bus interface for RAS in
>>>>>      xyz/openbmc_project/.
>>>>>      It would be helpful if you could point me to it.
>>>>>
>>>>>
>>>>> There is an interface for the dumps generated from the host, which
>>>>> can
>>>>> be used for these kinds of dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
>>>>>
>>>>>
>>>>> The fault log also provides similar dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
>>>>>
>>>>>
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However,
>>>> the current BMCWEB implementation references
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp 
>>>>
>>>> [com.intel.crashdump]
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>>>>
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump";
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump";
>>>>
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
>>>>
>>>> or
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
>>>>
>>>> is it exercised in Redfish logservices?
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
>>> to copy the crashdump json file to the dump tarball.
>>> The crashdump tool (Intel or AMD) could trigger a dump after the
>>> crashdump is completed, and then we could get a dump entry containing
>>> the crashdump.
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the
>> CPER binary file which can then be passed through a plugin/script for
>> detailed analysis.
>> In any case irrespective of whichever Dbus interface we use, we need a
>> repository which will gather data from AMD processor via APML as per
>> AMD design.
>> APML
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip
>> Can someone please help create bmc-ras or amd-debug-collector
>> repository as there are instances of openpower-debug-collector
>> repository used for Open Power systems?
>>>
>>>
>>> -- 
>>> BRs,
>>> Lei YU
> I am interested in possibly standardizing some of this. IBM POWER has
> several related components. openpower-hw-diags is a service that will
> listen for the hardware interrupts via a GPIO pin. When an error is
> detected, it will use openpower-libhei to query hardware registers to
> determine what happened. Based on that information openpower-hw-diags
> will generate a PEL, which is an extended log in phosphor-logging, that
> is used to tell service what to replace if necessary. Afterward,
> openpower-hw-diags will initiate openpower-debug-collector, which
> gathers a significant amount of data from the hardware for additional
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It
> uses data files (currently XML, but moving to JSON) to define register
> addresses and rules for isolation. openpower-hw-diags is fairly POWER
> specific, but I can see some parts can be made generic. Dhruv would have
> to help with openpower-debug-collector.
Thank you. Lets collaborate in standardizing some aspects of it.
>
> Regarding creation of a new repository, I think we'll need to have some
> more collaboration to determine the scope before creating it. It
> certainly sounds like we are doing similar things, but we need to
> determine if enough can be abstracted to make it worth our time.
I have put in a request here: 
https://github.com/openbmc/technical-oversight-forum/issues/24
Please chime in.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-03-21  5:14 [RFC] BMC RAS Feature Supreeth Venkatesh
  2023-03-21 10:40 ` Patrick Williams
@ 2023-07-14 22:05 ` Bills, Jason M
  2023-07-15  9:01   ` dhruvaraj S
  2023-07-24 14:29   ` Venkatesh, Supreeth
  1 sibling, 2 replies; 20+ messages in thread
From: Bills, Jason M @ 2023-07-14 22:05 UTC (permalink / raw)
  To: openbmc

Sorry for missing this earlier.  Here are some of my thoughts.

On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
> 
> #### Requirements
> 
> 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
>     of virtual APIs to allow override for processor specific way of
>     collecting the data.
> 2. Crash data format shall be stored in common platform error record
>     (CPER) format as per UEFI specification
>     [https://uefi.org/specs/UEFI/2.10/].

Do we have to define a single output format? Could it be made to be 
flexible with the format of the collected crash data?

> 3. Configuration parameters of the service shall be standard with scope
>     for processor specific extensions.
> 
> #### Proposed Design
> 
> When one or more processors register a fatal error condition , then an 
> interrupt is generated to the host processor.
> 
> The host processor in the failed state asserts the signal to indicate to 
> the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD 
> processor family]
> 
> BMC RAS application listens on the event [APML_ALERT# in case of AMD 
> processor family ].

The host-error-monitor application provides support for listening for 
events and taking action such as logging or triggering a crashdump that 
may meet this requirement.


One thought may be to break this up into various layers to allow for 
flexibility and standardization. For example:
1. Redfish -> provided by bmcweb which pulls from
2. D-Bus -> provided by a new service which looks for data stored by
3. processor-specific collector -> provided by separate services as 
needed and triggered by
4. platform-specific monitoring service -> provided by 
host-error-monitor or other service as needed.

Ideally, we could make 2 a generic service.

> 
> Upon detection of FATAL error event , BMC will check the status register 
> of the host processor [implementation defined method] to see
> 
> if the assertion is due to the fatal error.
> 
> Upon fatal error , BMC will attempt to harvest crash data from all 
> processors. [via the APML interface (mailbox) in case of AMD processor 
> family].
> 
> BMC will generate a single raw crashdump record and saves it in 
> non-volatile location /var/lib/bmc-ras.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-07-14 22:05 ` Bills, Jason M
@ 2023-07-15  9:01   ` dhruvaraj S
  2023-07-24 14:29   ` Venkatesh, Supreeth
  1 sibling, 0 replies; 20+ messages in thread
From: dhruvaraj S @ 2023-07-15  9:01 UTC (permalink / raw)
  To: Bills, Jason M; +Cc: openbmc

Please find a few comments on using phosphor-debug-collector for this

Phosphor-debug-collector employs a set of scripts for BMC dump
collections, which can be customised per processor architecture.
Architecture-specific dump collections are appended to dump-extensions
and activated exclusively on systems that support them, identified by
their corresponding feature code.

Data Format: The data is packaged as a basic tarball or a custom
package according to host specifications.

Event Triggering: The phosphor-debug-collector responds to specific
events to initialize dump creation. A core monitor observes a
designated directory, generating a BMC dump containing the core file
upon event detection. On IBM systems, an attention handler awaits
notifications from processors or the host to trigger dump creation via
phosphor-debug-collector.

Layered Design: The phosphor-debug-collector operates as a
processor-specific collector within the proposed layered architecture,
initiated by a platform-specific monitoring service like the
host-error-monitor. The created dumps are exposed via D-Bus, which can
then be served by bmcweb via Redfish.

Phosphor-debug-collector allows for extensions to accommodate
processor-specific parameters. This is achieved by adjusting the dump
collection scripts in line with the particular processor requirements.

The phosphor-debug-collector interacts with specific applications
during the dump collection process. For example, on IBM systems, it
invokes an IBM-specific application via the dump collection script to
retrieve the dump from the host processor.

On Sat, 15 Jul 2023 at 03:37, Bills, Jason M
<jason.m.bills@linux.intel.com> wrote:
>
> Sorry for missing this earlier.  Here are some of my thoughts.
>
> On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
> >
> > #### Requirements
> >
> > 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
> >     of virtual APIs to allow override for processor specific way of
> >     collecting the data.
> > 2. Crash data format shall be stored in common platform error record
> >     (CPER) format as per UEFI specification
> >     [https://uefi.org/specs/UEFI/2.10/].
>
> Do we have to define a single output format? Could it be made to be
> flexible with the format of the collected crash data?
>
> > 3. Configuration parameters of the service shall be standard with scope
> >     for processor specific extensions.
> >
> > #### Proposed Design
> >
> > When one or more processors register a fatal error condition , then an
> > interrupt is generated to the host processor.
> >
> > The host processor in the failed state asserts the signal to indicate to
> > the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD
> > processor family]
> >
> > BMC RAS application listens on the event [APML_ALERT# in case of AMD
> > processor family ].
>
> The host-error-monitor application provides support for listening for
> events and taking action such as logging or triggering a crashdump that
> may meet this requirement.
>
>
> One thought may be to break this up into various layers to allow for
> flexibility and standardization. For example:
> 1. Redfish -> provided by bmcweb which pulls from
> 2. D-Bus -> provided by a new service which looks for data stored by
> 3. processor-specific collector -> provided by separate services as
> needed and triggered by
> 4. platform-specific monitoring service -> provided by
> host-error-monitor or other service as needed.
>
> Ideally, we could make 2 a generic service.
>
> >
> > Upon detection of FATAL error event , BMC will check the status register
> > of the host processor [implementation defined method] to see
> >
> > if the assertion is due to the fatal error.
> >
> > Upon fatal error , BMC will attempt to harvest crash data from all
> > processors. [via the APML interface (mailbox) in case of AMD processor
> > family].
> >
> > BMC will generate a single raw crashdump record and saves it in
> > non-volatile location /var/lib/bmc-ras.
> >
>


-- 
--------------
Dhruvaraj S

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] BMC RAS Feature
  2023-04-03 16:36               ` Supreeth Venkatesh
@ 2023-07-21 10:29                 ` J Dhanasekar
  2023-07-21 14:03                   ` Venkatesh, Supreeth
  0 siblings, 1 reply; 20+ messages in thread
From: J Dhanasekar @ 2023-07-21 10:29 UTC (permalink / raw)
  To: Supreeth Venkatesh
  Cc: Lei Yu, Zane Shelley, Michael Shen, openbmc, dhruvaraj S,
	Brad Bishop, Ed Tanous, Abinaya.Dhandapani

[-- Attachment #1: Type: text/plain, Size: 7207 bytes --]

Hi Supreeth Venkatesh,



Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform. 

If this RAS works for Daytona Platform. I will include it in my project. 



Please provide your suggestions. 



Thanks,

Dhanasekar











---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <supreeth.venkatesh@amd.com> wrote ---




On 3/23/23 13:57, Zane Shelley wrote: 
> Caution: This message originated from an External Source. Use proper 
> caution when opening attachments, clicking links, or responding. 
> 
> 
> On 2023-03-22 19:07, Supreeth Venkatesh wrote: 
>> On 3/22/23 02:10, Lei Yu wrote: 
>>> Caution: This message originated from an External Source. Use proper 
>>> caution when opening attachments, clicking links, or responding. 
>>> 
>>> 
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh 
>>>>> <mailto:supreeth.venkatesh@amd.com> wrote: 
>>>>> 
>>>>> 
>>>>>      On 3/21/23 05:40, Patrick Williams wrote: 
>>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh 
>>>>> wrote: 
>>>>>      > 
>>>>>      >> #### Alternatives Considered 
>>>>>      >> 
>>>>>      >> In-band mechanisms using System Management Mode (SMM) 
>>>>> exists. 
>>>>>      >> 
>>>>>      >> However, out of band method to gather RAS data is processor 
>>>>>      specific. 
>>>>>      >> 
>>>>>      > How does this compare with existing implementations in 
>>>>>      > phosphor-debug-collector. 
>>>>>      Thanks for your feedback. See below. 
>>>>>      > I believe there was some attempt to extend 
>>>>>      > P-D-C previously to handle Intel's crashdump behavior. 
>>>>>      Intel's crashdump interface uses com.intel.crashdump. 
>>>>>      We have implemented com.amd.crashdump based on that reference. 
>>>>>      However, 
>>>>>      can this be made generic? 
>>>>> 
>>>>>      PoC below: 
>>>>> 
>>>>>      busctl tree com.amd.crashdump 
>>>>> 
>>>>>      └─/com 
>>>>>         └─/com/amd 
>>>>>           └─/com/amd/crashdump 
>>>>>             ├─/com/amd/crashdump/0 
>>>>>             ├─/com/amd/crashdump/1 
>>>>>             ├─/com/amd/crashdump/2 
>>>>>             ├─/com/amd/crashdump/3 
>>>>>             ├─/com/amd/crashdump/4 
>>>>>             ├─/com/amd/crashdump/5 
>>>>>             ├─/com/amd/crashdump/6 
>>>>>             ├─/com/amd/crashdump/7 
>>>>>             ├─/com/amd/crashdump/8 
>>>>>             └─/com/amd/crashdump/9 
>>>>> 
>>>>>      > The repository 
>>>>>      > currently handles IBM's processors, I think, or maybe that is 
>>>>>      covered by 
>>>>>      > openpower-debug-collector. 
>>>>>      > 
>>>>>      > In any case, I think you should look at the existing D-Bus 
>>>>>      interfaces 
>>>>>      > (and associated Redfish implementation) of these repositories 
>>>>> and 
>>>>>      > determine if you can use those approaches (or document why 
>>>>> now). 
>>>>>      I could not find an existing D-Bus interface for RAS in 
>>>>>      xyz/openbmc_project/. 
>>>>>      It would be helpful if you could point me to it. 
>>>>> 
>>>>> 
>>>>> There is an interface for the dumps generated from the host, which 
>>>>> can 
>>>>> be used for these kinds of dumps 
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
>>>>> 
>>>>> 
>>>>> The fault log also provides similar dumps 
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
>>>>> 
>>>>> 
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However, 
>>>> the current BMCWEB implementation references 
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp 
>>>> 
>>>> [com.intel.crashdump] 
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump"; 
>>>> 
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump"; 
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump"; 
>>>> 
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
>>>> 
>>>> or 
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
>>>> 
>>>> is it exercised in Redfish logservices? 
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added 
>>> to copy the crashdump json file to the dump tarball. 
>>> The crashdump tool (Intel or AMD) could trigger a dump after the 
>>> crashdump is completed, and then we could get a dump entry containing 
>>> the crashdump. 
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the 
>> CPER binary file which can then be passed through a plugin/script for 
>> detailed analysis. 
>> In any case irrespective of whichever Dbus interface we use, we need a 
>> repository which will gather data from AMD processor via APML as per 
>> AMD design. 
>> APML 
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip 
>> Can someone please help create bmc-ras or amd-debug-collector 
>> repository as there are instances of openpower-debug-collector 
>> repository used for Open Power systems? 
>>> 
>>> 
>>> -- 
>>> BRs, 
>>> Lei YU 
> I am interested in possibly standardizing some of this. IBM POWER has 
> several related components. openpower-hw-diags is a service that will 
> listen for the hardware interrupts via a GPIO pin. When an error is 
> detected, it will use openpower-libhei to query hardware registers to 
> determine what happened. Based on that information openpower-hw-diags 
> will generate a PEL, which is an extended log in phosphor-logging, that 
> is used to tell service what to replace if necessary. Afterward, 
> openpower-hw-diags will initiate openpower-debug-collector, which 
> gathers a significant amount of data from the hardware for additional 
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It 
> uses data files (currently XML, but moving to JSON) to define register 
> addresses and rules for isolation. openpower-hw-diags is fairly POWER 
> specific, but I can see some parts can be made generic. Dhruv would have 
> to help with openpower-debug-collector. 
Thank you. Lets collaborate in standardizing some aspects of it. 
> 
> Regarding creation of a new repository, I think we'll need to have some 
> more collaboration to determine the scope before creating it. It 
> certainly sounds like we are doing similar things, but we need to 
> determine if enough can be abstracted to make it worth our time. 
I have put in a request here: 
https://github.com/openbmc/technical-oversight-forum/issues/24 
Please chime in.

[-- Attachment #2: Type: text/html, Size: 13843 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] BMC RAS Feature
  2023-07-21 10:29                 ` J Dhanasekar
@ 2023-07-21 14:03                   ` Venkatesh, Supreeth
  2023-07-24 13:04                     ` J Dhanasekar
  0 siblings, 1 reply; 20+ messages in thread
From: Venkatesh, Supreeth @ 2023-07-21 14:03 UTC (permalink / raw)
  To: J Dhanasekar
  Cc: Lei Yu, Zane Shelley, Michael Shen, openbmc, dhruvaraj S,
	Brad Bishop, Ed Tanous, Dhandapani, Abinaya


[-- Attachment #1.1: Type: text/plain, Size: 7772 bytes --]

[AMD Official Use Only - General]

Hi Dhanasekar,

It is supported for EPYC Genoa family and beyond at this time.
Daytona uses EPYC Milan family and support is not there in that.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png@01D9BBB2.3DA7CC00]

From: J Dhanasekar <jdhanasekar@velankanigroup.com>
Sent: Friday, July 21, 2023 5:30 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com>
Cc: Zane Shelley <zshelle@imap.linux.ibm.com>; Lei Yu <yulei.sh@bytedance.com>; Michael Shen <gpgpgp@google.com>; openbmc <openbmc@lists.ozlabs.org>; dhruvaraj S <dhruvaraj@gmail.com>; Brad Bishop <bradleyb@fuzziesquirrel.com>; Ed Tanous <ed@tanous.net>; Dhandapani, Abinaya <Abinaya.Dhandapani@amd.com>
Subject: Re: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Supreeth Venkatesh,

Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform.
If this RAS works for Daytona Platform. I will include it in my project.

Please provide your suggestions.

Thanks,
Dhanasekar





---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <supreeth.venkatesh@amd.com<mailto:supreeth.venkatesh@amd.com>> wrote ---


On 3/23/23 13:57, Zane Shelley wrote:
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> On 2023-03-22 19:07, Supreeth Venkatesh wrote:
>> On 3/22/23 02:10, Lei Yu wrote:
>>> Caution: This message originated from an External Source. Use proper
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
>>>>> <supreeth.venkatesh@amd.com<mailto:supreeth.venkatesh@amd.com>> wrote:
>>>>>
>>>>>
>>>>>      On 3/21/23 05:40, Patrick Williams wrote:
>>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh
>>>>> wrote:
>>>>>      >
>>>>>      >> #### Alternatives Considered
>>>>>      >>
>>>>>      >> In-band mechanisms using System Management Mode (SMM)
>>>>> exists.
>>>>>      >>
>>>>>      >> However, out of band method to gather RAS data is processor
>>>>>      specific.
>>>>>      >>
>>>>>      > How does this compare with existing implementations in
>>>>>      > phosphor-debug-collector.
>>>>>      Thanks for your feedback. See below.
>>>>>      > I believe there was some attempt to extend
>>>>>      > P-D-C previously to handle Intel's crashdump behavior.
>>>>>      Intel's crashdump interface uses com.intel.crashdump.
>>>>>      We have implemented com.amd.crashdump based on that reference.
>>>>>      However,
>>>>>      can this be made generic?
>>>>>
>>>>>      PoC below:
>>>>>
>>>>>      busctl tree com.amd.crashdump
>>>>>
>>>>>      └─/com
>>>>>         └─/com/amd
>>>>>           └─/com/amd/crashdump
>>>>>             ├─/com/amd/crashdump/0
>>>>>             ├─/com/amd/crashdump/1
>>>>>             ├─/com/amd/crashdump/2
>>>>>             ├─/com/amd/crashdump/3
>>>>>             ├─/com/amd/crashdump/4
>>>>>             ├─/com/amd/crashdump/5
>>>>>             ├─/com/amd/crashdump/6
>>>>>             ├─/com/amd/crashdump/7
>>>>>             ├─/com/amd/crashdump/8
>>>>>             └─/com/amd/crashdump/9
>>>>>
>>>>>      > The repository
>>>>>      > currently handles IBM's processors, I think, or maybe that is
>>>>>      covered by
>>>>>      > openpower-debug-collector.
>>>>>      >
>>>>>      > In any case, I think you should look at the existing D-Bus
>>>>>      interfaces
>>>>>      > (and associated Redfish implementation) of these repositories
>>>>> and
>>>>>      > determine if you can use those approaches (or document why
>>>>> now).
>>>>>      I could not find an existing D-Bus interface for RAS in
>>>>>      xyz/openbmc_project/.
>>>>>      It would be helpful if you could point me to it.
>>>>>
>>>>>
>>>>> There is an interface for the dumps generated from the host, which
>>>>> can
>>>>> be used for these kinds of dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>>
>>>>>
>>>>> The fault log also provides similar dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>>
>>>>>
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However,
>>>> the current BMCWEB implementation references
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp
>>>>
>>>> [com.intel.crashdump]
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>>>>
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump";
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump";
>>>>
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>
>>>> or
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>
>>>> is it exercised in Redfish logservices?
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
>>> to copy the crashdump json file to the dump tarball.
>>> The crashdump tool (Intel or AMD) could trigger a dump after the
>>> crashdump is completed, and then we could get a dump entry containing
>>> the crashdump.
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the
>> CPER binary file which can then be passed through a plugin/script for
>> detailed analysis.
>> In any case irrespective of whichever Dbus interface we use, we need a
>> repository which will gather data from AMD processor via APML as per
>> AMD design.
>> APML
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip
>> Can someone please help create bmc-ras or amd-debug-collector
>> repository as there are instances of openpower-debug-collector
>> repository used for Open Power systems?
>>>
>>>
>>> --
>>> BRs,
>>> Lei YU
> I am interested in possibly standardizing some of this. IBM POWER has
> several related components. openpower-hw-diags is a service that will
> listen for the hardware interrupts via a GPIO pin. When an error is
> detected, it will use openpower-libhei to query hardware registers to
> determine what happened. Based on that information openpower-hw-diags
> will generate a PEL, which is an extended log in phosphor-logging, that
> is used to tell service what to replace if necessary. Afterward,
> openpower-hw-diags will initiate openpower-debug-collector, which
> gathers a significant amount of data from the hardware for additional
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It
> uses data files (currently XML, but moving to JSON) to define register
> addresses and rules for isolation. openpower-hw-diags is fairly POWER
> specific, but I can see some parts can be made generic. Dhruv would have
> to help with openpower-debug-collector.
Thank you. Lets collaborate in standardizing some aspects of it.
>
> Regarding creation of a new repository, I think we'll need to have some
> more collaboration to determine the scope before creating it. It
> certainly sounds like we are doing similar things, but we need to
> determine if enough can be abstracted to make it worth our time.
I have put in a request here:
https://github.com/openbmc/technical-oversight-forum/issues/24
Please chime in.



[-- Attachment #1.2: Type: text/html, Size: 21292 bytes --]

[-- Attachment #2: image001.png --]
[-- Type: image/png, Size: 3608 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] BMC RAS Feature
  2023-07-21 14:03                   ` Venkatesh, Supreeth
@ 2023-07-24 13:04                     ` J Dhanasekar
  2023-07-24 14:14                       ` Venkatesh, Supreeth
  0 siblings, 1 reply; 20+ messages in thread
From: J Dhanasekar @ 2023-07-24 13:04 UTC (permalink / raw)
  To: Venkatesh, Supreeth
  Cc: Lei Yu, Zane Shelley, Michael Shen, openbmc, dhruvaraj S,
	Brad Bishop, Ed Tanous, Dhandapani,  Abinaya

[-- Attachment #1: Type: text/plain, Size: 8935 bytes --]

Hi Supreeth,



Thanks for the info. We hoped that Daytonax would be upstreamed. Unfortunately, It is not available. 

Actually, we need to enable SOL, Post code and PSU features in Daytona.  Will we get support for this feature enablement? or Are there any reference implementation available for AMD boards?.



Thanks,

Dhanasekar







---- On Fri, 21 Jul 2023 19:33:41 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com> wrote ---



[AMD Official Use Only - General]



Hi Dhanasekar,

 

It is supported for EPYC Genoa family and beyond at this time.

Daytona uses EPYC Milan family and support is not there in that.

 

Thanks,

Supreeth Venkatesh

System Manageability Architect  |  AMD
 Server Software



 

From: J Dhanasekar <mailto:jdhanasekar@velankanigroup.com> 
 Sent: Friday, July 21, 2023 5:30 AM
 To: Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com>
 Cc: Zane Shelley <mailto:zshelle@imap.linux.ibm.com>; Lei Yu <mailto:yulei.sh@bytedance.com>; Michael Shen <mailto:gpgpgp@google.com>; openbmc <mailto:openbmc@lists.ozlabs.org>; dhruvaraj S <mailto:dhruvaraj@gmail.com>; Brad Bishop <mailto:bradleyb@fuzziesquirrel.com>; Ed Tanous <mailto:ed@tanous.net>;
 Dhandapani, Abinaya <mailto:Abinaya.Dhandapani@amd.com>
 Subject: Re: [RFC] BMC RAS Feature


 



Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.






 

Hi Supreeth Venkatesh,


 


Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform. 


If this RAS works for Daytona Platform. I will include it in my project. 


 


Please provide your suggestions. 


 


Thanks,


Dhanasekar


 


 



 


 


 


---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <mailto:supreeth.venkatesh@amd.com> wrote ---


 



On 3/23/23 13:57, Zane Shelley wrote:
 > Caution: This message originated from an External Source. Use proper 
 > caution when opening attachments, clicking links, or responding. 
 > 
 > 
 > On 2023-03-22 19:07, Supreeth Venkatesh wrote: 
 >> On 3/22/23 02:10, Lei Yu wrote: 
 >>> Caution: This message originated from an External Source. Use proper 
 >>> caution when opening attachments, clicking links, or responding. 
 >>> 
 >>> 
 >>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh 
 >>>>> <mailto:supreeth.venkatesh@amd.com> wrote: 
 >>>>> 
 >>>>> 
 >>>>>      On 3/21/23 05:40, Patrick Williams wrote: 
 >>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh 
 >>>>> wrote: 
 >>>>>      > 
 >>>>>      >> #### Alternatives Considered 
 >>>>>      >> 
 >>>>>      >> In-band mechanisms using System Management Mode (SMM) 
 >>>>> exists. 
 >>>>>      >> 
 >>>>>      >> However, out of band method to gather RAS data is processor 
 >>>>>      specific. 
 >>>>>      >> 
 >>>>>      > How does this compare with existing implementations in 
 >>>>>      > phosphor-debug-collector. 
 >>>>>      Thanks for your feedback. See below. 
 >>>>>      > I believe there was some attempt to extend 
 >>>>>      > P-D-C previously to handle Intel's crashdump behavior. 
 >>>>>      Intel's crashdump interface uses com.intel.crashdump. 
 >>>>>      We have implemented com.amd.crashdump based on that reference. 
 >>>>>      However, 
 >>>>>      can this be made generic? 
 >>>>> 
 >>>>>      PoC below: 
 >>>>> 
 >>>>>      busctl tree com.amd.crashdump 
 >>>>> 
 >>>>>      └─/com 
 >>>>>         └─/com/amd 
 >>>>>           └─/com/amd/crashdump 
 >>>>>             ├─/com/amd/crashdump/0 
 >>>>>             ├─/com/amd/crashdump/1 
 >>>>>             ├─/com/amd/crashdump/2 
 >>>>>             ├─/com/amd/crashdump/3 
 >>>>>             ├─/com/amd/crashdump/4 
 >>>>>             ├─/com/amd/crashdump/5 
 >>>>>             ├─/com/amd/crashdump/6 
 >>>>>             ├─/com/amd/crashdump/7 
 >>>>>             ├─/com/amd/crashdump/8 
 >>>>>             └─/com/amd/crashdump/9 
 >>>>> 
 >>>>>      > The repository 
 >>>>>      > currently handles IBM's processors, I think, or maybe that is 
 >>>>>      covered by 
 >>>>>      > openpower-debug-collector. 
 >>>>>      > 
 >>>>>      > In any case, I think you should look at the existing D-Bus 
 >>>>>      interfaces 
 >>>>>      > (and associated Redfish implementation) of these repositories 
 >>>>> and 
 >>>>>      > determine if you can use those approaches (or document why 
 >>>>> now). 
 >>>>>      I could not find an existing D-Bus interface for RAS in 
 >>>>>      xyz/openbmc_project/. 
 >>>>>      It would be helpful if you could point me to it. 
 >>>>> 
 >>>>> 
 >>>>> There is an interface for the dumps generated from the host, which 
 >>>>> can 
 >>>>> be used for these kinds of dumps 
 >>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
 >>>>> 
 >>>>> 
 >>>>> The fault log also provides similar dumps 
 >>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
 >>>>> 
 >>>>> 
 >>>> ThanksDdhruvraj. The interface looks useful for the purpose. However, 
 >>>> the current BMCWEB implementation references 
 >>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp 
 >>>> 
 >>>> [com.intel.crashdump] 
 >>>> constexpr char const* crashdumpPath = "/com/intel/crashdump"; 
 >>>> 
 >>>> constexpr char const* crashdumpInterface = "com.intel.crashdump"; 
 >>>> constexpr char const* crashdumpObject = "com.intel.crashdump"; 
 >>>> 
 >>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
 >>>> 
 >>>> or 
 >>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
 >>>> 
 >>>> is it exercised in Redfish logservices? 
 >>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added 
 >>> to copy the crashdump json file to the dump tarball. 
 >>> The crashdump tool (Intel or AMD) could trigger a dump after the 
 >>> crashdump is completed, and then we could get a dump entry containing 
 >>> the crashdump. 
 >> Thanks Lei Yu for your input. We are using Redfish to retrieve the 
 >> CPER binary file which can then be passed through a plugin/script for 
 >> detailed analysis. 
 >> In any case irrespective of whichever Dbus interface we use, we need a 
 >> repository which will gather data from AMD processor via APML as per 
 >> AMD design. 
 >> APML 
 >> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip 
 >> Can someone please help create bmc-ras or amd-debug-collector 
 >> repository as there are instances of openpower-debug-collector 
 >> repository used for Open Power systems? 
 >>> 
 >>> 
 >>> -- 
 >>> BRs, 
 >>> Lei YU 
 > I am interested in possibly standardizing some of this. IBM POWER has 
 > several related components. openpower-hw-diags is a service that will 
 > listen for the hardware interrupts via a GPIO pin. When an error is 
 > detected, it will use openpower-libhei to query hardware registers to 
 > determine what happened. Based on that information openpower-hw-diags 
 > will generate a PEL, which is an extended log in phosphor-logging, that 
 > is used to tell service what to replace if necessary. Afterward, 
 > openpower-hw-diags will initiate openpower-debug-collector, which 
 > gathers a significant amount of data from the hardware for additional 
 > debug when necessary. I wrote openpower-libhei to be fairly agnostic. It 
 > uses data files (currently XML, but moving to JSON) to define register 
 > addresses and rules for isolation. openpower-hw-diags is fairly POWER 
 > specific, but I can see some parts can be made generic. Dhruv would have 
 > to help with openpower-debug-collector. 
 Thank you. Lets collaborate in standardizing some aspects of it. 
 > 
 > Regarding creation of a new repository, I think we'll need to have some 
 > more collaboration to determine the scope before creating it. It 
 > certainly sounds like we are doing similar things, but we need to 
 > determine if enough can be abstracted to make it worth our time. 
 I have put in a request here: 
 https://github.com/openbmc/technical-oversight-forum/issues/24 
 Please chime in.



 



 

[-- Attachment #2.1: Type: text/html, Size: 25597 bytes --]

[-- Attachment #2.2: 1.png --]
[-- Type: image/png, Size: 3608 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] BMC RAS Feature
  2023-07-24 13:04                     ` J Dhanasekar
@ 2023-07-24 14:14                       ` Venkatesh, Supreeth
  2023-07-25 13:09                         ` J Dhanasekar
  0 siblings, 1 reply; 20+ messages in thread
From: Venkatesh, Supreeth @ 2023-07-24 14:14 UTC (permalink / raw)
  To: J Dhanasekar
  Cc: Lei Yu, Zane Shelley, Michael Shen, openbmc, dhruvaraj S,
	Brad Bishop, Ed Tanous, Dhandapani, Abinaya


[-- Attachment #1.1: Type: text/plain, Size: 9764 bytes --]

[AMD Official Use Only - General]

Hi Dhanasekar,

DaytonaX and EthanolX platforms were only OpenBMC PoC with limited functionality.
We are in the process of upstreaming new AMD CRBs with OpenBMC which has all the functionality you mention below.
Public instance of the staging/intermediary repository before upstream is here:
AMDESE/OpenBMC: OpenBMC for Genoa SP5 socket platforms (github.com)<https://github.com/AMDESE/OpenBMC>

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png@01D9BE0F.4EE9EAC0]

From: J Dhanasekar <jdhanasekar@velankanigroup.com>
Sent: Monday, July 24, 2023 8:04 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com>
Cc: Lei Yu <yulei.sh@bytedance.com>; Zane Shelley <zshelle@imap.linux.ibm.com>; Michael Shen <gpgpgp@google.com>; openbmc <openbmc@lists.ozlabs.org>; dhruvaraj S <dhruvaraj@gmail.com>; Brad Bishop <bradleyb@fuzziesquirrel.com>; Ed Tanous <ed@tanous.net>; Dhandapani, Abinaya <Abinaya.Dhandapani@amd.com>
Subject: RE: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Supreeth,

Thanks for the info. We hoped that Daytonax would be upstreamed. Unfortunately, It is not available.
Actually, we need to enable SOL, Post code and PSU features in Daytona.  Will we get support for this feature enablement? or Are there any reference implementation available for AMD boards?.

Thanks,
Dhanasekar



---- On Fri, 21 Jul 2023 19:33:41 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com<mailto:Supreeth.Venkatesh@amd.com>> wrote ---


[AMD Official Use Only - General]

Hi Dhanasekar,

It is supported for EPYC Genoa family and beyond at this time.
Daytona uses EPYC Milan family and support is not there in that.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png@01D9BE0F.4EE9EAC0]

From: J Dhanasekar <jdhanasekar@velankanigroup.com<mailto:jdhanasekar@velankanigroup.com>>
Sent: Friday, July 21, 2023 5:30 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com<mailto:Supreeth.Venkatesh@amd.com>>
Cc: Zane Shelley <zshelle@imap.linux.ibm.com<mailto:zshelle@imap.linux.ibm.com>>; Lei Yu <yulei.sh@bytedance.com<mailto:yulei.sh@bytedance.com>>; Michael Shen <gpgpgp@google.com<mailto:gpgpgp@google.com>>; openbmc <openbmc@lists.ozlabs.org<mailto:openbmc@lists.ozlabs.org>>; dhruvaraj S <dhruvaraj@gmail.com<mailto:dhruvaraj@gmail.com>>; Brad Bishop <bradleyb@fuzziesquirrel.com<mailto:bradleyb@fuzziesquirrel.com>>; Ed Tanous <ed@tanous.net<mailto:ed@tanous.net>>; Dhandapani, Abinaya <Abinaya.Dhandapani@amd.com<mailto:Abinaya.Dhandapani@amd.com>>
Subject: Re: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Supreeth Venkatesh,

Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform.
If this RAS works for Daytona Platform. I will include it in my project.

Please provide your suggestions.

Thanks,
Dhanasekar





---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <supreeth.venkatesh@amd.com<mailto:supreeth.venkatesh@amd.com>> wrote ---


On 3/23/23 13:57, Zane Shelley wrote:
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> On 2023-03-22 19:07, Supreeth Venkatesh wrote:
>> On 3/22/23 02:10, Lei Yu wrote:
>>> Caution: This message originated from an External Source. Use proper
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
>>>>> <supreeth.venkatesh@amd.com<mailto:supreeth.venkatesh@amd.com>> wrote:
>>>>>
>>>>>
>>>>>      On 3/21/23 05:40, Patrick Williams wrote:
>>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh
>>>>> wrote:
>>>>>      >
>>>>>      >> #### Alternatives Considered
>>>>>      >>
>>>>>      >> In-band mechanisms using System Management Mode (SMM)
>>>>> exists.
>>>>>      >>
>>>>>      >> However, out of band method to gather RAS data is processor
>>>>>      specific.
>>>>>      >>
>>>>>      > How does this compare with existing implementations in
>>>>>      > phosphor-debug-collector.
>>>>>      Thanks for your feedback. See below.
>>>>>      > I believe there was some attempt to extend
>>>>>      > P-D-C previously to handle Intel's crashdump behavior.
>>>>>      Intel's crashdump interface uses com.intel.crashdump.
>>>>>      We have implemented com.amd.crashdump based on that reference.
>>>>>      However,
>>>>>      can this be made generic?
>>>>>
>>>>>      PoC below:
>>>>>
>>>>>      busctl tree com.amd.crashdump
>>>>>
>>>>>      └─/com
>>>>>         └─/com/amd
>>>>>           └─/com/amd/crashdump
>>>>>             ├─/com/amd/crashdump/0
>>>>>             ├─/com/amd/crashdump/1
>>>>>             ├─/com/amd/crashdump/2
>>>>>             ├─/com/amd/crashdump/3
>>>>>             ├─/com/amd/crashdump/4
>>>>>             ├─/com/amd/crashdump/5
>>>>>             ├─/com/amd/crashdump/6
>>>>>             ├─/com/amd/crashdump/7
>>>>>             ├─/com/amd/crashdump/8
>>>>>             └─/com/amd/crashdump/9
>>>>>
>>>>>      > The repository
>>>>>      > currently handles IBM's processors, I think, or maybe that is
>>>>>      covered by
>>>>>      > openpower-debug-collector.
>>>>>      >
>>>>>      > In any case, I think you should look at the existing D-Bus
>>>>>      interfaces
>>>>>      > (and associated Redfish implementation) of these repositories
>>>>> and
>>>>>      > determine if you can use those approaches (or document why
>>>>> now).
>>>>>      I could not find an existing D-Bus interface for RAS in
>>>>>      xyz/openbmc_project/.
>>>>>      It would be helpful if you could point me to it.
>>>>>
>>>>>
>>>>> There is an interface for the dumps generated from the host, which
>>>>> can
>>>>> be used for these kinds of dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>>
>>>>>
>>>>> The fault log also provides similar dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>>
>>>>>
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However,
>>>> the current BMCWEB implementation references
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp
>>>>
>>>> [com.intel.crashdump]
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>>>>
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump";
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump";
>>>>
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>
>>>> or
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>
>>>> is it exercised in Redfish logservices?
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
>>> to copy the crashdump json file to the dump tarball.
>>> The crashdump tool (Intel or AMD) could trigger a dump after the
>>> crashdump is completed, and then we could get a dump entry containing
>>> the crashdump.
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the
>> CPER binary file which can then be passed through a plugin/script for
>> detailed analysis.
>> In any case irrespective of whichever Dbus interface we use, we need a
>> repository which will gather data from AMD processor via APML as per
>> AMD design.
>> APML
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip
>> Can someone please help create bmc-ras or amd-debug-collector
>> repository as there are instances of openpower-debug-collector
>> repository used for Open Power systems?
>>>
>>>
>>> --
>>> BRs,
>>> Lei YU
> I am interested in possibly standardizing some of this. IBM POWER has
> several related components. openpower-hw-diags is a service that will
> listen for the hardware interrupts via a GPIO pin. When an error is
> detected, it will use openpower-libhei to query hardware registers to
> determine what happened. Based on that information openpower-hw-diags
> will generate a PEL, which is an extended log in phosphor-logging, that
> is used to tell service what to replace if necessary. Afterward,
> openpower-hw-diags will initiate openpower-debug-collector, which
> gathers a significant amount of data from the hardware for additional
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It
> uses data files (currently XML, but moving to JSON) to define register
> addresses and rules for isolation. openpower-hw-diags is fairly POWER
> specific, but I can see some parts can be made generic. Dhruv would have
> to help with openpower-debug-collector.
Thank you. Lets collaborate in standardizing some aspects of it.
>
> Regarding creation of a new repository, I think we'll need to have some
> more collaboration to determine the scope before creating it. It
> certainly sounds like we are doing similar things, but we need to
> determine if enough can be abstracted to make it worth our time.
I have put in a request here:
https://github.com/openbmc/technical-oversight-forum/issues/24
Please chime in.





[-- Attachment #1.2: Type: text/html, Size: 35718 bytes --]

[-- Attachment #2: image001.png --]
[-- Type: image/png, Size: 3608 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] BMC RAS Feature
  2023-07-14 22:05 ` Bills, Jason M
  2023-07-15  9:01   ` dhruvaraj S
@ 2023-07-24 14:29   ` Venkatesh, Supreeth
  1 sibling, 0 replies; 20+ messages in thread
From: Venkatesh, Supreeth @ 2023-07-24 14:29 UTC (permalink / raw)
  To: Bills, Jason M, openbmc

[AMD Official Use Only - General]

Thanks for your feedback Jason. Sorry for the delay in my response.

1. The format can be anything. [We could use phosphor-debug-collector that collects different debug dumps]
2. Agree with this path
        i. Redfish -> provided by bmcweb which pulls from
        ii. D-Bus -> provided by a new service which looks for data stored by
        iii. processor-specific collector -> provided by separate services as needed and triggered by
        iv. platform-specific monitoring service -> provided by host-error-monitor or other service as needed.
We need a repository for processor-specific collector.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software


-----Original Message-----
From: openbmc <openbmc-bounces+supreeth.venkatesh=amd.com@lists.ozlabs.org> On Behalf Of Bills, Jason M
Sent: Friday, July 14, 2023 5:05 PM
To: openbmc@lists.ozlabs.org
Subject: Re: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


Sorry for missing this earlier.  Here are some of my thoughts.

On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
>
> #### Requirements
>
> 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
>     of virtual APIs to allow override for processor specific way of
>     collecting the data.
> 2. Crash data format shall be stored in common platform error record
>     (CPER) format as per UEFI specification
>     [https://uefi.org/specs/UEFI/2.10/].

Do we have to define a single output format? Could it be made to be flexible with the format of the collected crash data?

> 3. Configuration parameters of the service shall be standard with scope
>     for processor specific extensions.
>
> #### Proposed Design
>
> When one or more processors register a fatal error condition , then an
> interrupt is generated to the host processor.
>
> The host processor in the failed state asserts the signal to indicate
> to the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD
> processor family]
>
> BMC RAS application listens on the event [APML_ALERT# in case of AMD
> processor family ].

The host-error-monitor application provides support for listening for events and taking action such as logging or triggering a crashdump that may meet this requirement.


One thought may be to break this up into various layers to allow for flexibility and standardization. For example:
1. Redfish -> provided by bmcweb which pulls from 2. D-Bus -> provided by a new service which looks for data stored by 3. processor-specific collector -> provided by separate services as needed and triggered by 4. platform-specific monitoring service -> provided by host-error-monitor or other service as needed.

Ideally, we could make 2 a generic service.

>
> Upon detection of FATAL error event , BMC will check the status
> register of the host processor [implementation defined method] to see
>
> if the assertion is due to the fatal error.
>
> Upon fatal error , BMC will attempt to harvest crash data from all
> processors. [via the APML interface (mailbox) in case of AMD processor
> family].
>
> BMC will generate a single raw crashdump record and saves it in
> non-volatile location /var/lib/bmc-ras.
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] BMC RAS Feature
  2023-07-24 14:14                       ` Venkatesh, Supreeth
@ 2023-07-25 13:09                         ` J Dhanasekar
  2023-07-25 14:02                           ` Venkatesh, Supreeth
  0 siblings, 1 reply; 20+ messages in thread
From: J Dhanasekar @ 2023-07-25 13:09 UTC (permalink / raw)
  To: Venkatesh, Supreeth
  Cc: Lei Yu, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Ed Tanous, Dhandapani,  Abinaya

[-- Attachment #1: Type: text/plain, Size: 10683 bytes --]

Hi Supreeth, 



I am working on SP5 Servers too. SP5 Servers has aspeed 2600 chip and  BMC is off the board whereas EthanolX/Daytonax has 2500 and BMC is on the board. 

Algorithms or Steps for implementing functionalities (SOL, PostCode, PSU..) will  remain the same?. 



Thanks,

Dhanasekar









---- On Mon, 24 Jul 2023 19:44:52 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com> wrote ---



[AMD Official Use Only - General]


Hi Dhanasekar,

 

DaytonaX and EthanolX platforms were only OpenBMC PoC with limited functionality.

We are in the process of upstreaming new AMD CRBs with OpenBMC which has all the functionality you mention below.

Public instance of the staging/intermediary repository before upstream is here:

https://github.com/AMDESE/OpenBMC

 

Thanks,

Supreeth Venkatesh

System Manageability Architect  |  AMD
 Server Software



 

From: J Dhanasekar <mailto:jdhanasekar@velankanigroup.com> 
 Sent: Monday, July 24, 2023 8:04 AM
 To: Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com>
 Cc: Lei Yu <mailto:yulei.sh@bytedance.com>; Zane Shelley <mailto:zshelle@imap.linux.ibm.com>; Michael Shen <mailto:gpgpgp@google.com>; openbmc <mailto:openbmc@lists.ozlabs.org>; dhruvaraj S <mailto:dhruvaraj@gmail.com>; Brad Bishop <mailto:bradleyb@fuzziesquirrel.com>; Ed Tanous <mailto:ed@tanous.net>;
 Dhandapani, Abinaya <mailto:Abinaya.Dhandapani@amd.com>
 Subject: RE: [RFC] BMC RAS Feature


 



Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.






 

Hi Supreeth,


 


Thanks for the info. We hoped that Daytonax would be upstreamed. Unfortunately, It is not available. 


Actually, we need to enable SOL, Post code and PSU features in Daytona.  Will we get support for this feature enablement? or Are there any reference implementation
 available for AMD boards?.


 


Thanks,


Dhanasekar


 


 


 


---- On Fri, 21 Jul 2023 19:33:41 +0530 Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com> wrote ---


 


[AMD Official Use Only - General]

 


Hi Dhanasekar,

 

It is supported for EPYC Genoa family and beyond at this time.

Daytona uses EPYC Milan family and support is not there in that.

 

Thanks,

Supreeth Venkatesh

System Manageability Architect  |  AMD
 Server Software



 

From: J Dhanasekar <mailto:jdhanasekar@velankanigroup.com> 
 Sent: Friday, July 21, 2023 5:30 AM
 To: Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com>
 Cc: Zane Shelley <mailto:zshelle@imap.linux.ibm.com>; Lei Yu <mailto:yulei.sh@bytedance.com>; Michael Shen <mailto:gpgpgp@google.com>;
 openbmc <mailto:openbmc@lists.ozlabs.org>; dhruvaraj S <mailto:dhruvaraj@gmail.com>; Brad Bishop <mailto:bradleyb@fuzziesquirrel.com>;
 Ed Tanous <mailto:ed@tanous.net>; Dhandapani, Abinaya <mailto:Abinaya.Dhandapani@amd.com>
 Subject: Re: [RFC] BMC RAS Feature


 



Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.






 

Hi Supreeth Venkatesh,


 


Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform. 


If this RAS works for Daytona Platform. I will include it in my project. 


 


Please provide your suggestions. 


 


Thanks,


Dhanasekar


 


 



 


 


 


---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <mailto:supreeth.venkatesh@amd.com> wrote ---


 



On 3/23/23 13:57, Zane Shelley wrote:
 > Caution: This message originated from an External Source. Use proper 
 > caution when opening attachments, clicking links, or responding. 
 > 
 > 
 > On 2023-03-22 19:07, Supreeth Venkatesh wrote: 
 >> On 3/22/23 02:10, Lei Yu wrote: 
 >>> Caution: This message originated from an External Source. Use proper 
 >>> caution when opening attachments, clicking links, or responding. 
 >>> 
 >>> 
 >>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh  
 >>>>> <mailto:supreeth.venkatesh@amd.com> wrote: 
 >>>>> 
 >>>>> 
 >>>>>      On 3/21/23 05:40, Patrick Williams wrote: 
 >>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh 
 >>>>> wrote: 
 >>>>>      > 
 >>>>>      >> #### Alternatives Considered 
 >>>>>      >> 
 >>>>>      >> In-band mechanisms using System Management Mode (SMM) 
 >>>>> exists. 
 >>>>>      >> 
 >>>>>      >> However, out of band method to gather RAS data is processor 
 >>>>>      specific. 
 >>>>>      >> 
 >>>>>      > How does this compare with existing implementations in 
 >>>>>      > phosphor-debug-collector. 
 >>>>>      Thanks for your feedback. See below. 
 >>>>>      > I believe there was some attempt to extend  
 >>>>>      > P-D-C previously to handle Intel's crashdump behavior. 
 >>>>>      Intel's crashdump interface uses com.intel.crashdump. 
 >>>>>      We have implemented com.amd.crashdump based on that reference. 
 >>>>>      However, 
 >>>>>      can this be made generic? 
 >>>>> 
 >>>>>      PoC below: 
 >>>>> 
 >>>>>      busctl tree com.amd.crashdump 
 >>>>> 
 >>>>>      └─/com 
 >>>>>         └─/com/amd 
 >>>>>           └─/com/amd/crashdump 
 >>>>>             ├─/com/amd/crashdump/0 
 >>>>>             ├─/com/amd/crashdump/1 
 >>>>>             ├─/com/amd/crashdump/2 
 >>>>>             ├─/com/amd/crashdump/3 
 >>>>>             ├─/com/amd/crashdump/4 
 >>>>>             ├─/com/amd/crashdump/5 
 >>>>>             ├─/com/amd/crashdump/6 
 >>>>>             ├─/com/amd/crashdump/7 
 >>>>>             ├─/com/amd/crashdump/8 
 >>>>>             └─/com/amd/crashdump/9 
 >>>>> 
 >>>>>      > The repository 
 >>>>>      > currently handles IBM's processors, I think, or maybe that is 
 >>>>>      covered by 
 >>>>>      > openpower-debug-collector. 
 >>>>>      > 
 >>>>>      > In any case, I think you should look at the existing D-Bus 
 >>>>>      interfaces 
 >>>>>      > (and associated Redfish implementation) of these repositories 
 >>>>> and 
 >>>>>      > determine if you can use those approaches (or document why 
 >>>>> now). 
 >>>>>      I could not find an existing D-Bus interface for RAS in 
 >>>>>      xyz/openbmc_project/. 
 >>>>>      It would be helpful if you could point me to it.  
 >>>>> 
 >>>>> 
 >>>>> There is an interface for the dumps generated from the host, which 
 >>>>> can 
 >>>>> be used for these kinds of dumps 
 >>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
 >>>>> 
 >>>>> 
 >>>>> The fault log also provides similar dumps 
 >>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
 >>>>> 
 >>>>> 
 >>>> ThanksDdhruvraj. The interface looks useful for the purpose. However, 
 >>>> the current BMCWEB implementation references 
 >>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp 
 >>>> 
 >>>> [com.intel.crashdump] 
 >>>> constexpr char const* crashdumpPath = "/com/intel/crashdump"; 
 >>>> 
 >>>> constexpr char const* crashdumpInterface = "com.intel.crashdump"; 
 >>>> constexpr char const* crashdumpObject = "com.intel.crashdump"; 
 >>>> 
 >>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
 >>>> 
 >>>> or 
 >>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
 >>>> 
 >>>> is it exercised in Redfish logservices? 
 >>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added 
 >>> to copy the crashdump json file to the dump tarball.  
 >>> The crashdump tool (Intel or AMD) could trigger a dump after the 
 >>> crashdump is completed, and then we could get a dump entry containing 
 >>> the crashdump. 
 >> Thanks Lei Yu for your input. We are using Redfish to retrieve the 
 >> CPER binary file which can then be passed through a plugin/script for 
 >> detailed analysis. 
 >> In any case irrespective of whichever Dbus interface we use, we need a 
 >> repository which will gather data from AMD processor via APML as per 
 >> AMD design. 
 >> APML 
 >> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip 
 >> Can someone please help create bmc-ras or amd-debug-collector 
 >> repository as there are instances of openpower-debug-collector 
 >> repository used for Open Power systems? 
 >>> 
 >>> 
 >>> -- 
 >>> BRs, 
 >>> Lei YU 
 > I am interested in possibly standardizing some of this. IBM POWER has 
 > several related components. openpower-hw-diags is a service that will 
 > listen for the hardware interrupts via a GPIO pin. When an error is 
 > detected, it will use openpower-libhei to query hardware registers to 
 > determine what happened. Based on that information openpower-hw-diags 
 > will generate a PEL, which is an extended log in phosphor-logging, that 
 > is used to tell service what to replace if necessary. Afterward, 
 > openpower-hw-diags will initiate openpower-debug-collector, which 
 > gathers a significant amount of data from the hardware for additional 
 > debug when necessary. I wrote openpower-libhei to be fairly agnostic. It 
 > uses data files (currently XML, but moving to JSON) to define register 
 > addresses and rules for isolation. openpower-hw-diags is fairly POWER 
 > specific, but I can see some parts can be made generic. Dhruv would have 
 > to help with openpower-debug-collector. 
 Thank you. Lets collaborate in standardizing some aspects of it. 
 > 
 > Regarding creation of a new repository, I think we'll need to have some 
 > more collaboration to determine the scope before creating it. It 
 > certainly sounds like we are doing similar things, but we need to 
 > determine if enough can be abstracted to make it worth our time. 
 I have put in a request here: 
 https://github.com/openbmc/technical-oversight-forum/issues/24 
 Please chime in.



 



 







 



 

[-- Attachment #2.1: Type: text/html, Size: 44488 bytes --]

[-- Attachment #2.2: 1.png --]
[-- Type: image/png, Size: 3608 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] BMC RAS Feature
  2023-07-25 13:09                         ` J Dhanasekar
@ 2023-07-25 14:02                           ` Venkatesh, Supreeth
  2023-07-27 10:20                             ` J Dhanasekar
  0 siblings, 1 reply; 20+ messages in thread
From: Venkatesh, Supreeth @ 2023-07-25 14:02 UTC (permalink / raw)
  To: J Dhanasekar
  Cc: Lei Yu, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Ed Tanous, Dhandapani, Abinaya


[-- Attachment #1.1: Type: text/plain, Size: 11381 bytes --]

[AMD Official Use Only - General]

Hi Dhanasekar,

Algorithms or Steps for implementing functionalities (SOL, PostCode, ) will be same.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png@01D9BED6.CE5EAC10]

From: J Dhanasekar <jdhanasekar@velankanigroup.com>
Sent: Tuesday, July 25, 2023 8:09 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com>
Cc: Lei Yu <yulei.sh@bytedance.com>; Michael Shen <gpgpgp@google.com>; openbmc <openbmc@lists.ozlabs.org>; dhruvaraj S <dhruvaraj@gmail.com>; Brad Bishop <bradleyb@fuzziesquirrel.com>; Ed Tanous <ed@tanous.net>; Dhandapani, Abinaya <Abinaya.Dhandapani@amd.com>
Subject: RE: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


Hi Supreeth,

I am working on SP5 Servers too. SP5 Servers has aspeed 2600 chip and  BMC is off the board whereas EthanolX/Daytonax has 2500 and BMC is on the board.
Algorithms or Steps for implementing functionalities (SOL, PostCode, PSU..) will  remain the same?.

Thanks,
Dhanasekar




---- On Mon, 24 Jul 2023 19:44:52 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com<mailto:Supreeth.Venkatesh@amd.com>> wrote ---


[AMD Official Use Only - General]

Hi Dhanasekar,

DaytonaX and EthanolX platforms were only OpenBMC PoC with limited functionality.
We are in the process of upstreaming new AMD CRBs with OpenBMC which has all the functionality you mention below.
Public instance of the staging/intermediary repository before upstream is here:
AMDESE/OpenBMC: OpenBMC for Genoa SP5 socket platforms (github.com)<https://github.com/AMDESE/OpenBMC>

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png@01D9BED6.CE5EAC10]

From: J Dhanasekar <jdhanasekar@velankanigroup.com<mailto:jdhanasekar@velankanigroup.com>>
Sent: Monday, July 24, 2023 8:04 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com<mailto:Supreeth.Venkatesh@amd.com>>
Cc: Lei Yu <yulei.sh@bytedance.com<mailto:yulei.sh@bytedance.com>>; Zane Shelley <zshelle@imap.linux.ibm.com<mailto:zshelle@imap.linux.ibm.com>>; Michael Shen <gpgpgp@google.com<mailto:gpgpgp@google.com>>; openbmc <openbmc@lists.ozlabs.org<mailto:openbmc@lists.ozlabs.org>>; dhruvaraj S <dhruvaraj@gmail.com<mailto:dhruvaraj@gmail.com>>; Brad Bishop <bradleyb@fuzziesquirrel.com<mailto:bradleyb@fuzziesquirrel.com>>; Ed Tanous <ed@tanous.net<mailto:ed@tanous.net>>; Dhandapani, Abinaya <Abinaya.Dhandapani@amd.com<mailto:Abinaya.Dhandapani@amd.com>>
Subject: RE: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Supreeth,

Thanks for the info. We hoped that Daytonax would be upstreamed. Unfortunately, It is not available.
Actually, we need to enable SOL, Post code and PSU features in Daytona.  Will we get support for this feature enablement? or Are there any reference implementation available for AMD boards?.

Thanks,
Dhanasekar



---- On Fri, 21 Jul 2023 19:33:41 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com<mailto:Supreeth.Venkatesh@amd.com>> wrote ---


[AMD Official Use Only - General]

Hi Dhanasekar,

It is supported for EPYC Genoa family and beyond at this time.
Daytona uses EPYC Milan family and support is not there in that.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png@01D9BED6.CE5EAC10]

From: J Dhanasekar <jdhanasekar@velankanigroup.com<mailto:jdhanasekar@velankanigroup.com>>
Sent: Friday, July 21, 2023 5:30 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com<mailto:Supreeth.Venkatesh@amd.com>>
Cc: Zane Shelley <zshelle@imap.linux.ibm.com<mailto:zshelle@imap.linux.ibm.com>>; Lei Yu <yulei.sh@bytedance.com<mailto:yulei.sh@bytedance.com>>; Michael Shen <gpgpgp@google.com<mailto:gpgpgp@google.com>>; openbmc <openbmc@lists.ozlabs.org<mailto:openbmc@lists.ozlabs.org>>; dhruvaraj S <dhruvaraj@gmail.com<mailto:dhruvaraj@gmail.com>>; Brad Bishop <bradleyb@fuzziesquirrel.com<mailto:bradleyb@fuzziesquirrel.com>>; Ed Tanous <ed@tanous.net<mailto:ed@tanous.net>>; Dhandapani, Abinaya <Abinaya.Dhandapani@amd.com<mailto:Abinaya.Dhandapani@amd.com>>
Subject: Re: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Supreeth Venkatesh,

Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform.
If this RAS works for Daytona Platform. I will include it in my project.

Please provide your suggestions.

Thanks,
Dhanasekar





---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <supreeth.venkatesh@amd.com<mailto:supreeth.venkatesh@amd.com>> wrote ---


On 3/23/23 13:57, Zane Shelley wrote:
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> On 2023-03-22 19:07, Supreeth Venkatesh wrote:
>> On 3/22/23 02:10, Lei Yu wrote:
>>> Caution: This message originated from an External Source. Use proper
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
>>>>> <supreeth.venkatesh@amd.com<mailto:supreeth.venkatesh@amd.com>> wrote:
>>>>>
>>>>>
>>>>>      On 3/21/23 05:40, Patrick Williams wrote:
>>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh
>>>>> wrote:
>>>>>      >
>>>>>      >> #### Alternatives Considered
>>>>>      >>
>>>>>      >> In-band mechanisms using System Management Mode (SMM)
>>>>> exists.
>>>>>      >>
>>>>>      >> However, out of band method to gather RAS data is processor
>>>>>      specific.
>>>>>      >>
>>>>>      > How does this compare with existing implementations in
>>>>>      > phosphor-debug-collector.
>>>>>      Thanks for your feedback. See below.
>>>>>      > I believe there was some attempt to extend
>>>>>      > P-D-C previously to handle Intel's crashdump behavior.
>>>>>      Intel's crashdump interface uses com.intel.crashdump.
>>>>>      We have implemented com.amd.crashdump based on that reference.
>>>>>      However,
>>>>>      can this be made generic?
>>>>>
>>>>>      PoC below:
>>>>>
>>>>>      busctl tree com.amd.crashdump
>>>>>
>>>>>      └─/com
>>>>>         └─/com/amd
>>>>>           └─/com/amd/crashdump
>>>>>             ├─/com/amd/crashdump/0
>>>>>             ├─/com/amd/crashdump/1
>>>>>             ├─/com/amd/crashdump/2
>>>>>             ├─/com/amd/crashdump/3
>>>>>             ├─/com/amd/crashdump/4
>>>>>             ├─/com/amd/crashdump/5
>>>>>             ├─/com/amd/crashdump/6
>>>>>             ├─/com/amd/crashdump/7
>>>>>             ├─/com/amd/crashdump/8
>>>>>             └─/com/amd/crashdump/9
>>>>>
>>>>>      > The repository
>>>>>      > currently handles IBM's processors, I think, or maybe that is
>>>>>      covered by
>>>>>      > openpower-debug-collector.
>>>>>      >
>>>>>      > In any case, I think you should look at the existing D-Bus
>>>>>      interfaces
>>>>>      > (and associated Redfish implementation) of these repositories
>>>>> and
>>>>>      > determine if you can use those approaches (or document why
>>>>> now).
>>>>>      I could not find an existing D-Bus interface for RAS in
>>>>>      xyz/openbmc_project/.
>>>>>      It would be helpful if you could point me to it.
>>>>>
>>>>>
>>>>> There is an interface for the dumps generated from the host, which
>>>>> can
>>>>> be used for these kinds of dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>>
>>>>>
>>>>> The fault log also provides similar dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>>
>>>>>
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However,
>>>> the current BMCWEB implementation references
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp
>>>>
>>>> [com.intel.crashdump]
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>>>>
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump";
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump";
>>>>
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>
>>>> or
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>
>>>> is it exercised in Redfish logservices?
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
>>> to copy the crashdump json file to the dump tarball.
>>> The crashdump tool (Intel or AMD) could trigger a dump after the
>>> crashdump is completed, and then we could get a dump entry containing
>>> the crashdump.
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the
>> CPER binary file which can then be passed through a plugin/script for
>> detailed analysis.
>> In any case irrespective of whichever Dbus interface we use, we need a
>> repository which will gather data from AMD processor via APML as per
>> AMD design.
>> APML
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip
>> Can someone please help create bmc-ras or amd-debug-collector
>> repository as there are instances of openpower-debug-collector
>> repository used for Open Power systems?
>>>
>>>
>>> --
>>> BRs,
>>> Lei YU
> I am interested in possibly standardizing some of this. IBM POWER has
> several related components. openpower-hw-diags is a service that will
> listen for the hardware interrupts via a GPIO pin. When an error is
> detected, it will use openpower-libhei to query hardware registers to
> determine what happened. Based on that information openpower-hw-diags
> will generate a PEL, which is an extended log in phosphor-logging, that
> is used to tell service what to replace if necessary. Afterward,
> openpower-hw-diags will initiate openpower-debug-collector, which
> gathers a significant amount of data from the hardware for additional
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It
> uses data files (currently XML, but moving to JSON) to define register
> addresses and rules for isolation. openpower-hw-diags is fairly POWER
> specific, but I can see some parts can be made generic. Dhruv would have
> to help with openpower-debug-collector.
Thank you. Lets collaborate in standardizing some aspects of it.
>
> Regarding creation of a new repository, I think we'll need to have some
> more collaboration to determine the scope before creating it. It
> certainly sounds like we are doing similar things, but we need to
> determine if enough can be abstracted to make it worth our time.
I have put in a request here:
https://github.com/openbmc/technical-oversight-forum/issues/24
Please chime in.







[-- Attachment #1.2: Type: text/html, Size: 48603 bytes --]

[-- Attachment #2: image001.png --]
[-- Type: image/png, Size: 3608 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC] BMC RAS Feature
  2023-07-25 14:02                           ` Venkatesh, Supreeth
@ 2023-07-27 10:20                             ` J Dhanasekar
  0 siblings, 0 replies; 20+ messages in thread
From: J Dhanasekar @ 2023-07-27 10:20 UTC (permalink / raw)
  To: Venkatesh, Supreeth
  Cc: Lei Yu, Michael Shen, openbmc, dhruvaraj S, Brad Bishop,
	Ed Tanous, Dhandapani,  Abinaya

[-- Attachment #1: Type: text/plain, Size: 11908 bytes --]

Hi Supreeth,



Thanks for the info. 



-dhanasekar











---- On Tue, 25 Jul 2023 19:32:59 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh@amd.com> wrote ---



[AMD Official Use Only - General]


Hi Dhanasekar,

 

Algorithms or Steps for implementing functionalities (SOL, PostCode, ) will be same.

 

Thanks,

Supreeth Venkatesh

System Manageability Architect  |  AMD
 Server Software



 

From: J Dhanasekar <mailto:jdhanasekar@velankanigroup.com> 
 Sent: Tuesday, July 25, 2023 8:09 AM
 To: Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com>
 Cc: Lei Yu <mailto:yulei.sh@bytedance.com>; Michael Shen <mailto:gpgpgp@google.com>; openbmc <mailto:openbmc@lists.ozlabs.org>; dhruvaraj S <mailto:dhruvaraj@gmail.com>; Brad Bishop <mailto:bradleyb@fuzziesquirrel.com>; Ed Tanous <mailto:ed@tanous.net>; Dhandapani, Abinaya <mailto:Abinaya.Dhandapani@amd.com>
 Subject: RE: [RFC] BMC RAS Feature


 



Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.






 

 


Hi Supreeth, 


 


I am working on SP5 Servers too. SP5 Servers has aspeed 2600 chip and  BMC is off the board whereas EthanolX/Daytonax has 2500 and BMC is on the board. 


Algorithms or Steps for implementing functionalities (SOL, PostCode, PSU..) will  remain the same?. 


 


Thanks,


Dhanasekar


 



 


 


 


---- On Mon, 24 Jul 2023 19:44:52 +0530 Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com> wrote ---


 


[AMD Official Use Only - General]

 

Hi Dhanasekar,

 

DaytonaX and EthanolX platforms were only OpenBMC PoC with limited functionality.

We are in the process of upstreaming new AMD CRBs with OpenBMC which has all the functionality you mention below.

Public instance of the staging/intermediary repository before upstream is here:

https://github.com/AMDESE/OpenBMC

 

Thanks,

Supreeth Venkatesh

System Manageability Architect  |  AMD
 Server Software



 

From: J Dhanasekar <mailto:jdhanasekar@velankanigroup.com> 
 Sent: Monday, July 24, 2023 8:04 AM
 To: Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com>
 Cc: Lei Yu <mailto:yulei.sh@bytedance.com>; Zane Shelley <mailto:zshelle@imap.linux.ibm.com>; Michael Shen <mailto:gpgpgp@google.com>;
 openbmc <mailto:openbmc@lists.ozlabs.org>; dhruvaraj S <mailto:dhruvaraj@gmail.com>; Brad Bishop <mailto:bradleyb@fuzziesquirrel.com>;
 Ed Tanous <mailto:ed@tanous.net>; Dhandapani, Abinaya <mailto:Abinaya.Dhandapani@amd.com>
 Subject: RE: [RFC] BMC RAS Feature


 



Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.






 

Hi Supreeth,


 


Thanks for the info. We hoped that Daytonax would be upstreamed. Unfortunately, It is not available. 


Actually, we need to enable SOL, Post code and PSU features in Daytona.  Will we get support for this feature enablement? or Are there any reference implementation
 available for AMD boards?.


 


Thanks,


Dhanasekar


 


 


 


---- On Fri, 21 Jul 2023 19:33:41 +0530 Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com> wrote ---


 


[AMD Official Use Only - General]

 


Hi Dhanasekar,

 

It is supported for EPYC Genoa family and beyond at this time.

Daytona uses EPYC Milan family and support is not there in that.

 

Thanks,

Supreeth Venkatesh

System Manageability Architect  |  AMD
 Server Software



 

From: J Dhanasekar <mailto:jdhanasekar@velankanigroup.com> 
 Sent: Friday, July 21, 2023 5:30 AM
 To: Venkatesh, Supreeth <mailto:Supreeth.Venkatesh@amd.com>
 Cc: Zane Shelley <mailto:zshelle@imap.linux.ibm.com>; Lei Yu <mailto:yulei.sh@bytedance.com>; Michael Shen <mailto:gpgpgp@google.com>;
 openbmc <mailto:openbmc@lists.ozlabs.org>; dhruvaraj S <mailto:dhruvaraj@gmail.com>; Brad Bishop <mailto:bradleyb@fuzziesquirrel.com>;
 Ed Tanous <mailto:ed@tanous.net>; Dhandapani, Abinaya <mailto:Abinaya.Dhandapani@amd.com>
 Subject: Re: [RFC] BMC RAS Feature


 



Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.






 

Hi Supreeth Venkatesh,


 


Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform. 


If this RAS works for Daytona Platform. I will include it in my project. 


 


Please provide your suggestions. 


 


Thanks,


Dhanasekar


 


 



 


 


 


---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <mailto:supreeth.venkatesh@amd.com> wrote ---


 



On 3/23/23 13:57, Zane Shelley wrote:
 > Caution: This message originated from an External Source. Use proper 
 > caution when opening attachments, clicking links, or responding. 
 > 
 > 
 > On 2023-03-22 19:07, Supreeth Venkatesh wrote:  
 >> On 3/22/23 02:10, Lei Yu wrote: 
 >>> Caution: This message originated from an External Source. Use proper 
 >>> caution when opening attachments, clicking links, or responding. 
 >>> 
 >>> 
 >>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh 
 >>>>> <mailto:supreeth.venkatesh@amd.com> wrote: 
 >>>>> 
 >>>>> 
 >>>>>      On 3/21/23 05:40, Patrick Williams wrote: 
 >>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh 
 >>>>> wrote: 
 >>>>>      > 
 >>>>>      >> #### Alternatives Considered 
 >>>>>      >> 
 >>>>>      >> In-band mechanisms using System Management Mode (SMM) 
 >>>>> exists. 
 >>>>>      >> 
 >>>>>      >> However, out of band method to gather RAS data is processor 
 >>>>>      specific. 
 >>>>>      >> 
 >>>>>      > How does this compare with existing implementations in 
 >>>>>      > phosphor-debug-collector. 
 >>>>>      Thanks for your feedback. See below.  
 >>>>>      > I believe there was some attempt to extend 
 >>>>>      > P-D-C previously to handle Intel's crashdump behavior. 
 >>>>>      Intel's crashdump interface uses com.intel.crashdump. 
 >>>>>      We have implemented com.amd.crashdump based on that reference. 
 >>>>>      However, 
 >>>>>      can this be made generic? 
 >>>>> 
 >>>>>      PoC below: 
 >>>>> 
 >>>>>      busctl tree com.amd.crashdump 
 >>>>> 
 >>>>>      └─/com 
 >>>>>         └─/com/amd 
 >>>>>           └─/com/amd/crashdump 
 >>>>>             ├─/com/amd/crashdump/0 
 >>>>>             ├─/com/amd/crashdump/1 
 >>>>>             ├─/com/amd/crashdump/2 
 >>>>>             ├─/com/amd/crashdump/3 
 >>>>>             ├─/com/amd/crashdump/4 
 >>>>>             ├─/com/amd/crashdump/5 
 >>>>>             ├─/com/amd/crashdump/6 
 >>>>>             ├─/com/amd/crashdump/7 
 >>>>>             ├─/com/amd/crashdump/8 
 >>>>>             └─/com/amd/crashdump/9 
 >>>>> 
 >>>>>      > The repository 
 >>>>>      > currently handles IBM's processors, I think, or maybe that is 
 >>>>>      covered by 
 >>>>>      > openpower-debug-collector. 
 >>>>>      > 
 >>>>>      > In any case, I think you should look at the existing D-Bus 
 >>>>>      interfaces 
 >>>>>      > (and associated Redfish implementation) of these repositories 
 >>>>> and 
 >>>>>      > determine if you can use those approaches (or document why 
 >>>>> now). 
 >>>>>      I could not find an existing D-Bus interface for RAS in 
 >>>>>      xyz/openbmc_project/. 
 >>>>>      It would be helpful if you could point me to it. 
 >>>>> 
 >>>>> 
 >>>>> There is an interface for the dumps generated from the host, which 
 >>>>> can 
 >>>>> be used for these kinds of dumps 
 >>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
 >>>>> 
 >>>>> 
 >>>>> The fault log also provides similar dumps  
 >>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
 >>>>> 
 >>>>> 
 >>>> ThanksDdhruvraj. The interface looks useful for the purpose. However, 
 >>>> the current BMCWEB implementation references  
 >>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp 
 >>>> 
 >>>> [com.intel.crashdump] 
 >>>> constexpr char const* crashdumpPath = "/com/intel/crashdump"; 
 >>>> 
 >>>> constexpr char const* crashdumpInterface = "com.intel.crashdump"; 
 >>>> constexpr char const* crashdumpObject = "com.intel.crashdump"; 
 >>>> 
 >>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
 >>>> 
 >>>> or 
 >>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
 >>>> 
 >>>> is it exercised in Redfish logservices?  
 >>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added 
 >>> to copy the crashdump json file to the dump tarball. 
 >>> The crashdump tool (Intel or AMD) could trigger a dump after the 
 >>> crashdump is completed, and then we could get a dump entry containing 
 >>> the crashdump. 
 >> Thanks Lei Yu for your input. We are using Redfish to retrieve the 
 >> CPER binary file which can then be passed through a plugin/script for 
 >> detailed analysis. 
 >> In any case irrespective of whichever Dbus interface we use, we need a 
 >> repository which will gather data from AMD processor via APML as per 
 >> AMD design. 
 >> APML 
 >> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip 
 >> Can someone please help create bmc-ras or amd-debug-collector 
 >> repository as there are instances of openpower-debug-collector 
 >> repository used for Open Power systems? 
 >>> 
 >>> 
 >>> -- 
 >>> BRs, 
 >>> Lei YU 
 > I am interested in possibly standardizing some of this. IBM POWER has 
 > several related components. openpower-hw-diags is a service that will 
 > listen for the hardware interrupts via a GPIO pin. When an error is 
 > detected, it will use openpower-libhei to query hardware registers to 
 > determine what happened. Based on that information openpower-hw-diags 
 > will generate a PEL, which is an extended log in phosphor-logging, that 
 > is used to tell service what to replace if necessary. Afterward, 
 > openpower-hw-diags will initiate openpower-debug-collector, which 
 > gathers a significant amount of data from the hardware for additional 
 > debug when necessary. I wrote openpower-libhei to be fairly agnostic. It 
 > uses data files (currently XML, but moving to JSON) to define register 
 > addresses and rules for isolation. openpower-hw-diags is fairly POWER 
 > specific, but I can see some parts can be made generic. Dhruv would have 
 > to help with openpower-debug-collector. 
 Thank you. Lets collaborate in standardizing some aspects of it. 
 > 
 > Regarding creation of a new repository, I think we'll need to have some 
 > more collaboration to determine the scope before creating it. It 
 > certainly sounds like we are doing similar things, but we need to 
 > determine if enough can be abstracted to make it worth our time. 
 I have put in a request here: 
 https://github.com/openbmc/technical-oversight-forum/issues/24 
 Please chime in.



 



 







 



 







 



 

[-- Attachment #2.1: Type: text/html, Size: 59956 bytes --]

[-- Attachment #2.2: 1.png --]
[-- Type: image/png, Size: 3608 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-07-27 10:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-21  5:14 [RFC] BMC RAS Feature Supreeth Venkatesh
2023-03-21 10:40 ` Patrick Williams
2023-03-21 15:07   ` Supreeth Venkatesh
2023-03-21 16:26     ` dhruvaraj S
2023-03-21 17:25       ` Supreeth Venkatesh
2023-03-22  7:10         ` Lei Yu
2023-03-23  0:07           ` Supreeth Venkatesh
2023-04-03 11:44             ` Patrick Williams
2023-04-03 16:32               ` Supreeth Venkatesh
     [not found]             ` <d65937a46b6fb4f9f94edbdef44af58e@imap.linux.ibm.com>
2023-04-03 16:36               ` Supreeth Venkatesh
2023-07-21 10:29                 ` J Dhanasekar
2023-07-21 14:03                   ` Venkatesh, Supreeth
2023-07-24 13:04                     ` J Dhanasekar
2023-07-24 14:14                       ` Venkatesh, Supreeth
2023-07-25 13:09                         ` J Dhanasekar
2023-07-25 14:02                           ` Venkatesh, Supreeth
2023-07-27 10:20                             ` J Dhanasekar
2023-07-14 22:05 ` Bills, Jason M
2023-07-15  9:01   ` dhruvaraj S
2023-07-24 14:29   ` Venkatesh, Supreeth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).