openbmc.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Proposal for operations on isolated hardware units using Redfish logging
@ 2020-12-10 14:55 dhruvaraj S
  2020-12-10 17:29 ` Ed Tanous
  0 siblings, 1 reply; 5+ messages in thread
From: dhruvaraj S @ 2020-12-10 14:55 UTC (permalink / raw)
  To: openbmc; +Cc: bradleyb, gmills

Hi,
Please find the option for operations on isolated hardware units using
Redfisg logging


Hardware Isolation
On systems with multiple processor units and other redundant vital resources,
the system downtime can be prevented by isolating the faulty hardware units.
Most of the actions required to isolate the parts will be dependent on
the architecture and
executed in the host. But the BMC needs to support a few steps like
provide a method to users to query the units in isolation, clearing
isolation, isolating a
suspected part, or isolating when the host is down due to a fault in a
critical unit.
Since a user interface is needed for the above actions proposing a method to use
Redfish log service to carry out these actions.

Requirements
When user requests, isolate a hardware unit.
Getting the list of all isolated resources.
Remove the isolation of a hardware unit.
Remove all existing isolation

Isolating a hardware unit:
redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
{
  "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware",
  "@odata.type": "#LogService.v1_2_0.LogService",
  "Actions": {
    "#LogService.CollectDiagnosticData": {
      "target":
"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Actions/LogService.CollectDiagnosticData"
    }
  },
  "Description": "Isolated Hardware",
  "Entries": {
    "@odata.id":
"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries"
  },
  "Id": "IsolatedHardware",
  "Name": "Isolated Hardware LogService",
  "OverWritePolicy": "WrapsWhenFull"

Listing isolated hardware units.
redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
{
  "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
  "@odata.type": "#LogEntryCollection.LogEntryCollection",
  "Description": "Collection of Isolated Hardware Components",
  "Members": [
    {
      "@odata.id":
"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
      "@odata.type": "#LogEntry.v1_7_0.LogEntry",
      "Created": "2020-10-15T10:30:08+00:00",
      "EntryType": "Event",
      "Id": "1",
      "Resolved": "false",
      "Name": "Processor 1",
      "links":  {
                 "OriginOfCondition": {
                        "@odata.id":
"/redfish/v1/Systems/system/Processors/cpu1"
                    },
      "Severity": "Critical",
       "SensorType" : "Processor",

 "AdditionalDataURI":
“/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
 “AddionalDataSizeBytes": "1024"

  }
  ],
  "Members@odata.count": 1,
  "Name": "Isolated Hardware Entries"

Users will be able to delete any entry or all the entries, but if an
isolated unit is serviced then that unit will be back in service, in
such cases the "Resolved" property in the entries will be marked as
"true"
"AdditionalDataURI" : This is a link to the error log associated with
this isolation action.
--------------
Dhruvaraj S

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for operations on isolated hardware units using Redfish logging
  2020-12-10 14:55 Proposal for operations on isolated hardware units using Redfish logging dhruvaraj S
@ 2020-12-10 17:29 ` Ed Tanous
  2020-12-11 11:27   ` dhruvaraj S
  2020-12-11 16:12   ` Gunnar Mills
  0 siblings, 2 replies; 5+ messages in thread
From: Ed Tanous @ 2020-12-10 17:29 UTC (permalink / raw)
  To: dhruvaraj S; +Cc: openbmc, Brad Bishop, Gunnar Mills

On Thu, Dec 10, 2020 at 7:49 AM dhruvaraj S <dhruvaraj@gmail.com> wrote:
>
> Hi,
> Please find the option for operations on isolated hardware units using
> Redfisg logging
>
>
> Hardware Isolation
> On systems with multiple processor units and other redundant vital resources,
> the system downtime can be prevented by isolating the faulty hardware units.
> Most of the actions required to isolate the parts will be dependent on
> the architecture and
> executed in the host. But the BMC needs to support a few steps like
> provide a method to users to query the units in isolation, clearing
> isolation, isolating a
> suspected part, or isolating when the host is down due to a fault in a
> critical unit.
> Since a user interface is needed for the above actions proposing a method to use
> Redfish log service to carry out these actions.

Right off the bat, LogServices seems like a strange choice for this.
In your requirements, you're taking actions on the unit itself, not
logging the actions that occurred, so I'm struggling to see the design
choice here.  Can you elaborate why LogService, something intended to
be for historical logging, would be appropriate for a design that
needs to accept user action?

>
> Requirements
> When user requests, isolate a hardware unit.
> Getting the list of all isolated resources.
> Remove the isolation of a hardware unit.
> Remove all existing isolation
>
> Isolating a hardware unit:
> redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
> {
>   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware",
>   "@odata.type": "#LogService.v1_2_0.LogService",
>   "Actions": {
>     "#LogService.CollectDiagnosticData": {
>       "target":
> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Actions/LogService.CollectDiagnosticData"

What is this action intended to do?

>     }
>   },
>   "Description": "Isolated Hardware",
>   "Entries": {
>     "@odata.id":
> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries"
>   },
>   "Id": "IsolatedHardware",
>   "Name": "Isolated Hardware LogService",
>   "OverWritePolicy": "WrapsWhenFull"
>
> Listing isolated hardware units.
> redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
> {
>   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
>   "@odata.type": "#LogEntryCollection.LogEntryCollection",
>   "Description": "Collection of Isolated Hardware Components",
>   "Members": [
>     {
>       "@odata.id":
> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
>       "@odata.type": "#LogEntry.v1_7_0.LogEntry",
>       "Created": "2020-10-15T10:30:08+00:00",
>       "EntryType": "Event",
>       "Id": "1",
>       "Resolved": "false",

LogEntry doesn't have a "Resolved" field that I can see.

>       "Name": "Processor 1",
>       "links":  {
>                  "OriginOfCondition": {
>                         "@odata.id":
> "/redfish/v1/Systems/system/Processors/cpu1"
>                     },
>       "Severity": "Critical",
>        "SensorType" : "Processor",

SensorType doesn't really make sense in this case, as you're not
reporting errors from a sensor, but from a resource.

>
>  "AdditionalDataURI":
> “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
>  “AddionalDataSizeBytes": "1024"
>
>   }
>   ],
>   "Members@odata.count": 1,
>   "Name": "Isolated Hardware Entries"
>
> Users will be able to delete any entry or all the entries, but if an
> isolated unit is serviced then that unit will be back in service, in
> such cases the "Resolved" property in the entries will be marked as
> "true"
> "AdditionalDataURI" : This is a link to the error log associated with
> this isolation action.
> --------------
> Dhruvaraj S


I suspect overall you need to separate this into two different
resources.  One for logging things that have happened in the past,
under log service, and one for interacting directly with the system in
its current state.  The second one would likely take the form of being
able to set the Status property to something like "Disabled",
"UnavailableOffline", or something similar on your Processor
resources.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for operations on isolated hardware units using Redfish logging
  2020-12-10 17:29 ` Ed Tanous
@ 2020-12-11 11:27   ` dhruvaraj S
  2020-12-11 16:12   ` Gunnar Mills
  1 sibling, 0 replies; 5+ messages in thread
From: dhruvaraj S @ 2020-12-11 11:27 UTC (permalink / raw)
  To: Ed Tanous; +Cc: openbmc, Brad Bishop, Gunnar Mills

On Thu, Dec 10, 2020 at 10:59 PM Ed Tanous <ed@tanous.net> wrote:
>
> On Thu, Dec 10, 2020 at 7:49 AM dhruvaraj S <dhruvaraj@gmail.com> wrote:
> >
> > Hi,
> > Please find the option for operations on isolated hardware units using
> > Redfisg logging
> >
> >
> > Hardware Isolation
> > On systems with multiple processor units and other redundant vital resources,
> > the system downtime can be prevented by isolating the faulty hardware units.
> > Most of the actions required to isolate the parts will be dependent on
> > the architecture and
> > executed in the host. But the BMC needs to support a few steps like
> > provide a method to users to query the units in isolation, clearing
> > isolation, isolating a
> > suspected part, or isolating when the host is down due to a fault in a
> > critical unit.
> > Since a user interface is needed for the above actions proposing a method to use
> > Redfish log service to carry out these actions.
>
> Right off the bat, LogServices seems like a strange choice for this.
> In your requirements, you're taking actions on the unit itself, not
> logging the actions that occurred, so I'm struggling to see the design
> choice here.  Can you elaborate why LogService, something intended to
> be for historical logging, would be appropriate for a design that
> needs to accept user action?

Apart from user-requested isolation of a hardware unit, usually, hardware units
get isolated due to a past action in the system. for example, if a
processor core encountered
an error while performing the activities and cannot continue in
service, that will be listed
as isolated. A method is needed to show the list of such units to the users.
Since log service is for showing such logs, I think log service is
suitable for that.
And after the repair, once the unit is back in service, this log
service entry will be marked
as resolved.

>
> >
> > Requirements
> > When user requests, isolate a hardware unit.
> > Getting the list of all isolated resources.
> > Remove the isolation of a hardware unit.
> > Remove all existing isolation
> >
> > Isolating a hardware unit:
> > redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
> > {
> >   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware",
> >   "@odata.type": "#LogService.v1_2_0.LogService",
> >   "Actions": {
> >     "#LogService.CollectDiagnosticData": {
> >       "target":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Actions/LogService.CollectDiagnosticData"
>
> What is this action intended to do?
>
> >     }
> >   },
> >   "Description": "Isolated Hardware",
> >   "Entries": {
> >     "@odata.id":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries"
> >   },
> >   "Id": "IsolatedHardware",
> >   "Name": "Isolated Hardware LogService",
> >   "OverWritePolicy": "WrapsWhenFull"
> >
> > Listing isolated hardware units.
> > redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
> > {
> >   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
> >   "@odata.type": "#LogEntryCollection.LogEntryCollection",
> >   "Description": "Collection of Isolated Hardware Components",
> >   "Members": [
> >     {
> >       "@odata.id":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
> >       "@odata.type": "#LogEntry.v1_7_0.LogEntry",
> >       "Created": "2020-10-15T10:30:08+00:00",
> >       "EntryType": "Event",
> >       "Id": "1",
> >       "Resolved": "false",
>
> LogEntry doesn't have a "Resolved" field that I can see.
>
> >       "Name": "Processor 1",
> >       "links":  {
> >                  "OriginOfCondition": {
> >                         "@odata.id":
> > "/redfish/v1/Systems/system/Processors/cpu1"
> >                     },
> >       "Severity": "Critical",
> >        "SensorType" : "Processor",
>
> SensorType doesn't really make sense in this case, as you're not
> reporting errors from a sensor, but from a resource.
>
> >
> >  "AdditionalDataURI":
> > “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
> >  “AddionalDataSizeBytes": "1024"
> >
> >   }
> >   ],
> >   "Members@odata.count": 1,
> >   "Name": "Isolated Hardware Entries"
> >
> > Users will be able to delete any entry or all the entries, but if an
> > isolated unit is serviced then that unit will be back in service, in
> > such cases the "Resolved" property in the entries will be marked as
> > "true"
> > "AdditionalDataURI" : This is a link to the error log associated with
> > this isolation action.
> > --------------
> > Dhruvaraj S
>
>
> I suspect overall you need to separate this into two different
> resources.  One for logging things that have happened in the past,
> under log service, and one for interacting directly with the system in
> its current state.  The second one would likely take the form of being
> able to set the Status property to something like "Disabled",
> "UnavailableOffline", or something similar on your Processor
> resources.

The log service is already being used to generate the dump, which is a
user-initiated
 action in log service, I am thinking the user-initiated isolation
also can be in the same place.
But as you suggested setting the disabled/UnavailableOffline on the
list of units also a good option,
need to look more into that.

-- 
--------------
Dhruvaraj S

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for operations on isolated hardware units using Redfish logging
  2020-12-10 17:29 ` Ed Tanous
  2020-12-11 11:27   ` dhruvaraj S
@ 2020-12-11 16:12   ` Gunnar Mills
  2020-12-23 16:54     ` dhruvaraj S
  1 sibling, 1 reply; 5+ messages in thread
From: Gunnar Mills @ 2020-12-11 16:12 UTC (permalink / raw)
  To: Ed Tanous, dhruvaraj S; +Cc: openbmc, Brad Bishop

On 12/10/2020 10:29 AM, Ed Tanous wrote:
> On Thu, Dec 10, 2020 at 7:49 AM dhruvaraj S <dhruvaraj@gmail.com> wrote:
>>
>>
>> Listing isolated hardware units.
>> redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
>> {
>>    "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
>>    "@odata.type": "#LogEntryCollection.LogEntryCollection",
>>    "Description": "Collection of Isolated Hardware Components",
>>    "Members": [
>>      {
>>        "@odata.id":
>> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
>>        "@odata.type": "#LogEntry.v1_7_0.LogEntry",
>>        "Created": "2020-10-15T10:30:08+00:00",
>>        "EntryType": "Event",
>>        "Id": "1",
>>        "Resolved": "false",
> 
> LogEntry doesn't have a "Resolved" field that I can see.

Part of Redfish's 2020.4. Matches OpenBMC's 
https://github.com/openbmc/phosphor-dbus-interfaces/blob/05dd96872560bc6f11616be48b1873f539904142/xyz/openbmc_project/Logging/Entry.interface.yaml#L29

> 
>>        "Name": "Processor 1",
>>        "links":  {
>>                   "OriginOfCondition": {
>>                          "@odata.id":
>> "/redfish/v1/Systems/system/Processors/cpu1"
>>                      },
>>        "Severity": "Critical",
>>         "SensorType" : "Processor",
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Proposal for operations on isolated hardware units using Redfish logging
  2020-12-11 16:12   ` Gunnar Mills
@ 2020-12-23 16:54     ` dhruvaraj S
  0 siblings, 0 replies; 5+ messages in thread
From: dhruvaraj S @ 2020-12-23 16:54 UTC (permalink / raw)
  To: Gunnar Mills; +Cc: Brad Bishop, openbmc, Ed Tanous

HI,

Updated, instead of using LogService.CollectDiagnosticDatato manually
isolate the hardware, new proposal is to  set the property
ReadyToRemove to True"
redfish » v1 » Systems » system » Processors » CPU1
{
  "@odata.type": "#Processor.v1_7_0.Processor",
  "Id":view details "CPU1",
  "Name": "Processor",
   "Socket": "CPU 1",
  "ProcessorType": "CPU",
  "ProcessorId":
   {
       "VendorId": "XXXX",
       "IdentificationRegisters": "XXXX",
   } ,
   "MaxSpeedMHz": 3700,
   "TotalCores": 8,
   "TotalThreads": 16,
   "Status":
   {
        "State": "Enabled",
        "Health": "OK"
       "ReadyToRemove": "True" <---
    } ,
"@odata.id":view details "/redfish/v1/Systems/system/Processors/CPU1"
}

On Fri, Dec 11, 2020 at 9:42 PM Gunnar Mills <gmills@linux.vnet.ibm.com> wrote:
>
> On 12/10/2020 10:29 AM, Ed Tanous wrote:
> > On Thu, Dec 10, 2020 at 7:49 AM dhruvaraj S <dhruvaraj@gmail.com> wrote:
> >>
> >>
> >> Listing isolated hardware units.
> >> redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
> >> {
> >>    "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
> >>    "@odata.type": "#LogEntryCollection.LogEntryCollection",
> >>    "Description": "Collection of Isolated Hardware Components",
> >>    "Members": [
> >>      {
> >>        "@odata.id":
> >> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
> >>        "@odata.type": "#LogEntry.v1_7_0.LogEntry",
> >>        "Created": "2020-10-15T10:30:08+00:00",
> >>        "EntryType": "Event",
> >>        "Id": "1",
> >>        "Resolved": "false",
> >
> > LogEntry doesn't have a "Resolved" field that I can see.
>
> Part of Redfish's 2020.4. Matches OpenBMC's
> https://github.com/openbmc/phosphor-dbus-interfaces/blob/05dd96872560bc6f11616be48b1873f539904142/xyz/openbmc_project/Logging/Entry.interface.yaml#L29
>
> >
> >>        "Name": "Processor 1",
> >>        "links":  {
> >>                   "OriginOfCondition": {
> >>                          "@odata.id":
> >> "/redfish/v1/Systems/system/Processors/cpu1"
> >>                      },
> >>        "Severity": "Critical",
> >>         "SensorType" : "Processor",
> >
>
>


-- 
--------------
Dhruvaraj S

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-23 16:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 14:55 Proposal for operations on isolated hardware units using Redfish logging dhruvaraj S
2020-12-10 17:29 ` Ed Tanous
2020-12-11 11:27   ` dhruvaraj S
2020-12-11 16:12   ` Gunnar Mills
2020-12-23 16:54     ` dhruvaraj S

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).