All of lore.kernel.org
 help / color / mirror / Atom feed
* BMC health metrics (again!)
@ 2019-04-09 16:25 Kun Yi
  2019-04-11 12:56 ` Sivas Srr
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Kun Yi @ 2019-04-09 16:25 UTC (permalink / raw)
  To: OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 2480 bytes --]

Hello there,

This topic has been brought up several times on the mailing list and
offline, but in general seems we as a community didn't reach a consensus on
what things would be the most valuable to monitor, and how to monitor them.
While it seems a general purposed monitoring infrastructure for OpenBMC is
a hard problem, I have some simple ideas that I hope can provide immediate
and direct benefits.

1. Monitoring host IPMI link reliability (host side)

The essentials I want are "IPMI commands sent" and "IPMI commands
succeeded" counts over time. More metrics like response time would
be helpful as well. The issue to address here: when some IPMI sensor
readings are flaky, it would be really helpful to tell from IPMI command
stats to determine whether it is a hardware issue, or IPMI issue. Moreover,
it would be a very useful regression test metric for rolling out new BMC
software.

Looking at the host IPMI side, there is some metrics exposed
through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug
into whether it contains information mapping to the interrupts. Time to
read the source code I guess.

Another idea would be to instrument caller libraries like the interfaces in
ipmitool, though I feel that approach is harder due to fragmentation of
IPMI libraries.

2. Read and expose core BMC performance metrics from procfs

This is straightforward: have a smallish daemon (or bmc-state-manager)
read,parse, and process procfs and put values on D-Bus. Core metrics I'm
interested in getting through this way: load average, memory, disk
used/available, net stats... The values can then simply be exported as IPMI
sensors or Redfish resource properties.

A nice byproduct of this effort would be a procfs parsing library. Since
different platforms would probably have different monitoring requirements
and procfs output format has no standard, I'm thinking the user would just
provide a configuration file containing list of (procfs path, property
regex, D-Bus property name), and the compile-time generated code to provide
an object for each property.

All of this is merely thoughts and nothing concrete. With that said, it
would be really great if you could provide some feedback such as "I want
this, but I really need that feature", or let me know it's all implemented
already :)

If this seems valuable, after gathering more feedback of feature
requirements, I'm going to turn them into design docs and upload for review.

-- 
Regards,
Kun

[-- Attachment #2: Type: text/html, Size: 2977 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-04-09 16:25 BMC health metrics (again!) Kun Yi
@ 2019-04-11 12:56 ` Sivas Srr
  2019-04-20  1:04   ` Kun Yi
  2019-04-12 13:02 ` Andrew Geissler
  2019-05-08  8:11 ` vishwa
  2 siblings, 1 reply; 13+ messages in thread
From: Sivas Srr @ 2019-04-11 12:56 UTC (permalink / raw)
  To: kunyi; +Cc: openbmc

[-- Attachment #1: Type: text/html, Size: 7582 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-04-09 16:25 BMC health metrics (again!) Kun Yi
  2019-04-11 12:56 ` Sivas Srr
@ 2019-04-12 13:02 ` Andrew Geissler
  2019-04-20  1:08   ` Kun Yi
  2019-05-08  8:11 ` vishwa
  2 siblings, 1 reply; 13+ messages in thread
From: Andrew Geissler @ 2019-04-12 13:02 UTC (permalink / raw)
  To: Kun Yi; +Cc: OpenBMC Maillist

On Tue, Apr 9, 2019 at 11:26 AM Kun Yi <kunyi@google.com> wrote:
>
> Hello there,
>
> This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.

I like it, start simple and we can build from there.

>
> 1. Monitoring host IPMI link reliability (host side)
>
> The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.

Are you thinking this is mostly for out of band IPMI? Or in-band as
well? I can't say I've looked into this much but are there known
issues in this area? The only time I've run into IPMI issues are
usually when communicating with the host firmware. We've hit a variety
of race conditions and timeouts in that path.

>
> Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.
>
> Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.
>
> 2. Read and expose core BMC performance metrics from procfs
>
> This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.

Yes, I would def be interested in being able to look at these things.
I assume the sampling rate would be configurable? We could build this
into our CI images and collect the information for each run. I'm not
sure if we'd use this in production due to the additional resources it
will consume but I could see it being very useful in lab/debug/CI
areas.

> A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.

Sounds flexible and reasonable to me.

> All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)
>
> If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.

Perfect

> --
> Regards,
> Kun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-04-11 12:56 ` Sivas Srr
@ 2019-04-20  1:04   ` Kun Yi
  0 siblings, 0 replies; 13+ messages in thread
From: Kun Yi @ 2019-04-20  1:04 UTC (permalink / raw)
  To: Sivas Srr; +Cc: OpenBMC Maillist

Thanks Sivas. Response inline.

On Thu, Apr 11, 2019 at 5:57 AM Sivas Srr <sivas.srr@in.ibm.com> wrote:
>
> Thank you Kun Yi for your proposal.
> My input starts with word "Response:".
>
>
>
> With regards,
> Sivas
>
>
> ----- Original message -----
> From: Kun Yi <kunyi@google.com>
> Sent by: "openbmc" <openbmc-bounces+sivas.srr=in.ibm.com@lists.ozlabs.org>
> To: OpenBMC Maillist <openbmc@lists.ozlabs.org>
> Cc:
> Subject: BMC health metrics (again!)
> Date: Tue, Apr 9, 2019 9:57 PM
>
> Hello there,
>
> This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.
>
> 1. Monitoring host IPMI link reliability (host side)
>
> The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.
>
> Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.
>
> Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.
>
> Response: Can we have it as a part of debug / tarball image to get response time and this can be used only at that time.
> And more over IPMI interface is not fading away? Will let others to provide input.

Debug tarball tool is an interesting idea, though it seems from my
preliminary probing that getting command response from kernel stat
alone is not feasible without modifying the driver.

> 2. Read and expose core BMC performance metrics from procfs
>
> This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.
>
> A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.
>
> All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)
>
> If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.
>
> Response: As BMC is small embedded system, Do we really need to put this and may need to decide based on memory / flash foot print.

Yes, obviously it depends on whether the daemon itself is lightweight.
I don't envision it to be larger than any standard phosphor daemon.
Again, it could be configured and included on a platform-by-platform
basis.

>
> Feature to get even when BMC usage goes > 90%:
>
> From end user perspective,  If BMC performance / usages reaches consistently > 90% of BMC CPU utilization / BMC Memory / BMC file system then we should have way to get an event accordingly. This will help end user. I feel this is higher priority.
>
> May be based on the event, involved application should try to correct itself.

Agree with generating event logs for degraded BMC performances. There
is a standard software watchdog that can reset/recover the system
based on configuration, and we are using it on our platforms, we
should look into whether it can be hooked up to generate an event.
[1] https://linux.die.net/man/8/watchdog

>
> If After this, BMC have good foot print then nothing wrong in having small daemon like procfs and use d-bus to get  performance metrics.

As I have mentioned, I think there are still values from a QA
perspective to profile the performance even if BMC itself is running
fine.

>
> With regards,
> Sivas
> --
>
> Regards,
> Kun
>
>
>


--
Regards,
Kun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-04-12 13:02 ` Andrew Geissler
@ 2019-04-20  1:08   ` Kun Yi
  0 siblings, 0 replies; 13+ messages in thread
From: Kun Yi @ 2019-04-20  1:08 UTC (permalink / raw)
  To: Andrew Geissler; +Cc: OpenBMC Maillist

On Fri, Apr 12, 2019 at 6:02 AM Andrew Geissler <geissonator@gmail.com> wrote:
>
> On Tue, Apr 9, 2019 at 11:26 AM Kun Yi <kunyi@google.com> wrote:
> >
> > Hello there,
> >
> > This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.
>
> I like it, start simple and we can build from there.
>
> >
> > 1. Monitoring host IPMI link reliability (host side)
> >
> > The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.
>
> Are you thinking this is mostly for out of band IPMI? Or in-band as
> well? I can't say I've looked into this much but are there known
> issues in this area? The only time I've run into IPMI issues are
> usually when communicating with the host firmware. We've hit a variety
> of race conditions and timeouts in that path.

Good question. Mostly for in-band IPMI, because we didn't have much
experience with OOB IPMI. :)

Compared to the other proposal this one needs to be fleshed out more
for a design proposal. I'm currently low on resources but will pick it
up soon after the next 1-2 weeks.

>
> >
> > Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.
> >
> > Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.
> >
> > 2. Read and expose core BMC performance metrics from procfs
> >
> > This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.
>
> Yes, I would def be interested in being able to look at these things.
> I assume the sampling rate would be configurable? We could build this
> into our CI images and collect the information for each run. I'm not
> sure if we'd use this in production due to the additional resources it
> will consume but I could see it being very useful in lab/debug/CI
> areas.
>
> > A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.
>
> Sounds flexible and reasonable to me.
>
> > All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)
> >
> > If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.
>
> Perfect
>
> > --
> > Regards,
> > Kun



-- 
Regards,
Kun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-04-09 16:25 BMC health metrics (again!) Kun Yi
  2019-04-11 12:56 ` Sivas Srr
  2019-04-12 13:02 ` Andrew Geissler
@ 2019-05-08  8:11 ` vishwa
  2019-05-17  6:30   ` Neeraj Ladkani
  2 siblings, 1 reply; 13+ messages in thread
From: vishwa @ 2019-05-08  8:11 UTC (permalink / raw)
  To: Kun Yi, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 4397 bytes --]

Hello Kun,

Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, 
is it only targeted to IPMI -or- a generic BMC-Host communication kink ?

Some of the things in my wish-list are:

1/. Flash wear and tear detection and the threshold to be a config option
2/. Any SoC specific health checks ( If that is exposed )
3/. Mechanism to detect spurious interrupts on any HW link
4/. Some kind of check to see if there will be any I2C lock to a given 
end device
5/. Ability to detect errors on HW links

On the watchdog(8) area, I was just thinking these:

How about having some kind of BMC_health D-Bus properties -or- a compile 
time feed, whose values can be fed into a configuration file than 
watchdog using the default /etc/watchdog.conf always. If the properties 
are coming from a D-Bus, then we could either append to 
/etc/watchdog.conf -or- treat those values only as the config file that 
can be given to watchdog.
The systemd service files to be setup accordingly.


We have seen instances where we get an error that is indicating no 
resources available. Those could be file descriptors / socket 
descriptors etc. A way to plug this into watchdog as part of test binary 
that checks for this ? We could hook a repair-binary to take the action.


Another thing that I was looking at hooking into watchdog is the test to 
see the file system usage as defined by the policy.
Policy could mention the file system mounts and also the threshold.

For example, /tmp , /root etc.. We could again hook a repair binary to 
do some cleanup if needed

If we see the list is growing with these custom requirements, then 
probably does not make sense to pollute the watchdog(2) but
have these consumed into the app instead ?

!! Vishwa !!

On 4/9/19 9:55 PM, Kun Yi wrote:
> Hello there,
>
> This topic has been brought up several times on the mailing list and 
> offline, but in general seems we as a community didn't reach a 
> consensus on what things would be the most valuable to monitor, and 
> how to monitor them. While it seems a general purposed monitoring 
> infrastructure for OpenBMC is a hard problem, I have some simple ideas 
> that I hope can provide immediate and direct benefits.
>
> 1. Monitoring host IPMI link reliability (host side)
>
> The essentials I want are "IPMI commands sent" and "IPMI commands 
> succeeded" counts over time. More metrics like response time would 
> be helpful as well. The issue to address here: when some IPMI sensor 
> readings are flaky, it would be really helpful to tell from IPMI 
> command stats to determine whether it is a hardware issue, or IPMI 
> issue. Moreover, it would be a very useful regression test metric for 
> rolling out new BMC software.
>
> Looking at the host IPMI side, there is some metrics exposed 
> through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't 
> dug into whether it contains information mapping to the interrupts. 
> Time to read the source code I guess.
>
> Another idea would be to instrument caller libraries like the 
> interfaces in ipmitool, though I feel that approach is harder due to 
> fragmentation of IPMI libraries.
>
> 2. Read and expose core BMC performance metrics from procfs
>
> This is straightforward: have a smallish daemon (or bmc-state-manager) 
> read,parse, and process procfs and put values on D-Bus. Core metrics 
> I'm interested in getting through this way: load average, memory, disk 
> used/available, net stats... The values can then simply be exported as 
> IPMI sensors or Redfish resource properties.
>
> A nice byproduct of this effort would be a procfs parsing library. 
> Since different platforms would probably have different monitoring 
> requirements and procfs output format has no standard, I'm thinking 
> the user would just provide a configuration file containing list of 
> (procfs path, property regex, D-Bus property name), and the 
> compile-time generated code to provide an object for each property.
>
> All of this is merely thoughts and nothing concrete. With that said, 
> it would be really great if you could provide some feedback such as "I 
> want this, but I really need that feature", or let me know it's all 
> implemented already :)
>
> If this seems valuable, after gathering more feedback of feature 
> requirements, I'm going to turn them into design docs and upload for 
> review.
>
> -- 
> Regards,
> Kun

[-- Attachment #2: Type: text/html, Size: 6603 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: BMC health metrics (again!)
  2019-05-08  8:11 ` vishwa
@ 2019-05-17  6:30   ` Neeraj Ladkani
  2019-05-17  7:17     ` vishwa
  0 siblings, 1 reply; 13+ messages in thread
From: Neeraj Ladkani @ 2019-05-17  6:30 UTC (permalink / raw)
  To: vishwa, Kun Yi, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 6060 bytes --]

At cloud scale, telemetry and health monitoring is very critical. We should define a framework that allows platform owners to add their own telemetry hooks. Telemetry service should be designed to make this data accessible and store in resilient way (like blackbox during plane crash).

Is there any workgroup that drives this feature “Platform telemetry and health monitoring” ?

Wishlist

BMC telemetry :

  1.  Linux subsystem
     *   Uptime
     *   CPU Load average
     *   Memory info
     *   Storage usage ( RW )
     *   Dmesg
     *   Syslog
     *   FDs of critical processes
     *   Alignment traps
     *   WDT excursions
  2.  IPMI subsystem
     *   Request and Response logging par interface with timestamps ( KCS, LAN, USB)
     *   Request and Response of IPMB

                                                               i.      Request , Response, No of Retries

  1.  Misc

  1.  Critical Temperature Excursions

                                                               i.      Minimum Reading of Sensor

                                                             ii.      Max Reading of a sensor

                                                           iii.      Count of state transition

                                                           iv.      Retry Count

  1.  Count of assertions/deassertions of GPIO and ability to capture the state
  2.  timestamp of last assertion/deassertion of GPIO

Thanks
~Neeraj

From: openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> On Behalf Of vishwa
Sent: Wednesday, May 8, 2019 1:11 AM
To: Kun Yi <kunyi@google.com>; OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: BMC health metrics (again!)


Hello Kun,

Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, is it only targeted to IPMI -or- a generic BMC-Host communication kink ?

Some of the things in my wish-list are:

1/. Flash wear and tear detection and the threshold to be a config option
2/. Any SoC specific health checks ( If that is exposed )
3/. Mechanism to detect spurious interrupts on any HW link
4/. Some kind of check to see if there will be any I2C lock to a given end device
5/. Ability to detect errors on HW links

On the watchdog(8) area, I was just thinking these:

How about having some kind of BMC_health D-Bus properties -or- a compile time feed, whose values can be fed into a configuration file than watchdog using the default /etc/watchdog.conf always. If the properties are coming from a D-Bus, then we could either append to /etc/watchdog.conf -or- treat those values only as the config file that can be given to watchdog.
The systemd service files to be setup accordingly.

We have seen instances where we get an error that is indicating no resources available. Those could be file descriptors / socket descriptors etc. A way to plug this into watchdog as part of test binary that checks for this ? We could hook a repair-binary to take the action.

Another thing that I was looking at hooking into watchdog is the test to see the file system usage as defined by the policy.
Policy could mention the file system mounts and also the threshold.

For example, /tmp , /root etc.. We could again hook a repair binary to do some cleanup if needed

If we see the list is growing with these custom requirements, then probably does not make sense to pollute the watchdog(2) but
have these consumed into the app instead ?

!! Vishwa !!
On 4/9/19 9:55 PM, Kun Yi wrote:
Hello there,

This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.

1. Monitoring host IPMI link reliability (host side)

The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.

Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.

Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.

2. Read and expose core BMC performance metrics from procfs

This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.

A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.

All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)

If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.

--
Regards,
Kun

[-- Attachment #2: Type: text/html, Size: 28105 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-05-17  6:30   ` Neeraj Ladkani
@ 2019-05-17  7:17     ` vishwa
  2019-05-17  7:23       ` Neeraj Ladkani
  0 siblings, 1 reply; 13+ messages in thread
From: vishwa @ 2019-05-17  7:17 UTC (permalink / raw)
  To: Neeraj Ladkani, Kun Yi, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 6609 bytes --]

Neeraj,

Thanks for the inputs. It's nice to see us having a similar thought.

AFAIK, we don't have any work-group that is driving “Platform telemetry 
and health monitoring”. Also, do we want to see this as 2 different 
entities ?. In the past, there were thoughts about using websockets to 
channel some of the thermal parameters as telemetry data. But then it 
was not implemented.

We can discuss here I think.

!! Vishwa !!

On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>
> At cloud scale, telemetry and health monitoring is very critical. We 
> should define a framework that allows platform owners to add their own 
> telemetry hooks. Telemetry service should be designed to make this 
> data accessible and store in resilient way (like blackbox during plane 
> crash).
>
> Is there any workgroup that drives this feature “Platform telemetry 
> and health monitoring” ?
>
> Wishlist
>
> BMC telemetry :
>
>  1. Linux subsystem
>      1. Uptime
>      2. CPU Load average
>      3. Memory info
>      4. Storage usage ( RW )
>      5. Dmesg
>      6. Syslog
>      7. FDs of critical processes
>      8. Alignment traps
>      9. WDT excursions
>  2. IPMI subsystem
>      1. Request and Response logging par interface with timestamps (
>         KCS, LAN, USB)
>      2. Request and Response of IPMB
>
> i.Request , Response, No of Retries
>
>  3. Misc
>
>  1. Critical Temperature Excursions
>
> i.Minimum Reading of Sensor
>
> ii.Max Reading of a sensor
>
> iii.Count of state transition
>
> iv.Retry Count
>
>  2. Count of assertions/deassertions of GPIO and ability to capture
>     the state
>  3. timestamp of last assertion/deassertion of GPIO
>
> Thanks
>
> ~Neeraj
>
> *From:*openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> 
> *On Behalf Of *vishwa
> *Sent:* Wednesday, May 8, 2019 1:11 AM
> *To:* Kun Yi <kunyi@google.com>; OpenBMC Maillist 
> <openbmc@lists.ozlabs.org>
> *Subject:* Re: BMC health metrics (again!)
>
> Hello Kun,
>
> Thanks for initiating it. I liked the /proc parsing. On the IPMI 
> thing, is it only targeted to IPMI -or- a generic BMC-Host 
> communication kink ?
>
> Some of the things in my wish-list are:
>
> 1/. Flash wear and tear detection and the threshold to be a config option
> 2/. Any SoC specific health checks ( If that is exposed )
> 3/. Mechanism to detect spurious interrupts on any HW link
> 4/. Some kind of check to see if there will be any I2C lock to a given 
> end device
> 5/. Ability to detect errors on HW links
>
> On the watchdog(8) area, I was just thinking these:
>
> How about having some kind of BMC_health D-Bus properties -or- a 
> compile time feed, whose values can be fed into a configuration file 
> than watchdog using the default /etc/watchdog.conf always. If the 
> properties are coming from a D-Bus, then we could either append to 
> /etc/watchdog.conf -or- treat those values only as the config file 
> that can be given to watchdog.
> The systemd service files to be setup accordingly.
>
>
> We have seen instances where we get an error that is indicating no 
> resources available. Those could be file descriptors / socket 
> descriptors etc. A way to plug this into watchdog as part of test 
> binary that checks for this ? We could hook a repair-binary to take 
> the action.
>
>
> Another thing that I was looking at hooking into watchdog is the test 
> to see the file system usage as defined by the policy.
> Policy could mention the file system mounts and also the threshold.
>
> For example, /tmp , /root etc.. We could again hook a repair binary to 
> do some cleanup if needed
>
> If we see the list is growing with these custom requirements, then 
> probably does not make sense to pollute the watchdog(2) but
> have these consumed into the app instead ?
>
> !! Vishwa !!
>
> On 4/9/19 9:55 PM, Kun Yi wrote:
>
>     Hello there,
>
>     This topic has been brought up several times on the mailing list
>     and offline, but in general seems we as a community didn't reach a
>     consensus on what things would be the most valuable to monitor,
>     and how to monitor them. While it seems a general purposed
>     monitoring infrastructure for OpenBMC is a hard problem, I have
>     some simple ideas that I hope can provide immediate and direct
>     benefits.
>
>     1. Monitoring host IPMI link reliability (host side)
>
>     The essentials I want are "IPMI commands sent" and "IPMI commands
>     succeeded" counts over time. More metrics like response time would
>     be helpful as well. The issue to address here: when some IPMI
>     sensor readings are flaky, it would be really helpful to tell from
>     IPMI command stats to determine whether it is a hardware issue, or
>     IPMI issue. Moreover, it would be a very useful regression test
>     metric for rolling out new BMC software.
>
>     Looking at the host IPMI side, there is some metrics exposed
>     through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I
>     haven't dug into whether it contains information mapping to the
>     interrupts. Time to read the source code I guess.
>
>     Another idea would be to instrument caller libraries like the
>     interfaces in ipmitool, though I feel that approach is harder due
>     to fragmentation of IPMI libraries.
>
>     2. Read and expose core BMC performance metrics from procfs
>
>     This is straightforward: have a smallish daemon (or
>     bmc-state-manager) read,parse, and process procfs and put values
>     on D-Bus. Core metrics I'm interested in getting through this way:
>     load average, memory, disk used/available, net stats... The values
>     can then simply be exported as IPMI sensors or Redfish resource
>     properties.
>
>     A nice byproduct of this effort would be a procfs parsing library.
>     Since different platforms would probably have different monitoring
>     requirements and procfs output format has no standard, I'm
>     thinking the user would just provide a configuration file
>     containing list of (procfs path, property regex, D-Bus property
>     name), and the compile-time generated code to provide an object
>     for each property.
>
>     All of this is merely thoughts and nothing concrete. With that
>     said, it would be really great if you could provide some feedback
>     such as "I want this, but I really need that feature", or let me
>     know it's all implemented already :)
>
>     If this seems valuable, after gathering more feedback of feature
>     requirements, I'm going to turn them into design docs and upload
>     for review.
>
>     -- 
>
>     Regards,
>
>     Kun
>

[-- Attachment #2: Type: text/html, Size: 31597 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-05-17  7:17     ` vishwa
@ 2019-05-17  7:23       ` Neeraj Ladkani
  2019-05-17  7:27         ` vishwa
  0 siblings, 1 reply; 13+ messages in thread
From: Neeraj Ladkani @ 2019-05-17  7:23 UTC (permalink / raw)
  To: vishwa, Kun Yi, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 7060 bytes --]

Sure thing. Is there an design document that exist for this feature ?

I can volunteer to drive this work group if we have quorum.

Neeraj

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: vishwa <vishwa@linux.vnet.ibm.com>
Sent: Friday, May 17, 2019 12:17:51 AM
To: Neeraj Ladkani; Kun Yi; OpenBMC Maillist
Subject: Re: BMC health metrics (again!)


Neeraj,

Thanks for the inputs. It's nice to see us having a similar thought.

AFAIK, we don't have any work-group that is driving “Platform telemetry and health monitoring”. Also, do we want to see this as 2 different entities ?. In the past, there were thoughts about using websockets to channel some of the thermal parameters as telemetry data. But then it was not implemented.

We can discuss here I think.

!! Vishwa !!

On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
At cloud scale, telemetry and health monitoring is very critical. We should define a framework that allows platform owners to add their own telemetry hooks. Telemetry service should be designed to make this data accessible and store in resilient way (like blackbox during plane crash).

Is there any workgroup that drives this feature “Platform telemetry and health monitoring” ?

Wishlist

BMC telemetry :

  1.  Linux subsystem
     *   Uptime
     *   CPU Load average
     *   Memory info
     *   Storage usage ( RW )
     *   Dmesg
     *   Syslog
     *   FDs of critical processes
     *   Alignment traps
     *   WDT excursions
  2.  IPMI subsystem
     *   Request and Response logging par interface with timestamps ( KCS, LAN, USB)
     *   Request and Response of IPMB

                                                               i.      Request , Response, No of Retries

  1.  Misc

  1.  Critical Temperature Excursions

                                                               i.      Minimum Reading of Sensor

                                                             ii.      Max Reading of a sensor

                                                           iii.      Count of state transition

                                                           iv.      Retry Count

  1.  Count of assertions/deassertions of GPIO and ability to capture the state
  2.  timestamp of last assertion/deassertion of GPIO

Thanks
~Neeraj

From: openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org><mailto:openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> On Behalf Of vishwa
Sent: Wednesday, May 8, 2019 1:11 AM
To: Kun Yi <kunyi@google.com><mailto:kunyi@google.com>; OpenBMC Maillist <openbmc@lists.ozlabs.org><mailto:openbmc@lists.ozlabs.org>
Subject: Re: BMC health metrics (again!)


Hello Kun,

Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, is it only targeted to IPMI -or- a generic BMC-Host communication kink ?

Some of the things in my wish-list are:

1/. Flash wear and tear detection and the threshold to be a config option
2/. Any SoC specific health checks ( If that is exposed )
3/. Mechanism to detect spurious interrupts on any HW link
4/. Some kind of check to see if there will be any I2C lock to a given end device
5/. Ability to detect errors on HW links

On the watchdog(8) area, I was just thinking these:

How about having some kind of BMC_health D-Bus properties -or- a compile time feed, whose values can be fed into a configuration file than watchdog using the default /etc/watchdog.conf always. If the properties are coming from a D-Bus, then we could either append to /etc/watchdog.conf -or- treat those values only as the config file that can be given to watchdog.
The systemd service files to be setup accordingly.

We have seen instances where we get an error that is indicating no resources available. Those could be file descriptors / socket descriptors etc. A way to plug this into watchdog as part of test binary that checks for this ? We could hook a repair-binary to take the action.

Another thing that I was looking at hooking into watchdog is the test to see the file system usage as defined by the policy.
Policy could mention the file system mounts and also the threshold.

For example, /tmp , /root etc.. We could again hook a repair binary to do some cleanup if needed

If we see the list is growing with these custom requirements, then probably does not make sense to pollute the watchdog(2) but
have these consumed into the app instead ?

!! Vishwa !!
On 4/9/19 9:55 PM, Kun Yi wrote:
Hello there,

This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.

1. Monitoring host IPMI link reliability (host side)

The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.

Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.

Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.

2. Read and expose core BMC performance metrics from procfs

This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.

A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.

All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)

If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.

--
Regards,
Kun

[-- Attachment #2: Type: text/html, Size: 30712 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-05-17  7:23       ` Neeraj Ladkani
@ 2019-05-17  7:27         ` vishwa
  2019-05-17 15:50           ` Kun Yi
  0 siblings, 1 reply; 13+ messages in thread
From: vishwa @ 2019-05-17  7:27 UTC (permalink / raw)
  To: Neeraj Ladkani, Kun Yi, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 7382 bytes --]

IMO, we could start fresh here. The initial thought was an year+ ago.

!! Vishwa !!

On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
> Sure thing. Is there an design document that exist for this feature ?
>
> I can volunteer to drive this work group if we have quorum.
>
> Neeraj
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
> ------------------------------------------------------------------------
> *From:* vishwa <vishwa@linux.vnet.ibm.com>
> *Sent:* Friday, May 17, 2019 12:17:51 AM
> *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist
> *Subject:* Re: BMC health metrics (again!)
>
> Neeraj,
>
> Thanks for the inputs. It's nice to see us having a similar thought.
>
> AFAIK, we don't have any work-group that is driving “Platform 
> telemetry and health monitoring”. Also, do we want to see this as 2 
> different entities ?. In the past, there were thoughts about using 
> websockets to channel some of the thermal parameters as telemetry 
> data. But then it was not implemented.
>
> We can discuss here I think.
>
> !! Vishwa !!
>
> On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>>
>> At cloud scale, telemetry and health monitoring is very critical. We 
>> should define a framework that allows platform owners to add their 
>> own telemetry hooks. Telemetry service should be designed to make 
>> this data accessible and store in resilient way (like blackbox during 
>> plane crash).
>>
>> Is there any workgroup that drives this feature “Platform telemetry 
>> and health monitoring” ?
>>
>> Wishlist
>>
>> BMC telemetry :
>>
>>  1. Linux subsystem
>>      1. Uptime
>>      2. CPU Load average
>>      3. Memory info
>>      4. Storage usage ( RW )
>>      5. Dmesg
>>      6. Syslog
>>      7. FDs of critical processes
>>      8. Alignment traps
>>      9. WDT excursions
>>  2. IPMI subsystem
>>      1. Request and Response logging par interface with timestamps (
>>         KCS, LAN, USB)
>>      2. Request and Response of IPMB
>>
>> i.Request , Response, No of Retries
>>
>>  3. Misc
>>
>>  1. Critical Temperature Excursions
>>
>> i.Minimum Reading of Sensor
>>
>> ii.Max Reading of a sensor
>>
>> iii.Count of state transition
>>
>> iv.Retry Count
>>
>>  2. Count of assertions/deassertions of GPIO and ability to capture
>>     the state
>>  3. timestamp of last assertion/deassertion of GPIO
>>
>> Thanks
>>
>> ~Neeraj
>>
>> *From:*openbmc 
>> <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> *On Behalf Of 
>> *vishwa
>> *Sent:* Wednesday, May 8, 2019 1:11 AM
>> *To:* Kun Yi <kunyi@google.com>; OpenBMC Maillist 
>> <openbmc@lists.ozlabs.org>
>> *Subject:* Re: BMC health metrics (again!)
>>
>> Hello Kun,
>>
>> Thanks for initiating it. I liked the /proc parsing. On the IPMI 
>> thing, is it only targeted to IPMI -or- a generic BMC-Host 
>> communication kink ?
>>
>> Some of the things in my wish-list are:
>>
>> 1/. Flash wear and tear detection and the threshold to be a config option
>> 2/. Any SoC specific health checks ( If that is exposed )
>> 3/. Mechanism to detect spurious interrupts on any HW link
>> 4/. Some kind of check to see if there will be any I2C lock to a 
>> given end device
>> 5/. Ability to detect errors on HW links
>>
>> On the watchdog(8) area, I was just thinking these:
>>
>> How about having some kind of BMC_health D-Bus properties -or- a 
>> compile time feed, whose values can be fed into a configuration file 
>> than watchdog using the default /etc/watchdog.conf always. If the 
>> properties are coming from a D-Bus, then we could either append to 
>> /etc/watchdog.conf -or- treat those values only as the config file 
>> that can be given to watchdog.
>> The systemd service files to be setup accordingly.
>>
>>
>> We have seen instances where we get an error that is indicating no 
>> resources available. Those could be file descriptors / socket 
>> descriptors etc. A way to plug this into watchdog as part of test 
>> binary that checks for this ? We could hook a repair-binary to take 
>> the action.
>>
>>
>> Another thing that I was looking at hooking into watchdog is the test 
>> to see the file system usage as defined by the policy.
>> Policy could mention the file system mounts and also the threshold.
>>
>> For example, /tmp , /root etc.. We could again hook a repair binary 
>> to do some cleanup if needed
>>
>> If we see the list is growing with these custom requirements, then 
>> probably does not make sense to pollute the watchdog(2) but
>> have these consumed into the app instead ?
>>
>> !! Vishwa !!
>>
>> On 4/9/19 9:55 PM, Kun Yi wrote:
>>
>>     Hello there,
>>
>>     This topic has been brought up several times on the mailing list
>>     and offline, but in general seems we as a community didn't reach
>>     a consensus on what things would be the most valuable to monitor,
>>     and how to monitor them. While it seems a general purposed
>>     monitoring infrastructure for OpenBMC is a hard problem, I have
>>     some simple ideas that I hope can provide immediate and direct
>>     benefits.
>>
>>     1. Monitoring host IPMI link reliability (host side)
>>
>>     The essentials I want are "IPMI commands sent" and "IPMI commands
>>     succeeded" counts over time. More metrics like response time
>>     would be helpful as well. The issue to address here: when some
>>     IPMI sensor readings are flaky, it would be really helpful to
>>     tell from IPMI command stats to determine whether it is a
>>     hardware issue, or IPMI issue. Moreover, it would be a very
>>     useful regression test metric for rolling out new BMC software.
>>
>>     Looking at the host IPMI side, there is some metrics exposed
>>     through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I
>>     haven't dug into whether it contains information mapping to the
>>     interrupts. Time to read the source code I guess.
>>
>>     Another idea would be to instrument caller libraries like the
>>     interfaces in ipmitool, though I feel that approach is harder due
>>     to fragmentation of IPMI libraries.
>>
>>     2. Read and expose core BMC performance metrics from procfs
>>
>>     This is straightforward: have a smallish daemon (or
>>     bmc-state-manager) read,parse, and process procfs and put values
>>     on D-Bus. Core metrics I'm interested in getting through this
>>     way: load average, memory, disk used/available, net stats... The
>>     values can then simply be exported as IPMI sensors or Redfish
>>     resource properties.
>>
>>     A nice byproduct of this effort would be a procfs parsing
>>     library. Since different platforms would probably have different
>>     monitoring requirements and procfs output format has no standard,
>>     I'm thinking the user would just provide a configuration file
>>     containing list of (procfs path, property regex, D-Bus property
>>     name), and the compile-time generated code to provide an object
>>     for each property.
>>
>>     All of this is merely thoughts and nothing concrete. With that
>>     said, it would be really great if you could provide some feedback
>>     such as "I want this, but I really need that feature", or let me
>>     know it's all implemented already :)
>>
>>     If this seems valuable, after gathering more feedback of feature
>>     requirements, I'm going to turn them into design docs and upload
>>     for review.
>>
>>     -- 
>>
>>     Regards,
>>
>>     Kun
>>

[-- Attachment #2: Type: text/html, Size: 35191 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-05-17  7:27         ` vishwa
@ 2019-05-17 15:50           ` Kun Yi
  2019-05-17 18:25             ` vishwa
  0 siblings, 1 reply; 13+ messages in thread
From: Kun Yi @ 2019-05-17 15:50 UTC (permalink / raw)
  To: vishwa; +Cc: Neeraj Ladkani, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 8023 bytes --]

I'd also like to be in the metric workgroup. Neeraj, I can see the first
and second point you listed aligns with my goals in the original proposal
very well.

On Fri, May 17, 2019 at 12:28 AM vishwa <vishwa@linux.vnet.ibm.com> wrote:

> IMO, we could start fresh here. The initial thought was an year+ ago.
>
> !! Vishwa !!
> On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
>
> Sure thing. Is there an design document that exist for this feature ?
>
> I can volunteer to drive this work group if we have quorum.
>
> Neeraj
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
> ------------------------------
> *From:* vishwa <vishwa@linux.vnet.ibm.com> <vishwa@linux.vnet.ibm.com>
> *Sent:* Friday, May 17, 2019 12:17:51 AM
> *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist
> *Subject:* Re: BMC health metrics (again!)
>
>
> Neeraj,
>
> Thanks for the inputs. It's nice to see us having a similar thought.
>
> AFAIK, we don't have any work-group that is driving “Platform telemetry
> and health monitoring”. Also, do we want to see this as 2 different
> entities ?. In the past, there were thoughts about using websockets to
> channel some of the thermal parameters as telemetry data. But then it was
> not implemented.
>
> We can discuss here I think.
>
> !! Vishwa !!
> On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>
> At cloud scale, telemetry and health monitoring is very critical. We
> should define a framework that allows platform owners to add their own
> telemetry hooks. Telemetry service should be designed to make this data
> accessible and store in resilient way (like blackbox during plane crash).
>
>
>
> Is there any workgroup that drives this feature “Platform telemetry and
> health monitoring” ?
>
>
>
> Wishlist
>
>
>
> BMC telemetry :
>
>    1. Linux subsystem
>       1. Uptime
>       2. CPU Load average
>       3. Memory info
>       4. Storage usage ( RW )
>       5. Dmesg
>       6. Syslog
>       7. FDs of critical processes
>       8. Alignment traps
>       9. WDT excursions
>    2. IPMI subsystem
>       1. Request and Response logging par interface with timestamps (
>       KCS, LAN, USB)
>       2. Request and Response of IPMB
>
>                                                                i.      Request
> , Response, No of Retries
>
>    1. Misc
>
>
>    1. Critical Temperature Excursions
>
>                                                                i.      Minimum
> Reading of Sensor
>
>                                                              ii.      Max
> Reading of a sensor
>
>                                                            iii.      Count
> of state transition
>
>                                                            iv.      Retry
> Count
>
>    1. Count of assertions/deassertions of GPIO and ability to capture the
>    state
>    2. timestamp of last assertion/deassertion of GPIO
>
>
>
> Thanks
>
> ~Neeraj
>
>
>
> *From:* openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org>
> <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> *On Behalf Of *
> vishwa
> *Sent:* Wednesday, May 8, 2019 1:11 AM
> *To:* Kun Yi <kunyi@google.com> <kunyi@google.com>; OpenBMC Maillist
> <openbmc@lists.ozlabs.org> <openbmc@lists.ozlabs.org>
> *Subject:* Re: BMC health metrics (again!)
>
>
>
> Hello Kun,
>
> Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, is
> it only targeted to IPMI -or- a generic BMC-Host communication kink ?
>
> Some of the things in my wish-list are:
>
> 1/. Flash wear and tear detection and the threshold to be a config option
> 2/. Any SoC specific health checks ( If that is exposed )
> 3/. Mechanism to detect spurious interrupts on any HW link
> 4/. Some kind of check to see if there will be any I2C lock to a given end
> device
> 5/. Ability to detect errors on HW links
>
> On the watchdog(8) area, I was just thinking these:
>
> How about having some kind of BMC_health D-Bus properties -or- a compile
> time feed, whose values can be fed into a configuration file than watchdog
> using the default /etc/watchdog.conf always. If the properties are coming
> from a D-Bus, then we could either append to /etc/watchdog.conf -or- treat
> those values only as the config file that can be given to watchdog.
> The systemd service files to be setup accordingly.
>
>
> We have seen instances where we get an error that is indicating no
> resources available. Those could be file descriptors / socket descriptors
> etc. A way to plug this into watchdog as part of test binary that checks
> for this ? We could hook a repair-binary to take the action.
>
>
> Another thing that I was looking at hooking into watchdog is the test to
> see the file system usage as defined by the policy.
> Policy could mention the file system mounts and also the threshold.
>
> For example, /tmp , /root etc.. We could again hook a repair binary to do
> some cleanup if needed
>
> If we see the list is growing with these custom requirements, then
> probably does not make sense to pollute the watchdog(2) but
> have these consumed into the app instead ?
>
> !! Vishwa !!
>
> On 4/9/19 9:55 PM, Kun Yi wrote:
>
> Hello there,
>
>
>
> This topic has been brought up several times on the mailing list and
> offline, but in general seems we as a community didn't reach a consensus on
> what things would be the most valuable to monitor, and how to monitor them.
> While it seems a general purposed monitoring infrastructure for OpenBMC is
> a hard problem, I have some simple ideas that I hope can provide immediate
> and direct benefits.
>
>
>
> 1. Monitoring host IPMI link reliability (host side)
>
>
>
> The essentials I want are "IPMI commands sent" and "IPMI commands
> succeeded" counts over time. More metrics like response time would
> be helpful as well. The issue to address here: when some IPMI sensor
> readings are flaky, it would be really helpful to tell from IPMI command
> stats to determine whether it is a hardware issue, or IPMI issue. Moreover,
> it would be a very useful regression test metric for rolling out new BMC
> software.
>
>
>
> Looking at the host IPMI side, there is some metrics exposed
> through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug
> into whether it contains information mapping to the interrupts. Time to
> read the source code I guess.
>
>
>
> Another idea would be to instrument caller libraries like the interfaces
> in ipmitool, though I feel that approach is harder due to fragmentation of
> IPMI libraries.
>
>
>
> 2. Read and expose core BMC performance metrics from procfs
>
>
>
> This is straightforward: have a smallish daemon (or bmc-state-manager)
> read,parse, and process procfs and put values on D-Bus. Core metrics I'm
> interested in getting through this way: load average, memory, disk
> used/available, net stats... The values can then simply be exported as IPMI
> sensors or Redfish resource properties.
>
>
>
> A nice byproduct of this effort would be a procfs parsing library. Since
> different platforms would probably have different monitoring requirements
> and procfs output format has no standard, I'm thinking the user would just
> provide a configuration file containing list of (procfs path, property
> regex, D-Bus property name), and the compile-time generated code to provide
> an object for each property.
>
>
>
> All of this is merely thoughts and nothing concrete. With that said, it
> would be really great if you could provide some feedback such as "I want
> this, but I really need that feature", or let me know it's all implemented
> already :)
>
>
>
> If this seems valuable, after gathering more feedback of feature
> requirements, I'm going to turn them into design docs and upload for review.
>
>
>
> --
>
> Regards,
>
> Kun
>
>

-- 
Regards,
Kun

[-- Attachment #2: Type: text/html, Size: 22156 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BMC health metrics (again!)
  2019-05-17 15:50           ` Kun Yi
@ 2019-05-17 18:25             ` vishwa
  2019-05-20 21:29               ` Neeraj Ladkani
  0 siblings, 1 reply; 13+ messages in thread
From: vishwa @ 2019-05-17 18:25 UTC (permalink / raw)
  To: Kun Yi; +Cc: Neeraj Ladkani, OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 8808 bytes --]

This is great !!

Neeraj / Kun, Were you guys planning on putting an initial proposal ?

!! Vishwa !!

On 5/17/19 9:20 PM, Kun Yi wrote:
> I'd also like to be in the metric workgroup. Neeraj, I can see the 
> first and second point you listed aligns with my goals in the original 
> proposal very well.
>
> On Fri, May 17, 2019 at 12:28 AM vishwa <vishwa@linux.vnet.ibm.com 
> <mailto:vishwa@linux.vnet.ibm.com>> wrote:
>
>     IMO, we could start fresh here. The initial thought was an year+ ago.
>
>     !! Vishwa !!
>
>     On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
>>     Sure thing. Is there an design document that exist for this
>>     feature ?
>>
>>     I can volunteer to drive this work group if we have quorum.
>>
>>     Neeraj
>>
>>     Get Outlook for Android <https://aka.ms/ghei36>
>>
>>     ------------------------------------------------------------------------
>>     *From:* vishwa <vishwa@linux.vnet.ibm.com>
>>     <mailto:vishwa@linux.vnet.ibm.com>
>>     *Sent:* Friday, May 17, 2019 12:17:51 AM
>>     *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist
>>     *Subject:* Re: BMC health metrics (again!)
>>
>>     Neeraj,
>>
>>     Thanks for the inputs. It's nice to see us having a similar thought.
>>
>>     AFAIK, we don't have any work-group that is driving “Platform
>>     telemetry and health monitoring”. Also, do we want to see this as
>>     2 different entities ?. In the past, there were thoughts about
>>     using websockets to channel some of the thermal parameters as
>>     telemetry data. But then it was not implemented.
>>
>>     We can discuss here I think.
>>
>>     !! Vishwa !!
>>
>>     On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>>>
>>>     At cloud scale, telemetry and health monitoring is very
>>>     critical. We should define a framework that allows platform
>>>     owners to add their own telemetry hooks. Telemetry service
>>>     should be designed to make this data accessible and store in
>>>     resilient way (like blackbox during plane crash).
>>>
>>>     Is there any workgroup that drives this feature “Platform
>>>     telemetry and health monitoring” ?
>>>
>>>     Wishlist
>>>
>>>     BMC telemetry :
>>>
>>>      1. Linux subsystem
>>>          1. Uptime
>>>          2. CPU Load average
>>>          3. Memory info
>>>          4. Storage usage ( RW )
>>>          5. Dmesg
>>>          6. Syslog
>>>          7. FDs of critical processes
>>>          8. Alignment traps
>>>          9. WDT excursions
>>>      2. IPMI subsystem
>>>          1. Request and Response logging par interface with
>>>             timestamps ( KCS, LAN, USB)
>>>          2. Request and Response of IPMB
>>>
>>>     i.Request , Response, No of Retries
>>>
>>>      3. Misc
>>>
>>>      1. Critical Temperature Excursions
>>>
>>>     i.Minimum Reading of Sensor
>>>
>>>     ii.Max Reading of a sensor
>>>
>>>     iii.Count of state transition
>>>
>>>     iv.Retry Count
>>>
>>>      2. Count of assertions/deassertions of GPIO and ability to
>>>         capture the state
>>>      3. timestamp of last assertion/deassertion of GPIO
>>>
>>>     Thanks
>>>
>>>     ~Neeraj
>>>
>>>     *From:*openbmc
>>>     <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org>
>>>     <mailto:openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org>
>>>     *On Behalf Of *vishwa
>>>     *Sent:* Wednesday, May 8, 2019 1:11 AM
>>>     *To:* Kun Yi <kunyi@google.com> <mailto:kunyi@google.com>;
>>>     OpenBMC Maillist <openbmc@lists.ozlabs.org>
>>>     <mailto:openbmc@lists.ozlabs.org>
>>>     *Subject:* Re: BMC health metrics (again!)
>>>
>>>     Hello Kun,
>>>
>>>     Thanks for initiating it. I liked the /proc parsing. On the IPMI
>>>     thing, is it only targeted to IPMI -or- a generic BMC-Host
>>>     communication kink ?
>>>
>>>     Some of the things in my wish-list are:
>>>
>>>     1/. Flash wear and tear detection and the threshold to be a
>>>     config option
>>>     2/. Any SoC specific health checks ( If that is exposed )
>>>     3/. Mechanism to detect spurious interrupts on any HW link
>>>     4/. Some kind of check to see if there will be any I2C lock to a
>>>     given end device
>>>     5/. Ability to detect errors on HW links
>>>
>>>     On the watchdog(8) area, I was just thinking these:
>>>
>>>     How about having some kind of BMC_health D-Bus properties -or- a
>>>     compile time feed, whose values can be fed into a configuration
>>>     file than watchdog using the default /etc/watchdog.conf always.
>>>     If the properties are coming from a D-Bus, then we could either
>>>     append to /etc/watchdog.conf -or- treat those values only as the
>>>     config file that can be given to watchdog.
>>>     The systemd service files to be setup accordingly.
>>>
>>>
>>>     We have seen instances where we get an error that is indicating
>>>     no resources available. Those could be file descriptors / socket
>>>     descriptors etc. A way to plug this into watchdog as part of
>>>     test binary that checks for this ? We could hook a repair-binary
>>>     to take the action.
>>>
>>>
>>>     Another thing that I was looking at hooking into watchdog is the
>>>     test to see the file system usage as defined by the policy.
>>>     Policy could mention the file system mounts and also the threshold.
>>>
>>>     For example, /tmp , /root etc.. We could again hook a repair
>>>     binary to do some cleanup if needed
>>>
>>>     If we see the list is growing with these custom requirements,
>>>     then probably does not make sense to pollute the watchdog(2) but
>>>     have these consumed into the app instead ?
>>>
>>>     !! Vishwa !!
>>>
>>>     On 4/9/19 9:55 PM, Kun Yi wrote:
>>>
>>>         Hello there,
>>>
>>>         This topic has been brought up several times on the mailing
>>>         list and offline, but in general seems we as a community
>>>         didn't reach a consensus on what things would be the most
>>>         valuable to monitor, and how to monitor them. While it seems
>>>         a general purposed monitoring infrastructure for OpenBMC is
>>>         a hard problem, I have some simple ideas that I hope can
>>>         provide immediate and direct benefits.
>>>
>>>         1. Monitoring host IPMI link reliability (host side)
>>>
>>>         The essentials I want are "IPMI commands sent" and "IPMI
>>>         commands succeeded" counts over time. More metrics like
>>>         response time would be helpful as well. The issue to address
>>>         here: when some IPMI sensor readings are flaky, it would be
>>>         really helpful to tell from IPMI command stats to determine
>>>         whether it is a hardware issue, or IPMI issue. Moreover, it
>>>         would be a very useful regression test metric for rolling
>>>         out new BMC software.
>>>
>>>         Looking at the host IPMI side, there is some metrics exposed
>>>         through /proc/ipmi/0/si_stats if ipmi_si driver is used, but
>>>         I haven't dug into whether it contains information mapping
>>>         to the interrupts. Time to read the source code I guess.
>>>
>>>         Another idea would be to instrument caller libraries like
>>>         the interfaces in ipmitool, though I feel that approach is
>>>         harder due to fragmentation of IPMI libraries.
>>>
>>>         2. Read and expose core BMC performance metrics from procfs
>>>
>>>         This is straightforward: have a smallish daemon (or
>>>         bmc-state-manager) read,parse, and process procfs and put
>>>         values on D-Bus. Core metrics I'm interested in getting
>>>         through this way: load average, memory, disk used/available,
>>>         net stats... The values can then simply be exported as IPMI
>>>         sensors or Redfish resource properties.
>>>
>>>         A nice byproduct of this effort would be a procfs parsing
>>>         library. Since different platforms would probably have
>>>         different monitoring requirements and procfs output format
>>>         has no standard, I'm thinking the user would just provide a
>>>         configuration file containing list of (procfs path, property
>>>         regex, D-Bus property name), and the compile-time generated
>>>         code to provide an object for each property.
>>>
>>>         All of this is merely thoughts and nothing concrete. With
>>>         that said, it would be really great if you could provide
>>>         some feedback such as "I want this, but I really need that
>>>         feature", or let me know it's all implemented already :)
>>>
>>>         If this seems valuable, after gathering more feedback of
>>>         feature requirements, I'm going to turn them into design
>>>         docs and upload for review.
>>>
>>>         -- 
>>>
>>>         Regards,
>>>
>>>         Kun
>>>
>
>
> -- 
> Regards,
> Kun

[-- Attachment #2: Type: text/html, Size: 26560 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: BMC health metrics (again!)
  2019-05-17 18:25             ` vishwa
@ 2019-05-20 21:29               ` Neeraj Ladkani
  0 siblings, 0 replies; 13+ messages in thread
From: Neeraj Ladkani @ 2019-05-20 21:29 UTC (permalink / raw)
  To: vishwa, Kun Yi; +Cc: OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 8026 bytes --]

Just sent email for participation. Let’s discuss specific requirements and come up with proposal.

Neeraj

From: vishwa <vishwa@linux.vnet.ibm.com>
Sent: Friday, May 17, 2019 11:26 AM
To: Kun Yi <kunyi@google.com>
Cc: Neeraj Ladkani <neladk@microsoft.com>; OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: BMC health metrics (again!)


This is great !!

Neeraj / Kun, Were you guys planning on putting an initial proposal ?

!! Vishwa !!
On 5/17/19 9:20 PM, Kun Yi wrote:
I'd also like to be in the metric workgroup. Neeraj, I can see the first and second point you listed aligns with my goals in the original proposal very well.

On Fri, May 17, 2019 at 12:28 AM vishwa <vishwa@linux.vnet.ibm.com<mailto:vishwa@linux.vnet.ibm.com>> wrote:

IMO, we could start fresh here. The initial thought was an year+ ago.

!! Vishwa !!
On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
Sure thing. Is there an design document that exist for this feature ?
I can volunteer to drive this work group if we have quorum.
Neeraj
Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: vishwa <vishwa@linux.vnet.ibm.com><mailto:vishwa@linux.vnet.ibm.com>
Sent: Friday, May 17, 2019 12:17:51 AM
To: Neeraj Ladkani; Kun Yi; OpenBMC Maillist
Subject: Re: BMC health metrics (again!)


Neeraj,

Thanks for the inputs. It's nice to see us having a similar thought.

AFAIK, we don't have any work-group that is driving “Platform telemetry and health monitoring”. Also, do we want to see this as 2 different entities ?. In the past, there were thoughts about using websockets to channel some of the thermal parameters as telemetry data. But then it was not implemented.

We can discuss here I think.

!! Vishwa !!
On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
At cloud scale, telemetry and health monitoring is very critical. We should define a framework that allows platform owners to add their own telemetry hooks. Telemetry service should be designed to make this data accessible and store in resilient way (like blackbox during plane crash).

Is there any workgroup that drives this feature “Platform telemetry and health monitoring” ?

Wishlist

BMC telemetry :

  1.  Linux subsystem

     *   Uptime
     *   CPU Load average
     *   Memory info
     *   Storage usage ( RW )
     *   Dmesg
     *   Syslog
     *   FDs of critical processes
     *   Alignment traps
     *   WDT excursions

  1.  IPMI subsystem

     *   Request and Response logging par interface with timestamps ( KCS, LAN, USB)
     *   Request and Response of IPMB

                                                               i.      Request , Response, No of Retries

  1.  Misc

  1.  Critical Temperature Excursions

                                                               i.      Minimum Reading of Sensor

                                                             ii.      Max Reading of a sensor

                                                           iii.      Count of state transition

                                                           iv.      Retry Count

  1.  Count of assertions/deassertions of GPIO and ability to capture the state
  2.  timestamp of last assertion/deassertion of GPIO

Thanks
~Neeraj

From: openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org><mailto:openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> On Behalf Of vishwa
Sent: Wednesday, May 8, 2019 1:11 AM
To: Kun Yi <kunyi@google.com><mailto:kunyi@google.com>; OpenBMC Maillist <openbmc@lists.ozlabs.org><mailto:openbmc@lists.ozlabs.org>
Subject: Re: BMC health metrics (again!)


Hello Kun,

Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, is it only targeted to IPMI -or- a generic BMC-Host communication kink ?

Some of the things in my wish-list are:

1/. Flash wear and tear detection and the threshold to be a config option
2/. Any SoC specific health checks ( If that is exposed )
3/. Mechanism to detect spurious interrupts on any HW link
4/. Some kind of check to see if there will be any I2C lock to a given end device
5/. Ability to detect errors on HW links

On the watchdog(8) area, I was just thinking these:

How about having some kind of BMC_health D-Bus properties -or- a compile time feed, whose values can be fed into a configuration file than watchdog using the default /etc/watchdog.conf always. If the properties are coming from a D-Bus, then we could either append to /etc/watchdog.conf -or- treat those values only as the config file that can be given to watchdog.
The systemd service files to be setup accordingly.

We have seen instances where we get an error that is indicating no resources available. Those could be file descriptors / socket descriptors etc. A way to plug this into watchdog as part of test binary that checks for this ? We could hook a repair-binary to take the action.

Another thing that I was looking at hooking into watchdog is the test to see the file system usage as defined by the policy.
Policy could mention the file system mounts and also the threshold.

For example, /tmp , /root etc.. We could again hook a repair binary to do some cleanup if needed

If we see the list is growing with these custom requirements, then probably does not make sense to pollute the watchdog(2) but
have these consumed into the app instead ?

!! Vishwa !!
On 4/9/19 9:55 PM, Kun Yi wrote:
Hello there,

This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.

1. Monitoring host IPMI link reliability (host side)

The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.

Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.

Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.

2. Read and expose core BMC performance metrics from procfs

This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.

A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.

All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)

If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.

--
Regards,
Kun


--
Regards,
Kun

[-- Attachment #2: Type: text/html, Size: 26856 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-05-20 21:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-09 16:25 BMC health metrics (again!) Kun Yi
2019-04-11 12:56 ` Sivas Srr
2019-04-20  1:04   ` Kun Yi
2019-04-12 13:02 ` Andrew Geissler
2019-04-20  1:08   ` Kun Yi
2019-05-08  8:11 ` vishwa
2019-05-17  6:30   ` Neeraj Ladkani
2019-05-17  7:17     ` vishwa
2019-05-17  7:23       ` Neeraj Ladkani
2019-05-17  7:27         ` vishwa
2019-05-17 15:50           ` Kun Yi
2019-05-17 18:25             ` vishwa
2019-05-20 21:29               ` Neeraj Ladkani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.