Neeraj,

Thanks for the inputs. It's nice to see us having a similar thought.

AFAIK, we don't have any work-group that is driving =93Platform telemetry and health monitoring=94. Also, do we want to see thi= s as 2 different entities ?. In the past, there were thoughts about using w= ebsockets to channel some of the thermal parameters as telemetry data. But = then it was not implemented.

We can discuss here I think.

!! Vishwa !!

On 5/17/19 12:00 PM, Neeraj Ladkani wrote:

At cloud scale, tel= emetry and health monitoring is very critical. We should define a framework= that allows platform owners to add their own telemetry hooks. Telemetry se= rvice should be designed to make this data accessible and store in resilient way (like blackbox during plane cra= sh).

Is there any workgr= oup that drives this feature =93Platform telemetry and health monitoring=94= ?

Wishlist=

BMC telemetry :

Linux subsystem

Uptime
CPU Load average
Memory info
Storage usage ( RW )
Dmesg
Syslog
FDs of critical processes
Alignment traps
WDT excursions

IPMI subsystem

Request and Response logging par interface with timestamps ( KCS, LAN, USB)=
Request and Response of IPMB

     =             &nb= sp;            =             &nb= sp;            =         i.      Request , Resp= onse, No of Retries

Misc

Critical Temperature Excursions

     =             &nb= sp;            =             &nb= sp;            =         i.      Minimum Readin= g of Sensor

     =             &nb= sp;            =             &nb= sp;            =       ii.      Max Reading of= a sensor

     =             &nb= sp;            =             &nb= sp;            =     iii.      Count of state= transition

     =             &nb= sp;            =             &nb= sp;            =     iv.      Retry Count

Count of a= ssertions/deassertions of GPIO and ability to capture the state<= /li>
timest= amp of last assertion/deassertion of GPIO

Thanks

~Neeraj<= /span>

From:= openbmc <openbmc-bounces+neladk=3Dmicrosoft.com@lists.ozlabs.org> = On Behalf Of vishwa
Sent: Wednesday, May 8, 2019 1:11 AM
To: Kun Yi <kunyi@google.com>; OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: BMC health metrics (again!)

Hello Kun,

Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, = is it only targeted to IPMI -or- a generic BMC-Host communication kink ?

Some of the things in my wish-list are:

1/. Flash wear and tear detection and the threshold to be a config optio= n
2/. Any SoC specific health checks ( If that is exposed )
3/. Mechanism to detect spurious interrupts on any HW link
4/. Some kind of check to see if there will be any I2C lock to a given end = device
5/. Ability to detect errors on HW links

On the watchdog(8) area, I was just thinking these:

How about having some kind of BMC_health D-Bus properties -or- a compile= time feed, whose values can be fed into a configuration file than watchdog= using the default /etc/watchdog.conf always. If the properties are coming = from a D-Bus, then we could either append to /etc/watchdog.conf -or- treat those values only as the config fi= le that can be given to watchdog.
The systemd service files to be setup accordingly.

We have seen instances where we get an error that is indicating no resource= s available. Those could be file descriptors / socket descriptors etc. A wa= y to plug this into watchdog as part of test binary that checks for this ? = We could hook a repair-binary to take the action.

Another thing that I was looking at hooking into watchdog is the test to se= e the file system usage as defined by the policy.
Policy could mention the file system mounts and also the threshold.

For example, /tmp , /root etc.. We could again hook a repair binary to do s= ome cleanup if needed

If we see the list is growing with these custom requirements, then probably= does not make sense to pollute the watchdog(2) but
have these consumed into the app instead ?

!! Vishwa !!

On 4/9/19 9:55 PM, Kun Yi wrote:

Hello there,

This topic has been brought up several times on the = mailing list and offline, but in general seems we as a community didn't rea= ch a consensus on what things would be the most valuable to monitor, and ho= w to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem= , I have some simple ideas that I hope can provide immediate and direct ben= efits.

1. Monitoring host IPMI link reliability (host side)=

The essentials I want are "IPMI commands sent&q= uot; and "IPMI commands succeeded" counts over time. More metrics= like response time would be helpful as well. The issue to address her= e: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardw= are issue, or IPMI issue. Moreover, it would be a very useful regression te= st metric for rolling out new BMC software.

Looking at the host IPMI side, there is some metrics= exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but = I haven't dug into whether it contains information mapping to the interrupt= s. Time to read the source code I guess.

Another idea would be to instrument caller libraries= like the interfaces in ipmitool, though I feel that approach is harder due= to fragmentation of IPMI libraries.

2. Read and expose core BMC performance metrics from= procfs

This is straightforward: have a smallish daemon (or = bmc-state-manager) read,parse, and process procfs and put values on D-Bus. = Core metrics I'm interested in getting through this way: load average, memo= ry, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource= properties.

A nice byproduct of this effort would be a procfs pa= rsing library. Since different platforms would probably have different moni= toring requirements and procfs output format has no standard, I'm thinking = the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name)= , and the compile-time generated code to provide an object for each pr= operty.

All of this is merely thoughts and nothing concrete.= With that said, it would be really great if you could provide some feedbac= k such as "I want this, but I really need that feature", or let m= e know it's all implemented already :)

If this seems valuable, after gathering more feedbac= k of feature requirements, I'm going to turn them into design docs and uplo= ad for review.

--

Regards,

Kun