I'd also like to be in the metric workgroup. Neeraj, I can see the first and second point you listed aligns with my goals in the original proposal very well. On Fri, May 17, 2019 at 12:28 AM vishwa wrote: > IMO, we could start fresh here. The initial thought was an year+ ago. > > !! Vishwa !! > On 5/17/19 12:53 PM, Neeraj Ladkani wrote: > > Sure thing. Is there an design document that exist for this feature ? > > I can volunteer to drive this work group if we have quorum. > > Neeraj > > Get Outlook for Android > > ------------------------------ > *From:* vishwa > *Sent:* Friday, May 17, 2019 12:17:51 AM > *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist > *Subject:* Re: BMC health metrics (again!) > > > Neeraj, > > Thanks for the inputs. It's nice to see us having a similar thought. > > AFAIK, we don't have any work-group that is driving “Platform telemetry > and health monitoring”. Also, do we want to see this as 2 different > entities ?. In the past, there were thoughts about using websockets to > channel some of the thermal parameters as telemetry data. But then it was > not implemented. > > We can discuss here I think. > > !! Vishwa !! > On 5/17/19 12:00 PM, Neeraj Ladkani wrote: > > At cloud scale, telemetry and health monitoring is very critical. We > should define a framework that allows platform owners to add their own > telemetry hooks. Telemetry service should be designed to make this data > accessible and store in resilient way (like blackbox during plane crash). > > > > Is there any workgroup that drives this feature “Platform telemetry and > health monitoring” ? > > > > Wishlist > > > > BMC telemetry : > > 1. Linux subsystem > 1. Uptime > 2. CPU Load average > 3. Memory info > 4. Storage usage ( RW ) > 5. Dmesg > 6. Syslog > 7. FDs of critical processes > 8. Alignment traps > 9. WDT excursions > 2. IPMI subsystem > 1. Request and Response logging par interface with timestamps ( > KCS, LAN, USB) > 2. Request and Response of IPMB > > i. Request > , Response, No of Retries > > 1. Misc > > > 1. Critical Temperature Excursions > > i. Minimum > Reading of Sensor > > ii. Max > Reading of a sensor > > iii. Count > of state transition > > iv. Retry > Count > > 1. Count of assertions/deassertions of GPIO and ability to capture the > state > 2. timestamp of last assertion/deassertion of GPIO > > > > Thanks > > ~Neeraj > > > > *From:* openbmc > *On Behalf Of * > vishwa > *Sent:* Wednesday, May 8, 2019 1:11 AM > *To:* Kun Yi ; OpenBMC Maillist > > *Subject:* Re: BMC health metrics (again!) > > > > Hello Kun, > > Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, is > it only targeted to IPMI -or- a generic BMC-Host communication kink ? > > Some of the things in my wish-list are: > > 1/. Flash wear and tear detection and the threshold to be a config option > 2/. Any SoC specific health checks ( If that is exposed ) > 3/. Mechanism to detect spurious interrupts on any HW link > 4/. Some kind of check to see if there will be any I2C lock to a given end > device > 5/. Ability to detect errors on HW links > > On the watchdog(8) area, I was just thinking these: > > How about having some kind of BMC_health D-Bus properties -or- a compile > time feed, whose values can be fed into a configuration file than watchdog > using the default /etc/watchdog.conf always. If the properties are coming > from a D-Bus, then we could either append to /etc/watchdog.conf -or- treat > those values only as the config file that can be given to watchdog. > The systemd service files to be setup accordingly. > > > We have seen instances where we get an error that is indicating no > resources available. Those could be file descriptors / socket descriptors > etc. A way to plug this into watchdog as part of test binary that checks > for this ? We could hook a repair-binary to take the action. > > > Another thing that I was looking at hooking into watchdog is the test to > see the file system usage as defined by the policy. > Policy could mention the file system mounts and also the threshold. > > For example, /tmp , /root etc.. We could again hook a repair binary to do > some cleanup if needed > > If we see the list is growing with these custom requirements, then > probably does not make sense to pollute the watchdog(2) but > have these consumed into the app instead ? > > !! Vishwa !! > > On 4/9/19 9:55 PM, Kun Yi wrote: > > Hello there, > > > > This topic has been brought up several times on the mailing list and > offline, but in general seems we as a community didn't reach a consensus on > what things would be the most valuable to monitor, and how to monitor them. > While it seems a general purposed monitoring infrastructure for OpenBMC is > a hard problem, I have some simple ideas that I hope can provide immediate > and direct benefits. > > > > 1. Monitoring host IPMI link reliability (host side) > > > > The essentials I want are "IPMI commands sent" and "IPMI commands > succeeded" counts over time. More metrics like response time would > be helpful as well. The issue to address here: when some IPMI sensor > readings are flaky, it would be really helpful to tell from IPMI command > stats to determine whether it is a hardware issue, or IPMI issue. Moreover, > it would be a very useful regression test metric for rolling out new BMC > software. > > > > Looking at the host IPMI side, there is some metrics exposed > through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug > into whether it contains information mapping to the interrupts. Time to > read the source code I guess. > > > > Another idea would be to instrument caller libraries like the interfaces > in ipmitool, though I feel that approach is harder due to fragmentation of > IPMI libraries. > > > > 2. Read and expose core BMC performance metrics from procfs > > > > This is straightforward: have a smallish daemon (or bmc-state-manager) > read,parse, and process procfs and put values on D-Bus. Core metrics I'm > interested in getting through this way: load average, memory, disk > used/available, net stats... The values can then simply be exported as IPMI > sensors or Redfish resource properties. > > > > A nice byproduct of this effort would be a procfs parsing library. Since > different platforms would probably have different monitoring requirements > and procfs output format has no standard, I'm thinking the user would just > provide a configuration file containing list of (procfs path, property > regex, D-Bus property name), and the compile-time generated code to provide > an object for each property. > > > > All of this is merely thoughts and nothing concrete. With that said, it > would be really great if you could provide some feedback such as "I want > this, but I really need that feature", or let me know it's all implemented > already :) > > > > If this seems valuable, after gathering more feedback of feature > requirements, I'm going to turn them into design docs and upload for review. > > > > -- > > Regards, > > Kun > > -- Regards, Kun