Re: BMC health metrics (again!)

From: Kun Yi <kunyi@google.com>
To: vishwa <vishwa@linux.vnet.ibm.com>
Cc: Neeraj Ladkani <neladk@microsoft.com>,
	OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: BMC health metrics (again!)
Date: Fri, 17 May 2019 08:50:47 -0700	[thread overview]
Message-ID: <CAGMNF6XyH-VGRh18acGUbJniJ_YLW-3dz6sFJTvKbO7ZraJcZA@mail.gmail.com> (raw)
In-Reply-To: <c6ac62c1-48b4-0df8-fbff-8172275ef8b1@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 8023 bytes --]

I'd also like to be in the metric workgroup. Neeraj, I can see the first
and second point you listed aligns with my goals in the original proposal
very well.

On Fri, May 17, 2019 at 12:28 AM vishwa <vishwa@linux.vnet.ibm.com> wrote:

> IMO, we could start fresh here. The initial thought was an year+ ago.
>
> !! Vishwa !!
> On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
>
> Sure thing. Is there an design document that exist for this feature ?
>
> I can volunteer to drive this work group if we have quorum.
>
> Neeraj
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
> ------------------------------
> *From:* vishwa <vishwa@linux.vnet.ibm.com> <vishwa@linux.vnet.ibm.com>
> *Sent:* Friday, May 17, 2019 12:17:51 AM
> *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist
> *Subject:* Re: BMC health metrics (again!)
>
>
> Neeraj,
>
> Thanks for the inputs. It's nice to see us having a similar thought.
>
> AFAIK, we don't have any work-group that is driving “Platform telemetry
> and health monitoring”. Also, do we want to see this as 2 different
> entities ?. In the past, there were thoughts about using websockets to
> channel some of the thermal parameters as telemetry data. But then it was
> not implemented.
>
> We can discuss here I think.
>
> !! Vishwa !!
> On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>
> At cloud scale, telemetry and health monitoring is very critical. We
> should define a framework that allows platform owners to add their own
> telemetry hooks. Telemetry service should be designed to make this data
> accessible and store in resilient way (like blackbox during plane crash).
>
>
>
> Is there any workgroup that drives this feature “Platform telemetry and
> health monitoring” ?
>
>
>
> Wishlist
>
>
>
> BMC telemetry :
>
>    1. Linux subsystem
>       1. Uptime
>       2. CPU Load average
>       3. Memory info
>       4. Storage usage ( RW )
>       5. Dmesg
>       6. Syslog
>       7. FDs of critical processes
>       8. Alignment traps
>       9. WDT excursions
>    2. IPMI subsystem
>       1. Request and Response logging par interface with timestamps (
>       KCS, LAN, USB)
>       2. Request and Response of IPMB
>
>                                                                i.      Request
> , Response, No of Retries
>
>    1. Misc
>
>
>    1. Critical Temperature Excursions
>
>                                                                i.      Minimum
> Reading of Sensor
>
>                                                              ii.      Max
> Reading of a sensor
>
>                                                            iii.      Count
> of state transition
>
>                                                            iv.      Retry
> Count
>
>    1. Count of assertions/deassertions of GPIO and ability to capture the
>    state
>    2. timestamp of last assertion/deassertion of GPIO
>
>
>
> Thanks
>
> ~Neeraj
>
>
>
> *From:* openbmc <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org>
> <openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org> *On Behalf Of *
> vishwa
> *Sent:* Wednesday, May 8, 2019 1:11 AM
> *To:* Kun Yi <kunyi@google.com> <kunyi@google.com>; OpenBMC Maillist
> <openbmc@lists.ozlabs.org> <openbmc@lists.ozlabs.org>
> *Subject:* Re: BMC health metrics (again!)
>
>
>
> Hello Kun,
>
> Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, is
> it only targeted to IPMI -or- a generic BMC-Host communication kink ?
>
> Some of the things in my wish-list are:
>
> 1/. Flash wear and tear detection and the threshold to be a config option
> 2/. Any SoC specific health checks ( If that is exposed )
> 3/. Mechanism to detect spurious interrupts on any HW link
> 4/. Some kind of check to see if there will be any I2C lock to a given end
> device
> 5/. Ability to detect errors on HW links
>
> On the watchdog(8) area, I was just thinking these:
>
> How about having some kind of BMC_health D-Bus properties -or- a compile
> time feed, whose values can be fed into a configuration file than watchdog
> using the default /etc/watchdog.conf always. If the properties are coming
> from a D-Bus, then we could either append to /etc/watchdog.conf -or- treat
> those values only as the config file that can be given to watchdog.
> The systemd service files to be setup accordingly.
>
>
> We have seen instances where we get an error that is indicating no
> resources available. Those could be file descriptors / socket descriptors
> etc. A way to plug this into watchdog as part of test binary that checks
> for this ? We could hook a repair-binary to take the action.
>
>
> Another thing that I was looking at hooking into watchdog is the test to
> see the file system usage as defined by the policy.
> Policy could mention the file system mounts and also the threshold.
>
> For example, /tmp , /root etc.. We could again hook a repair binary to do
> some cleanup if needed
>
> If we see the list is growing with these custom requirements, then
> probably does not make sense to pollute the watchdog(2) but
> have these consumed into the app instead ?
>
> !! Vishwa !!
>
> On 4/9/19 9:55 PM, Kun Yi wrote:
>
> Hello there,
>
>
>
> This topic has been brought up several times on the mailing list and
> offline, but in general seems we as a community didn't reach a consensus on
> what things would be the most valuable to monitor, and how to monitor them.
> While it seems a general purposed monitoring infrastructure for OpenBMC is
> a hard problem, I have some simple ideas that I hope can provide immediate
> and direct benefits.
>
>
>
> 1. Monitoring host IPMI link reliability (host side)
>
>
>
> The essentials I want are "IPMI commands sent" and "IPMI commands
> succeeded" counts over time. More metrics like response time would
> be helpful as well. The issue to address here: when some IPMI sensor
> readings are flaky, it would be really helpful to tell from IPMI command
> stats to determine whether it is a hardware issue, or IPMI issue. Moreover,
> it would be a very useful regression test metric for rolling out new BMC
> software.
>
>
>
> Looking at the host IPMI side, there is some metrics exposed
> through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug
> into whether it contains information mapping to the interrupts. Time to
> read the source code I guess.
>
>
>
> Another idea would be to instrument caller libraries like the interfaces
> in ipmitool, though I feel that approach is harder due to fragmentation of
> IPMI libraries.
>
>
>
> 2. Read and expose core BMC performance metrics from procfs
>
>
>
> This is straightforward: have a smallish daemon (or bmc-state-manager)
> read,parse, and process procfs and put values on D-Bus. Core metrics I'm
> interested in getting through this way: load average, memory, disk
> used/available, net stats... The values can then simply be exported as IPMI
> sensors or Redfish resource properties.
>
>
>
> A nice byproduct of this effort would be a procfs parsing library. Since
> different platforms would probably have different monitoring requirements
> and procfs output format has no standard, I'm thinking the user would just
> provide a configuration file containing list of (procfs path, property
> regex, D-Bus property name), and the compile-time generated code to provide
> an object for each property.
>
>
>
> All of this is merely thoughts and nothing concrete. With that said, it
> would be really great if you could provide some feedback such as "I want
> this, but I really need that feature", or let me know it's all implemented
> already :)
>
>
>
> If this seems valuable, after gathering more feedback of feature
> requirements, I'm going to turn them into design docs and upload for review.
>
>
>
> --
>
> Regards,
>
> Kun
>
>

-- 
Regards,
Kun

[-- Attachment #2: Type: text/html, Size: 22156 bytes --]