All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Metrics vs Logging, Continued
@ 2018-01-03  2:22 Christopher Covington
  2018-01-03  4:08 ` Deepak Kodihalli
  2018-01-04 15:39 ` Michael E Brown
  0 siblings, 2 replies; 10+ messages in thread
From: Christopher Covington @ 2018-01-03  2:22 UTC (permalink / raw)
  To: openbmc, Michael_E_Brown, venture

Hi Michael, Patrick,

I probably should have hopped on this list months ago. Thanks for your patience as I come up to
speed on your code, configure my mail client to suit this list, and so on.

> Prometheus metrics is fundamentally a pull model, not a push model. If you have a pull model,
> it greatly simplifies the dependencies:

>	- Pull metrics internally or externally (daemons listen on 127.0.0.1, optionally reverse proxy
>	  that through your web service).

An option for on-demand metrics (as opposed to periodic, always-on monitoring) is nice. I would
use it to more highly scrutinize upgrades in progress for example.

>	- Optionally run the metrics server or not depending on configuration.

I agree it should fail gracefully when there is no server present, and think this generalizes to
other network services, even NTP and DHCP.

>	- Pull model naturally self-limits in performance-limited cases... you don’t have a thundering
>	  herd of daemons trying to push metrics. In case metrics server gets loaded it will naturally
>	  slow down polls to backend daemons.

At large scale you'll either need multiple pollers or load-balancing for the receiving server. I'm
not sure what the best solution is. Is load-balancing perhaps more commonplace?

> But what I think would be pretty nice is if you could point graphana/Prometheus towards every
> BMC on your network to get nice graphs of temp, fan speeds, etc.

For metrics/counters, I've been centrally pulling/polling from a fleet running the following RESTful
API:

https://github.com/facebook/openbmc/tree/helium/common/recipes-rest/rest-api/files

But polling the whole fleet doesn't seem ideal, so I'm wondering about a push model.

Prometheus looks interesting, thanks for the pointer. It does seem to support a push model
https://prometheus.io/docs/instrumenting/pushing/

Do Go language applications run reasonably well on ASpeed 2400 SoCs?

I've heard that OpenWRT uses collectd: https://wiki.openwrt.org/doc/howto/statistic.collectd

Thanks,
Christopher Covington

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Metrics vs Logging, Continued
  2018-01-03  2:22 Metrics vs Logging, Continued Christopher Covington
@ 2018-01-03  4:08 ` Deepak Kodihalli
  2018-01-04 15:39 ` Michael E Brown
  1 sibling, 0 replies; 10+ messages in thread
From: Deepak Kodihalli @ 2018-01-03  4:08 UTC (permalink / raw)
  To: openbmc

On 03/01/18 7:52 am, Christopher Covington wrote:

> https://github.com/facebook/openbmc/tree/helium/common/recipes-rest/rest-api/files
> 
> But polling the whole fleet doesn't seem ideal, so I'm wondering about a push model.
> 
> Prometheus looks interesting, thanks for the pointer. It does seem to support a push model
> https://prometheus.io/docs/instrumenting/pushing/

FWIW, the phosphor rest-server 
(https://github.com/openbmc/phosphor-rest-server) can push events 
occurring in the d-bus namespace to subscribed clients via WebSockets. 
The client subscription protocol has been documented here - 
https://github.com/openbmc/docs/blob/master/rest-api.md (the last section).

Regards,
Deepak

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Metrics vs Logging, Continued
  2018-01-03  2:22 Metrics vs Logging, Continued Christopher Covington
  2018-01-03  4:08 ` Deepak Kodihalli
@ 2018-01-04 15:39 ` Michael E Brown
  1 sibling, 0 replies; 10+ messages in thread
From: Michael E Brown @ 2018-01-04 15:39 UTC (permalink / raw)
  To: Christopher Covington; +Cc: openbmc, venture

On Wed, Jan 03, 2018 at 02:22:39AM +0000, Christopher Covington wrote:
> Hi Michael, Patrick,
> 
> I probably should have hopped on this list months ago. Thanks for your patience as I come up to
> speed on your code, configure my mail client to suit this list, and so on.
> 
> > Prometheus metrics is fundamentally a pull model, not a push model. If you have a pull model,
> > it greatly simplifies the dependencies:
> 
> >	- Pull metrics internally or externally (daemons listen on 127.0.0.1, optionally reverse proxy
> >	  that through your web service).
> 
> An option for on-demand metrics (as opposed to periodic, always-on monitoring) is nice. I would
> use it to more highly scrutinize upgrades in progress for example.

This is a nice point. You can easily turn on/off pull of metrics from your
systems by simply turning the server on/off.  But is harder for push metrics,
as you have to *configure* each endpoint for where to push to (if that changes
from time to time, which may or may not be the case), and you have to turn push
on/off.

> 
> >	- Optionally run the metrics server or not depending on configuration.
> 
> I agree it should fail gracefully when there is no server present, and think this generalizes to
> other network services, even NTP and DHCP.
> 
> >	- Pull model naturally self-limits in performance-limited cases... you don’t have a thundering
> >	  herd of daemons trying to push metrics. In case metrics server gets loaded it will naturally
> >	  slow down polls to backend daemons.
> 
> At large scale you'll either need multiple pollers or load-balancing for the receiving server. I'm
> not sure what the best solution is. Is load-balancing perhaps more commonplace?

Load balancing is "more commonplace" for things like generic web servers.
Setting up load balancing is distinctly non-trivial, as specifics of how you
are using are very important.

Overall this is a matter that reasonable people can disagree on. I favor the
pull approach. It degrades much more predictably (server pulls more slowly but
still hits all the systems). You can easily scale up. Load balancing for this
type of thing seems way more difficult to set up and get working well. (How you
persistently balance clients between servers, for instance.) However, I think
that a push model is also pretty easy to argue successfully for (though I
personally wont).

Also, I think the conversation here started more as a discussion on the best
way to collect metrics for individual daemons, rather than specifically for how
to get metrics for openbmc overall. I think that something like a protocol spec
for how individual metrics collection would be done would be pretty useful,
then it could be implemented differently across different daemons, but have the
same protocol. And then we can talk about how to extend that off the box.

Another part of the conversation we probably need to talk about is "api
stability" and are we going to have a requirement for specific metrics to be
stable over time? Or can we deal with ad-hoc metrics that may be added to over
time or may drop.

> 
> > But what I think would be pretty nice is if you could point graphana/Prometheus towards every
> > BMC on your network to get nice graphs of temp, fan speeds, etc.
> 
> For metrics/counters, I've been centrally pulling/polling from a fleet running the following RESTful
> API:
> 
> https://github.com/facebook/openbmc/tree/helium/common/recipes-rest/rest-api/files
> 
> But polling the whole fleet doesn't seem ideal, so I'm wondering about a push model.
> 
> Prometheus looks interesting, thanks for the pointer. It does seem to support a push model
> https://prometheus.io/docs/instrumenting/pushing/

Prometheus is just a specification for an http endpoint, so it would (in
theory) be relatively easy to write a pull-to-push gateway using any language.
The push model mentioned here is just a local agent that periodically polls the
local pull endpoint and pushes it somewhere.

> 
> Do Go language applications run reasonably well on ASpeed 2400 SoCs?

I've done several prototypes of golang servers on the Nuvoton ARM chip
(slightly faster than aspeed) and the results for me were more-than-acceptable.
I was very pleased with the ease of development, memory usage, and most other
metrics. It has the development speed of python combined with the runtime speed
of Java (and sometimes approaching C).

> I've heard that OpenWRT uses collectd: https://wiki.openwrt.org/doc/howto/statistic.collectd

Quick look at this project, it does two things: a) it has a plugin format for
how to collect various stats, and some pre-written plugins for "popular"
things, and b) it formats output in rrd format for consumption by other tools.
This is conceptually very similar to prometheus (same concepts: gauges,
histograms, counts, etc). However, it does appear that collectd specifies a
file format for output, but not an access format. The prometheus part that I'm
focusing on is in how we can standardize access and file formats.

--
Michael Brown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Metrics vs Logging, Continued
  2017-12-20 20:10         ` Patrick Venture
@ 2017-12-22 18:02           ` Michael.E.Brown
  0 siblings, 0 replies; 10+ messages in thread
From: Michael.E.Brown @ 2017-12-22 18:02 UTC (permalink / raw)
  To: venture; +Cc: openbmc, bradleyb

Prometheus metrics is fundamentally a pull model, not a push model. If you have a pull model, it greatly simplifies the dependencies:
	- Make them compile-time selectable on the daemon side, compile them out on builds where you don’t need them.
	- startup dependencies are easier: you don’t have to worry about error cases where the metrics daemon doesn't start
	- Pull metrics internally or externally (daemons listen on 127.0.0.1, optionally reverse proxy that through your web service).
	- different metrics servers can poll at different rates for different endpoints. Push model is more of a one-size-fits-all. Push model also means that each daemon needs to know the destinations for each push
	- Optionally run the metrics server or not depending on configuration.
	- Pull model naturally self-limits in performance-limited cases... you don’t have a thundering herd of daemons trying to push metrics. In case metrics server gets loaded it will naturally slow down polls to backend daemons.
	- We can write a metrics server to poll daemons and present dbus/etc endpoints to expose those metrics for the things we think should be part of our API.

All that being said, I would be open to designing a dbus api that is similar to the Prometheus api. Boiling down the types of metrics is a useful abstraction, and I think Prometheus has this basically right: counters, gauges, histograms, etc.

But what I think would be pretty nice is if you could point graphana/Prometheus towards every BMC on your network to get nice graphs of temp, fan speeds, etc. 
--
Michael

-----Original Message-----
From: Patrick Venture [mailto:venture@google.com] 
Sent: Wednesday, December 20, 2017 2:11 PM
To: Brown, Michael E <Michael_E_Brown@Dell.com>
Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop <bradleyb@fuzziesquirrel.com>
Subject: Re: Metrics vs Logging, Continued

On Wed, Dec 20, 2017 at 12:01 PM, Patrick Venture <venture@google.com> wrote:
> On Wed, Dec 20, 2017 at 11:07 AM,  <Michael.E.Brown@dell.com> wrote:
>> So what do you mean by "something similar on the inside"? Do you have references?
>
> I was just indicating that this use of http to export metrics is 
> something with which I'm familiar.
>
>>
>> And what do you mean by "not well-suited for an embedded platform"? What metrics are you using to base this opinion on?
>> --
>
> I'm going from design metrics.  Every daemon now will need a thread to 
> provide the information to anyone who asks.  So that's X daemons 
> needing new threads to handle the requests.  That's my understanding 
> of how this works when you add the library to your daemon.

I'm reading through the c++ library for this to see whether it pushes data to some central metric server, which would be preferable design-wise.

>
>
>> Michael
>>
>> -----Original Message-----
>> From: Patrick Venture [mailto:venture@google.com]
>> Sent: Monday, December 18, 2017 12:17 PM
>> To: Brown, Michael E <Michael_E_Brown@Dell.com>
>> Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop 
>> <bradleyb@fuzziesquirrel.com>
>> Subject: Re: Metrics vs Logging, Continued
>>
>> So, we use something similar on the inside, but it's not well-suited for an embedded platform.
>>
>> On Fri, Dec 15, 2017 at 11:06 AM,  <Michael.E.Brown@dell.com> wrote:
>>> Things like Prometheus and graphana already exist and have fairly standardized metrics gathering interfaces. Why re-invent the wheel here? I would advocate for Prometheus for metrics gathering.
>>>
>>> Quick primer on Prometheus: each process exposes an HTTP endpoint that reports metrics in a standardized JSON format. The process can expose counters, gauges, histograms or summaries. The server side is responsible for polling the various clients at whatever interval is desired and can format results graphically and over a bunch of clients. Each process can instrument itself, there are c++ client libraries to do this already. Then we can aggregate them out with a route from the main HTTP server.
>>>
>>> All of the things in your metrics list fall neatly into types of data that Prometheus format handles well.
>>> --
>>> Michael
>>>
>>> -----Original Message-----
>>> From: openbmc
>>> [mailto:openbmc-bounces+michael.e.brown=dell.com@lists.ozlabs.org] 
>>> On Behalf Of Patrick Venture
>>> Sent: Tuesday, December 5, 2017 10:51 AM
>>> To: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop 
>>> <bradleyb@fuzziesquirrel.com>
>>> Subject: Metrics vs Logging, Continued
>>>
>>> Logging being something separate from metrics -- I've been toying around with different approaches to allowing userspace metrics collection and distribution.  There are likely better ways, and I think I saw a message on chat about a metrics library that could be used. -- but I've mostly been following email.
>>>
>>> I was thinking this morning of a couple methods, some y'all might like (one where the daemon owns it, one where the metric owner owns it):
>>>
>>> 1) Each daemon can be responsible for exporting onto dbus some objects with a well-defined path that are of a metric type that has a value and the daemon that owns it is therefore responsible for maintaining it.  to collect the metrics, one must grab the subtree for the starting point and trace out all the different metrics and get the values from their owners. and reports that up somehow. -- the somehow could be several IPMI packets.  or several IPMI packets containing a protobuf (similarly to the flash access approach proposed by Brendan).
>>> The upside to the free-form text and paths is you could parse it out to figure out what was each thing.
>>>
>>> 2) Each daemon that wants to track a metric creates a metric object in another daemon (via dbus calls) and then periodically updates that value.  then the information can be reported in the way described above similarly, except the owner of the dbus objects would be the one daemon and one bus, etc.  This implementation requires a lot more dbus traffic to maintain the values.  However, in situations where one doesn't want to manage their own dbus object for this, they can just make one dbus call to update their value based on whatever mechanism they use for timing this and they can store the metrics internally in their daemon however they please.  Another upside to this is that it'd be straightforward to add to the current set of daemons without needing to restructure anything.  Also, depending on the metric itself, it may not be something updated all that frequently.  For many, I foresee updating on non-critical failures, or interesting failures -- for instance, how often the ipmi daemon's reply is rejected by the btbridge daemon.
>>>
>>> Approach #2 could be rolled into a couple library calls as well, very easily such that they don't even know the internals of the tracking...
>>> I like and don't like the free-form text naming of the metrics, because obviously they can be human-readable.  Another approach might be to assign them human readable names and IDs, similarly to sensors so that you can read back the name for a metric once, and then in the future cache it, making subsequent requests smaller.
>>>
>>> Obvious downside to both implementations (although #2 has an easy mitigation), if the daemon with the internal state crashes the metrics are lost, when it comes back up all the metrics are 0.  If the metrics are owned by another daemon, then the library calls to set up the metrics tracking could check if the metric already exists, and use that value to start with -- then you only have to care about that one daemon crashing.  It could periodically write the values down and then read them on start-up to persist these values.  However, you might want the values to not persist... I imagine I wouldn't, however, something like boot count would...
>>>
>>> There are specific things that the host wants to know, that really fall into metrics over logging:
>>> 1) BMCs boot count
>>> 2) i2c ioctl failure count (which bus/device/reg: count)
>>> 3) Specific sensor requests (reading, writing)
>>> 4) Fan control failsafe mode count, how often it's falling into 
>>> failsafe mode
>>> 5) How often the ipmi daemon's reply to the btbridge daemon fails.
>>>
>>> Given some feedback on this, I'll write up a design and the use-cases it's trying to address.
>>>
>>> Thanks,
>>> Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Metrics vs Logging, Continued
  2017-12-20 20:01       ` Patrick Venture
@ 2017-12-20 20:10         ` Patrick Venture
  2017-12-22 18:02           ` Michael.E.Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Patrick Venture @ 2017-12-20 20:10 UTC (permalink / raw)
  To: Michael.E.Brown; +Cc: OpenBMC Maillist, Brad Bishop

On Wed, Dec 20, 2017 at 12:01 PM, Patrick Venture <venture@google.com> wrote:
> On Wed, Dec 20, 2017 at 11:07 AM,  <Michael.E.Brown@dell.com> wrote:
>> So what do you mean by "something similar on the inside"? Do you have references?
>
> I was just indicating that this use of http to export metrics is
> something with which I'm familiar.
>
>>
>> And what do you mean by "not well-suited for an embedded platform"? What metrics are you using to base this opinion on?
>> --
>
> I'm going from design metrics.  Every daemon now will need a thread to
> provide the information to anyone who asks.  So that's X daemons
> needing new threads to handle the requests.  That's my understanding
> of how this works when you add the library to your daemon.

I'm reading through the c++ library for this to see whether it pushes
data to some central metric server, which would be preferable
design-wise.

>
>
>> Michael
>>
>> -----Original Message-----
>> From: Patrick Venture [mailto:venture@google.com]
>> Sent: Monday, December 18, 2017 12:17 PM
>> To: Brown, Michael E <Michael_E_Brown@Dell.com>
>> Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop <bradleyb@fuzziesquirrel.com>
>> Subject: Re: Metrics vs Logging, Continued
>>
>> So, we use something similar on the inside, but it's not well-suited for an embedded platform.
>>
>> On Fri, Dec 15, 2017 at 11:06 AM,  <Michael.E.Brown@dell.com> wrote:
>>> Things like Prometheus and graphana already exist and have fairly standardized metrics gathering interfaces. Why re-invent the wheel here? I would advocate for Prometheus for metrics gathering.
>>>
>>> Quick primer on Prometheus: each process exposes an HTTP endpoint that reports metrics in a standardized JSON format. The process can expose counters, gauges, histograms or summaries. The server side is responsible for polling the various clients at whatever interval is desired and can format results graphically and over a bunch of clients. Each process can instrument itself, there are c++ client libraries to do this already. Then we can aggregate them out with a route from the main HTTP server.
>>>
>>> All of the things in your metrics list fall neatly into types of data that Prometheus format handles well.
>>> --
>>> Michael
>>>
>>> -----Original Message-----
>>> From: openbmc
>>> [mailto:openbmc-bounces+michael.e.brown=dell.com@lists.ozlabs.org] On
>>> Behalf Of Patrick Venture
>>> Sent: Tuesday, December 5, 2017 10:51 AM
>>> To: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop
>>> <bradleyb@fuzziesquirrel.com>
>>> Subject: Metrics vs Logging, Continued
>>>
>>> Logging being something separate from metrics -- I've been toying around with different approaches to allowing userspace metrics collection and distribution.  There are likely better ways, and I think I saw a message on chat about a metrics library that could be used. -- but I've mostly been following email.
>>>
>>> I was thinking this morning of a couple methods, some y'all might like (one where the daemon owns it, one where the metric owner owns it):
>>>
>>> 1) Each daemon can be responsible for exporting onto dbus some objects with a well-defined path that are of a metric type that has a value and the daemon that owns it is therefore responsible for maintaining it.  to collect the metrics, one must grab the subtree for the starting point and trace out all the different metrics and get the values from their owners. and reports that up somehow. -- the somehow could be several IPMI packets.  or several IPMI packets containing a protobuf (similarly to the flash access approach proposed by Brendan).
>>> The upside to the free-form text and paths is you could parse it out to figure out what was each thing.
>>>
>>> 2) Each daemon that wants to track a metric creates a metric object in another daemon (via dbus calls) and then periodically updates that value.  then the information can be reported in the way described above similarly, except the owner of the dbus objects would be the one daemon and one bus, etc.  This implementation requires a lot more dbus traffic to maintain the values.  However, in situations where one doesn't want to manage their own dbus object for this, they can just make one dbus call to update their value based on whatever mechanism they use for timing this and they can store the metrics internally in their daemon however they please.  Another upside to this is that it'd be straightforward to add to the current set of daemons without needing to restructure anything.  Also, depending on the metric itself, it may not be something updated all that frequently.  For many, I foresee updating on non-critical failures, or interesting failures -- for instance, how often the ipmi daemon's reply is rejected by the btbridge daemon.
>>>
>>> Approach #2 could be rolled into a couple library calls as well, very easily such that they don't even know the internals of the tracking...
>>> I like and don't like the free-form text naming of the metrics, because obviously they can be human-readable.  Another approach might be to assign them human readable names and IDs, similarly to sensors so that you can read back the name for a metric once, and then in the future cache it, making subsequent requests smaller.
>>>
>>> Obvious downside to both implementations (although #2 has an easy mitigation), if the daemon with the internal state crashes the metrics are lost, when it comes back up all the metrics are 0.  If the metrics are owned by another daemon, then the library calls to set up the metrics tracking could check if the metric already exists, and use that value to start with -- then you only have to care about that one daemon crashing.  It could periodically write the values down and then read them on start-up to persist these values.  However, you might want the values to not persist... I imagine I wouldn't, however, something like boot count would...
>>>
>>> There are specific things that the host wants to know, that really fall into metrics over logging:
>>> 1) BMCs boot count
>>> 2) i2c ioctl failure count (which bus/device/reg: count)
>>> 3) Specific sensor requests (reading, writing)
>>> 4) Fan control failsafe mode count, how often it's falling into
>>> failsafe mode
>>> 5) How often the ipmi daemon's reply to the btbridge daemon fails.
>>>
>>> Given some feedback on this, I'll write up a design and the use-cases it's trying to address.
>>>
>>> Thanks,
>>> Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Metrics vs Logging, Continued
  2017-12-20 19:07     ` Michael.E.Brown
@ 2017-12-20 20:01       ` Patrick Venture
  2017-12-20 20:10         ` Patrick Venture
  0 siblings, 1 reply; 10+ messages in thread
From: Patrick Venture @ 2017-12-20 20:01 UTC (permalink / raw)
  To: Michael.E.Brown; +Cc: OpenBMC Maillist, Brad Bishop

On Wed, Dec 20, 2017 at 11:07 AM,  <Michael.E.Brown@dell.com> wrote:
> So what do you mean by "something similar on the inside"? Do you have references?

I was just indicating that this use of http to export metrics is
something with which I'm familiar.

>
> And what do you mean by "not well-suited for an embedded platform"? What metrics are you using to base this opinion on?
> --

I'm going from design metrics.  Every daemon now will need a thread to
provide the information to anyone who asks.  So that's X daemons
needing new threads to handle the requests.  That's my understanding
of how this works when you add the library to your daemon.


> Michael
>
> -----Original Message-----
> From: Patrick Venture [mailto:venture@google.com]
> Sent: Monday, December 18, 2017 12:17 PM
> To: Brown, Michael E <Michael_E_Brown@Dell.com>
> Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop <bradleyb@fuzziesquirrel.com>
> Subject: Re: Metrics vs Logging, Continued
>
> So, we use something similar on the inside, but it's not well-suited for an embedded platform.
>
> On Fri, Dec 15, 2017 at 11:06 AM,  <Michael.E.Brown@dell.com> wrote:
>> Things like Prometheus and graphana already exist and have fairly standardized metrics gathering interfaces. Why re-invent the wheel here? I would advocate for Prometheus for metrics gathering.
>>
>> Quick primer on Prometheus: each process exposes an HTTP endpoint that reports metrics in a standardized JSON format. The process can expose counters, gauges, histograms or summaries. The server side is responsible for polling the various clients at whatever interval is desired and can format results graphically and over a bunch of clients. Each process can instrument itself, there are c++ client libraries to do this already. Then we can aggregate them out with a route from the main HTTP server.
>>
>> All of the things in your metrics list fall neatly into types of data that Prometheus format handles well.
>> --
>> Michael
>>
>> -----Original Message-----
>> From: openbmc
>> [mailto:openbmc-bounces+michael.e.brown=dell.com@lists.ozlabs.org] On
>> Behalf Of Patrick Venture
>> Sent: Tuesday, December 5, 2017 10:51 AM
>> To: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop
>> <bradleyb@fuzziesquirrel.com>
>> Subject: Metrics vs Logging, Continued
>>
>> Logging being something separate from metrics -- I've been toying around with different approaches to allowing userspace metrics collection and distribution.  There are likely better ways, and I think I saw a message on chat about a metrics library that could be used. -- but I've mostly been following email.
>>
>> I was thinking this morning of a couple methods, some y'all might like (one where the daemon owns it, one where the metric owner owns it):
>>
>> 1) Each daemon can be responsible for exporting onto dbus some objects with a well-defined path that are of a metric type that has a value and the daemon that owns it is therefore responsible for maintaining it.  to collect the metrics, one must grab the subtree for the starting point and trace out all the different metrics and get the values from their owners. and reports that up somehow. -- the somehow could be several IPMI packets.  or several IPMI packets containing a protobuf (similarly to the flash access approach proposed by Brendan).
>> The upside to the free-form text and paths is you could parse it out to figure out what was each thing.
>>
>> 2) Each daemon that wants to track a metric creates a metric object in another daemon (via dbus calls) and then periodically updates that value.  then the information can be reported in the way described above similarly, except the owner of the dbus objects would be the one daemon and one bus, etc.  This implementation requires a lot more dbus traffic to maintain the values.  However, in situations where one doesn't want to manage their own dbus object for this, they can just make one dbus call to update their value based on whatever mechanism they use for timing this and they can store the metrics internally in their daemon however they please.  Another upside to this is that it'd be straightforward to add to the current set of daemons without needing to restructure anything.  Also, depending on the metric itself, it may not be something updated all that frequently.  For many, I foresee updating on non-critical failures, or interesting failures -- for instance, how often the ipmi daemon's reply is rejected by the btbridge daemon.
>>
>> Approach #2 could be rolled into a couple library calls as well, very easily such that they don't even know the internals of the tracking...
>> I like and don't like the free-form text naming of the metrics, because obviously they can be human-readable.  Another approach might be to assign them human readable names and IDs, similarly to sensors so that you can read back the name for a metric once, and then in the future cache it, making subsequent requests smaller.
>>
>> Obvious downside to both implementations (although #2 has an easy mitigation), if the daemon with the internal state crashes the metrics are lost, when it comes back up all the metrics are 0.  If the metrics are owned by another daemon, then the library calls to set up the metrics tracking could check if the metric already exists, and use that value to start with -- then you only have to care about that one daemon crashing.  It could periodically write the values down and then read them on start-up to persist these values.  However, you might want the values to not persist... I imagine I wouldn't, however, something like boot count would...
>>
>> There are specific things that the host wants to know, that really fall into metrics over logging:
>> 1) BMCs boot count
>> 2) i2c ioctl failure count (which bus/device/reg: count)
>> 3) Specific sensor requests (reading, writing)
>> 4) Fan control failsafe mode count, how often it's falling into
>> failsafe mode
>> 5) How often the ipmi daemon's reply to the btbridge daemon fails.
>>
>> Given some feedback on this, I'll write up a design and the use-cases it's trying to address.
>>
>> Thanks,
>> Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Metrics vs Logging, Continued
  2017-12-18 18:17   ` Patrick Venture
@ 2017-12-20 19:07     ` Michael.E.Brown
  2017-12-20 20:01       ` Patrick Venture
  0 siblings, 1 reply; 10+ messages in thread
From: Michael.E.Brown @ 2017-12-20 19:07 UTC (permalink / raw)
  To: venture; +Cc: openbmc, bradleyb

So what do you mean by "something similar on the inside"? Do you have references?

And what do you mean by "not well-suited for an embedded platform"? What metrics are you using to base this opinion on? 
--
Michael

-----Original Message-----
From: Patrick Venture [mailto:venture@google.com] 
Sent: Monday, December 18, 2017 12:17 PM
To: Brown, Michael E <Michael_E_Brown@Dell.com>
Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop <bradleyb@fuzziesquirrel.com>
Subject: Re: Metrics vs Logging, Continued

So, we use something similar on the inside, but it's not well-suited for an embedded platform.

On Fri, Dec 15, 2017 at 11:06 AM,  <Michael.E.Brown@dell.com> wrote:
> Things like Prometheus and graphana already exist and have fairly standardized metrics gathering interfaces. Why re-invent the wheel here? I would advocate for Prometheus for metrics gathering.
>
> Quick primer on Prometheus: each process exposes an HTTP endpoint that reports metrics in a standardized JSON format. The process can expose counters, gauges, histograms or summaries. The server side is responsible for polling the various clients at whatever interval is desired and can format results graphically and over a bunch of clients. Each process can instrument itself, there are c++ client libraries to do this already. Then we can aggregate them out with a route from the main HTTP server.
>
> All of the things in your metrics list fall neatly into types of data that Prometheus format handles well.
> --
> Michael
>
> -----Original Message-----
> From: openbmc 
> [mailto:openbmc-bounces+michael.e.brown=dell.com@lists.ozlabs.org] On 
> Behalf Of Patrick Venture
> Sent: Tuesday, December 5, 2017 10:51 AM
> To: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop 
> <bradleyb@fuzziesquirrel.com>
> Subject: Metrics vs Logging, Continued
>
> Logging being something separate from metrics -- I've been toying around with different approaches to allowing userspace metrics collection and distribution.  There are likely better ways, and I think I saw a message on chat about a metrics library that could be used. -- but I've mostly been following email.
>
> I was thinking this morning of a couple methods, some y'all might like (one where the daemon owns it, one where the metric owner owns it):
>
> 1) Each daemon can be responsible for exporting onto dbus some objects with a well-defined path that are of a metric type that has a value and the daemon that owns it is therefore responsible for maintaining it.  to collect the metrics, one must grab the subtree for the starting point and trace out all the different metrics and get the values from their owners. and reports that up somehow. -- the somehow could be several IPMI packets.  or several IPMI packets containing a protobuf (similarly to the flash access approach proposed by Brendan).
> The upside to the free-form text and paths is you could parse it out to figure out what was each thing.
>
> 2) Each daemon that wants to track a metric creates a metric object in another daemon (via dbus calls) and then periodically updates that value.  then the information can be reported in the way described above similarly, except the owner of the dbus objects would be the one daemon and one bus, etc.  This implementation requires a lot more dbus traffic to maintain the values.  However, in situations where one doesn't want to manage their own dbus object for this, they can just make one dbus call to update their value based on whatever mechanism they use for timing this and they can store the metrics internally in their daemon however they please.  Another upside to this is that it'd be straightforward to add to the current set of daemons without needing to restructure anything.  Also, depending on the metric itself, it may not be something updated all that frequently.  For many, I foresee updating on non-critical failures, or interesting failures -- for instance, how often the ipmi daemon's reply is rejected by the btbridge daemon.
>
> Approach #2 could be rolled into a couple library calls as well, very easily such that they don't even know the internals of the tracking...
> I like and don't like the free-form text naming of the metrics, because obviously they can be human-readable.  Another approach might be to assign them human readable names and IDs, similarly to sensors so that you can read back the name for a metric once, and then in the future cache it, making subsequent requests smaller.
>
> Obvious downside to both implementations (although #2 has an easy mitigation), if the daemon with the internal state crashes the metrics are lost, when it comes back up all the metrics are 0.  If the metrics are owned by another daemon, then the library calls to set up the metrics tracking could check if the metric already exists, and use that value to start with -- then you only have to care about that one daemon crashing.  It could periodically write the values down and then read them on start-up to persist these values.  However, you might want the values to not persist... I imagine I wouldn't, however, something like boot count would...
>
> There are specific things that the host wants to know, that really fall into metrics over logging:
> 1) BMCs boot count
> 2) i2c ioctl failure count (which bus/device/reg: count)
> 3) Specific sensor requests (reading, writing)
> 4) Fan control failsafe mode count, how often it's falling into 
> failsafe mode
> 5) How often the ipmi daemon's reply to the btbridge daemon fails.
>
> Given some feedback on this, I'll write up a design and the use-cases it's trying to address.
>
> Thanks,
> Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Metrics vs Logging, Continued
  2017-12-15 19:06 ` Michael.E.Brown
@ 2017-12-18 18:17   ` Patrick Venture
  2017-12-20 19:07     ` Michael.E.Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Patrick Venture @ 2017-12-18 18:17 UTC (permalink / raw)
  To: Michael.E.Brown; +Cc: OpenBMC Maillist, Brad Bishop

So, we use something similar on the inside, but it's not well-suited
for an embedded platform.

On Fri, Dec 15, 2017 at 11:06 AM,  <Michael.E.Brown@dell.com> wrote:
> Things like Prometheus and graphana already exist and have fairly standardized metrics gathering interfaces. Why re-invent the wheel here? I would advocate for Prometheus for metrics gathering.
>
> Quick primer on Prometheus: each process exposes an HTTP endpoint that reports metrics in a standardized JSON format. The process can expose counters, gauges, histograms or summaries. The server side is responsible for polling the various clients at whatever interval is desired and can format results graphically and over a bunch of clients. Each process can instrument itself, there are c++ client libraries to do this already. Then we can aggregate them out with a route from the main HTTP server.
>
> All of the things in your metrics list fall neatly into types of data that Prometheus format handles well.
> --
> Michael
>
> -----Original Message-----
> From: openbmc [mailto:openbmc-bounces+michael.e.brown=dell.com@lists.ozlabs.org] On Behalf Of Patrick Venture
> Sent: Tuesday, December 5, 2017 10:51 AM
> To: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop <bradleyb@fuzziesquirrel.com>
> Subject: Metrics vs Logging, Continued
>
> Logging being something separate from metrics -- I've been toying around with different approaches to allowing userspace metrics collection and distribution.  There are likely better ways, and I think I saw a message on chat about a metrics library that could be used. -- but I've mostly been following email.
>
> I was thinking this morning of a couple methods, some y'all might like (one where the daemon owns it, one where the metric owner owns it):
>
> 1) Each daemon can be responsible for exporting onto dbus some objects with a well-defined path that are of a metric type that has a value and the daemon that owns it is therefore responsible for maintaining it.  to collect the metrics, one must grab the subtree for the starting point and trace out all the different metrics and get the values from their owners. and reports that up somehow. -- the somehow could be several IPMI packets.  or several IPMI packets containing a protobuf (similarly to the flash access approach proposed by Brendan).
> The upside to the free-form text and paths is you could parse it out to figure out what was each thing.
>
> 2) Each daemon that wants to track a metric creates a metric object in another daemon (via dbus calls) and then periodically updates that value.  then the information can be reported in the way described above similarly, except the owner of the dbus objects would be the one daemon and one bus, etc.  This implementation requires a lot more dbus traffic to maintain the values.  However, in situations where one doesn't want to manage their own dbus object for this, they can just make one dbus call to update their value based on whatever mechanism they use for timing this and they can store the metrics internally in their daemon however they please.  Another upside to this is that it'd be straightforward to add to the current set of daemons without needing to restructure anything.  Also, depending on the metric itself, it may not be something updated all that frequently.  For many, I foresee updating on non-critical failures, or interesting failures -- for instance, how often the ipmi daemon's reply is rejected by the btbridge daemon.
>
> Approach #2 could be rolled into a couple library calls as well, very easily such that they don't even know the internals of the tracking...
> I like and don't like the free-form text naming of the metrics, because obviously they can be human-readable.  Another approach might be to assign them human readable names and IDs, similarly to sensors so that you can read back the name for a metric once, and then in the future cache it, making subsequent requests smaller.
>
> Obvious downside to both implementations (although #2 has an easy mitigation), if the daemon with the internal state crashes the metrics are lost, when it comes back up all the metrics are 0.  If the metrics are owned by another daemon, then the library calls to set up the metrics tracking could check if the metric already exists, and use that value to start with -- then you only have to care about that one daemon crashing.  It could periodically write the values down and then read them on start-up to persist these values.  However, you might want the values to not persist... I imagine I wouldn't, however, something like boot count would...
>
> There are specific things that the host wants to know, that really fall into metrics over logging:
> 1) BMCs boot count
> 2) i2c ioctl failure count (which bus/device/reg: count)
> 3) Specific sensor requests (reading, writing)
> 4) Fan control failsafe mode count, how often it's falling into failsafe mode
> 5) How often the ipmi daemon's reply to the btbridge daemon fails.
>
> Given some feedback on this, I'll write up a design and the use-cases it's trying to address.
>
> Thanks,
> Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Metrics vs Logging, Continued
  2017-12-05 16:50 Patrick Venture
@ 2017-12-15 19:06 ` Michael.E.Brown
  2017-12-18 18:17   ` Patrick Venture
  0 siblings, 1 reply; 10+ messages in thread
From: Michael.E.Brown @ 2017-12-15 19:06 UTC (permalink / raw)
  To: venture, openbmc, bradleyb

Things like Prometheus and graphana already exist and have fairly standardized metrics gathering interfaces. Why re-invent the wheel here? I would advocate for Prometheus for metrics gathering. 

Quick primer on Prometheus: each process exposes an HTTP endpoint that reports metrics in a standardized JSON format. The process can expose counters, gauges, histograms or summaries. The server side is responsible for polling the various clients at whatever interval is desired and can format results graphically and over a bunch of clients. Each process can instrument itself, there are c++ client libraries to do this already. Then we can aggregate them out with a route from the main HTTP server.

All of the things in your metrics list fall neatly into types of data that Prometheus format handles well.
--
Michael

-----Original Message-----
From: openbmc [mailto:openbmc-bounces+michael.e.brown=dell.com@lists.ozlabs.org] On Behalf Of Patrick Venture
Sent: Tuesday, December 5, 2017 10:51 AM
To: OpenBMC Maillist <openbmc@lists.ozlabs.org>; Brad Bishop <bradleyb@fuzziesquirrel.com>
Subject: Metrics vs Logging, Continued

Logging being something separate from metrics -- I've been toying around with different approaches to allowing userspace metrics collection and distribution.  There are likely better ways, and I think I saw a message on chat about a metrics library that could be used. -- but I've mostly been following email.

I was thinking this morning of a couple methods, some y'all might like (one where the daemon owns it, one where the metric owner owns it):

1) Each daemon can be responsible for exporting onto dbus some objects with a well-defined path that are of a metric type that has a value and the daemon that owns it is therefore responsible for maintaining it.  to collect the metrics, one must grab the subtree for the starting point and trace out all the different metrics and get the values from their owners. and reports that up somehow. -- the somehow could be several IPMI packets.  or several IPMI packets containing a protobuf (similarly to the flash access approach proposed by Brendan).
The upside to the free-form text and paths is you could parse it out to figure out what was each thing.

2) Each daemon that wants to track a metric creates a metric object in another daemon (via dbus calls) and then periodically updates that value.  then the information can be reported in the way described above similarly, except the owner of the dbus objects would be the one daemon and one bus, etc.  This implementation requires a lot more dbus traffic to maintain the values.  However, in situations where one doesn't want to manage their own dbus object for this, they can just make one dbus call to update their value based on whatever mechanism they use for timing this and they can store the metrics internally in their daemon however they please.  Another upside to this is that it'd be straightforward to add to the current set of daemons without needing to restructure anything.  Also, depending on the metric itself, it may not be something updated all that frequently.  For many, I foresee updating on non-critical failures, or interesting failures -- for instance, how often the ipmi daemon's reply is rejected by the btbridge daemon.

Approach #2 could be rolled into a couple library calls as well, very easily such that they don't even know the internals of the tracking...
I like and don't like the free-form text naming of the metrics, because obviously they can be human-readable.  Another approach might be to assign them human readable names and IDs, similarly to sensors so that you can read back the name for a metric once, and then in the future cache it, making subsequent requests smaller.

Obvious downside to both implementations (although #2 has an easy mitigation), if the daemon with the internal state crashes the metrics are lost, when it comes back up all the metrics are 0.  If the metrics are owned by another daemon, then the library calls to set up the metrics tracking could check if the metric already exists, and use that value to start with -- then you only have to care about that one daemon crashing.  It could periodically write the values down and then read them on start-up to persist these values.  However, you might want the values to not persist... I imagine I wouldn't, however, something like boot count would...

There are specific things that the host wants to know, that really fall into metrics over logging:
1) BMCs boot count
2) i2c ioctl failure count (which bus/device/reg: count)
3) Specific sensor requests (reading, writing)
4) Fan control failsafe mode count, how often it's falling into failsafe mode
5) How often the ipmi daemon's reply to the btbridge daemon fails.

Given some feedback on this, I'll write up a design and the use-cases it's trying to address.

Thanks,
Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Metrics vs Logging, Continued
@ 2017-12-05 16:50 Patrick Venture
  2017-12-15 19:06 ` Michael.E.Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Patrick Venture @ 2017-12-05 16:50 UTC (permalink / raw)
  To: OpenBMC Maillist, Brad Bishop

Logging being something separate from metrics -- I've been toying
around with different approaches to allowing userspace metrics
collection and distribution.  There are likely better ways, and I
think I saw a message on chat about a metrics library that could be
used. -- but I've mostly been following email.

I was thinking this morning of a couple methods, some y'all might like
(one where the daemon owns it, one where the metric owner owns it):

1) Each daemon can be responsible for exporting onto dbus some objects
with a well-defined path that are of a metric type that has a value
and the daemon that owns it is therefore responsible for maintaining
it.  to collect the metrics, one must grab the subtree for the
starting point and trace out all the different metrics and get the
values from their owners. and reports that up somehow. -- the somehow
could be several IPMI packets.  or several IPMI packets containing a
protobuf (similarly to the flash access approach proposed by Brendan).
The upside to the free-form text and paths is you could parse it out
to figure out what was each thing.

2) Each daemon that wants to track a metric creates a metric object in
another daemon (via dbus calls) and then periodically updates that
value.  then the information can be reported in the way described
above similarly, except the owner of the dbus objects would be the one
daemon and one bus, etc.  This implementation requires a lot more dbus
traffic to maintain the values.  However, in situations where one
doesn't want to manage their own dbus object for this, they can just
make one dbus call to update their value based on whatever mechanism
they use for timing this and they can store the metrics internally in
their daemon however they please.  Another upside to this is that it'd
be straightforward to add to the current set of daemons without
needing to restructure anything.  Also, depending on the metric
itself, it may not be something updated all that frequently.  For
many, I foresee updating on non-critical failures, or interesting
failures -- for instance, how often the ipmi daemon's reply is
rejected by the btbridge daemon.

Approach #2 could be rolled into a couple library calls as well, very
easily such that they don't even know the internals of the tracking...
I like and don't like the free-form text naming of the metrics,
because obviously they can be human-readable.  Another approach might
be to assign them human readable names and IDs, similarly to sensors
so that you can read back the name for a metric once, and then in the
future cache it, making subsequent requests smaller.

Obvious downside to both implementations (although #2 has an easy
mitigation), if the daemon with the internal state crashes the metrics
are lost, when it comes back up all the metrics are 0.  If the metrics
are owned by another daemon, then the library calls to set up the
metrics tracking could check if the metric already exists, and use
that value to start with -- then you only have to care about that one
daemon crashing.  It could periodically write the values down and then
read them on start-up to persist these values.  However, you might
want the values to not persist... I imagine I wouldn't, however,
something like boot count would...

There are specific things that the host wants to know, that really
fall into metrics over logging:
1) BMCs boot count
2) i2c ioctl failure count (which bus/device/reg: count)
3) Specific sensor requests (reading, writing)
4) Fan control failsafe mode count, how often it's falling into failsafe mode
5) How often the ipmi daemon's reply to the btbridge daemon fails.

Given some feedback on this, I'll write up a design and the use-cases
it's trying to address.

Thanks,
Patrick

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-01-04 15:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-03  2:22 Metrics vs Logging, Continued Christopher Covington
2018-01-03  4:08 ` Deepak Kodihalli
2018-01-04 15:39 ` Michael E Brown
  -- strict thread matches above, loose matches on Subject: below --
2017-12-05 16:50 Patrick Venture
2017-12-15 19:06 ` Michael.E.Brown
2017-12-18 18:17   ` Patrick Venture
2017-12-20 19:07     ` Michael.E.Brown
2017-12-20 20:01       ` Patrick Venture
2017-12-20 20:10         ` Patrick Venture
2017-12-22 18:02           ` Michael.E.Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.