All of lore.kernel.org
 help / color / mirror / Atom feed
* Logging utilities.
@ 2016-09-17 12:49 Patrick Williams
  2016-09-19  3:18 ` Rick Altherr
  0 siblings, 1 reply; 4+ messages in thread
From: Patrick Williams @ 2016-09-17 12:49 UTC (permalink / raw)
  To: OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 6031 bytes --]

There is a commit on Gerrit where some discussion was requested on the
libraries we plan to use for logging:
    https://gerrit.openbmc-project.xyz/#/c/53/
The suggestion was made that we consider google/glog as a logging
library.

Our intention with the program logging is to heavily leverage systemd's
journal methods both for original program logging but also for later
programmatic extraction on errors.  Adriana is working on the second
part and will have some proof of concepts to share soon.

Left to their own devices, two different programmers might create two
logging events such as:
    ERROR "Failed to open device %s, %s", device, strerror(errno)
    ERROR "fopen on %s failed: %d", device, errno

Notice in this case, not only is the English different but the content
itself is different and impossible to parse out if we want to do
something like automatically transpose "device" into a potentially
failing FRU.  The journald approach is that you do not put content in
English but you use meta-data tags to represent it.

    LEVEL=ERROR, MESSAGE="Failed to open", FILE="%s", device,
    ERRNO="%d", errno

Now we have a message that humans can read ("Failed to open
FILE=device") and one that computers can still interpret.  To take this
a step further, consider adding another meta-data tag:
    ERROR_EVENT="xyz.openbmc-project.Error.Device-Not-Found"

We can now codify behavior, in a data format customizable per-system if
needed, such as:
    on xyz.openbmc-project.Error.Device-Not-Found:
        [Device, AccessType] = parse-device-string(metadata.DEVICE)
        Parts = devices-on-bus(BMC, Device, AccessType)
        add-callout(Parts, High)
        add-callout("Error.Device-Configuration", Low)

Effectively, we separate the error identification from the error policy.
There are a few major advantages of this:
    1. Historically we have found that well over 50% of the code in this
       environment ends up being dealing with errors, which makes the
       code very difficult to understand and maintain.  By separating
       the policy, we end up with something that is much more readable.
    2. Different systems end up with policy quirks that tend to litter
       the code.  "On this system i2c bus 3 goes through this extra
       riser card and the connector ends up oxidizing faster than we
       expected, so put that riser as the highest priority for
       replacement."  We can now put this policy into a system-specific
       enhancement file rather than --enable-firestone and #if
       CONFIG_FIRESTONE littering all the packages.

Now, back to the logging API, what Adriana delivered aligns very well
with these metadata and error concepts because it was designed with
these thoughts in mind (which, I know, we could have done a better job of
communicating).  We enforce, at compile-time, that "messages" are static
English statements and dynamic data is all tagged through the "entry"
mechanism.  Down the line we will be able to further enforce, again at
compile time, that log(error(xyz.openbmc-project.Device-Not-Found)...)
must contain {entry(DEVICE), entry(ERRNO)} so that the program fails to
compile if the required data, per the error definition, were not given.

I don't see an easy way to leverage google/glog to include these
concepts, other than simply having the logging utilities Adriana
delivered use google/glog instead of systemd_journal.  At that point we
are losing the mind-share(?) associated with using glog in the first
place.

Glog seems to be primarily oriented towards a logging infrastructure 
for a stand-alone application.  The default behavior of it is to create
a log file in /tmp per process.  Due to the nature of our project, being
a collection of interacting products, debug is often going to require
aggregation of logging across the system and this is where journald
would be needed as the back-end.  As Rick mentioned in the above review, 
glog doesn't currently support systemd as a back-end but that is likely
trivial to add (or maybe available as an unreleased patch already).  It
does seem like a significant code path to introduce simply as a way to
get from a log statement to journald, when what Adriana delivered
compiles directly to a single journald call.

Glog also uses C++ iostream-style << operators.  I made a simple
hello-world application using it and there ends up being a local object
constructed to contain the buffer, a << operator call to insert a string
into the buffer, and a destructor call to emit the log entry.  There is
4 function calls involved at a minimum for each log entry, plus an
additional function call per-element.  So, in addition to the
performance implications of whatever glog is doing internally, I have
concerns about the code size impacts to using Glog due to the way the
API is structured and implemented.

In summary, I do not see an advantage to using glog over direct
journald calls, and would prefer we incorporate something derived from
what Adriana has proposed, for the following reasons:
    1. The proposed API has syntactic handling for concepts we want in
       our error-event design that glog does not.
        a. glog could be inserted under phosphor::logging at any point
           in the future without any code change to applications that
           use phosphor::logging.
    2. Glog has performance and size implications, over direct journald
       calls, without much in the way of added value.

There are a number of macros that Glog provides to simplify log
interaction:
        LOG_IF(condition) << entry
    rather than
        if (condition) { LOG << entry; }
    (and many more)

It does seem like a concept we could enhance the phosphor::logging
interfaces with if we find that this makes the code more concise.  I
would be interested to hear some of the more widely used macros and see
those enhancements proposed in Gerrit.

-- 
Patrick Williams

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Logging utilities.
  2016-09-17 12:49 Logging utilities Patrick Williams
@ 2016-09-19  3:18 ` Rick Altherr
  2016-10-25  0:43   ` Brendan Higgins
  0 siblings, 1 reply; 4+ messages in thread
From: Rick Altherr @ 2016-09-19  3:18 UTC (permalink / raw)
  To: Patrick Williams; +Cc: OpenBMC Maillist, Adriana Kobylak

[-- Attachment #1: Type: text/plain, Size: 7076 bytes --]

Thank you for the background.  That explains a lot about the path Adriana
has gone down.  I completely understand wanting to separate policy from
reporting of device and process errors.  Where we differ is that I don't
see that as the purpose of logging.  You've described an event-based
telemetry system meant to be primarily consumed by other software to drive
alerts and automated responses.  I view logging as being primarily for the
benefit of a programmer or operator trying to understand what section of
_code_ went wrong.  That distinction is part of why glog isn't designed to
be super high-performance.  Google uses entirely separate systems for
telemetry.  In my and Google's experience, scraping logs to extract
information (even when the messages are well structured) winds up being
painful to maintain and unreliable over time even when using journald.

On Sat, Sep 17, 2016 at 5:49 AM, Patrick Williams <patrick@stwcx.xyz> wrote:

> There is a commit on Gerrit where some discussion was requested on the
> libraries we plan to use for logging:
>     https://gerrit.openbmc-project.xyz/#/c/53/
> The suggestion was made that we consider google/glog as a logging
> library.
>
> Our intention with the program logging is to heavily leverage systemd's
> journal methods both for original program logging but also for later
> programmatic extraction on errors.  Adriana is working on the second
> part and will have some proof of concepts to share soon.
>
> Left to their own devices, two different programmers might create two
> logging events such as:
>     ERROR "Failed to open device %s, %s", device, strerror(errno)
>     ERROR "fopen on %s failed: %d", device, errno
>
> Notice in this case, not only is the English different but the content
> itself is different and impossible to parse out if we want to do
> something like automatically transpose "device" into a potentially
> failing FRU.  The journald approach is that you do not put content in
> English but you use meta-data tags to represent it.
>
>     LEVEL=ERROR, MESSAGE="Failed to open", FILE="%s", device,
>     ERRNO="%d", errno
>
> Now we have a message that humans can read ("Failed to open
> FILE=device") and one that computers can still interpret.  To take this
> a step further, consider adding another meta-data tag:
>     ERROR_EVENT="xyz.openbmc-project.Error.Device-Not-Found"
>
> We can now codify behavior, in a data format customizable per-system if
> needed, such as:
>     on xyz.openbmc-project.Error.Device-Not-Found:
>         [Device, AccessType] = parse-device-string(metadata.DEVICE)
>         Parts = devices-on-bus(BMC, Device, AccessType)
>         add-callout(Parts, High)
>         add-callout("Error.Device-Configuration", Low)
>
> Effectively, we separate the error identification from the error policy.
> There are a few major advantages of this:
>     1. Historically we have found that well over 50% of the code in this
>        environment ends up being dealing with errors, which makes the
>        code very difficult to understand and maintain.  By separating
>        the policy, we end up with something that is much more readable.
>     2. Different systems end up with policy quirks that tend to litter
>        the code.  "On this system i2c bus 3 goes through this extra
>        riser card and the connector ends up oxidizing faster than we
>        expected, so put that riser as the highest priority for
>        replacement."  We can now put this policy into a system-specific
>        enhancement file rather than --enable-firestone and #if
>        CONFIG_FIRESTONE littering all the packages.
>
> Now, back to the logging API, what Adriana delivered aligns very well
> with these metadata and error concepts because it was designed with
> these thoughts in mind (which, I know, we could have done a better job of
> communicating).  We enforce, at compile-time, that "messages" are static
> English statements and dynamic data is all tagged through the "entry"
> mechanism.  Down the line we will be able to further enforce, again at
> compile time, that log(error(xyz.openbmc-project.Device-Not-Found)...)
> must contain {entry(DEVICE), entry(ERRNO)} so that the program fails to
> compile if the required data, per the error definition, were not given.
>
> I don't see an easy way to leverage google/glog to include these
> concepts, other than simply having the logging utilities Adriana
> delivered use google/glog instead of systemd_journal.  At that point we
> are losing the mind-share(?) associated with using glog in the first
> place.
>
> Glog seems to be primarily oriented towards a logging infrastructure
> for a stand-alone application.  The default behavior of it is to create
> a log file in /tmp per process.  Due to the nature of our project, being
> a collection of interacting products, debug is often going to require
> aggregation of logging across the system and this is where journald
> would be needed as the back-end.  As Rick mentioned in the above review,
> glog doesn't currently support systemd as a back-end but that is likely
> trivial to add (or maybe available as an unreleased patch already).  It
> does seem like a significant code path to introduce simply as a way to
> get from a log statement to journald, when what Adriana delivered
> compiles directly to a single journald call.
>
> Glog also uses C++ iostream-style << operators.  I made a simple
> hello-world application using it and there ends up being a local object
> constructed to contain the buffer, a << operator call to insert a string
> into the buffer, and a destructor call to emit the log entry.  There is
> 4 function calls involved at a minimum for each log entry, plus an
> additional function call per-element.  So, in addition to the
> performance implications of whatever glog is doing internally, I have
> concerns about the code size impacts to using Glog due to the way the
> API is structured and implemented.
>
> In summary, I do not see an advantage to using glog over direct
> journald calls, and would prefer we incorporate something derived from
> what Adriana has proposed, for the following reasons:
>     1. The proposed API has syntactic handling for concepts we want in
>        our error-event design that glog does not.
>         a. glog could be inserted under phosphor::logging at any point
>            in the future without any code change to applications that
>            use phosphor::logging.
>     2. Glog has performance and size implications, over direct journald
>        calls, without much in the way of added value.
>
> There are a number of macros that Glog provides to simplify log
> interaction:
>         LOG_IF(condition) << entry
>     rather than
>         if (condition) { LOG << entry; }
>     (and many more)
>
> It does seem like a concept we could enhance the phosphor::logging
> interfaces with if we find that this makes the code more concise.  I
> would be interested to hear some of the more widely used macros and see
> those enhancements proposed in Gerrit.
>
> --
> Patrick Williams
>

[-- Attachment #2: Type: text/html, Size: 8215 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Logging utilities.
  2016-09-19  3:18 ` Rick Altherr
@ 2016-10-25  0:43   ` Brendan Higgins
  2016-10-28 13:28     ` Patrick Williams
  0 siblings, 1 reply; 4+ messages in thread
From: Brendan Higgins @ 2016-10-25  0:43 UTC (permalink / raw)
  To: patrick, raltherr, openbmc, anoo; +Cc: Brendan Higgins

Trying to revive this thread: we never concluded discussion on this. For the
convenience of the reader I have quoted the previous email below:

> Thank you for the background.  That explains a lot about the path Adriana
> has gone down.  I completely understand wanting to separate policy from
> reporting of device and process errors.  Where we differ is that I don't
> see that as the purpose of logging.  You've described an event-based
> telemetry system meant to be primarily consumed by other software to drive
> alerts and automated responses.  I view logging as being primarily for the
> benefit of a programmer or operator trying to understand what section of
> _code_ went wrong.  That distinction is part of why glog isn't designed to
> be super high-performance.  Google uses entirely separate systems for
> telemetry.  In my and Google's experience, scraping logs to extract
> information (even when the messages are well structured) winds up being
> painful to maintain and unreliable over time even when using journald.

We have a patch set out that adds glog support at:
https://gerrit.openbmc-project.xyz/#/c/788/

In addition, we have a patch for glog that adds support for logging to journald,
but I would like to get some feedback before I do anything else.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Logging utilities.
  2016-10-25  0:43   ` Brendan Higgins
@ 2016-10-28 13:28     ` Patrick Williams
  0 siblings, 0 replies; 4+ messages in thread
From: Patrick Williams @ 2016-10-28 13:28 UTC (permalink / raw)
  To: Brendan Higgins; +Cc: raltherr, openbmc, anoo

[-- Attachment #1: Type: text/plain, Size: 5396 bytes --]

On Mon, Oct 24, 2016 at 05:43:22PM -0700, Brendan Higgins wrote:
> Trying to revive this thread: we never concluded discussion on this. For the
> convenience of the reader I have quoted the previous email below:
> 
> > Thank you for the background.  That explains a lot about the path Adriana
> > has gone down.  I completely understand wanting to separate policy from
> > reporting of device and process errors.  Where we differ is that I don't
> > see that as the purpose of logging.  You've described an event-based
> > telemetry system meant to be primarily consumed by other software to drive
> > alerts and automated responses.  I view logging as being primarily for the
> > benefit of a programmer or operator trying to understand what section of
> > _code_ went wrong.  That distinction is part of why glog isn't designed to
> > be super high-performance.  Google uses entirely separate systems for
> > telemetry.  In my and Google's experience, scraping logs to extract
> > information (even when the messages are well structured) winds up being
> > painful to maintain and unreliable over time even when using journald.
> 
> We have a patch set out that adds glog support at:
> https://gerrit.openbmc-project.xyz/#/c/788/
> 
> In addition, we have a patch for glog that adds support for logging to journald,
> but I would like to get some feedback before I do anything else.

The IBM team and Google team had a in-person discussion on this email
thread yesterday (10/27).  I'm going to try to recap some of the points
there.

There ends up being three different aspects of "logging" that we were
discussing and in some cases using the same words for the different
aspects.  Those aspects are:

   1.  Program level information to assist in debugging.  This is
   sometimes called "logging", "tracing" or "flight recording".

   2.  Creating a record of something wrong with the system that may
   require a service action.  IBM has traditionally used the word "error
   logging" to refer to the action and "serviceable event" to refer to
   the thing generated by the action.

   3.  Collection of metrics associated with the system for external
   consumption.  Examples of metrics might be: raw sensor values, BMC
   memory and filesystem consumption, instances of recovered /
   predictive failure errors that have not reached a serviceable
   threshold.

Program level "logging" is intended to go into the journal.  The commit
from Adriana referenced in the discussion implements this.  This logging
interface allows unstructured entries, but we prefer to keep metadata tags
on data items as described earlier in the chain.
  (ex. {MSG="Failed to open file.", ERRNO=%d, FILE_PATH=%s} rather than
       {"Failed to open file(%d): %s", errno, file_path})
The interface also allows syslog-style log levels and we could decide on
a compile flag to disable (compile out) certain log levels.  I believe
the IBM team discussed earlier in the week adding a compile flag to
disable DEBUG level in full image builds, but allowing developers to do
something like "-DPHOSPHOR_LOGGING_DEBUG" to enable full logging on a
particular application.

We understand that there are some nice features of glog, with respect to
backtraces for instance, but strongly feel that the API of it will cause
a fairly significant flash size increase due to iostream-style shift
operators.  There is also concern, though unmeasured, about the
performance impact from the increase in the code path to the journald
interfaces.  The phosphor-logging interfaces compile down to exactly a
single journald function call per log call, which alleviates both of
these concerns.

The creation of service action records proposal was what I described
previously in this email chain.  The commits for that are available in
Gerrit.  We have found that often the information a developer would want
to log for application debug is often related to the information that is
needed to create a serviceable event.  Rather than require the program
(and developer) to log the same information through two different APIs,
we have attempted to combine it by leveraging the meta-data fields from
the program level logging API.  The serviceable event structure is
defined via a YAML file with required meta-data fields and then the API
has compile-time assurance that the required meta-data fields have been
provided.

We recognize this means we are effectively using the journal as an IPC
mechanism between the program identifying a serviceable event and the
program recording and reacting to / executing policy for the serviceable
event.  If this becomes cumbersome or problematic due to journald's
behavior we believe we can change this implementation choice with
little or no API modification.

Metrics are typically going to be generated and distributed through the
system via dbus signals.  For instance, a dbus object is created per
sensor and the org.freedesktop.Object.PropertiesChanged signal is
generated when the value of the sensor changes.  We expect to enhance
the REST server to allow users to subscribe and stream out those sorts
of signals.  If there isn't an interest in the REST API, an implementer
is free to create another daemon that subscribes to those signals and
processes them in another way.

-- 
Patrick Williams

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-10-28 13:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-17 12:49 Logging utilities Patrick Williams
2016-09-19  3:18 ` Rick Altherr
2016-10-25  0:43   ` Brendan Higgins
2016-10-28 13:28     ` Patrick Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.