All of lore.kernel.org
 help / color / mirror / Atom feed
* towards a user-mode diagnostic log mechanism
@ 2011-12-20  2:42 Mark Kampe
  2011-12-23 10:04 ` Wido den Hollander
  2012-01-06  4:09 ` Colin McCabe
  0 siblings, 2 replies; 8+ messages in thread
From: Mark Kampe @ 2011-12-20  2:42 UTC (permalink / raw)
  To: ceph-devel

I'd like to keep this ball moving ... as I believe that the
limitations of our current logging mechanisms are already
making support difficult, and that is about to become worse.

As a first step, I'd just like to get opinions on the general
requirements we are trying to satisfy, and decisions we have
to make along the way.

Comments?

I Requirements

   A. Primary Requirements (must have)
      1. information captured
         a. standard: time, sub-system, level, proc/thread
         b. additional: operation and parameters
         c. extensible for new operations
      2. efficiency
         a. run time overhead < 1%
            (I believe this requires delayed flush circular bufferring)
         b. persistent space O(Gigabytes per node-year)
      3. configurability
         a. capture level per sub-system
      4. persistence
         a. flushed out on process shut-down
         b. recoverable from user-mode core-dumps
      5. presentation
         a. output can be processed w/grep,less,...

   B. Secondary Requirements (nice to have)
      1. ease of use
         a. compatible with/convertable from existing calls
         b. run-time definition of new event records
      2. configurability
         a. size/rotation rules per sub-system
         b. separate in-memory/on-disk capture levels

II Decisions to be made

    A. Capture Circumstances
       1. some subset of procedure calls
          (I'm opposed to this, but it is an option)
       2. explicit event logging calls

    B. Capture Format
       1. ASCII text
       2. per-event binary format
       3. binary header + ASCII text

    C. Synchronization
       1. per-process vs per-thread buffers

    D. Flushing
       1. last writer flushes vs dedicated thread
       2. single- vs double-bufferred output

    E. Available open source candidates


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: towards a user-mode diagnostic log mechanism
  2011-12-20  2:42 towards a user-mode diagnostic log mechanism Mark Kampe
@ 2011-12-23 10:04 ` Wido den Hollander
  2012-01-06  4:09 ` Colin McCabe
  1 sibling, 0 replies; 8+ messages in thread
From: Wido den Hollander @ 2011-12-23 10:04 UTC (permalink / raw)
  To: Mark Kampe; +Cc: ceph-devel

On 12/20/2011 03:42 AM, Mark Kampe wrote:
> I'd like to keep this ball moving ... as I believe that the
> limitations of our current logging mechanisms are already
> making support difficult, and that is about to become worse.
>

I'll have to agree on that.

Running a larger cluster with full debugging on is nearly impossible. It 
puts a lot of load on your systems which could even lead to more trouble.

> As a first step, I'd just like to get opinions on the general
> requirements we are trying to satisfy, and decisions we have
> to make along the way.
>
> Comments?
>
> I Requirements
>
> A. Primary Requirements (must have)
> 1. information captured
> a. standard: time, sub-system, level, proc/thread
> b. additional: operation and parameters
> c. extensible for new operations
> 2. efficiency
> a. run time overhead < 1%
> (I believe this requires delayed flush circular bufferring)
> b. persistent space O(Gigabytes per node-year)
> 3. configurability
> a. capture level per sub-system
> 4. persistence
> a. flushed out on process shut-down
> b. recoverable from user-mode core-dumps
> 5. presentation
> a. output can be processed w/grep,less,...
>
> B. Secondary Requirements (nice to have)
> 1. ease of use
> a. compatible with/convertable from existing calls
> b. run-time definition of new event records
> 2. configurability
> a. size/rotation rules per sub-system
> b. separate in-memory/on-disk capture levels
>
> II Decisions to be made
>
> A. Capture Circumstances
> 1. some subset of procedure calls
> (I'm opposed to this, but it is an option)
> 2. explicit event logging calls
>
> B. Capture Format
> 1. ASCII text
> 2. per-event binary format
> 3. binary header + ASCII text
>
> C. Synchronization
> 1. per-process vs per-thread buffers
>
> D. Flushing
> 1. last writer flushes vs dedicated thread
> 2. single- vs double-bufferred output
>
> E. Available open source candidates

I'd still opt for the ring-buffer where all kinds of information is 
being dumped in. A separate reader/analyser can get this information out 
of the ring and write logs of it our do performance counting.

Currently there is no statistics information about OSD's as well. From 
log entries you can also generate statistics, the amount of IOps a 
specific OSD has to process, the number of PG operations, etc, etc.

I'd still suggest to take a look at how Varnish did this with their 
varnishlog and varnishncsa tools.

That works for us with 10k req/sec and we can do fully debugging without 
performance impact.

Just my $2c

Wido

>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: towards a user-mode diagnostic log mechanism
  2011-12-20  2:42 towards a user-mode diagnostic log mechanism Mark Kampe
  2011-12-23 10:04 ` Wido den Hollander
@ 2012-01-06  4:09 ` Colin McCabe
  2012-01-07  1:46   ` Mark Kampe
  2012-01-10 23:25   ` Tommi Virtanen
  1 sibling, 2 replies; 8+ messages in thread
From: Colin McCabe @ 2012-01-06  4:09 UTC (permalink / raw)
  To: Mark Kampe; +Cc: ceph-devel

On Mon, Dec 19, 2011 at 6:42 PM, Mark Kampe <mark.kampe@dreamhost.com> wrote:
> I'd like to keep this ball moving ... as I believe that the
> limitations of our current logging mechanisms are already
> making support difficult, and that is about to become worse.
>
> As a first step, I'd just like to get opinions on the general
> requirements we are trying to satisfy, and decisions we have
> to make along the way.
>
> Comments?
>
> I Requirements
>
>  A. Primary Requirements (must have)
>     1. information captured
>        a. standard: time, sub-system, level, proc/thread

We already do capture all this stuff, just not efficiently.

One easy way to get a lot of efficiency would be to add a thread-local
variable to store the thread name / ID.

Getting the system time is a surprisingly expensive operation, and
this poses a problem for logging system designers.  You can use the
rdtsc CPU instruction, but unfortunately on some architectures CPU
frequency scaling makes it very inaccurate.  Even when it is accurate,
it's not synchronized between multiple CPUs.

Another option is to without time for most messages and just have a
periodic timestamp that gets injected every so often-- perhaps every
second, for example.

>        b. additional: operation and parameters
>        c. extensible for new operations
>     2. efficiency
>        a. run time overhead < 1%
>           (I believe this requires delayed flush circular bufferring)

The current code could be changed to support delayed flush without too
much trouble.  In order to do that:

1. The message timestamp needs to be added by dout() rather than by
DoutStreambuf::overflow()

2. dendl needs to stop flushing the stream.  This part is simple...
just change the dendl macro to add "\n" rather than std::endl.

3. A signal handler needs to be added that flushes the dout stream
before program exit.

#1 is probably the most complex part.

I never added support for delayed flush because I was afraid that it
would prevent us from debugging bugs which killed processes without
triggering the signal handlers.  For example, when a process is killed
by the OOM killer, it does not get a chance to clean up, and so the
logs will be incomplete unless they are being continuously flushed.

Perhaps Ceph has matured to the point where this makes sense to add,
at least as an option.

>        b. persistent space O(Gigabytes per node-year)
>     3. configurability
>        a. capture level per sub-system
>     4. persistence
>        a. flushed out on process shut-down
>        b. recoverable from user-mode core-dumps
>     5. presentation
>        a. output can be processed w/grep,less,...
>
>  B. Secondary Requirements (nice to have)
>     1. ease of use
>        a. compatible with/convertable from existing calls
>        b. run-time definition of new event records
>     2. configurability
>        a. size/rotation rules per sub-system
>        b. separate in-memory/on-disk capture levels
>
> II Decisions to be made
>
>   A. Capture Circumstances
>      1. some subset of procedure calls
>         (I'm opposed to this, but it is an option)

Logging all function calls is very slow.  Trust me on this, I wrote a
profiler that did this once...  I used -finstrument-functions.

>      2. explicit event logging calls
>
>   B. Capture Format
>      1. ASCII text
>      2. per-event binary format
>      3. binary header + ASCII text
>
>   C. Synchronization
>      1. per-process vs per-thread buffers
>
>   D. Flushing
>      1. last writer flushes vs dedicated thread
>      2. single- vs double-bufferred output
>
>   E. Available open source candidates

Pantheios and log4cpp are two potential candidates.  I don't know that
much about either, unfortunately.

Honestly, I feel like logging is something that you will end up
customizing pretty extensively to suit the needs of your application.
But perhaps it's worth checking out what these libraries provide--
especially in terms of efficiency.

The biggest efficiency killer in logging right now is that dout lock.
However, even if that lock was replaced by an atomic operation, there
would still be the issue of cacheline ping-pong to contend with.  In
order to scale to the hundred-core systems of the future, you have to
get rid of that shared state.

P.S. Happy new year.

cheers,
Colin


>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: towards a user-mode diagnostic log mechanism
  2012-01-06  4:09 ` Colin McCabe
@ 2012-01-07  1:46   ` Mark Kampe
  2012-01-10  2:20     ` Colin McCabe
  2012-01-10 23:25   ` Tommi Virtanen
  1 sibling, 1 reply; 8+ messages in thread
From: Mark Kampe @ 2012-01-07  1:46 UTC (permalink / raw)
  To: Colin McCabe; +Cc: ceph-devel

On 01/05/12 20:09, Colin McCabe wrote:

> Getting the system time is a surprisingly expensive operation, and
> this poses a problem for logging system designers.  You can use the
> rdtsc CPU instruction, but unfortunately on some architectures CPU
> frequency scaling makes it very inaccurate.  Even when it is accurate,
> it's not synchronized between multiple CPUs.
>
> Another option is to without time for most messages and just have a
> periodic timestamp that gets injected every so often-- perhaps every
> second, for example.

I agree it needs to be cheap ... but my experience with
debugging problems in this sort of system suggests that
we need the finest grained timestamps we can get
(on every single message).

Even though the clocks on different nodes are not that
closely synchronized, computing the relative offsets
from initial transactions isn't hard ... and then it
becomes possible to construct a total ordering of
events with accurate timings.

> Pantheios and log4cpp are two potential candidates.  I don't know that
> much about either, unfortunately.

Good suggestions.  I am also looking at varnish (suggested by Wido
den Hollander) which does logging in a shared memory segment from
which external processes can save it (or not).  What was (for me)
a new idea here is the clean decoupling of in-memory event capture
from on-disk persistence and log rotation.  After I thought about
it for a few minutes, I concluded it had many nice consequences.

> Honestly, I feel like logging is something that you will end up
> customizing pretty extensively to suit the needs of your application.
> But perhaps it's worth checking out what these libraries provide--
> especially in terms of efficiency.

I agree that the data captured is going to be something we hone
based on experience.  I am (for now) most concerned with the
mechanism ... because I don't want to start making big investments
in instrumentation until we have a good mechanism, around which
we can fix the APIs that instrumentation will use.

I'll try to review the suggested candidates and describe the
mechanisms and advantages of each in another two weeks.

Thank you very much for the feedback.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: towards a user-mode diagnostic log mechanism
  2012-01-07  1:46   ` Mark Kampe
@ 2012-01-10  2:20     ` Colin McCabe
  0 siblings, 0 replies; 8+ messages in thread
From: Colin McCabe @ 2012-01-10  2:20 UTC (permalink / raw)
  To: Mark Kampe; +Cc: ceph-devel

On Fri, Jan 6, 2012 at 5:46 PM, Mark Kampe <mark.kampe@dreamhost.com> wrote:
> On 01/05/12 20:09, Colin McCabe wrote:
>> Pantheios and log4cpp are two potential candidates.  I don't know that
>> much about either, unfortunately.
>
> Good suggestions.  I am also looking at varnish (suggested by Wido
> den Hollander) which does logging in a shared memory segment from
> which external processes can save it (or not).  What was (for me)
> a new idea here is the clean decoupling of in-memory event capture
> from on-disk persistence and log rotation.  After I thought about
> it for a few minutes, I concluded it had many nice consequences.

On-disk storage and log rotation is already decoupled from event
capture-- maybe even a little too decoupled.  The logging system can
send output to stderr, syslog, or a file.  It can even send to
multiple sinks at once.

Log rotation has never been done by Ceph itself.  If you want that,
you have to install a separate daemon.  On Fedora, that would be
"logrotate."  I realize there is a function named something like
"log_rotate" in the logging code, but actually it implements something
called the "symlink dance" which is needed by vstart.sh.  The less
said about that, the better.

Maybe what you are thinking of is copying all log output to a buffer,
but only permanently storing certain logs.  This could be useful in
the event of a crash or assertion.  This is a good idea in general,
but there are some things to keep in mind.

There is a bunch of code that performs really expensive operations
when lots of logging is enabled.  I remember seeing things like code
that printed out all elements in the PG history, etc.  This is
acceptable with the current logging system, because those logs never
even get created when the log level is low.  If you want "always-on"
logging, the first thing you'll have to do is audit all the code to
make sure nobody is performing very expensive operations when logging
is turned all the way up.

>
>
>> Honestly, I feel like logging is something that you will end up
>> customizing pretty extensively to suit the needs of your application.
>> But perhaps it's worth checking out what these libraries provide--
>> especially in terms of efficiency.
>
>
> I agree that the data captured is going to be something we hone
> based on experience.  I am (for now) most concerned with the
> mechanism ... because I don't want to start making big investments
> in instrumentation until we have a good mechanism, around which
> we can fix the APIs that instrumentation will use.
>
> I'll try to review the suggested candidates and describe the
> mechanisms and advantages of each in another two weeks.

Good luck with your review.  I hope you come up with something that
serves Ceph well.

Colin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: towards a user-mode diagnostic log mechanism
  2012-01-06  4:09 ` Colin McCabe
  2012-01-07  1:46   ` Mark Kampe
@ 2012-01-10 23:25   ` Tommi Virtanen
       [not found]     ` <4F0CDA9A.4070500@cs.ucsc.edu>
  1 sibling, 1 reply; 8+ messages in thread
From: Tommi Virtanen @ 2012-01-10 23:25 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Mark Kampe, ceph-devel

On Thu, Jan 5, 2012 at 20:09, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> Getting the system time is a surprisingly expensive operation, and
> this poses a problem for logging system designers.  You can use the
> rdtsc CPU instruction, but unfortunately on some architectures CPU
> frequency scaling makes it very inaccurate.  Even when it is accurate,
> it's not synchronized between multiple CPUs.

Google leads me to http://pastebin.com/9csWfDvx which is a simple
benchmark that happily does a gettimeofday call in about 28
nanoseconds on my Macbook Air. This is not where the bottleneck is.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: towards a user-mode diagnostic log mechanism
       [not found]     ` <4F0CDA9A.4070500@cs.ucsc.edu>
@ 2012-01-11  1:04       ` Tommi Virtanen
  2012-01-11  1:06         ` Noah Watkins
  0 siblings, 1 reply; 8+ messages in thread
From: Tommi Virtanen @ 2012-01-11  1:04 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Colin McCabe, Mark Kampe, ceph-devel

On Tue, Jan 10, 2012 at 16:40, Noah Watkins <jayhawk@cs.ucsc.edu> wrote:
> The cost of reading the software clock via gettimeofday is dependent on the
> OS implementation, and can vary by as much as 2 orders of magnitude (see
> attached slide: 1.6us vs 60ns per clock read).
>
> Older Linux kernels force a thread into the kernel when calling gettimeofday
> adding enormous overhead relative to the newer vsyscall implementations that
> evaluate gettimeofday within userspace.

So turning debug to full blast on older kernels will be painful. I'm
personally ok with that, especially when the alternative is
implementing all of the TSC logic myself. This is really complex code,
and somebody went through the effort of solving the hard problems for
us -- let's use their code!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: towards a user-mode diagnostic log mechanism
  2012-01-11  1:04       ` Tommi Virtanen
@ 2012-01-11  1:06         ` Noah Watkins
  0 siblings, 0 replies; 8+ messages in thread
From: Noah Watkins @ 2012-01-11  1:06 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Noah Watkins, Colin McCabe, Mark Kampe, ceph-devel

On 1/10/12 5:04 PM, Tommi Virtanen wrote:
> On Tue, Jan 10, 2012 at 16:40, Noah Watkins<jayhawk@cs.ucsc.edu>  wrote:
>>
>> Older Linux kernels force a thread into the kernel when calling gettimeofday
>> adding enormous overhead relative to the newer vsyscall implementations that
>> evaluate gettimeofday within userspace.
>
> So turning debug to full blast on older kernels will be painful. I'm
> personally ok with that, especially when the alternative is
> implementing all of the TSC logic myself. This is really complex code,
> and somebody went through the effort of solving the hard problems for

Yeh, that was just a bit of fyi noise.

> us -- let's use their code!

I definitely agree :)

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-01-11  1:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-20  2:42 towards a user-mode diagnostic log mechanism Mark Kampe
2011-12-23 10:04 ` Wido den Hollander
2012-01-06  4:09 ` Colin McCabe
2012-01-07  1:46   ` Mark Kampe
2012-01-10  2:20     ` Colin McCabe
2012-01-10 23:25   ` Tommi Virtanen
     [not found]     ` <4F0CDA9A.4070500@cs.ucsc.edu>
2012-01-11  1:04       ` Tommi Virtanen
2012-01-11  1:06         ` Noah Watkins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.