All of lore.kernel.org
 help / color / mirror / Atom feed
* Tracing Ceph results
@ 2017-05-02 20:21 Mohamad Gebai
  2017-05-02 20:47 ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Mohamad Gebai @ 2017-05-02 20:21 UTC (permalink / raw)
  To: ceph-devel

Hi,

I thought I would share some results of tracing I've been conducting 
recently. I instrumented Ceph (OSD only) with GCC's 
-finstrument-functions option, and traced it and the kernel 
simultaneously using LTTng (no instrumentation is actually necessary, 
though rebuilding Ceph is required).

In [1]:
- the top view shows the state of the OSD process (22983). It is 
switching between the running (green) and blocked (yellow) states
- the middle view shows the critical path of that process, in other 
words it shows what the process 22983 was waiting for when it is in the 
blocked (yellow) state
- the bottom view shows the call stack graph, which is all function 
calls in OSD code, generated by -finstrument-functions

Image [2] shows a better image of the call stack view, and Image [3] 
shows a better one of the critical path.

The code is in branch [4].

After looking extensively at the result, I created a diagram that 
simplifies these interactions and shows the high level behavior of Ceph.

Here's the IO path of a write_full() - how it goes from the client to 
all OSDs and back. The OSDs talk to each other through their respective 
msg-worker thread using sockets, and the threads of the same OSD talk to 
each other through semaphores (int Wait(Mutex &mutex) in Cond.h). 
Colored arrows are messages across nodes (over the network). Thick 
horizontal lines separate nodes, thin horizontal lines separate threads 
on the same node.
https://docs.google.com/drawings/d/1fFb-aI8NQq7RESV2OKzrte2KvaBWAh6X9TL7woAGAmo/edit?usp=sharing

I did the same exercise for librbd (build with -finstrument-functions). 
Here's the IO path in librbd for writes against images with RBD 
journaling enabled and disabled. The red parts only appear when 
journaling is enabled. There's still a bit more differences for the 
final flush() part, but I've omitted them for simplicity. If there's 
need/interest I can add those as well. The dotted vertical line separate 
iterations of writes() in the client (rbd tool).
https://docs.google.com/drawings/d/1s9pRVygdPzBtQyYrvPueF8Cm6Xi1d4oy5gNMOjx2Wzo/edit?usp=sharing

I think these are the kind of diagrams we'll be able to get more easily 
from Zipkin/Blkin.

Mohamad


[1] https://drive.google.com/open?id=0B0_2xRlvBgt9c3ltaTNmMEplTlE
[2] https://drive.google.com/open?id=0B0_2xRlvBgt9TjVNTUpzWkpITjQ
[3] https://drive.google.com/open?id=0B0_2xRlvBgt9dl9vS083X2sxWTQ
[4] https://github.com/mogeb/ceph/tree/wip-finstrument-functions

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Tracing Ceph results
  2017-05-02 20:21 Tracing Ceph results Mohamad Gebai
@ 2017-05-02 20:47 ` Sage Weil
  2017-05-02 21:05   ` Mohamad Gebai
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2017-05-02 20:47 UTC (permalink / raw)
  To: Mohamad Gebai; +Cc: ceph-devel, anjaneya.chagam

On Tue, 2 May 2017, Mohamad Gebai wrote:
> Hi,
> 
> I thought I would share some results of tracing I've been conducting recently.
> I instrumented Ceph (OSD only) with GCC's -finstrument-functions option, and
> traced it and the kernel simultaneously using LTTng (no instrumentation is
> actually necessary, though rebuilding Ceph is required).
> 
> In [1]:
> - the top view shows the state of the OSD process (22983). It is switching
> between the running (green) and blocked (yellow) states
> - the middle view shows the critical path of that process, in other words it
> shows what the process 22983 was waiting for when it is in the blocked
> (yellow) state
> - the bottom view shows the call stack graph, which is all function calls in
> OSD code, generated by -finstrument-functions
> 
> Image [2] shows a better image of the call stack view, and Image [3] shows a
> better one of the critical path.
> 
> The code is in branch [4].

These are awesome! What is the tool you're screenshotting to visualize?

It would be great to apply the build options based on an option passed to 
cmake (e.g, -DWITH_INSTRUMENT_FUNCTIONS).

Also, I wonder if these are a better/simpler alternative to the 
FUNCTRACE() macro in src/common/EventTrace.h.  What kind of overhead is 
introduced?  LTTNG lets you enable on specific tracepoints, right?
 
> After looking extensively at the result, I created a diagram that simplifies
> these interactions and shows the high level behavior of Ceph.
>
> Here's the IO path of a write_full() - how it goes from the client to all OSDs
> and back. The OSDs talk to each other through their respective msg-worker
> thread using sockets, and the threads of the same OSD talk to each other
> through semaphores (int Wait(Mutex &mutex) in Cond.h). Colored arrows are
> messages across nodes (over the network). Thick horizontal lines separate
> nodes, thin horizontal lines separate threads on the same node.
> https://docs.google.com/drawings/d/1fFb-aI8NQq7RESV2OKzrte2KvaBWAh6X9TL7woAGAmo/edit?usp=sharing
> 
> I did the same exercise for librbd (build with -finstrument-functions). Here's
> the IO path in librbd for writes against images with RBD journaling enabled
> and disabled. The red parts only appear when journaling is enabled. There's
> still a bit more differences for the final flush() part, but I've omitted them
> for simplicity. If there's need/interest I can add those as well. The dotted
> vertical line separate iterations of writes() in the client (rbd tool).
> https://docs.google.com/drawings/d/1s9pRVygdPzBtQyYrvPueF8Cm6Xi1d4oy5gNMOjx2Wzo/edit?usp=sharing
> 
> I think these are the kind of diagrams we'll be able to get more easily from
> Zipkin/Blkin.

Yes!
sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Tracing Ceph results
  2017-05-02 20:47 ` Sage Weil
@ 2017-05-02 21:05   ` Mohamad Gebai
  2017-05-03  0:49     ` Mark Nelson
  2017-05-03 13:43     ` Jason Dillaman
  0 siblings, 2 replies; 6+ messages in thread
From: Mohamad Gebai @ 2017-05-02 21:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: anjaneya.chagam, ceph-devel

On 05/02/2017 04:47 PM, Sage Weil wrote:
> On Tue, 2 May 2017, Mohamad Gebai wrote:
>> Image [2] shows a better image of the call stack view, and Image [3] shows a
>> better one of the critical path.
>>
>> The code is in branch [4].
> These are awesome! What is the tool you're screenshotting to visualize?
Sorry, I forgot to mention that part. The viewing tool is TraceCompass 
(use with an SSD because the traces are quite big).
> It would be great to apply the build options based on an option passed to
> cmake (e.g, -DWITH_INSTRUMENT_FUNCTIONS).
That's definitely an option, I'll submit a patch. The thing is that it 
also needs -no-pie in order to resolve the function addresses at 
analysis time.
> Also, I wonder if these are a better/simpler alternative to the
> FUNCTRACE() macro in src/common/EventTrace.h.  What kind of overhead is
> introduced?  LTTNG lets you enable on specific tracepoints, right?
Indeed, but these are all the same tracepoint (function_entry and 
function_exit, with different payloads depending on the function that is 
called). We could also limit the stack depth to stop tracing after a 
certain number of nested function calls.

I've done some benchmarking of LTTng in the past. On average, a kernel 
tracepoint costs less than 100ns and a userspace one is around 150ns. It 
also hits the disk to periodically flush the tracing buffers. The 
overhead is definitely non negligeable. I'll write a how-to as there are 
quite a few steps to get all of this running.

Mohamad

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Tracing Ceph results
  2017-05-02 21:05   ` Mohamad Gebai
@ 2017-05-03  0:49     ` Mark Nelson
  2017-05-03 13:45       ` Mohamad Gebai
  2017-05-03 13:43     ` Jason Dillaman
  1 sibling, 1 reply; 6+ messages in thread
From: Mark Nelson @ 2017-05-03  0:49 UTC (permalink / raw)
  To: Mohamad Gebai, Sage Weil; +Cc: anjaneya.chagam, ceph-devel

On 05/02/2017 04:05 PM, Mohamad Gebai wrote:
> On 05/02/2017 04:47 PM, Sage Weil wrote:
>> On Tue, 2 May 2017, Mohamad Gebai wrote:
>>> Image [2] shows a better image of the call stack view, and Image [3]
>>> shows a
>>> better one of the critical path.
>>>
>>> The code is in branch [4].
>> These are awesome! What is the tool you're screenshotting to visualize?
> Sorry, I forgot to mention that part. The viewing tool is TraceCompass
> (use with an SSD because the traces are quite big).
>> It would be great to apply the build options based on an option passed to
>> cmake (e.g, -DWITH_INSTRUMENT_FUNCTIONS).
> That's definitely an option, I'll submit a patch. The thing is that it
> also needs -no-pie in order to resolve the function addresses at
> analysis time.
>> Also, I wonder if these are a better/simpler alternative to the
>> FUNCTRACE() macro in src/common/EventTrace.h.  What kind of overhead is
>> introduced?  LTTNG lets you enable on specific tracepoints, right?
> Indeed, but these are all the same tracepoint (function_entry and
> function_exit, with different payloads depending on the function that is
> called). We could also limit the stack depth to stop tracing after a
> certain number of nested function calls.
>
> I've done some benchmarking of LTTng in the past. On average, a kernel
> tracepoint costs less than 100ns and a userspace one is around 150ns. It
> also hits the disk to periodically flush the tracing buffers. The
> overhead is definitely non negligeable. I'll write a how-to as there are
> quite a few steps to get all of this running.

Awesome job Mohamad!  It'd be sweet if we could automate the tracing and 
geneation of plots like this.  Some time I'd like to be able to include 
tracing like this during CBT benchmark runs.

Mark

>
> Mohamad
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Tracing Ceph results
  2017-05-02 21:05   ` Mohamad Gebai
  2017-05-03  0:49     ` Mark Nelson
@ 2017-05-03 13:43     ` Jason Dillaman
  1 sibling, 0 replies; 6+ messages in thread
From: Jason Dillaman @ 2017-05-03 13:43 UTC (permalink / raw)
  To: Mohamad Gebai; +Cc: Sage Weil, Chagam, Anjaneya, ceph-devel

On Tue, May 2, 2017 at 5:05 PM, Mohamad Gebai <mgebai@suse.com> wrote:
> I'll write a how-to as there are quite a few steps to get all of this
> running.

Awesome ... thanks!

-- 
Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Tracing Ceph results
  2017-05-03  0:49     ` Mark Nelson
@ 2017-05-03 13:45       ` Mohamad Gebai
  0 siblings, 0 replies; 6+ messages in thread
From: Mohamad Gebai @ 2017-05-03 13:45 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: anjaneya.chagam, ceph-devel

On 05/02/2017 08:49 PM, Mark Nelson wrote:
> On 05/02/2017 04:05 PM, Mohamad Gebai wrote:
>>
>> I've done some benchmarking of LTTng in the past. On average, a kernel
>> tracepoint costs less than 100ns and a userspace one is around 150ns. It
>> also hits the disk to periodically flush the tracing buffers. The
>> overhead is definitely non negligeable. I'll write a how-to as there are
>> quite a few steps to get all of this running.
>
> Awesome job Mohamad!  It'd be sweet if we could automate the tracing 
> and geneation of plots like this.  Some time I'd like to be able to 
> include tracing like this during CBT benchmark runs.
>
Thanks Mark! I'll try to play with CBT and tracing automation, and I'll 
let you know if I find anything interesting.

Mohamad

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-05-03 13:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-02 20:21 Tracing Ceph results Mohamad Gebai
2017-05-02 20:47 ` Sage Weil
2017-05-02 21:05   ` Mohamad Gebai
2017-05-03  0:49     ` Mark Nelson
2017-05-03 13:45       ` Mohamad Gebai
2017-05-03 13:43     ` Jason Dillaman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.