All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [ceph-users] mon IO usage
       [not found] <CAF6-1L4oWCwNxywALb=cUNP_pbD=ND631MJqvCWyvAfvNdWauQ@mail.gmail.com>
@ 2013-05-21 12:57 ` Mike Dawson
       [not found]   ` <519B6F28.9000400-9dgm/EUDD3RBYT3KYJiKsA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Dawson @ 2013-05-21 12:57 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: ceph-users, ceph-devel

Sylvain,

I can confirm I see a similar traffic pattern.

Any time I have lots of writes going to my cluster (like heavy writes 
from RBD or remapping/backfilling after losing an OSD), I see all sorts 
of monitor issues.

If my monitor leveldb store.db directories grow past some unknown point 
(maybe ~1GB or so), 'compact on trim' is insufficiently slow. The 
store.db grows faster than compact can trim the garbage. After that 
point, the only hope to rein in the store.db size is to stop the OSDs 
and get leveldb to compact without any ongoing writes.

I sent Sage and Joao a transaction dump of the growth yesterday. Sage 
looked, but the files are so large it is tough to get useful info.

http://tracker.ceph.com/issues/4895

I believe this issue has existed since 0.48.

- Mike

On 5/21/2013 8:16 AM, Sylvain Munaut wrote:
> Hi,
>
>
> I've just added some monitoring to the IO usage of mon (trying to
> track down that growing mon issue), and I'm kind of surprised by the
> amount of IO generated by the monitor process.
>
> I get continuous 4 Mo/s / 75 iops with added big spikes at each
> compaction every 3 min or so.
>
> Is there a description somewhere of what the monitor does exactly ?  I
> mean the monmap / pgmap / osdmap / mdsmap / election epoch don't
> change that often (pgmap is like 1 per second and that's the fastest
> change by several orders of magnitude). So what exactly does the
> monitor do with all that IO ???
>
>
> Cheers,
>
>      Sylvain
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mon IO usage
       [not found]   ` <519B6F28.9000400-9dgm/EUDD3RBYT3KYJiKsA@public.gmane.org>
@ 2013-05-21 13:25     ` Mike Dawson
  2013-05-21 15:52       ` [ceph-users] " Sylvain Munaut
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Dawson @ 2013-05-21 13:25 UTC (permalink / raw)
  To: Sylvain Munaut
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Thanks for the correction on IRC. I should have written that this issue 
started with 0.59 (when the monitor changes hit).

http://ceph.com/dev-notes/cephs-new-monitor-changes/

The writeup and release notes sometimes say they went in for 0.58, but I 
believe they were actually released in 0.59.

Thanks for the correction Sylvain.

- Mike

On 5/21/2013 8:57 AM, Mike Dawson wrote:
> Sylvain,
>
> I can confirm I see a similar traffic pattern.
>
> Any time I have lots of writes going to my cluster (like heavy writes
> from RBD or remapping/backfilling after losing an OSD), I see all sorts
> of monitor issues.
>
> If my monitor leveldb store.db directories grow past some unknown point
> (maybe ~1GB or so), 'compact on trim' is insufficiently slow. The
> store.db grows faster than compact can trim the garbage. After that
> point, the only hope to rein in the store.db size is to stop the OSDs
> and get leveldb to compact without any ongoing writes.
>
> I sent Sage and Joao a transaction dump of the growth yesterday. Sage
> looked, but the files are so large it is tough to get useful info.
>
> http://tracker.ceph.com/issues/4895
>
> I believe this issue has existed since 0.48.
>
> - Mike
>
> On 5/21/2013 8:16 AM, Sylvain Munaut wrote:
>> Hi,
>>
>>
>> I've just added some monitoring to the IO usage of mon (trying to
>> track down that growing mon issue), and I'm kind of surprised by the
>> amount of IO generated by the monitor process.
>>
>> I get continuous 4 Mo/s / 75 iops with added big spikes at each
>> compaction every 3 min or so.
>>
>> Is there a description somewhere of what the monitor does exactly ?  I
>> mean the monmap / pgmap / osdmap / mdsmap / election epoch don't
>> change that often (pgmap is like 1 per second and that's the fastest
>> change by several orders of magnitude). So what exactly does the
>> monitor do with all that IO ???
>>
>>
>> Cheers,
>>
>>      Sylvain
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] mon IO usage
  2013-05-21 13:25     ` Mike Dawson
@ 2013-05-21 15:52       ` Sylvain Munaut
  2013-05-21 15:56         ` Gregory Farnum
  2013-05-21 15:57         ` Sage Weil
  0 siblings, 2 replies; 8+ messages in thread
From: Sylvain Munaut @ 2013-05-21 15:52 UTC (permalink / raw)
  To: Mike Dawson; +Cc: ceph-users, ceph-devel

So, AFAICT, the bulk of the write would be writing out the pgmap to
disk every second or so.

Is it really needed to write it in full ? It doesn't change all that
much AFAICT, so writing incremental changes with only periodic flush
might be a better option ?

Cheers,

     Sylvain

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] mon IO usage
  2013-05-21 15:52       ` [ceph-users] " Sylvain Munaut
@ 2013-05-21 15:56         ` Gregory Farnum
  2013-05-21 15:57         ` Sage Weil
  1 sibling, 0 replies; 8+ messages in thread
From: Gregory Farnum @ 2013-05-21 15:56 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: Mike Dawson, ceph-users, ceph-devel

On Tue, May 21, 2013 at 8:52 AM, Sylvain Munaut
<s.munaut@whatever-company.com> wrote:
> So, AFAICT, the bulk of the write would be writing out the pgmap to
> disk every second or so.
>
> Is it really needed to write it in full ? It doesn't change all that
> much AFAICT, so writing incremental changes with only periodic flush
> might be a better option ?

Yeah; this is definitely in our heads as work we'd like to get done.
LevelDB is costing us more throughput than we were expecting so people
are running into trouble much earlier than we expected.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] mon IO usage
  2013-05-21 15:52       ` [ceph-users] " Sylvain Munaut
  2013-05-21 15:56         ` Gregory Farnum
@ 2013-05-21 15:57         ` Sage Weil
  2013-05-21 16:05           ` Sylvain Munaut
  1 sibling, 1 reply; 8+ messages in thread
From: Sage Weil @ 2013-05-21 15:57 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: Mike Dawson, ceph-users, ceph-devel

On Tue, 21 May 2013, Sylvain Munaut wrote:
> So, AFAICT, the bulk of the write would be writing out the pgmap to
> disk every second or so.

It should be writing out the full map only every N commits... see 'paxos 
stash full interval', which defaults to 25.

> Is it really needed to write it in full ? It doesn't change all that
> much AFAICT, so writing incremental changes with only periodic flush
> might be a better option ?

Right.  It works this way now only because we haven't fully transitioned 
from the old scheme.  The next step is to store the PGMap over lots of 
leveldb keys (one per pg) so that there is no big encode/decode of the 
entire PGMap structure...

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] mon IO usage
  2013-05-21 15:57         ` Sage Weil
@ 2013-05-21 16:05           ` Sylvain Munaut
       [not found]             ` <CAF6-1L4C1QQHZ_5=3OCATFTCD_As63HEjJLcKsGURAV02PFQPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Sylvain Munaut @ 2013-05-21 16:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mike Dawson, ceph-users, ceph-devel

Hi,

>> So, AFAICT, the bulk of the write would be writing out the pgmap to
>> disk every second or so.
>
> It should be writing out the full map only every N commits... see 'paxos
> stash full interval', which defaults to 25.

But doesn't it also write it in full when there is a new pgmap ?

I have a new one about every second and its size * period seemed to
match the IO rate pretty well which it why I thought it was the reason
for the IO.


>> Is it really needed to write it in full ? It doesn't change all that
>> much AFAICT, so writing incremental changes with only periodic flush
>> might be a better option ?
>
> Right.  It works this way now only because we haven't fully transitioned
> from the old scheme.  The next step is to store the PGMap over lots of
> leveldb keys (one per pg) so that there is no big encode/decode of the
> entire PGMap structure...

Makes sense. I'm not sure of the "per-key" overhead of leveldb though,
in case where there are lots ( > 10k ) PGs.


Cheers,

    Sylvain

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mon IO usage
       [not found]             ` <CAF6-1L4C1QQHZ_5=3OCATFTCD_As63HEjJLcKsGURAV02PFQPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-05-21 16:10               ` Sage Weil
  2013-05-21 18:51                 ` [ceph-users] " Sylvain Munaut
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2013-05-21 16:10 UTC (permalink / raw)
  To: Sylvain Munaut
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, Mike Dawson,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Tue, 21 May 2013, Sylvain Munaut wrote:
> Hi,
> 
> >> So, AFAICT, the bulk of the write would be writing out the pgmap to
> >> disk every second or so.
> >
> > It should be writing out the full map only every N commits... see 'paxos
> > stash full interval', which defaults to 25.
> 
> But doesn't it also write it in full when there is a new pgmap ?
> 
> I have a new one about every second and its size * period seemed to
> match the IO rate pretty well which it why I thought it was the reason
> for the IO.

Hmm.  Can you generate a log with 'debug mon = 20', 'debug paxos = 20', 
'debug ms = 1' for a few minutes over which you see a high data rate and 
send it my way?  It sounds like there is something wrong with the 
stash_full logic.

Thanks!

> >> Is it really needed to write it in full ? It doesn't change all that
> >> much AFAICT, so writing incremental changes with only periodic flush
> >> might be a better option ?
> >
> > Right.  It works this way now only because we haven't fully transitioned
> > from the old scheme.  The next step is to store the PGMap over lots of
> > leveldb keys (one per pg) so that there is no big encode/decode of the
> > entire PGMap structure...
> 
> Makes sense. I'm not sure of the "per-key" overhead of leveldb though,
> in case where there are lots ( > 10k ) PGs.

Yeah, it will be larger on-disk, but the io rate will at least be 
proportional to the update rate.  :)

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ceph-users] mon IO usage
  2013-05-21 16:10               ` Sage Weil
@ 2013-05-21 18:51                 ` Sylvain Munaut
  0 siblings, 0 replies; 8+ messages in thread
From: Sylvain Munaut @ 2013-05-21 18:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mike Dawson, ceph-users, ceph-devel

Hi,

> Hmm.  Can you generate a log with 'debug mon = 20', 'debug paxos = 20',
> 'debug ms = 1' for a few minutes over which you see a high data rate and
> send it my way?  It sounds like there is something wrong with the
> stash_full logic.

Mm, actually I may have been fooled by the instrumentation ... it does
30 sec average, so when looking closer I don't have 4 Mo/s constantly,
it's more like 50 Mo every 15/20 sec as a burst.

In anycase, that seems like a lot of data being written.

logs can be downloaded from  http://ge.tt/9MOeKHh/v/0


Cheers,

    Sylvan

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-05-21 18:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAF6-1L4oWCwNxywALb=cUNP_pbD=ND631MJqvCWyvAfvNdWauQ@mail.gmail.com>
2013-05-21 12:57 ` [ceph-users] mon IO usage Mike Dawson
     [not found]   ` <519B6F28.9000400-9dgm/EUDD3RBYT3KYJiKsA@public.gmane.org>
2013-05-21 13:25     ` Mike Dawson
2013-05-21 15:52       ` [ceph-users] " Sylvain Munaut
2013-05-21 15:56         ` Gregory Farnum
2013-05-21 15:57         ` Sage Weil
2013-05-21 16:05           ` Sylvain Munaut
     [not found]             ` <CAF6-1L4C1QQHZ_5=3OCATFTCD_As63HEjJLcKsGURAV02PFQPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-21 16:10               ` Sage Weil
2013-05-21 18:51                 ` [ceph-users] " Sylvain Munaut

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.