* Questions about journals, performance and disk utilization.
@ 2013-01-22 19:59 martin
2013-01-22 21:16 ` Mark Nelson
0 siblings, 1 reply; 12+ messages in thread
From: martin @ 2013-01-22 19:59 UTC (permalink / raw)
To: ceph-devel
Hi list,
In a mixed SSD & SATA setup (5 or 8 nodes each holding 8x SATA and 4x
SSD) would it make sense to skip having journals on SSD or is the
advantage of doing so just too great? We're looking into having 2 pools,
sata and ssd and will be creating guests belonging into either of these
groups based on if they require high/heavy io.
Also, we currently lean on going with a very simple setup using a
serverboard with 8x onboard raid slots (LSI 2308) and 6x onboard sata
slots and just attach all disks to both onboard controller and onboard
slots (for cost and simplicity) - and just pass them along as JBOD.
Any suggestions/input about:
- Would it make sense to drop onboard controller and aim for a better
controller (cache/battery backed 12-16 port one)
- Attach another cheapo JBOD card like SAS2008/LSI 2308 etc.
- or just go with this setup (to keep it simpler and cheaper)
Journals:
- Would it make sense to kill say 1 ssd and 1 sata and attach 2 fast
SSD for journals? Or would that be 'redundant' in our case since we
already have a pool with sata and ssd (we do not expect heavy io in the
sata pool)
Rbd striping:
- Performance - afaik rbd is striped over objects; if one would create
say a 20GB rbd image would this mostly be striped over very few
objects/pg (say ~3 nodes as would be min. in our setup) or would one
expect it to be striped over pretty much the entirety of the nodes (5 or
8 in our case) in smaller objects (or even across all OSD?)
Disks:
- Any advice for SATA disks? I know a vendor like Seagate have their
'normal' enterprise disks (ES.3-models) and are also selling their
cloud-based disks (CS models). Any suggestions/experience what to look
at/aim at? Or what are people using in general?
Disk utilization:
- I've noticed in our testsetup that we have several pg's taking up
>300GB data each - is this normal? This results in some odd situations
where disk usage can vary by up to 15-20% (2TB disks). If we adjust the
weight it eventually means one of these pg will go to another disk and
it has to copy 300GB data. We're using 0.56.1.
Some output from 'ceph pg dump':
pg_stat objects mip degr unf bytes log disklog state
state_stamp v reported up acting last_scrub
scrub_stamp last_deep_scrub deep_scrub_stamp
4.5 90772 0 0 0 379301388412 150969 150969
active+clean 2013-01-22 00:07:13.384272 2827'412414
2795'3317565 [1,2] [1,2] 2827'397587 2013-01-22
00:07:13.384225 2744'299767 2013-01-17 05:40:40.737279
Results in disk usage like:
Filesystem Size Used
Avail Use% Mounted on
/dev/sdd1 1.9T 1.4T
446G 77% /srv/ceph/osd5
/dev/sdb1 1.4T 1.1T
331G 77% /srv/ceph/osd0
/dev/sda1 1.9T 1.4T
442G 77% /srv/ceph/osd1
/dev/sdc1 1.9T 1.8T
84G 96% /srv/ceph/osd2
If we reweight sdc down (even with 0.00X % at a time) one of those big
pg's will eventually move to any one of the above disks and the image
will look exactly the same with the exception another disk will have 96%
usage instead (I've bumped cluster full % to 98% in this setup).
Apologies up front if questions like these are not supposed to go to
this mailling-list.
Any advice/ideas/suggestions are very welcome!
Cheers,
Martin Nielsen
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 19:59 Questions about journals, performance and disk utilization martin
@ 2013-01-22 21:16 ` Mark Nelson
2013-01-22 21:26 ` Jeff Mitchell
0 siblings, 1 reply; 12+ messages in thread
From: Mark Nelson @ 2013-01-22 21:16 UTC (permalink / raw)
To: martin; +Cc: ceph-devel
On 01/22/2013 01:59 PM, martin wrote:
> Hi list,
>
> In a mixed SSD & SATA setup (5 or 8 nodes each holding 8x SATA and 4x
> SSD) would it make sense to skip having journals on SSD or is the
> advantage of doing so just too great? We're looking into having 2 pools,
> sata and ssd and will be creating guests belonging into either of these
> groups based on if they require high/heavy io.
>
> Also, we currently lean on going with a very simple setup using a
> serverboard with 8x onboard raid slots (LSI 2308) and 6x onboard sata
> slots and just attach all disks to both onboard controller and onboard
> slots (for cost and simplicity) - and just pass them along as JBOD.
>
> Any suggestions/input about:
> - Would it make sense to drop onboard controller and aim for a better
> controller (cache/battery backed 12-16 port one)
> - Attach another cheapo JBOD card like SAS2008/LSI 2308 etc.
> - or just go with this setup (to keep it simpler and cheaper)
You may be interested in reading some of our past performance articles:
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/
We didn't test the on-board controller on the supermicro MB, but those
will at least give you some idea of what different controllers can do
with and without SSDs.
>
> Journals:
> - Would it make sense to kill say 1 ssd and 1 sata and attach 2 fast SSD
> for journals? Or would that be 'redundant' in our case since we already
> have a pool with sata and ssd (we do not expect heavy io in the sata pool)
>
> Rbd striping:
> - Performance - afaik rbd is striped over objects; if one would create
> say a 20GB rbd image would this mostly be striped over very few
> objects/pg (say ~3 nodes as would be min. in our setup) or would one
> expect it to be striped over pretty much the entirety of the nodes (5 or
> 8 in our case) in smaller objects (or even across all OSD?)
>
> Disks:
> - Any advice for SATA disks? I know a vendor like Seagate have their
> 'normal' enterprise disks (ES.3-models) and are also selling their
> cloud-based disks (CS models). Any suggestions/experience what to look
> at/aim at? Or what are people using in general?
>
I've been using 1TB Seagate Constellation Enterprise SATA drives for
testing and have had mostly good luck (1 or 2 duds out of 36) with no
failures. Long term experience for me is that all vendors seem to have
bad batches here and there.
> Disk utilization:
> - I've noticed in our testsetup that we have several pg's taking up
> >300GB data each - is this normal? This results in some odd situations
> where disk usage can vary by up to 15-20% (2TB disks). If we adjust the
> weight it eventually means one of these pg will go to another disk and
> it has to copy 300GB data. We're using 0.56.1.
>
> Some output from 'ceph pg dump':
> pg_stat objects mip degr unf bytes log disklog state
> state_stamp v reported up acting last_scrub
> scrub_stamp last_deep_scrub deep_scrub_stamp
> 4.5 90772 0 0 0 379301388412 150969 150969
> active+clean 2013-01-22 00:07:13.384272 2827'412414
> 2795'3317565 [1,2] [1,2] 2827'397587 2013-01-22
> 00:07:13.384225 2744'299767 2013-01-17 05:40:40.737279
>
> Results in disk usage like:
> Filesystem Size Used Avail
> Use% Mounted on
> /dev/sdd1 1.9T 1.4T 446G
> 77% /srv/ceph/osd5
> /dev/sdb1 1.4T 1.1T 331G
> 77% /srv/ceph/osd0
> /dev/sda1 1.9T 1.4T 442G
> 77% /srv/ceph/osd1
> /dev/sdc1 1.9T 1.8T 84G
> 96% /srv/ceph/osd2
>
> If we reweight sdc down (even with 0.00X % at a time) one of those big
> pg's will eventually move to any one of the above disks and the image
> will look exactly the same with the exception another disk will have 96%
> usage instead (I've bumped cluster full % to 98% in this setup).
>
It may (or may not) help to use a power-of-2 number of PGs. It's
generally a good idea to do this anyway, so if you haven't set up your
production cluster yet, you may want to play around with this.
Basically just take whatever number you were planning on using and round
it up (or down slightly). IE if you were going to use 7,000 PGs, round
up to 8192.
Mark
> Apologies up front if questions like these are not supposed to go to
> this mailling-list.
>
> Any advice/ideas/suggestions are very welcome!
>
> Cheers,
> Martin Nielsen
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:16 ` Mark Nelson
@ 2013-01-22 21:26 ` Jeff Mitchell
2013-01-22 21:50 ` Stefan Priebe
0 siblings, 1 reply; 12+ messages in thread
From: Jeff Mitchell @ 2013-01-22 21:26 UTC (permalink / raw)
To: Mark Nelson; +Cc: martin, ceph-devel
Mark Nelson wrote:
> It may (or may not) help to use a power-of-2 number of PGs. It's
> generally a good idea to do this anyway, so if you haven't set up your
> production cluster yet, you may want to play around with this. Basically
> just take whatever number you were planning on using and round it up (or
> down slightly). IE if you were going to use 7,000 PGs, round up to 8192.
As I was asking about earlier on IRC, I'm in a situation where the docs
did not mention this in the section about calculating PGs so I have a
non-power-of-2 -- and since there are some production things running on
that pool I can't currently change it.
If indeed that makes a difference, here's one vote for a resilvering
mechanism :-)
Alternately, if I stand up a second pool, is there any easy way to
(offline) migrate an RBD from one to the other? (Knowing that this means
I'd have to update state with anything using it, after.) The only thing
I know of right now is to make a second RBD, map both to a client, and dd.
Thanks,
Jeff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:26 ` Jeff Mitchell
@ 2013-01-22 21:50 ` Stefan Priebe
2013-01-22 21:56 ` Jeff Mitchell
2013-01-22 21:57 ` Mark Nelson
0 siblings, 2 replies; 12+ messages in thread
From: Stefan Priebe @ 2013-01-22 21:50 UTC (permalink / raw)
To: Jeff Mitchell; +Cc: Mark Nelson, martin, ceph-devel
Hi,
Am 22.01.2013 22:26, schrieb Jeff Mitchell:
> Mark Nelson wrote:
>> It may (or may not) help to use a power-of-2 number of PGs. It's
>> generally a good idea to do this anyway, so if you haven't set up your
>> production cluster yet, you may want to play around with this. Basically
>> just take whatever number you were planning on using and round it up (or
>> down slightly). IE if you were going to use 7,000 PGs, round up to 8192.
>
> As I was asking about earlier on IRC, I'm in a situation where the docs
> did not mention this in the section about calculating PGs so I have a
> non-power-of-2 -- and since there are some production things running on
> that pool I can't currently change it.
Oh same thing here - did i miss the doc or can someone point me the
location.
Is there a chance to change the number of PGs for a pool?
Greets,
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:50 ` Stefan Priebe
@ 2013-01-22 21:56 ` Jeff Mitchell
2013-01-22 21:58 ` Stefan Priebe
2013-01-22 21:57 ` Mark Nelson
1 sibling, 1 reply; 12+ messages in thread
From: Jeff Mitchell @ 2013-01-22 21:56 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Mark Nelson, martin, ceph-devel
Stefan Priebe wrote:
> Hi,
> Am 22.01.2013 22:26, schrieb Jeff Mitchell:
>> Mark Nelson wrote:
>>> It may (or may not) help to use a power-of-2 number of PGs. It's
>>> generally a good idea to do this anyway, so if you haven't set up your
>>> production cluster yet, you may want to play around with this.
>>> Basically
>>> just take whatever number you were planning on using and round it up
>>> (or
>>> down slightly). IE if you were going to use 7,000 PGs, round up to
>>> 8192.
>>
>> As I was asking about earlier on IRC, I'm in a situation where the docs
>> did not mention this in the section about calculating PGs so I have a
>> non-power-of-2 -- and since there are some production things running on
>> that pool I can't currently change it.
>
> Oh same thing here - did i miss the doc or can someone point me the
> location.
Here you go: http://ceph.com/docs/master/rados/operations/placement-groups/
(Notice the lack of any power-of-2 mention :-) )
--Jeff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:50 ` Stefan Priebe
2013-01-22 21:56 ` Jeff Mitchell
@ 2013-01-22 21:57 ` Mark Nelson
2013-01-22 21:58 ` Jeff Mitchell
2013-01-22 22:00 ` Gregory Farnum
1 sibling, 2 replies; 12+ messages in thread
From: Mark Nelson @ 2013-01-22 21:57 UTC (permalink / raw)
To: Stefan Priebe; +Cc: Jeff Mitchell, martin, ceph-devel
On 01/22/2013 03:50 PM, Stefan Priebe wrote:
> Hi,
> Am 22.01.2013 22:26, schrieb Jeff Mitchell:
>> Mark Nelson wrote:
>>> It may (or may not) help to use a power-of-2 number of PGs. It's
>>> generally a good idea to do this anyway, so if you haven't set up your
>>> production cluster yet, you may want to play around with this. Basically
>>> just take whatever number you were planning on using and round it up (or
>>> down slightly). IE if you were going to use 7,000 PGs, round up to 8192.
>>
>> As I was asking about earlier on IRC, I'm in a situation where the docs
>> did not mention this in the section about calculating PGs so I have a
>> non-power-of-2 -- and since there are some production things running on
>> that pool I can't currently change it.
>
> Oh same thing here - did i miss the doc or can someone point me the
> location.
>
> Is there a chance to change the number of PGs for a pool?
>
> Greets,
> Stefan
Honestly I don't know if it will actually have a significant effect.
ceph_stable_mod will map things optimally when pg_num is a power of 2,
but that's only part of how things work. It may not matter very much
with high PG counts.
Mark
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:57 ` Mark Nelson
@ 2013-01-22 21:58 ` Jeff Mitchell
2013-01-23 0:25 ` Josh Durgin
2013-01-22 22:00 ` Gregory Farnum
1 sibling, 1 reply; 12+ messages in thread
From: Jeff Mitchell @ 2013-01-22 21:58 UTC (permalink / raw)
To: Mark Nelson; +Cc: Stefan Priebe, martin, ceph-devel
Mark Nelson wrote:
> On 01/22/2013 03:50 PM, Stefan Priebe wrote:
>> Hi,
>> Am 22.01.2013 22:26, schrieb Jeff Mitchell:
>>> Mark Nelson wrote:
>>>> It may (or may not) help to use a power-of-2 number of PGs. It's
>>>> generally a good idea to do this anyway, so if you haven't set up your
>>>> production cluster yet, you may want to play around with this.
>>>> Basically
>>>> just take whatever number you were planning on using and round it
>>>> up (or
>>>> down slightly). IE if you were going to use 7,000 PGs, round up to
>>>> 8192.
>>>
>>> As I was asking about earlier on IRC, I'm in a situation where the docs
>>> did not mention this in the section about calculating PGs so I have a
>>> non-power-of-2 -- and since there are some production things running on
>>> that pool I can't currently change it.
>>
>> Oh same thing here - did i miss the doc or can someone point me the
>> location.
>>
>> Is there a chance to change the number of PGs for a pool?
>>
>> Greets,
>> Stefan
>
> Honestly I don't know if it will actually have a significant effect.
> ceph_stable_mod will map things optimally when pg_num is a power of 2,
> but that's only part of how things work. It may not matter very much
> with high PG counts.
Yeah, that's why I said *if* it matters, once someone runs suitable benchmarks, please provide a resilvering mechanism :-)
I'd be interested in figuring out the right way to migrate an RBD from one pool to another regardless.
--Jeff
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:56 ` Jeff Mitchell
@ 2013-01-22 21:58 ` Stefan Priebe
0 siblings, 0 replies; 12+ messages in thread
From: Stefan Priebe @ 2013-01-22 21:58 UTC (permalink / raw)
To: Jeff Mitchell; +Cc: Mark Nelson, martin, ceph-devel
Hi,
Am 22.01.2013 22:56, schrieb Jeff Mitchell:
>> Oh same thing here - did i miss the doc or can someone point me the
>> location.
>
> Here you go: http://ceph.com/docs/master/rados/operations/placement-groups/
I know that one... but power of 2 is missing.
> (Notice the lack of any power-of-2 mention :-) )
Greets,
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:57 ` Mark Nelson
2013-01-22 21:58 ` Jeff Mitchell
@ 2013-01-22 22:00 ` Gregory Farnum
2013-01-22 22:11 ` Mark Nelson
1 sibling, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2013-01-22 22:00 UTC (permalink / raw)
To: Mark Nelson; +Cc: Stefan Priebe, Jeff Mitchell, martin, ceph-devel
On Tuesday, January 22, 2013 at 1:57 PM, Mark Nelson wrote:
> On 01/22/2013 03:50 PM, Stefan Priebe wrote:
> > Hi,
> > Am 22.01.2013 22:26, schrieb Jeff Mitchell:
> > > Mark Nelson wrote:
> > > > It may (or may not) help to use a power-of-2 number of PGs. It's
> > > > generally a good idea to do this anyway, so if you haven't set up your
> > > > production cluster yet, you may want to play around with this. Basically
> > > > just take whatever number you were planning on using and round it up (or
> > > > down slightly). IE if you were going to use 7,000 PGs, round up to 8192.
> > >
> > >
> > >
> > > As I was asking about earlier on IRC, I'm in a situation where the docs
> > > did not mention this in the section about calculating PGs so I have a
> > > non-power-of-2 -- and since there are some production things running on
> > > that pool I can't currently change it.
> >
> >
> >
> > Oh same thing here - did i miss the doc or can someone point me the
> > location.
> >
> > Is there a chance to change the number of PGs for a pool?
> >
> > Greets,
> > Stefan
>
>
>
> Honestly I don't know if it will actually have a significant effect.
> ceph_stable_mod will map things optimally when pg_num is a power of 2,
> but that's only part of how things work. It may not matter very much
> with high PG counts.
IIRC, having a non-power of 2 count means that the extra PGs (above the lower-bounding power of 2) will be twice the size of the other PGs. For reasonable PG counts this should not cause any problems.
-Greg
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 22:00 ` Gregory Farnum
@ 2013-01-22 22:11 ` Mark Nelson
0 siblings, 0 replies; 12+ messages in thread
From: Mark Nelson @ 2013-01-22 22:11 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Stefan Priebe, Jeff Mitchell, martin, ceph-devel
On 01/22/2013 04:00 PM, Gregory Farnum wrote:
> On Tuesday, January 22, 2013 at 1:57 PM, Mark Nelson wrote:
>> On 01/22/2013 03:50 PM, Stefan Priebe wrote:
>>> Hi,
>>> Am 22.01.2013 22:26, schrieb Jeff Mitchell:
>>>> Mark Nelson wrote:
>>>>> It may (or may not) help to use a power-of-2 number of PGs. It's
>>>>> generally a good idea to do this anyway, so if you haven't set up your
>>>>> production cluster yet, you may want to play around with this. Basically
>>>>> just take whatever number you were planning on using and round it up (or
>>>>> down slightly). IE if you were going to use 7,000 PGs, round up to 8192.
>>>>
>>>>
>>>>
>>>> As I was asking about earlier on IRC, I'm in a situation where the docs
>>>> did not mention this in the section about calculating PGs so I have a
>>>> non-power-of-2 -- and since there are some production things running on
>>>> that pool I can't currently change it.
>>>
>>>
>>>
>>> Oh same thing here - did i miss the doc or can someone point me the
>>> location.
>>>
>>> Is there a chance to change the number of PGs for a pool?
>>>
>>> Greets,
>>> Stefan
>>
>>
>>
>> Honestly I don't know if it will actually have a significant effect.
>> ceph_stable_mod will map things optimally when pg_num is a power of 2,
>> but that's only part of how things work. It may not matter very much
>> with high PG counts.
>
> IIRC, having a non-power of 2 count means that the extra PGs (above the lower-bounding power of 2) will be twice the size of the other PGs. For reasonable PG counts this should not cause any problems.
> -Greg
>
Hrm, for some reason I thought there was more to it than that. I
suppose then you really are just at the mercy then of the distribution
of big PGs vs small PGs on each OSD.
A while back I was talking to Sage about doing something like (forgive
the python):
def ceph_stable_mod2(x, b, bmask):
if ((x & bmask) < b):
return x & bmask
else:
return x % b
but that doesn't give as nice splitting behaviour. Still, unless I'm
missing something, isn't splitting kind of a rare event anyway?
Mark
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-22 21:58 ` Jeff Mitchell
@ 2013-01-23 0:25 ` Josh Durgin
2013-01-23 2:23 ` Jeff Mitchell
0 siblings, 1 reply; 12+ messages in thread
From: Josh Durgin @ 2013-01-23 0:25 UTC (permalink / raw)
To: Jeff Mitchell; +Cc: Mark Nelson, Stefan Priebe, martin, ceph-devel
On 01/22/2013 01:58 PM, Jeff Mitchell wrote:
> I'd be interested in figuring out the right way to migrate an RBD from
> one pool to another regardless.
Each way involves copying data, since by definition a different pool
will use different placement groups.
You could export/import with the rbd tool, do a manual dd like you
mentioned, or clone and then flatten in the new pool. The simplest
is probably 'rbd cp pool1/image pool2/image'.
Josh
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Questions about journals, performance and disk utilization.
2013-01-23 0:25 ` Josh Durgin
@ 2013-01-23 2:23 ` Jeff Mitchell
0 siblings, 0 replies; 12+ messages in thread
From: Jeff Mitchell @ 2013-01-23 2:23 UTC (permalink / raw)
To: Josh Durgin; +Cc: Mark Nelson, Stefan Priebe, martin, ceph-devel
On Tue, Jan 22, 2013 at 7:25 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> On 01/22/2013 01:58 PM, Jeff Mitchell wrote:
>>
>> I'd be interested in figuring out the right way to migrate an RBD from
>> one pool to another regardless.
>
>
> Each way involves copying data, since by definition a different pool
> will use different placement groups.
>
> You could export/import with the rbd tool, do a manual dd like you
> mentioned, or clone and then flatten in the new pool. The simplest
> is probably 'rbd cp pool1/image pool2/image'.
Awesome -- I didn't know about 'rbd cp', and I'll have to look into
cloning/flattening. Thanks for the info.
--Jeff
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-01-23 2:24 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-22 19:59 Questions about journals, performance and disk utilization martin
2013-01-22 21:16 ` Mark Nelson
2013-01-22 21:26 ` Jeff Mitchell
2013-01-22 21:50 ` Stefan Priebe
2013-01-22 21:56 ` Jeff Mitchell
2013-01-22 21:58 ` Stefan Priebe
2013-01-22 21:57 ` Mark Nelson
2013-01-22 21:58 ` Jeff Mitchell
2013-01-23 0:25 ` Josh Durgin
2013-01-23 2:23 ` Jeff Mitchell
2013-01-22 22:00 ` Gregory Farnum
2013-01-22 22:11 ` Mark Nelson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.