All of lore.kernel.org
 help / color / mirror / Atom feed
* Tuning placement group
@ 2012-07-20 16:33 François Charlier
  2012-07-20 18:08 ` Florian Haas
  2012-07-20 19:43 ` Sage Weil
  0 siblings, 2 replies; 4+ messages in thread
From: François Charlier @ 2012-07-20 16:33 UTC (permalink / raw)
  To: ceph-devel

Hello,

Reading    http://ceph.com/docs/master/ops/manage/grow/placement-groups/
and thinking to build a ceph cluster with potentially 1000 OSDs.

Using the recommandations on the previously cited link, it would require
pg_num being set between 10,000 &  30,000. Okay with that. Let's use the
recommended value of 16,384 ; this  is alreay about 160 placement groups
per OSD.

What  if, for  a start,  we choose  to reach  this number  of 1000  OSDs
slowly, starting with 100 OSDs ? It's now 1600 placement groups per OSD.

What if  we chose 30,000 (or  32,768) placement groups to  keep room for
expansion ?

My question  is : How will  behave a Ceph  pool with 1000, 5000  or even
10000 placement groups per OSD ?  Will this impact performance ? How bad
? Can it be worked around ? Is this a problem of RAM size ? CPU usage ?

Any hint about this would be much appreciated.

Thanks !

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Tuning placement group
  2012-07-20 16:33 Tuning placement group François Charlier
@ 2012-07-20 18:08 ` Florian Haas
  2012-07-20 19:33   ` Yehuda Sadeh
  2012-07-20 19:43 ` Sage Weil
  1 sibling, 1 reply; 4+ messages in thread
From: Florian Haas @ 2012-07-20 18:08 UTC (permalink / raw)
  To: François Charlier; +Cc: ceph-devel

On Fri, Jul 20, 2012 at 9:33 AM, François Charlier
<francois.charlier@enovance.com> wrote:
> Hello,
>
> Reading    http://ceph.com/docs/master/ops/manage/grow/placement-groups/
> and thinking to build a ceph cluster with potentially 1000 OSDs.
>
> Using the recommandations on the previously cited link, it would require
> pg_num being set between 10,000 &  30,000. Okay with that. Let's use the
> recommended value of 16,384 ; this  is alreay about 160 placement groups
> per OSD.
>
> What  if, for  a start,  we choose  to reach  this number  of 1000  OSDs
> slowly, starting with 100 OSDs ? It's now 1600 placement groups per OSD.
>
> What if  we chose 30,000 (or  32,768) placement groups to  keep room for
> expansion ?
>
> My question  is : How will  behave a Ceph  pool with 1000, 5000  or even
> 10000 placement groups per OSD ?  Will this impact performance ? How bad
> ? Can it be worked around ? Is this a problem of RAM size ? CPU usage ?
>
> Any hint about this would be much appreciated.

If I may, I'd like to add an additional point of consideration,
specifically for radosgw setups:

What's the recommended way to set the number of PGs for the half-dozen
pools that radosgw normally creates on its own (.rgw, .rgw.users,
.rgw.buckets and so on)? I *think* wanting to set a custom number of
PGs would require pre-creating these pools manually, but there may be
a way -- undocumented? -- to instruct radosgw to set a user-configured
number of PGs on pool creation. Insight on that would be much
appreciated.

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Tuning placement group
  2012-07-20 18:08 ` Florian Haas
@ 2012-07-20 19:33   ` Yehuda Sadeh
  0 siblings, 0 replies; 4+ messages in thread
From: Yehuda Sadeh @ 2012-07-20 19:33 UTC (permalink / raw)
  To: Florian Haas; +Cc: François Charlier, ceph-devel

On Fri, Jul 20, 2012 at 11:08 AM, Florian Haas <florian@hastexo.com> wrote:
>
> On Fri, Jul 20, 2012 at 9:33 AM, François Charlier
> <francois.charlier@enovance.com> wrote:
> > Hello,
> >
> > Reading    http://ceph.com/docs/master/ops/manage/grow/placement-groups/
> > and thinking to build a ceph cluster with potentially 1000 OSDs.
> >
> > Using the recommandations on the previously cited link, it would require
> > pg_num being set between 10,000 &  30,000. Okay with that. Let's use the
> > recommended value of 16,384 ; this  is alreay about 160 placement groups
> > per OSD.
> >
> > What  if, for  a start,  we choose  to reach  this number  of 1000  OSDs
> > slowly, starting with 100 OSDs ? It's now 1600 placement groups per OSD.
> >
> > What if  we chose 30,000 (or  32,768) placement groups to  keep room for
> > expansion ?
> >
> > My question  is : How will  behave a Ceph  pool with 1000, 5000  or even
> > 10000 placement groups per OSD ?  Will this impact performance ? How bad
> > ? Can it be worked around ? Is this a problem of RAM size ? CPU usage ?
> >
> > Any hint about this would be much appreciated.
>
> If I may, I'd like to add an additional point of consideration,
> specifically for radosgw setups:
>
> What's the recommended way to set the number of PGs for the half-dozen
> pools that radosgw normally creates on its own (.rgw, .rgw.users,
> .rgw.buckets and so on)? I *think* wanting to set a custom number of
> PGs would require pre-creating these pools manually, but there may be
> a way -- undocumented? -- to instruct radosgw to set a user-configured
> number of PGs on pool creation. Insight on that would be much
> appreciated.
>
At the moment there's no way to tell radosgw how many pgs should be in
the pools it creates automatically. One way to get around that is to
create these pools before running the radosgw in the first time. For
the data pools, you can modify the set of pools that will be used for
data placement by using the radosgw-admin 'pool add', 'pool rm', and
'pool list' commands. Note that buckets that have already been created
will retain their original pool.
Data in pools that were automatically created can now be copied to a
different pool (rados cppool), and pools can now be renamed (ceph osd
pool rename <oldname> <newname>). So you can create new pool with the
amount of required pgs, copy old data into it, and rename old pool and
new pool. NOTE: this should not be used for the data pool
(.rgw.buckets by default)! This can only be done for the pools that
hold the different indexes and metadata. The bucket index in the data
pool relies on internal pg state, which will be broken if pool moves
around.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Tuning placement group
  2012-07-20 16:33 Tuning placement group François Charlier
  2012-07-20 18:08 ` Florian Haas
@ 2012-07-20 19:43 ` Sage Weil
  1 sibling, 0 replies; 4+ messages in thread
From: Sage Weil @ 2012-07-20 19:43 UTC (permalink / raw)
  To: François Charlier; +Cc: ceph-devel

On Fri, 20 Jul 2012, Fran?ois Charlier wrote:
> Hello,
> 
> Reading    http://ceph.com/docs/master/ops/manage/grow/placement-groups/
> and thinking to build a ceph cluster with potentially 1000 OSDs.
> 
> Using the recommandations on the previously cited link, it would require
> pg_num being set between 10,000 &  30,000. Okay with that. Let's use the
> recommended value of 16,384 ; this  is alreay about 160 placement groups
> per OSD.

I think you mean (16384 * 3x) / 1000 osds ~= 50 pgs per osd?

> What  if, for  a start,  we choose  to reach  this number  of 1000  OSDs
> slowly, starting with 100 OSDs ? It's now 1600 placement groups per OSD.

~500
 
> What if  we chose 30,000 (or  32,768) placement groups to  keep room for
> expansion ?

~1000
 
> My question  is : How will  behave a Ceph  pool with 1000, 5000  or even
> 10000 placement groups per OSD ?  Will this impact performance ? How bad
> ? Can it be worked around ? Is this a problem of RAM size ? CPU usage ?
> 
> Any hint about this would be much appreciated.

It will work, but peering will be slower, and there will be more memory 
used.

The other question is when you expect to move beyond 1000 osds.  The next 
project we'll be doing on the OSD is PG splitting, which will make this 
problem adjustable.  It won't be backported to argonaut, but it will be in 
the next stable release, and will probably appear in our regular 
development release in 2-3 months.

sage


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-07-20 19:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-20 16:33 Tuning placement group François Charlier
2012-07-20 18:08 ` Florian Haas
2012-07-20 19:33   ` Yehuda Sadeh
2012-07-20 19:43 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.