All of lore.kernel.org
 help / color / mirror / Atom feed
* BlueStore and maximum number of objects per PG
@ 2017-02-21 20:04 Wido den Hollander
  2017-02-22  2:53 ` Mark Nelson
  2017-02-22 14:34 ` Mike
  0 siblings, 2 replies; 7+ messages in thread
From: Wido den Hollander @ 2017-02-21 20:04 UTC (permalink / raw)
  To: ceph-devel

Hi,

I'm about to start a test where I'll be putting a lot of objects into BlueStore and see how it holds.

The reasoning behind is that I have a customer which has 165M objects in it's cluster which results in some PGs having 900k objects.

For FileStore with XFS this is quite heavy. A simple scrub takes ages.

The problem is that we can't simply increase the number of PGs since that will overload the OSDs as well.

On the other hand we could add hardware, but that also takes time.

So just for the sake of testing I'm looking at trying to replicate this situation using BlueStore from master.

Is there anything I should take into account? I'll probably be just creating a lot (millions) of 100 byte objects in the cluster with just a few PGs.

Wido

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BlueStore and maximum number of objects per PG
  2017-02-21 20:04 BlueStore and maximum number of objects per PG Wido den Hollander
@ 2017-02-22  2:53 ` Mark Nelson
  2017-02-22 10:51   ` Wido den Hollander
  2017-02-22 14:34 ` Mike
  1 sibling, 1 reply; 7+ messages in thread
From: Mark Nelson @ 2017-02-22  2:53 UTC (permalink / raw)
  To: Wido den Hollander, ceph-devel

Hi Wido,

On 02/21/2017 02:04 PM, Wido den Hollander wrote:
> Hi,
>
> I'm about to start a test where I'll be putting a lot of objects into BlueStore and see how it holds.
>
> The reasoning behind is that I have a customer which has 165M objects in it's cluster which results in some PGs having 900k objects.
>
> For FileStore with XFS this is quite heavy. A simple scrub takes ages.
>
> The problem is that we can't simply increase the number of PGs since that will overload the OSDs as well.
>
> On the other hand we could add hardware, but that also takes time.
>
> So just for the sake of testing I'm looking at trying to replicate this situation using BlueStore from master.
>
> Is there anything I should take into account? I'll probably be just creating a lot (millions) of 100 byte objects in the cluster with just a few PGs.

Couple of general things:

I don't anticipate you'll run into the same kind of pg splitting 
slowdowns that you see with filestore, but you still may see some 
slowdown as the object count increases since rocksdb will have more 
key/value pairs to deal with.  I expect you'll see a lot of metadata 
movement between levels as it tries to keep things organized.  One thing 
to note is that it's possible you may see rocksdb bottlenecks as the OSD 
volume size increases.  This is one of the things the guys at Sandisk 
were trying to tackle with Zetascale.

If you can put the rocksdb DB and WAL on SSDs that will likely help, but 
you'll want to be mindful of how full the SSDs are getting.  I'll be 
very curious to see how your tests go, it's been a while since we've 
thrown that many objects on a bluestore cluster (back around the 
newstore timeframe we filled bluestore with many 10s of millions of 
objects and from what I remember it did pretty well).

Mark

>
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BlueStore and maximum number of objects per PG
  2017-02-22  2:53 ` Mark Nelson
@ 2017-02-22 10:51   ` Wido den Hollander
  2017-03-09 13:38     ` Wido den Hollander
  0 siblings, 1 reply; 7+ messages in thread
From: Wido den Hollander @ 2017-02-22 10:51 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel


> Op 22 februari 2017 om 3:53 schreef Mark Nelson <mnelson@redhat.com>:
> 
> 
> Hi Wido,
> 
> On 02/21/2017 02:04 PM, Wido den Hollander wrote:
> > Hi,
> >
> > I'm about to start a test where I'll be putting a lot of objects into BlueStore and see how it holds.
> >
> > The reasoning behind is that I have a customer which has 165M objects in it's cluster which results in some PGs having 900k objects.
> >
> > For FileStore with XFS this is quite heavy. A simple scrub takes ages.
> >
> > The problem is that we can't simply increase the number of PGs since that will overload the OSDs as well.
> >
> > On the other hand we could add hardware, but that also takes time.
> >
> > So just for the sake of testing I'm looking at trying to replicate this situation using BlueStore from master.
> >
> > Is there anything I should take into account? I'll probably be just creating a lot (millions) of 100 byte objects in the cluster with just a few PGs.
> 
> Couple of general things:
> 
> I don't anticipate you'll run into the same kind of pg splitting 
> slowdowns that you see with filestore, but you still may see some 
> slowdown as the object count increases since rocksdb will have more 
> key/value pairs to deal with.  I expect you'll see a lot of metadata 
> movement between levels as it tries to keep things organized.  One thing 
> to note is that it's possible you may see rocksdb bottlenecks as the OSD 
> volume size increases.  This is one of the things the guys at Sandisk 
> were trying to tackle with Zetascale.
> 

Ah, ok!

> If you can put the rocksdb DB and WAL on SSDs that will likely help, but 
> you'll want to be mindful of how full the SSDs are getting.  I'll be 
> very curious to see how your tests go, it's been a while since we've 
> thrown that many objects on a bluestore cluster (back around the 
> newstore timeframe we filled bluestore with many 10s of millions of 
> objects and from what I remember it did pretty well).
> 

Thanks for the information! I'll try first with a few OSDs and size = 1 and just put a lot of small objects in the PG and see how it goes.

Will time the latency for writing and reading the objects afterwards to see how it goes.

Wido

> Mark
> 
> >
> > Wido
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BlueStore and maximum number of objects per PG
  2017-02-21 20:04 BlueStore and maximum number of objects per PG Wido den Hollander
  2017-02-22  2:53 ` Mark Nelson
@ 2017-02-22 14:34 ` Mike
  1 sibling, 0 replies; 7+ messages in thread
From: Mike @ 2017-02-22 14:34 UTC (permalink / raw)
  To: Wido den Hollander, ceph-devel

On Tue, 2017-02-21 at 21:04 +0100, Wido den Hollander wrote:
> Hi,
> 
> I'm about to start a test where I'll be putting a lot of objects into BlueStore and
> see how it holds.
> 
> The reasoning behind is that I have a customer which has 165M objects in it's cluster
> which results in some PGs having 900k objects.
> 
> For FileStore with XFS this is quite heavy. A simple scrub takes ages.
> 
> The problem is that we can't simply increase the number of PGs since that will
> overload the OSDs as well.

Also the problem stands if have a huge amount _small_ objects: capacity of hardware is
enough (20% used) but PG quantity isn't.
 
> 
> On the other hand we could add hardware, but that also takes time.
> 
> So just for the sake of testing I'm looking at trying to replicate this situation
> using BlueStore from master.
> 
> Is there anything I should take into account? I'll probably be just creating a lot
> (millions) of 100 byte objects in the cluster with just a few PGs.
> 
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BlueStore and maximum number of objects per PG
  2017-02-22 10:51   ` Wido den Hollander
@ 2017-03-09 13:38     ` Wido den Hollander
  2017-03-09 14:10       ` Mark Nelson
  0 siblings, 1 reply; 7+ messages in thread
From: Wido den Hollander @ 2017-03-09 13:38 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel


> Op 22 februari 2017 om 11:51 schreef Wido den Hollander <wido@42on.com>:
> 
> 
> 
> > Op 22 februari 2017 om 3:53 schreef Mark Nelson <mnelson@redhat.com>:
> > 
> > 
> > Hi Wido,
> > 
> > On 02/21/2017 02:04 PM, Wido den Hollander wrote:
> > > Hi,
> > >
> > > I'm about to start a test where I'll be putting a lot of objects into BlueStore and see how it holds.
> > >
> > > The reasoning behind is that I have a customer which has 165M objects in it's cluster which results in some PGs having 900k objects.
> > >
> > > For FileStore with XFS this is quite heavy. A simple scrub takes ages.
> > >
> > > The problem is that we can't simply increase the number of PGs since that will overload the OSDs as well.
> > >
> > > On the other hand we could add hardware, but that also takes time.
> > >
> > > So just for the sake of testing I'm looking at trying to replicate this situation using BlueStore from master.
> > >
> > > Is there anything I should take into account? I'll probably be just creating a lot (millions) of 100 byte objects in the cluster with just a few PGs.
> > 
> > Couple of general things:
> > 
> > I don't anticipate you'll run into the same kind of pg splitting 
> > slowdowns that you see with filestore, but you still may see some 
> > slowdown as the object count increases since rocksdb will have more 
> > key/value pairs to deal with.  I expect you'll see a lot of metadata 
> > movement between levels as it tries to keep things organized.  One thing 
> > to note is that it's possible you may see rocksdb bottlenecks as the OSD 
> > volume size increases.  This is one of the things the guys at Sandisk 
> > were trying to tackle with Zetascale.
> > 
> 
> Ah, ok!
> 
> > If you can put the rocksdb DB and WAL on SSDs that will likely help, but 
> > you'll want to be mindful of how full the SSDs are getting.  I'll be 
> > very curious to see how your tests go, it's been a while since we've 
> > thrown that many objects on a bluestore cluster (back around the 
> > newstore timeframe we filled bluestore with many 10s of millions of 
> > objects and from what I remember it did pretty well).
> > 
> 
> Thanks for the information! I'll try first with a few OSDs and size = 1 and just put a lot of small objects in the PG and see how it goes.
> 
> Will time the latency for writing and reading the objects afterwards to see how it goes.

First test, one OSD running inside VirtualBox with a 300GB disk and Luminous.

1 OSD, size = 1, pg_num = 8.

After 2.5M objects the disk was full... but the OSD was still working fine. Didn't experience any issues. Although the OSD was using 3.4GB of RAM at that moment while I stopped doing I/O.

2.5M objects of 128 bytes written to the disk.

Would like to scale this test out further, but I don't have hardware available to run it on.

Wido

> 
> Wido
> 
> > Mark
> > 
> > >
> > > Wido
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BlueStore and maximum number of objects per PG
  2017-03-09 13:38     ` Wido den Hollander
@ 2017-03-09 14:10       ` Mark Nelson
  2017-03-10 10:23         ` Wido den Hollander
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Nelson @ 2017-03-09 14:10 UTC (permalink / raw)
  To: Wido den Hollander, ceph-devel



On 03/09/2017 07:38 AM, Wido den Hollander wrote:
>
>> Op 22 februari 2017 om 11:51 schreef Wido den Hollander <wido@42on.com>:
>>
>>
>>
>>> Op 22 februari 2017 om 3:53 schreef Mark Nelson <mnelson@redhat.com>:
>>>
>>>
>>> Hi Wido,
>>>
>>> On 02/21/2017 02:04 PM, Wido den Hollander wrote:
>>>> Hi,
>>>>
>>>> I'm about to start a test where I'll be putting a lot of objects into BlueStore and see how it holds.
>>>>
>>>> The reasoning behind is that I have a customer which has 165M objects in it's cluster which results in some PGs having 900k objects.
>>>>
>>>> For FileStore with XFS this is quite heavy. A simple scrub takes ages.
>>>>
>>>> The problem is that we can't simply increase the number of PGs since that will overload the OSDs as well.
>>>>
>>>> On the other hand we could add hardware, but that also takes time.
>>>>
>>>> So just for the sake of testing I'm looking at trying to replicate this situation using BlueStore from master.
>>>>
>>>> Is there anything I should take into account? I'll probably be just creating a lot (millions) of 100 byte objects in the cluster with just a few PGs.
>>>
>>> Couple of general things:
>>>
>>> I don't anticipate you'll run into the same kind of pg splitting
>>> slowdowns that you see with filestore, but you still may see some
>>> slowdown as the object count increases since rocksdb will have more
>>> key/value pairs to deal with.  I expect you'll see a lot of metadata
>>> movement between levels as it tries to keep things organized.  One thing
>>> to note is that it's possible you may see rocksdb bottlenecks as the OSD
>>> volume size increases.  This is one of the things the guys at Sandisk
>>> were trying to tackle with Zetascale.
>>>
>>
>> Ah, ok!
>>
>>> If you can put the rocksdb DB and WAL on SSDs that will likely help, but
>>> you'll want to be mindful of how full the SSDs are getting.  I'll be
>>> very curious to see how your tests go, it's been a while since we've
>>> thrown that many objects on a bluestore cluster (back around the
>>> newstore timeframe we filled bluestore with many 10s of millions of
>>> objects and from what I remember it did pretty well).
>>>
>>
>> Thanks for the information! I'll try first with a few OSDs and size = 1 and just put a lot of small objects in the PG and see how it goes.
>>
>> Will time the latency for writing and reading the objects afterwards to see how it goes.
>
> First test, one OSD running inside VirtualBox with a 300GB disk and Luminous.
>
> 1 OSD, size = 1, pg_num = 8.
>
> After 2.5M objects the disk was full... but the OSD was still working fine. Didn't experience any issues. Although the OSD was using 3.4GB of RAM at that moment while I stopped doing I/O.

Glad to hear it continued to work well!  That's pretty much how my 
testing went the last time I did scaling tests.  Based on your test 
parameters, it sounds like you hit something like ~300-400K objects per 
PG?  Did you get a chance to try filestore with the same parameters? 
The memory usage is not too surprising, bluestore uses it's own cache. 
We may still need to tweak the defaults a bit, though there are obvious 
trade-offs.  Hopefully Igor's patches should help here.

>
> 2.5M objects of 128 bytes written to the disk.
>
> Would like to scale this test out further, but I don't have hardware available to run it on.
>
> Wido
>
>>
>> Wido
>>
>>> Mark
>>>
>>>>
>>>> Wido
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BlueStore and maximum number of objects per PG
  2017-03-09 14:10       ` Mark Nelson
@ 2017-03-10 10:23         ` Wido den Hollander
  0 siblings, 0 replies; 7+ messages in thread
From: Wido den Hollander @ 2017-03-10 10:23 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel


> Op 9 maart 2017 om 15:10 schreef Mark Nelson <mnelson@redhat.com>:
> 
> 
> 
> 
> On 03/09/2017 07:38 AM, Wido den Hollander wrote:
> >
> >> Op 22 februari 2017 om 11:51 schreef Wido den Hollander <wido@42on.com>:
> >>
> >>
> >>
> >>> Op 22 februari 2017 om 3:53 schreef Mark Nelson <mnelson@redhat.com>:
> >>>
> >>>
> >>> Hi Wido,
> >>>
> >>> On 02/21/2017 02:04 PM, Wido den Hollander wrote:
> >>>> Hi,
> >>>>
> >>>> I'm about to start a test where I'll be putting a lot of objects into BlueStore and see how it holds.
> >>>>
> >>>> The reasoning behind is that I have a customer which has 165M objects in it's cluster which results in some PGs having 900k objects.
> >>>>
> >>>> For FileStore with XFS this is quite heavy. A simple scrub takes ages.
> >>>>
> >>>> The problem is that we can't simply increase the number of PGs since that will overload the OSDs as well.
> >>>>
> >>>> On the other hand we could add hardware, but that also takes time.
> >>>>
> >>>> So just for the sake of testing I'm looking at trying to replicate this situation using BlueStore from master.
> >>>>
> >>>> Is there anything I should take into account? I'll probably be just creating a lot (millions) of 100 byte objects in the cluster with just a few PGs.
> >>>
> >>> Couple of general things:
> >>>
> >>> I don't anticipate you'll run into the same kind of pg splitting
> >>> slowdowns that you see with filestore, but you still may see some
> >>> slowdown as the object count increases since rocksdb will have more
> >>> key/value pairs to deal with.  I expect you'll see a lot of metadata
> >>> movement between levels as it tries to keep things organized.  One thing
> >>> to note is that it's possible you may see rocksdb bottlenecks as the OSD
> >>> volume size increases.  This is one of the things the guys at Sandisk
> >>> were trying to tackle with Zetascale.
> >>>
> >>
> >> Ah, ok!
> >>
> >>> If you can put the rocksdb DB and WAL on SSDs that will likely help, but
> >>> you'll want to be mindful of how full the SSDs are getting.  I'll be
> >>> very curious to see how your tests go, it's been a while since we've
> >>> thrown that many objects on a bluestore cluster (back around the
> >>> newstore timeframe we filled bluestore with many 10s of millions of
> >>> objects and from what I remember it did pretty well).
> >>>
> >>
> >> Thanks for the information! I'll try first with a few OSDs and size = 1 and just put a lot of small objects in the PG and see how it goes.
> >>
> >> Will time the latency for writing and reading the objects afterwards to see how it goes.
> >
> > First test, one OSD running inside VirtualBox with a 300GB disk and Luminous.
> >
> > 1 OSD, size = 1, pg_num = 8.
> >
> > After 2.5M objects the disk was full... but the OSD was still working fine. Didn't experience any issues. Although the OSD was using 3.4GB of RAM at that moment while I stopped doing I/O.
> 
> Glad to hear it continued to work well!  That's pretty much how my 
> testing went the last time I did scaling tests.  Based on your test 
> parameters, it sounds like you hit something like ~300-400K objects per

Yes, about that number. I wanted to go for 1M objects in a PG to see how that holds out.
 
> PG?  Did you get a chance to try filestore with the same parameters? 

No, I didn't. Just tested this on my laptop with a VM. Didn't have much time either to do a full-scale test.

> The memory usage is not too surprising, bluestore uses it's own cache. 
> We may still need to tweak the defaults a bit, though there are obvious 
> trade-offs.  Hopefully Igor's patches should help here.
> 

Ok, understood.

I will test further.

Wido

> >
> > 2.5M objects of 128 bytes written to the disk.
> >
> > Would like to scale this test out further, but I don't have hardware available to run it on.
> >
> > Wido
> >
> >>
> >> Wido
> >>
> >>> Mark
> >>>
> >>>>
> >>>> Wido
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-10 10:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-21 20:04 BlueStore and maximum number of objects per PG Wido den Hollander
2017-02-22  2:53 ` Mark Nelson
2017-02-22 10:51   ` Wido den Hollander
2017-03-09 13:38     ` Wido den Hollander
2017-03-09 14:10       ` Mark Nelson
2017-03-10 10:23         ` Wido den Hollander
2017-02-22 14:34 ` Mike

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.