All of lore.kernel.org
 help / color / mirror / Atom feed
* Bluestore different allocator performance Vs FileStore
@ 2016-08-10 16:55 Somnath Roy
  2016-08-10 21:31 ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-08-10 16:55 UTC (permalink / raw)
  To: ceph-devel

Hi,
I spent some time on evaluating different Bluestore allocator and freelist performance. Also, tried to gaze the performance difference of Bluestore and filestore on the similar setup.

Setup:
--------

16 OSDs (8TB Flash) across 2 OSD nodes

Single pool and single rbd image of 4TB. 2X replication.

Disabled the exclusive lock feature so that I can run multiple write  jobs in parallel.
rbd_cache is disabled in the client side.
Each test ran for 15 mins.

Result :
---------

Here is the detailed report on this.

https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a250cb05986/Bluestore_allocator_comp.xlsx

Each profile I named based on <allocator>-<freelist> , so in the graph for example "stupid-extent" meaning stupid allocator and extent freelist.

I ran the test for each of the profile in the following order after creating a fresh rbd image for all the Bluestore test.

1. 4K RW for 15 min with 16QD and 10 jobs.

2. 16K RW for 15 min with 16QD and 10 jobs.

3. 64K RW for 15 min with 16QD and 10 jobs.

4. 256K RW for 15 min with 16QD and 10 jobs.

The above are non-preconditioned case i.e ran before filling up the entire image. The reason is I don't see any reason of filling up the rbd image before like filestore case where it will give stable performance if we fill up the rbd images first. Filling up rbd images in case of filestore will create the files in the filesystem.

5. Next, I did precondition the 4TB image with 1M seq write. This is primarily because I want to load BlueStore with more data.

6. Ran 4K RW test again (this is called out preconditioned in the profile) for 15 min

7. Ran 4K Seq test for similar QD for 15 min

8. Ran 16K RW test again for 15min

For filestore test, I ran tests after preconditioning the entire image first.

Each sheet on the xls have different block size result , I often miss to navigate through the xls sheets , so, thought of mentioning here :-)

I have also captured the mkfs time , OSD startup time and the memory usage after the entire run.

Observation:
---------------

1. First of all, in case of bitmap allocator mkfs time (and thus cluster creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore. Each OSD creation is taking ~2min or so sometimes and I nailed down the insert_free() function call (marked ****) in the Bitmap allocator is causing that.

2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next start
2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next 0x4663d00000~69959451000
2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free instance 139913322803328 offset 0x4663d00000 length 0x69959451000
****2016-08-05 16:12:40.975557 7f4024d258c0 20 bitmapalloc:insert_free instance 139913322803328 off 0x4663d00000 len 0x69959451000****
****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist enumerate_next end****
2016-08-05 16:13:20.748978 7f4024d258c0 10 bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1 extents

2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read buffered 0x4a14eb~265 of ^A:5242880+5242880
2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got 613
2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next 0x4663d00000~69959451000
2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free instance 139913306273920 offset 0x4663d00000 length 0x69959451000
*****2016-08-05 16:13:23.438666 7f4024d258c0 20 bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000 len 0x69959451000*****
*****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist enumerate_next end

2. The same function call is causing delay during OSD start time and it is ~4X slower than stupid/filestore.

3. As you can see in the result, bitmap allocator is performing a bit poorly for all the block sizes and has some significant 99th latency in some cases. This could be because of the above call as well since it is been called in IO path from kv_sync_thread.

4. In the end of each sheet, the graph for entire 15min run is shown and the performance looks stable for Bluestore but filestore for small block is kind od spiky as expected.

5. In my environment, filestore still outperforming narrowly for small blocks RW (4K/16K) but  for bigger blocks RW (64K/256K) Bluestore performance is ~2X than filestore.

6. 4K sequential performance is ~2X lower for Bluestore than Filestore. If you see the graph, it is starting very low and eventually it is stabilizing ~10K number.

7. 1M seq run for the entire image precondition is ~2X gain for Bluestore.

8. The small block performance (didn't measure bigblock) for Bluestore after precondition is getting slower and this is mainly because of onode size is growing (thus metadata size is growing). I will do some test with bigger rbd sizes to see how it behaves at scale.

9. I have adjusted the Bluestore cache to use most of my 64GB system memory and I found the amount of memory growth for each of the allocator test is more or less similar. Filestore of course is taking way less memory for OSD but it has kernel level cache that we need to consider as well. But, advantage for filestore is these kernel cache (buffer cache, dentries/dcaches/inode caches) can be reused..

10. One challenge for Bluestore as of today is to keep track of onode sizes and thus I think the BlueStore onode cache should be based on size and *not* based on number of onode entries otherwise memory growth for the long run will be unmanageable, Sage ?


Next, I will do some benchmarking on bigger setup and much larger data set.

Any feedback is much appreciated,

Thanks & Regards
Somnath

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Bluestore different allocator performance Vs FileStore
  2016-08-10 16:55 Bluestore different allocator performance Vs FileStore Somnath Roy
@ 2016-08-10 21:31 ` Sage Weil
  2016-08-10 22:27   ` Somnath Roy
  0 siblings, 1 reply; 34+ messages in thread
From: Sage Weil @ 2016-08-10 21:31 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Wed, 10 Aug 2016, Somnath Roy wrote:
> Hi, I spent some time on evaluating different Bluestore allocator and 
> freelist performance. Also, tried to gaze the performance difference of 
> Bluestore and filestore on the similar setup.
> 
> Setup:
> --------
> 
> 16 OSDs (8TB Flash) across 2 OSD nodes
> 
> Single pool and single rbd image of 4TB. 2X replication.
> 
> Disabled the exclusive lock feature so that I can run multiple write  jobs in parallel.
> rbd_cache is disabled in the client side.
> Each test ran for 15 mins.
> 
> Result :
> ---------
> 
> Here is the detailed report on this.
> 
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a250cb05986/Bluestore_allocator_comp.xlsx
> 
> Each profile I named based on <allocator>-<freelist> , so in the graph for example "stupid-extent" meaning stupid allocator and extent freelist.
> 
> I ran the test for each of the profile in the following order after creating a fresh rbd image for all the Bluestore test.
> 
> 1. 4K RW for 15 min with 16QD and 10 jobs.
> 
> 2. 16K RW for 15 min with 16QD and 10 jobs.
> 
> 3. 64K RW for 15 min with 16QD and 10 jobs.
> 
> 4. 256K RW for 15 min with 16QD and 10 jobs.
> 
> The above are non-preconditioned case i.e ran before filling up the entire image. The reason is I don't see any reason of filling up the rbd image before like filestore case where it will give stable performance if we fill up the rbd images first. Filling up rbd images in case of filestore will create the files in the filesystem.
> 
> 5. Next, I did precondition the 4TB image with 1M seq write. This is primarily because I want to load BlueStore with more data.
> 
> 6. Ran 4K RW test again (this is called out preconditioned in the profile) for 15 min
> 
> 7. Ran 4K Seq test for similar QD for 15 min
> 
> 8. Ran 16K RW test again for 15min
> 
> For filestore test, I ran tests after preconditioning the entire image first.
> 
> Each sheet on the xls have different block size result , I often miss to navigate through the xls sheets , so, thought of mentioning here :-)
> 
> I have also captured the mkfs time , OSD startup time and the memory usage after the entire run.
> 
> Observation:
> ---------------
> 
> 1. First of all, in case of bitmap allocator mkfs time (and thus cluster creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore. Each OSD creation is taking ~2min or so sometimes and I nailed down the insert_free() function call (marked ****) in the Bitmap allocator is causing that.
> 
> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next start
> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next 0x4663d00000~69959451000
> 2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free instance 139913322803328 offset 0x4663d00000 length 0x69959451000
> ****2016-08-05 16:12:40.975557 7f4024d258c0 20 bitmapalloc:insert_free instance 139913322803328 off 0x4663d00000 len 0x69959451000****
> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist enumerate_next end****
> 2016-08-05 16:13:20.748978 7f4024d258c0 10 bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1 extents
> 
> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read buffered 0x4a14eb~265 of ^A:5242880+5242880
> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got 613
> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next 0x4663d00000~69959451000
> 2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free instance 139913306273920 offset 0x4663d00000 length 0x69959451000
> *****2016-08-05 16:13:23.438666 7f4024d258c0 20 bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000 len 0x69959451000*****
> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist enumerate_next end

I'm not sure there's any easy fix for this. We can amortize it by feeding 
space to bluefs slowly (so that we don't have to do all the inserts at 
once), but I'm not sure that's really better.
 
> 2. The same function call is causing delay during OSD start time and it 
> is ~4X slower than stupid/filestore.

This is by design.

For 1 and 2, we might want to experiment with the value size 
(bluestore_freelist_blocks_per_key) and see if that helps.  It'll probably 
be a tradeoff between init time and runtime performance, but I'm not sure 
the current value (picked semi-randomly) is optimizing for either.
 
> 3. As you can see in the result, bitmap allocator is performing a bit 
> poorly for all the block sizes and has some significant 99th latency in 
> some cases. This could be because of the above call as well since it is 
> been called in IO path from kv_sync_thread.

Again, bluestore_freelist_blocks_per_key might be the main knob to start 
with?
 
> 4. In the end of each sheet, the graph for entire 15min run is shown and 
> the performance looks stable for Bluestore but filestore for small block 
> is kind od spiky as expected.
> 
> 5. In my environment, filestore still outperforming narrowly for small 
> blocks RW (4K/16K) but for bigger blocks RW (64K/256K) Bluestore 
> performance is ~2X than filestore.
> 
> 6. 4K sequential performance is ~2X lower for Bluestore than Filestore. 
> If you see the graph, it is starting very low and eventually it is 
> stabilizing ~10K number.

I think this is still the onode encoding overhead?
 
> 7. 1M seq run for the entire image precondition is ~2X gain for 
> Bluestore.
> 
> 8. The small block performance (didn't measure bigblock) for Bluestore 
> after precondition is getting slower and this is mainly because of onode 
> size is growing (thus metadata size is growing). I will do some test 
> with bigger rbd sizes to see how it behaves at scale.
> 
> 9. I have adjusted the Bluestore cache to use most of my 64GB system 
> memory and I found the amount of memory growth for each of the allocator 
> test is more or less similar. Filestore of course is taking way less 
> memory for OSD but it has kernel level cache that we need to consider as 
> well. But, advantage for filestore is these kernel cache (buffer cache, 
> dentries/dcaches/inode caches) can be reused..
> 
> 10. One challenge for Bluestore as of today is to keep track of onode 
> sizes and thus I think the BlueStore onode cache should be based on size 
> and *not* based on number of onode entries otherwise memory growth for 
> the long run will be unmanageable, Sage ?

This is tricky because the amount of memory the onode is consuming is 
difficult to calculate (or even estimate), and in order to do the 
accounting we'd need to restimate after every change, which sounds tedious 
and error prone.  I suspect a better path is to use a separate 
allocator instance for all of the key objects we use here and base the 
cache trimming on that.  That probably means chasing down all of the 
relevant STL containers (tedious), but probably most of it can be captures 
with just a few key classes getting switched over...

sage


> 
> 
> Next, I will do some benchmarking on bigger setup and much larger data 
> set.
> 
> Any feedback is much appreciated,
> 
> Thanks & Regards
> Somnath
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-10 21:31 ` Sage Weil
@ 2016-08-10 22:27   ` Somnath Roy
  2016-08-10 22:44     ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-08-10 22:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

<< inline with [Somnath]

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net] 
Sent: Wednesday, August 10, 2016 2:31 PM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Bluestore different allocator performance Vs FileStore

On Wed, 10 Aug 2016, Somnath Roy wrote:
> Hi, I spent some time on evaluating different Bluestore allocator and 
> freelist performance. Also, tried to gaze the performance difference 
> of Bluestore and filestore on the similar setup.
> 
> Setup:
> --------
> 
> 16 OSDs (8TB Flash) across 2 OSD nodes
> 
> Single pool and single rbd image of 4TB. 2X replication.
> 
> Disabled the exclusive lock feature so that I can run multiple write  jobs in parallel.
> rbd_cache is disabled in the client side.
> Each test ran for 15 mins.
> 
> Result :
> ---------
> 
> Here is the detailed report on this.
> 
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a25
> 0cb05986/Bluestore_allocator_comp.xlsx
> 
> Each profile I named based on <allocator>-<freelist> , so in the graph for example "stupid-extent" meaning stupid allocator and extent freelist.
> 
> I ran the test for each of the profile in the following order after creating a fresh rbd image for all the Bluestore test.
> 
> 1. 4K RW for 15 min with 16QD and 10 jobs.
> 
> 2. 16K RW for 15 min with 16QD and 10 jobs.
> 
> 3. 64K RW for 15 min with 16QD and 10 jobs.
> 
> 4. 256K RW for 15 min with 16QD and 10 jobs.
> 
> The above are non-preconditioned case i.e ran before filling up the entire image. The reason is I don't see any reason of filling up the rbd image before like filestore case where it will give stable performance if we fill up the rbd images first. Filling up rbd images in case of filestore will create the files in the filesystem.
> 
> 5. Next, I did precondition the 4TB image with 1M seq write. This is primarily because I want to load BlueStore with more data.
> 
> 6. Ran 4K RW test again (this is called out preconditioned in the 
> profile) for 15 min
> 
> 7. Ran 4K Seq test for similar QD for 15 min
> 
> 8. Ran 16K RW test again for 15min
> 
> For filestore test, I ran tests after preconditioning the entire image first.
> 
> Each sheet on the xls have different block size result , I often miss 
> to navigate through the xls sheets , so, thought of mentioning here 
> :-)
> 
> I have also captured the mkfs time , OSD startup time and the memory usage after the entire run.
> 
> Observation:
> ---------------
> 
> 1. First of all, in case of bitmap allocator mkfs time (and thus cluster creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore. Each OSD creation is taking ~2min or so sometimes and I nailed down the insert_free() function call (marked ****) in the Bitmap allocator is causing that.
> 
> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next 
> start
> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next 
> 0x4663d00000~69959451000
> 2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free 
> instance 139913322803328 offset 0x4663d00000 length 0x69959451000
> ****2016-08-05 16:12:40.975557 7f4024d258c0 20 bitmapalloc:insert_free 
> instance 139913322803328 off 0x4663d00000 len 0x69959451000****
> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist enumerate_next 
> end****
> 2016-08-05 16:13:20.748978 7f4024d258c0 10 
> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1 
> extents
> 
> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read 
> buffered 0x4a14eb~265 of ^A:5242880+5242880
> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got 613
> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next 
> 0x4663d00000~69959451000
> 2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free 
> instance 139913306273920 offset 0x4663d00000 length 0x69959451000
> *****2016-08-05 16:13:23.438666 7f4024d258c0 20 
> bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000 len 
> 0x69959451000*****
> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist 
> enumerate_next end

I'm not sure there's any easy fix for this. We can amortize it by feeding space to bluefs slowly (so that we don't have to do all the inserts at once), but I'm not sure that's really better.

[Somnath] I don't know that part of the code, so, may be a dumb question. This is during mkfs() time , so, can't we say to bluefs entire space is free ? I can understand for osd mount and all other cases we need to feed the free space every time.
IMO this is critical to fix as cluster creation time will be number of OSDs * 2 min otherwise. For me creating 16 OSDs cluster is taking ~32min compare to ~2 min for stupid allocator/filestore.
BTW, my drive data partition is ~6.9TB , db partition is ~100G and WAL is ~1G. I guess the time taking is dependent on data partition size as well (?
 
> 2. The same function call is causing delay during OSD start time and 
> it is ~4X slower than stupid/filestore.

This is by design.

For 1 and 2, we might want to experiment with the value size
(bluestore_freelist_blocks_per_key) and see if that helps.  It'll probably be a tradeoff between init time and runtime performance, but I'm not sure the current value (picked semi-randomly) is optimizing for either.

[Somnath] Will do.
 
> 3. As you can see in the result, bitmap allocator is performing a bit 
> poorly for all the block sizes and has some significant 99th latency 
> in some cases. This could be because of the above call as well since 
> it is been called in IO path from kv_sync_thread.

Again, bluestore_freelist_blocks_per_key might be the main knob to start with?
 
> 4. In the end of each sheet, the graph for entire 15min run is shown 
> and the performance looks stable for Bluestore but filestore for small 
> block is kind od spiky as expected.
> 
> 5. In my environment, filestore still outperforming narrowly for small 
> blocks RW (4K/16K) but for bigger blocks RW (64K/256K) Bluestore 
> performance is ~2X than filestore.
> 
> 6. 4K sequential performance is ~2X lower for Bluestore than Filestore. 
> If you see the graph, it is starting very low and eventually it is 
> stabilizing ~10K number.

I think this is still the onode encoding overhead?
 
> 7. 1M seq run for the entire image precondition is ~2X gain for 
> Bluestore.
> 
> 8. The small block performance (didn't measure bigblock) for Bluestore 
> after precondition is getting slower and this is mainly because of 
> onode size is growing (thus metadata size is growing). I will do some 
> test with bigger rbd sizes to see how it behaves at scale.
> 
> 9. I have adjusted the Bluestore cache to use most of my 64GB system 
> memory and I found the amount of memory growth for each of the 
> allocator test is more or less similar. Filestore of course is taking 
> way less memory for OSD but it has kernel level cache that we need to 
> consider as well. But, advantage for filestore is these kernel cache 
> (buffer cache, dentries/dcaches/inode caches) can be reused..
> 
> 10. One challenge for Bluestore as of today is to keep track of onode 
> sizes and thus I think the BlueStore onode cache should be based on 
> size and *not* based on number of onode entries otherwise memory 
> growth for the long run will be unmanageable, Sage ?

This is tricky because the amount of memory the onode is consuming is difficult to calculate (or even estimate), and in order to do the accounting we'd need to restimate after every change, which sounds tedious and error prone.  I suspect a better path is to use a separate allocator instance for all of the key objects we use here and base the cache trimming on that.  That probably means chasing down all of the relevant STL containers (tedious), but probably most of it can be captures with just a few key classes getting switched over...

sage


> 
> 
> Next, I will do some benchmarking on bigger setup and much larger data 
> set.
> 
> Any feedback is much appreciated,
> 
> Thanks & Regards
> Somnath
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-10 22:27   ` Somnath Roy
@ 2016-08-10 22:44     ` Sage Weil
  2016-08-10 22:58       ` Allen Samuels
  2016-08-11 12:28       ` Milosz Tanski
  0 siblings, 2 replies; 34+ messages in thread
From: Sage Weil @ 2016-08-10 22:44 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Wed, 10 Aug 2016, Somnath Roy wrote:
> << inline with [Somnath]
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net] 
> Sent: Wednesday, August 10, 2016 2:31 PM
> To: Somnath Roy
> Cc: ceph-devel
> Subject: Re: Bluestore different allocator performance Vs FileStore
> 
> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > Hi, I spent some time on evaluating different Bluestore allocator and 
> > freelist performance. Also, tried to gaze the performance difference 
> > of Bluestore and filestore on the similar setup.
> > 
> > Setup:
> > --------
> > 
> > 16 OSDs (8TB Flash) across 2 OSD nodes
> > 
> > Single pool and single rbd image of 4TB. 2X replication.
> > 
> > Disabled the exclusive lock feature so that I can run multiple write  jobs in parallel.
> > rbd_cache is disabled in the client side.
> > Each test ran for 15 mins.
> > 
> > Result :
> > ---------
> > 
> > Here is the detailed report on this.
> > 
> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a25
> > 0cb05986/Bluestore_allocator_comp.xlsx
> > 
> > Each profile I named based on <allocator>-<freelist> , so in the graph for example "stupid-extent" meaning stupid allocator and extent freelist.
> > 
> > I ran the test for each of the profile in the following order after creating a fresh rbd image for all the Bluestore test.
> > 
> > 1. 4K RW for 15 min with 16QD and 10 jobs.
> > 
> > 2. 16K RW for 15 min with 16QD and 10 jobs.
> > 
> > 3. 64K RW for 15 min with 16QD and 10 jobs.
> > 
> > 4. 256K RW for 15 min with 16QD and 10 jobs.
> > 
> > The above are non-preconditioned case i.e ran before filling up the entire image. The reason is I don't see any reason of filling up the rbd image before like filestore case where it will give stable performance if we fill up the rbd images first. Filling up rbd images in case of filestore will create the files in the filesystem.
> > 
> > 5. Next, I did precondition the 4TB image with 1M seq write. This is primarily because I want to load BlueStore with more data.
> > 
> > 6. Ran 4K RW test again (this is called out preconditioned in the 
> > profile) for 15 min
> > 
> > 7. Ran 4K Seq test for similar QD for 15 min
> > 
> > 8. Ran 16K RW test again for 15min
> > 
> > For filestore test, I ran tests after preconditioning the entire image first.
> > 
> > Each sheet on the xls have different block size result , I often miss 
> > to navigate through the xls sheets , so, thought of mentioning here 
> > :-)
> > 
> > I have also captured the mkfs time , OSD startup time and the memory usage after the entire run.
> > 
> > Observation:
> > ---------------
> > 
> > 1. First of all, in case of bitmap allocator mkfs time (and thus cluster creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore. Each OSD creation is taking ~2min or so sometimes and I nailed down the insert_free() function call (marked ****) in the Bitmap allocator is causing that.
> > 
> > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next 
> > start
> > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next 
> > 0x4663d00000~69959451000
> > 2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free 
> > instance 139913322803328 offset 0x4663d00000 length 0x69959451000
> > ****2016-08-05 16:12:40.975557 7f4024d258c0 20 bitmapalloc:insert_free 
> > instance 139913322803328 off 0x4663d00000 len 0x69959451000****
> > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist enumerate_next 
> > end****
> > 2016-08-05 16:13:20.748978 7f4024d258c0 10 
> > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1 
> > extents
> > 
> > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read 
> > buffered 0x4a14eb~265 of ^A:5242880+5242880
> > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got 613
> > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next 
> > 0x4663d00000~69959451000
> > 2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free 
> > instance 139913306273920 offset 0x4663d00000 length 0x69959451000
> > *****2016-08-05 16:13:23.438666 7f4024d258c0 20 
> > bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000 len 
> > 0x69959451000*****
> > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist 
> > enumerate_next end
> 
> I'm not sure there's any easy fix for this. We can amortize it by feeding space to bluefs slowly (so that we don't have to do all the inserts at once), but I'm not sure that's really better.
> 
> [Somnath] I don't know that part of the code, so, may be a dumb question. This is during mkfs() time , so, can't we say to bluefs entire space is free ? I can understand for osd mount and all other cases we need to feed the free space every time.
> IMO this is critical to fix as cluster creation time will be number of OSDs * 2 min otherwise. For me creating 16 OSDs cluster is taking ~32min compare to ~2 min for stupid allocator/filestore.
> BTW, my drive data partition is ~6.9TB , db partition is ~100G and WAL is ~1G. I guess the time taking is dependent on data partition size as well (?

Well, we're fundamentally limited by the fact that it's a bitmap, and a 
big chunk of space is "allocated" to bluefs and needs to have 1's set.

sage

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-10 22:44     ` Sage Weil
@ 2016-08-10 22:58       ` Allen Samuels
  2016-08-11  4:34         ` Ramesh Chander
  2016-08-11  6:07         ` Ramesh Chander
  2016-08-11 12:28       ` Milosz Tanski
  1 sibling, 2 replies; 34+ messages in thread
From: Allen Samuels @ 2016-08-10 22:58 UTC (permalink / raw)
  To: Sage Weil, Somnath Roy; +Cc: ceph-devel

We always knew that startup time for bitmap stuff would be somewhat longer. Still, the existing implementation can be speeded up significantly. The code in BitMapZone::set_blocks_used isn't very optimized. Converting it to use memset for all but the first/last bytes should significantly speed it up.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, August 10, 2016 3:44 PM
> To: Somnath Roy <Somnath.Roy@sandisk.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > << inline with [Somnath]
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Wednesday, August 10, 2016 2:31 PM
> > To: Somnath Roy
> > Cc: ceph-devel
> > Subject: Re: Bluestore different allocator performance Vs FileStore
> >
> > On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > Hi, I spent some time on evaluating different Bluestore allocator
> > > and freelist performance. Also, tried to gaze the performance
> > > difference of Bluestore and filestore on the similar setup.
> > >
> > > Setup:
> > > --------
> > >
> > > 16 OSDs (8TB Flash) across 2 OSD nodes
> > >
> > > Single pool and single rbd image of 4TB. 2X replication.
> > >
> > > Disabled the exclusive lock feature so that I can run multiple write  jobs in
> parallel.
> > > rbd_cache is disabled in the client side.
> > > Each test ran for 15 mins.
> > >
> > > Result :
> > > ---------
> > >
> > > Here is the detailed report on this.
> > >
> > >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > 25 0cb05986/Bluestore_allocator_comp.xlsx
> > >
> > > Each profile I named based on <allocator>-<freelist> , so in the graph for
> example "stupid-extent" meaning stupid allocator and extent freelist.
> > >
> > > I ran the test for each of the profile in the following order after creating a
> fresh rbd image for all the Bluestore test.
> > >
> > > 1. 4K RW for 15 min with 16QD and 10 jobs.
> > >
> > > 2. 16K RW for 15 min with 16QD and 10 jobs.
> > >
> > > 3. 64K RW for 15 min with 16QD and 10 jobs.
> > >
> > > 4. 256K RW for 15 min with 16QD and 10 jobs.
> > >
> > > The above are non-preconditioned case i.e ran before filling up the entire
> image. The reason is I don't see any reason of filling up the rbd image before
> like filestore case where it will give stable performance if we fill up the rbd
> images first. Filling up rbd images in case of filestore will create the files in the
> filesystem.
> > >
> > > 5. Next, I did precondition the 4TB image with 1M seq write. This is
> primarily because I want to load BlueStore with more data.
> > >
> > > 6. Ran 4K RW test again (this is called out preconditioned in the
> > > profile) for 15 min
> > >
> > > 7. Ran 4K Seq test for similar QD for 15 min
> > >
> > > 8. Ran 16K RW test again for 15min
> > >
> > > For filestore test, I ran tests after preconditioning the entire image first.
> > >
> > > Each sheet on the xls have different block size result , I often
> > > miss to navigate through the xls sheets , so, thought of mentioning
> > > here
> > > :-)
> > >
> > > I have also captured the mkfs time , OSD startup time and the memory
> usage after the entire run.
> > >
> > > Observation:
> > > ---------------
> > >
> > > 1. First of all, in case of bitmap allocator mkfs time (and thus cluster
> creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore.
> Each OSD creation is taking ~2min or so sometimes and I nailed down the
> insert_free() function call (marked ****) in the Bitmap allocator is causing
> that.
> > >
> > > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next
> > > start
> > > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next
> > > 0x4663d00000~69959451000
> > > 2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free
> > > instance 139913322803328 offset 0x4663d00000 length 0x69959451000
> > > ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > bitmapalloc:insert_free instance 139913322803328 off 0x4663d00000
> > > len 0x69959451000****
> > > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > enumerate_next
> > > end****
> > > 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1
> > > extents
> > >
> > > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read
> > > buffered 0x4a14eb~265 of ^A:5242880+5242880
> > > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got
> > > 613
> > > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next
> > > 0x4663d00000~69959451000
> > > 2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free
> > > instance 139913306273920 offset 0x4663d00000 length 0x69959451000
> > > *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000
> > > len
> > > 0x69959451000*****
> > > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > enumerate_next end
> >
> > I'm not sure there's any easy fix for this. We can amortize it by feeding
> space to bluefs slowly (so that we don't have to do all the inserts at once),
> but I'm not sure that's really better.
> >
> > [Somnath] I don't know that part of the code, so, may be a dumb question.
> This is during mkfs() time , so, can't we say to bluefs entire space is free ? I
> can understand for osd mount and all other cases we need to feed the free
> space every time.
> > IMO this is critical to fix as cluster creation time will be number of OSDs * 2
> min otherwise. For me creating 16 OSDs cluster is taking ~32min compare to
> ~2 min for stupid allocator/filestore.
> > BTW, my drive data partition is ~6.9TB , db partition is ~100G and WAL is
> ~1G. I guess the time taking is dependent on data partition size as well (?
> 
> Well, we're fundamentally limited by the fact that it's a bitmap, and a big
> chunk of space is "allocated" to bluefs and needs to have 1's set.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-10 22:58       ` Allen Samuels
@ 2016-08-11  4:34         ` Ramesh Chander
  2016-08-11  6:07         ` Ramesh Chander
  1 sibling, 0 replies; 34+ messages in thread
From: Ramesh Chander @ 2016-08-11  4:34 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, Somnath Roy; +Cc: ceph-devel

I think insert_free is limited by speed of function clear_bits here.

Though set_bits and clear_bits have same logic except one sets and another clears. Both of these does 64 bits (bitmap size) at a time.

I am not sure if doing memset will make it faster. But if we can do it for group of bitmaps, then it might help.

I am looking in to code if we can handle mkfs and osd mount in special way to make it faster.

If I don't find an easy fix, we can go to path of deferring init to later stage as and when required.

-Ramesh

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Thursday, August 11, 2016 4:28 AM
> To: Sage Weil; Somnath Roy
> Cc: ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> We always knew that startup time for bitmap stuff would be somewhat
> longer. Still, the existing implementation can be speeded up significantly. The
> code in BitMapZone::set_blocks_used isn't very optimized. Converting it to
> use memset for all but the first/last bytes should significantly speed it up.
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Wednesday, August 10, 2016 3:44 PM
> > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > << inline with [Somnath]
> > >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Wednesday, August 10, 2016 2:31 PM
> > > To: Somnath Roy
> > > Cc: ceph-devel
> > > Subject: Re: Bluestore different allocator performance Vs FileStore
> > >
> > > On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > Hi, I spent some time on evaluating different Bluestore allocator
> > > > and freelist performance. Also, tried to gaze the performance
> > > > difference of Bluestore and filestore on the similar setup.
> > > >
> > > > Setup:
> > > > --------
> > > >
> > > > 16 OSDs (8TB Flash) across 2 OSD nodes
> > > >
> > > > Single pool and single rbd image of 4TB. 2X replication.
> > > >
> > > > Disabled the exclusive lock feature so that I can run multiple
> > > > write  jobs in
> > parallel.
> > > > rbd_cache is disabled in the client side.
> > > > Each test ran for 15 mins.
> > > >
> > > > Result :
> > > > ---------
> > > >
> > > > Here is the detailed report on this.
> > > >
> > > >
> >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > >
> > > > Each profile I named based on <allocator>-<freelist> , so in the
> > > > graph for
> > example "stupid-extent" meaning stupid allocator and extent freelist.
> > > >
> > > > I ran the test for each of the profile in the following order
> > > > after creating a
> > fresh rbd image for all the Bluestore test.
> > > >
> > > > 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > >
> > > > 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > >
> > > > 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > >
> > > > 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > >
> > > > The above are non-preconditioned case i.e ran before filling up
> > > > the entire
> > image. The reason is I don't see any reason of filling up the rbd
> > image before like filestore case where it will give stable performance
> > if we fill up the rbd images first. Filling up rbd images in case of
> > filestore will create the files in the filesystem.
> > > >
> > > > 5. Next, I did precondition the 4TB image with 1M seq write. This
> > > > is
> > primarily because I want to load BlueStore with more data.
> > > >
> > > > 6. Ran 4K RW test again (this is called out preconditioned in the
> > > > profile) for 15 min
> > > >
> > > > 7. Ran 4K Seq test for similar QD for 15 min
> > > >
> > > > 8. Ran 16K RW test again for 15min
> > > >
> > > > For filestore test, I ran tests after preconditioning the entire image first.
> > > >
> > > > Each sheet on the xls have different block size result , I often
> > > > miss to navigate through the xls sheets , so, thought of
> > > > mentioning here
> > > > :-)
> > > >
> > > > I have also captured the mkfs time , OSD startup time and the
> > > > memory
> > usage after the entire run.
> > > >
> > > > Observation:
> > > > ---------------
> > > >
> > > > 1. First of all, in case of bitmap allocator mkfs time (and thus
> > > > cluster
> > creation time for 16 OSDs) are ~16X slower than stupid allocator and
> filestore.
> > Each OSD creation is taking ~2min or so sometimes and I nailed down
> > the
> > insert_free() function call (marked ****) in the Bitmap allocator is
> > causing that.
> > > >
> > > > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next
> > > > start
> > > > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next
> > > > 0x4663d00000~69959451000
> > > > 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > bitmapalloc:init_add_free instance 139913322803328 offset
> > > > 0x4663d00000 length 0x69959451000
> > > > ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > bitmapalloc:insert_free instance 139913322803328 off 0x4663d00000
> > > > len 0x69959451000****
> > > > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > > enumerate_next
> > > > end****
> > > > 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1
> > > > extents
> > > >
> > > > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
> > > > read buffered 0x4a14eb~265 of ^A:5242880+5242880
> > > > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got
> > > > 613
> > > > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next
> > > > 0x4663d00000~69959451000
> > > > 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > bitmapalloc:init_add_free instance 139913306273920 offset
> > > > 0x4663d00000 length 0x69959451000
> > > > *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000
> > > > len
> > > > 0x69959451000*****
> > > > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > > enumerate_next end
> > >
> > > I'm not sure there's any easy fix for this. We can amortize it by
> > > feeding
> > space to bluefs slowly (so that we don't have to do all the inserts at
> > once), but I'm not sure that's really better.
> > >
> > > [Somnath] I don't know that part of the code, so, may be a dumb
> question.
> > This is during mkfs() time , so, can't we say to bluefs entire space
> > is free ? I can understand for osd mount and all other cases we need
> > to feed the free space every time.
> > > IMO this is critical to fix as cluster creation time will be number
> > > of OSDs * 2
> > min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > compare to
> > ~2 min for stupid allocator/filestore.
> > > BTW, my drive data partition is ~6.9TB , db partition is ~100G and
> > > WAL is
> > ~1G. I guess the time taking is dependent on data partition size as well (?
> >
> > Well, we're fundamentally limited by the fact that it's a bitmap, and
> > a big chunk of space is "allocated" to bluefs and needs to have 1's set.
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-10 22:58       ` Allen Samuels
  2016-08-11  4:34         ` Ramesh Chander
@ 2016-08-11  6:07         ` Ramesh Chander
  2016-08-11  7:11           ` Somnath Roy
  2016-08-11 16:04           ` Allen Samuels
  1 sibling, 2 replies; 34+ messages in thread
From: Ramesh Chander @ 2016-08-11  6:07 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, Somnath Roy; +Cc: ceph-devel

Somnath,

Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes ( 32 / 16).

But is there a reason you should create osds in serial? I think for mmultiple osds mkfs can happen in parallel?

As a fix I am looking to batch multiple insert_free calls for now. If still that does not help, thinking of doing insert_free on different part of device in parallel.

-Ramesh

> -----Original Message-----
> From: Ramesh Chander
> Sent: Thursday, August 11, 2016 10:04 AM
> To: Allen Samuels; Sage Weil; Somnath Roy
> Cc: ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> I think insert_free is limited by speed of function clear_bits here.
>
> Though set_bits and clear_bits have same logic except one sets and another
> clears. Both of these does 64 bits (bitmap size) at a time.
>
> I am not sure if doing memset will make it faster. But if we can do it for group
> of bitmaps, then it might help.
>
> I am looking in to code if we can handle mkfs and osd mount in special way to
> make it faster.
>
> If I don't find an easy fix, we can go to path of deferring init to later stage as
> and when required.
>
> -Ramesh
>
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > Sent: Thursday, August 11, 2016 4:28 AM
> > To: Sage Weil; Somnath Roy
> > Cc: ceph-devel
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > We always knew that startup time for bitmap stuff would be somewhat
> > longer. Still, the existing implementation can be speeded up
> > significantly. The code in BitMapZone::set_blocks_used isn't very
> > optimized. Converting it to use memset for all but the first/last bytes
> should significantly speed it up.
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Wednesday, August 10, 2016 3:44 PM
> > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > << inline with [Somnath]
> > > >
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > Sent: Wednesday, August 10, 2016 2:31 PM
> > > > To: Somnath Roy
> > > > Cc: ceph-devel
> > > > Subject: Re: Bluestore different allocator performance Vs
> > > > FileStore
> > > >
> > > > On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > Hi, I spent some time on evaluating different Bluestore
> > > > > allocator and freelist performance. Also, tried to gaze the
> > > > > performance difference of Bluestore and filestore on the similar
> setup.
> > > > >
> > > > > Setup:
> > > > > --------
> > > > >
> > > > > 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > >
> > > > > Single pool and single rbd image of 4TB. 2X replication.
> > > > >
> > > > > Disabled the exclusive lock feature so that I can run multiple
> > > > > write  jobs in
> > > parallel.
> > > > > rbd_cache is disabled in the client side.
> > > > > Each test ran for 15 mins.
> > > > >
> > > > > Result :
> > > > > ---------
> > > > >
> > > > > Here is the detailed report on this.
> > > > >
> > > > >
> > >
> >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > >
> > > > > Each profile I named based on <allocator>-<freelist> , so in the
> > > > > graph for
> > > example "stupid-extent" meaning stupid allocator and extent freelist.
> > > > >
> > > > > I ran the test for each of the profile in the following order
> > > > > after creating a
> > > fresh rbd image for all the Bluestore test.
> > > > >
> > > > > 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > The above are non-preconditioned case i.e ran before filling up
> > > > > the entire
> > > image. The reason is I don't see any reason of filling up the rbd
> > > image before like filestore case where it will give stable
> > > performance if we fill up the rbd images first. Filling up rbd
> > > images in case of filestore will create the files in the filesystem.
> > > > >
> > > > > 5. Next, I did precondition the 4TB image with 1M seq write.
> > > > > This is
> > > primarily because I want to load BlueStore with more data.
> > > > >
> > > > > 6. Ran 4K RW test again (this is called out preconditioned in
> > > > > the
> > > > > profile) for 15 min
> > > > >
> > > > > 7. Ran 4K Seq test for similar QD for 15 min
> > > > >
> > > > > 8. Ran 16K RW test again for 15min
> > > > >
> > > > > For filestore test, I ran tests after preconditioning the entire image
> first.
> > > > >
> > > > > Each sheet on the xls have different block size result , I often
> > > > > miss to navigate through the xls sheets , so, thought of
> > > > > mentioning here
> > > > > :-)
> > > > >
> > > > > I have also captured the mkfs time , OSD startup time and the
> > > > > memory
> > > usage after the entire run.
> > > > >
> > > > > Observation:
> > > > > ---------------
> > > > >
> > > > > 1. First of all, in case of bitmap allocator mkfs time (and thus
> > > > > cluster
> > > creation time for 16 OSDs) are ~16X slower than stupid allocator and
> > filestore.
> > > Each OSD creation is taking ~2min or so sometimes and I nailed down
> > > the
> > > insert_free() function call (marked ****) in the Bitmap allocator is
> > > causing that.
> > > > >
> > > > > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > > enumerate_next start
> > > > > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > > enumerate_next
> > > > > 0x4663d00000~69959451000
> > > > > 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > bitmapalloc:init_add_free instance 139913322803328 offset
> > > > > 0x4663d00000 length 0x69959451000
> > > > > ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > bitmapalloc:insert_free instance 139913322803328 off
> > > > > 0x4663d00000 len 0x69959451000****
> > > > > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > > > enumerate_next
> > > > > end****
> > > > > 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in
> > > > > 1 extents
> > > > >
> > > > > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
> > > > > read buffered 0x4a14eb~265 of ^A:5242880+5242880
> > > > > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
> > > > > got
> > > > > 613
> > > > > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > > enumerate_next
> > > > > 0x4663d00000~69959451000
> > > > > 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > bitmapalloc:init_add_free instance 139913306273920 offset
> > > > > 0x4663d00000 length 0x69959451000
> > > > > *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > bitmapalloc:insert_free instance 139913306273920 off
> > > > > 0x4663d00000 len
> > > > > 0x69959451000*****
> > > > > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > > > enumerate_next end
> > > >
> > > > I'm not sure there's any easy fix for this. We can amortize it by
> > > > feeding
> > > space to bluefs slowly (so that we don't have to do all the inserts
> > > at once), but I'm not sure that's really better.
> > > >
> > > > [Somnath] I don't know that part of the code, so, may be a dumb
> > question.
> > > This is during mkfs() time , so, can't we say to bluefs entire space
> > > is free ? I can understand for osd mount and all other cases we need
> > > to feed the free space every time.
> > > > IMO this is critical to fix as cluster creation time will be
> > > > number of OSDs * 2
> > > min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > > compare to
> > > ~2 min for stupid allocator/filestore.
> > > > BTW, my drive data partition is ~6.9TB , db partition is ~100G and
> > > > WAL is
> > > ~1G. I guess the time taking is dependent on data partition size as well (?
> > >
> > > Well, we're fundamentally limited by the fact that it's a bitmap,
> > > and a big chunk of space is "allocated" to bluefs and needs to have 1's set.
> > >
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11  6:07         ` Ramesh Chander
@ 2016-08-11  7:11           ` Somnath Roy
  2016-08-11 11:24             ` Mark Nelson
  2016-08-11 16:04           ` Allen Samuels
  1 sibling, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-08-11  7:11 UTC (permalink / raw)
  To: Ramesh Chander, Allen Samuels, Sage Weil; +Cc: ceph-devel

Yes, we can create OSDs in parallel but I am not sure how many people are creating cluster like that as ceph-deploy end there is no interface for that.
FYI, we have introduced some parallelism in SanDisk wrapper script for installer based on ceph-deploy.
I don't think even with all these parallel OSD creation, this problem will go away but for sure will be reduced  a bit as we have seen in case of OSD start time since it is inherently parallel.

Thanks & Regards
Somnath

-----Original Message-----
From: Ramesh Chander
Sent: Wednesday, August 10, 2016 11:07 PM
To: Allen Samuels; Sage Weil; Somnath Roy
Cc: ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

Somnath,

Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes ( 32 / 16).

But is there a reason you should create osds in serial? I think for mmultiple osds mkfs can happen in parallel?

As a fix I am looking to batch multiple insert_free calls for now. If still that does not help, thinking of doing insert_free on different part of device in parallel.

-Ramesh

> -----Original Message-----
> From: Ramesh Chander
> Sent: Thursday, August 11, 2016 10:04 AM
> To: Allen Samuels; Sage Weil; Somnath Roy
> Cc: ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> I think insert_free is limited by speed of function clear_bits here.
>
> Though set_bits and clear_bits have same logic except one sets and
> another clears. Both of these does 64 bits (bitmap size) at a time.
>
> I am not sure if doing memset will make it faster. But if we can do it
> for group of bitmaps, then it might help.
>
> I am looking in to code if we can handle mkfs and osd mount in special
> way to make it faster.
>
> If I don't find an easy fix, we can go to path of deferring init to
> later stage as and when required.
>
> -Ramesh
>
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > Sent: Thursday, August 11, 2016 4:28 AM
> > To: Sage Weil; Somnath Roy
> > Cc: ceph-devel
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > We always knew that startup time for bitmap stuff would be somewhat
> > longer. Still, the existing implementation can be speeded up
> > significantly. The code in BitMapZone::set_blocks_used isn't very
> > optimized. Converting it to use memset for all but the first/last
> > bytes
> should significantly speed it up.
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Wednesday, August 10, 2016 3:44 PM
> > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs
> > > FileStore
> > >
> > > On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > << inline with [Somnath]
> > > >
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > Sent: Wednesday, August 10, 2016 2:31 PM
> > > > To: Somnath Roy
> > > > Cc: ceph-devel
> > > > Subject: Re: Bluestore different allocator performance Vs
> > > > FileStore
> > > >
> > > > On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > Hi, I spent some time on evaluating different Bluestore
> > > > > allocator and freelist performance. Also, tried to gaze the
> > > > > performance difference of Bluestore and filestore on the
> > > > > similar
> setup.
> > > > >
> > > > > Setup:
> > > > > --------
> > > > >
> > > > > 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > >
> > > > > Single pool and single rbd image of 4TB. 2X replication.
> > > > >
> > > > > Disabled the exclusive lock feature so that I can run multiple
> > > > > write  jobs in
> > > parallel.
> > > > > rbd_cache is disabled in the client side.
> > > > > Each test ran for 15 mins.
> > > > >
> > > > > Result :
> > > > > ---------
> > > > >
> > > > > Here is the detailed report on this.
> > > > >
> > > > >
> > >
> >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > >
> > > > > Each profile I named based on <allocator>-<freelist> , so in
> > > > > the graph for
> > > example "stupid-extent" meaning stupid allocator and extent freelist.
> > > > >
> > > > > I ran the test for each of the profile in the following order
> > > > > after creating a
> > > fresh rbd image for all the Bluestore test.
> > > > >
> > > > > 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > >
> > > > > The above are non-preconditioned case i.e ran before filling
> > > > > up the entire
> > > image. The reason is I don't see any reason of filling up the rbd
> > > image before like filestore case where it will give stable
> > > performance if we fill up the rbd images first. Filling up rbd
> > > images in case of filestore will create the files in the filesystem.
> > > > >
> > > > > 5. Next, I did precondition the 4TB image with 1M seq write.
> > > > > This is
> > > primarily because I want to load BlueStore with more data.
> > > > >
> > > > > 6. Ran 4K RW test again (this is called out preconditioned in
> > > > > the
> > > > > profile) for 15 min
> > > > >
> > > > > 7. Ran 4K Seq test for similar QD for 15 min
> > > > >
> > > > > 8. Ran 16K RW test again for 15min
> > > > >
> > > > > For filestore test, I ran tests after preconditioning the
> > > > > entire image
> first.
> > > > >
> > > > > Each sheet on the xls have different block size result , I
> > > > > often miss to navigate through the xls sheets , so, thought of
> > > > > mentioning here
> > > > > :-)
> > > > >
> > > > > I have also captured the mkfs time , OSD startup time and the
> > > > > memory
> > > usage after the entire run.
> > > > >
> > > > > Observation:
> > > > > ---------------
> > > > >
> > > > > 1. First of all, in case of bitmap allocator mkfs time (and
> > > > > thus cluster
> > > creation time for 16 OSDs) are ~16X slower than stupid allocator
> > > and
> > filestore.
> > > Each OSD creation is taking ~2min or so sometimes and I nailed
> > > down the
> > > insert_free() function call (marked ****) in the Bitmap allocator
> > > is causing that.
> > > > >
> > > > > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > > enumerate_next start
> > > > > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > > enumerate_next
> > > > > 0x4663d00000~69959451000
> > > > > 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > bitmapalloc:init_add_free instance 139913322803328 offset
> > > > > 0x4663d00000 length 0x69959451000
> > > > > ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > bitmapalloc:insert_free instance 139913322803328 off
> > > > > 0x4663d00000 len 0x69959451000****
> > > > > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > > > enumerate_next
> > > > > end****
> > > > > 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
> > > > > in
> > > > > 1 extents
> > > > >
> > > > > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
> > > > > read buffered 0x4a14eb~265 of ^A:5242880+5242880
> > > > > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
> > > > > got
> > > > > 613
> > > > > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > > enumerate_next
> > > > > 0x4663d00000~69959451000
> > > > > 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > bitmapalloc:init_add_free instance 139913306273920 offset
> > > > > 0x4663d00000 length 0x69959451000
> > > > > *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > bitmapalloc:insert_free instance 139913306273920 off
> > > > > 0x4663d00000 len
> > > > > 0x69959451000*****
> > > > > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > > > enumerate_next end
> > > >
> > > > I'm not sure there's any easy fix for this. We can amortize it
> > > > by feeding
> > > space to bluefs slowly (so that we don't have to do all the
> > > inserts at once), but I'm not sure that's really better.
> > > >
> > > > [Somnath] I don't know that part of the code, so, may be a dumb
> > question.
> > > This is during mkfs() time , so, can't we say to bluefs entire
> > > space is free ? I can understand for osd mount and all other cases
> > > we need to feed the free space every time.
> > > > IMO this is critical to fix as cluster creation time will be
> > > > number of OSDs * 2
> > > min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > > compare to
> > > ~2 min for stupid allocator/filestore.
> > > > BTW, my drive data partition is ~6.9TB , db partition is ~100G
> > > > and WAL is
> > > ~1G. I guess the time taking is dependent on data partition size as well (?
> > >
> > > Well, we're fundamentally limited by the fact that it's a bitmap,
> > > and a big chunk of space is "allocated" to bluefs and needs to have 1's set.
> > >
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Bluestore different allocator performance Vs FileStore
  2016-08-11  7:11           ` Somnath Roy
@ 2016-08-11 11:24             ` Mark Nelson
  2016-08-11 14:06               ` Ben England
  0 siblings, 1 reply; 34+ messages in thread
From: Mark Nelson @ 2016-08-11 11:24 UTC (permalink / raw)
  To: Somnath Roy, Ramesh Chander, Allen Samuels, Sage Weil
  Cc: ceph-devel, Ben England

Ben England added parallel OSD creation to CBT a while back which 
greatly speed up cluster creation time (not just for the bitmap 
alloctaor).  I'm not sure if ceph-ansible creates OSDs in parallel, but 
if not he might have some insights into how easy it would be to improve it.

Mark

On 08/11/2016 02:11 AM, Somnath Roy wrote:
> Yes, we can create OSDs in parallel but I am not sure how many people are creating cluster like that as ceph-deploy end there is no interface for that.
> FYI, we have introduced some parallelism in SanDisk wrapper script for installer based on ceph-deploy.
> I don't think even with all these parallel OSD creation, this problem will go away but for sure will be reduced  a bit as we have seen in case of OSD start time since it is inherently parallel.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Ramesh Chander
> Sent: Wednesday, August 10, 2016 11:07 PM
> To: Allen Samuels; Sage Weil; Somnath Roy
> Cc: ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> Somnath,
>
> Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes ( 32 / 16).
>
> But is there a reason you should create osds in serial? I think for mmultiple osds mkfs can happen in parallel?
>
> As a fix I am looking to batch multiple insert_free calls for now. If still that does not help, thinking of doing insert_free on different part of device in parallel.
>
> -Ramesh
>
>> -----Original Message-----
>> From: Ramesh Chander
>> Sent: Thursday, August 11, 2016 10:04 AM
>> To: Allen Samuels; Sage Weil; Somnath Roy
>> Cc: ceph-devel
>> Subject: RE: Bluestore different allocator performance Vs FileStore
>>
>> I think insert_free is limited by speed of function clear_bits here.
>>
>> Though set_bits and clear_bits have same logic except one sets and
>> another clears. Both of these does 64 bits (bitmap size) at a time.
>>
>> I am not sure if doing memset will make it faster. But if we can do it
>> for group of bitmaps, then it might help.
>>
>> I am looking in to code if we can handle mkfs and osd mount in special
>> way to make it faster.
>>
>> If I don't find an easy fix, we can go to path of deferring init to
>> later stage as and when required.
>>
>> -Ramesh
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>> Sent: Thursday, August 11, 2016 4:28 AM
>>> To: Sage Weil; Somnath Roy
>>> Cc: ceph-devel
>>> Subject: RE: Bluestore different allocator performance Vs FileStore
>>>
>>> We always knew that startup time for bitmap stuff would be somewhat
>>> longer. Still, the existing implementation can be speeded up
>>> significantly. The code in BitMapZone::set_blocks_used isn't very
>>> optimized. Converting it to use memset for all but the first/last
>>> bytes
>> should significantly speed it up.
>>>
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Wednesday, August 10, 2016 3:44 PM
>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
>>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>>>> Subject: RE: Bluestore different allocator performance Vs
>>>> FileStore
>>>>
>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
>>>>> << inline with [Somnath]
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sage@newdream.net]
>>>>> Sent: Wednesday, August 10, 2016 2:31 PM
>>>>> To: Somnath Roy
>>>>> Cc: ceph-devel
>>>>> Subject: Re: Bluestore different allocator performance Vs
>>>>> FileStore
>>>>>
>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
>>>>>> Hi, I spent some time on evaluating different Bluestore
>>>>>> allocator and freelist performance. Also, tried to gaze the
>>>>>> performance difference of Bluestore and filestore on the
>>>>>> similar
>> setup.
>>>>>>
>>>>>> Setup:
>>>>>> --------
>>>>>>
>>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
>>>>>>
>>>>>> Single pool and single rbd image of 4TB. 2X replication.
>>>>>>
>>>>>> Disabled the exclusive lock feature so that I can run multiple
>>>>>> write  jobs in
>>>> parallel.
>>>>>> rbd_cache is disabled in the client side.
>>>>>> Each test ran for 15 mins.
>>>>>>
>>>>>> Result :
>>>>>> ---------
>>>>>>
>>>>>> Here is the detailed report on this.
>>>>>>
>>>>>>
>>>>
>>>
>> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
>>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
>>>>>>
>>>>>> Each profile I named based on <allocator>-<freelist> , so in
>>>>>> the graph for
>>>> example "stupid-extent" meaning stupid allocator and extent freelist.
>>>>>>
>>>>>> I ran the test for each of the profile in the following order
>>>>>> after creating a
>>>> fresh rbd image for all the Bluestore test.
>>>>>>
>>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
>>>>>>
>>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
>>>>>>
>>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
>>>>>>
>>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
>>>>>>
>>>>>> The above are non-preconditioned case i.e ran before filling
>>>>>> up the entire
>>>> image. The reason is I don't see any reason of filling up the rbd
>>>> image before like filestore case where it will give stable
>>>> performance if we fill up the rbd images first. Filling up rbd
>>>> images in case of filestore will create the files in the filesystem.
>>>>>>
>>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
>>>>>> This is
>>>> primarily because I want to load BlueStore with more data.
>>>>>>
>>>>>> 6. Ran 4K RW test again (this is called out preconditioned in
>>>>>> the
>>>>>> profile) for 15 min
>>>>>>
>>>>>> 7. Ran 4K Seq test for similar QD for 15 min
>>>>>>
>>>>>> 8. Ran 16K RW test again for 15min
>>>>>>
>>>>>> For filestore test, I ran tests after preconditioning the
>>>>>> entire image
>> first.
>>>>>>
>>>>>> Each sheet on the xls have different block size result , I
>>>>>> often miss to navigate through the xls sheets , so, thought of
>>>>>> mentioning here
>>>>>> :-)
>>>>>>
>>>>>> I have also captured the mkfs time , OSD startup time and the
>>>>>> memory
>>>> usage after the entire run.
>>>>>>
>>>>>> Observation:
>>>>>> ---------------
>>>>>>
>>>>>> 1. First of all, in case of bitmap allocator mkfs time (and
>>>>>> thus cluster
>>>> creation time for 16 OSDs) are ~16X slower than stupid allocator
>>>> and
>>> filestore.
>>>> Each OSD creation is taking ~2min or so sometimes and I nailed
>>>> down the
>>>> insert_free() function call (marked ****) in the Bitmap allocator
>>>> is causing that.
>>>>>>
>>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
>>>>>> enumerate_next start
>>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
>>>>>> enumerate_next
>>>>>> 0x4663d00000~69959451000
>>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
>>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
>>>>>> 0x4663d00000 length 0x69959451000
>>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
>>>>>> bitmapalloc:insert_free instance 139913322803328 off
>>>>>> 0x4663d00000 len 0x69959451000****
>>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
>>>>>> enumerate_next
>>>>>> end****
>>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
>>>>>> in
>>>>>> 1 extents
>>>>>>
>>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
>>>>>> read buffered 0x4a14eb~265 of ^A:5242880+5242880
>>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
>>>>>> got
>>>>>> 613
>>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
>>>>>> enumerate_next
>>>>>> 0x4663d00000~69959451000
>>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
>>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
>>>>>> 0x4663d00000 length 0x69959451000
>>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
>>>>>> bitmapalloc:insert_free instance 139913306273920 off
>>>>>> 0x4663d00000 len
>>>>>> 0x69959451000*****
>>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
>>>>>> enumerate_next end
>>>>>
>>>>> I'm not sure there's any easy fix for this. We can amortize it
>>>>> by feeding
>>>> space to bluefs slowly (so that we don't have to do all the
>>>> inserts at once), but I'm not sure that's really better.
>>>>>
>>>>> [Somnath] I don't know that part of the code, so, may be a dumb
>>> question.
>>>> This is during mkfs() time , so, can't we say to bluefs entire
>>>> space is free ? I can understand for osd mount and all other cases
>>>> we need to feed the free space every time.
>>>>> IMO this is critical to fix as cluster creation time will be
>>>>> number of OSDs * 2
>>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
>>>> compare to
>>>> ~2 min for stupid allocator/filestore.
>>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G
>>>>> and WAL is
>>>> ~1G. I guess the time taking is dependent on data partition size as well (?
>>>>
>>>> Well, we're fundamentally limited by the fact that it's a bitmap,
>>>> and a big chunk of space is "allocated" to bluefs and needs to have 1's set.
>>>>
>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Bluestore different allocator performance Vs FileStore
  2016-08-10 22:44     ` Sage Weil
  2016-08-10 22:58       ` Allen Samuels
@ 2016-08-11 12:28       ` Milosz Tanski
  1 sibling, 0 replies; 34+ messages in thread
From: Milosz Tanski @ 2016-08-11 12:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, ceph-devel

On Wed, Aug 10, 2016 at 6:44 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 10 Aug 2016, Somnath Roy wrote:
>> << inline with [Somnath]
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@newdream.net]
>> Sent: Wednesday, August 10, 2016 2:31 PM
>> To: Somnath Roy
>> Cc: ceph-devel
>> Subject: Re: Bluestore different allocator performance Vs FileStore
>>
>> On Wed, 10 Aug 2016, Somnath Roy wrote:
>> > Hi, I spent some time on evaluating different Bluestore allocator and
>> > freelist performance. Also, tried to gaze the performance difference
>> > of Bluestore and filestore on the similar setup.
>> >
>> > Setup:
>> > --------
>> >
>> > 16 OSDs (8TB Flash) across 2 OSD nodes
>> >
>> > Single pool and single rbd image of 4TB. 2X replication.
>> >
>> > Disabled the exclusive lock feature so that I can run multiple write  jobs in parallel.
>> > rbd_cache is disabled in the client side.
>> > Each test ran for 15 mins.
>> >
>> > Result :
>> > ---------
>> >
>> > Here is the detailed report on this.
>> >
>> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a25
>> > 0cb05986/Bluestore_allocator_comp.xlsx
>> >
>> > Each profile I named based on <allocator>-<freelist> , so in the graph for example "stupid-extent" meaning stupid allocator and extent freelist.
>> >
>> > I ran the test for each of the profile in the following order after creating a fresh rbd image for all the Bluestore test.
>> >
>> > 1. 4K RW for 15 min with 16QD and 10 jobs.
>> >
>> > 2. 16K RW for 15 min with 16QD and 10 jobs.
>> >
>> > 3. 64K RW for 15 min with 16QD and 10 jobs.
>> >
>> > 4. 256K RW for 15 min with 16QD and 10 jobs.
>> >
>> > The above are non-preconditioned case i.e ran before filling up the entire image. The reason is I don't see any reason of filling up the rbd image before like filestore case where it will give stable performance if we fill up the rbd images first. Filling up rbd images in case of filestore will create the files in the filesystem.
>> >
>> > 5. Next, I did precondition the 4TB image with 1M seq write. This is primarily because I want to load BlueStore with more data.
>> >
>> > 6. Ran 4K RW test again (this is called out preconditioned in the
>> > profile) for 15 min
>> >
>> > 7. Ran 4K Seq test for similar QD for 15 min
>> >
>> > 8. Ran 16K RW test again for 15min
>> >
>> > For filestore test, I ran tests after preconditioning the entire image first.
>> >
>> > Each sheet on the xls have different block size result , I often miss
>> > to navigate through the xls sheets , so, thought of mentioning here
>> > :-)
>> >
>> > I have also captured the mkfs time , OSD startup time and the memory usage after the entire run.
>> >
>> > Observation:
>> > ---------------
>> >
>> > 1. First of all, in case of bitmap allocator mkfs time (and thus cluster creation time for 16 OSDs) are ~16X slower than stupid allocator and filestore. Each OSD creation is taking ~2min or so sometimes and I nailed down the insert_free() function call (marked ****) in the Bitmap allocator is causing that.
>> >
>> > 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist enumerate_next
>> > start
>> > 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist enumerate_next
>> > 0x4663d00000~69959451000
>> > 2016-08-05 16:12:40.975555 7f4024d258c0 10 bitmapalloc:init_add_free
>> > instance 139913322803328 offset 0x4663d00000 length 0x69959451000
>> > ****2016-08-05 16:12:40.975557 7f4024d258c0 20 bitmapalloc:insert_free
>> > instance 139913322803328 off 0x4663d00000 len 0x69959451000****
>> > ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist enumerate_next
>> > end****
>> > 2016-08-05 16:13:20.748978 7f4024d258c0 10
>> > bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in 1
>> > extents
>> >
>> > 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random read
>> > buffered 0x4a14eb~265 of ^A:5242880+5242880
>> > 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random got 613
>> > 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist enumerate_next
>> > 0x4663d00000~69959451000
>> > 2016-08-05 16:13:23.438664 7f4024d258c0 10 bitmapalloc:init_add_free
>> > instance 139913306273920 offset 0x4663d00000 length 0x69959451000
>> > *****2016-08-05 16:13:23.438666 7f4024d258c0 20
>> > bitmapalloc:insert_free instance 139913306273920 off 0x4663d00000 len
>> > 0x69959451000*****
>> > *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
>> > enumerate_next end
>>
>> I'm not sure there's any easy fix for this. We can amortize it by feeding space to bluefs slowly (so that we don't have to do all the inserts at once), but I'm not sure that's really better.
>>
>> [Somnath] I don't know that part of the code, so, may be a dumb question. This is during mkfs() time , so, can't we say to bluefs entire space is free ? I can understand for osd mount and all other cases we need to feed the free space every time.
>> IMO this is critical to fix as cluster creation time will be number of OSDs * 2 min otherwise. For me creating 16 OSDs cluster is taking ~32min compare to ~2 min for stupid allocator/filestore.
>> BTW, my drive data partition is ~6.9TB , db partition is ~100G and WAL is ~1G. I guess the time taking is dependent on data partition size as well (?
>
> Well, we're fundamentally limited by the fact that it's a bitmap, and a
> big chunk of space is "allocated" to bluefs and needs to have 1's set.

There's been a lot of research into compressed bitmaps (disk and
memory) in the last 10 years steaming from database index research.
Some of them can be decompressed at near memcpy speeds.

The current "best" method compressed compressed bitmap when you
require editing is Roaring bitmaps. Link http://roaringbitmap.org/ and
links to research http://arxiv.org/pdf/1603.06549.pdf ,
http://arxiv.org/pdf/1402.6407.pdf .

This could be useful, not only for creation of the partition but also
minimizing memory usage at runtime. And in the default case, you can
reserve as much space needed for the worst case bitmap (no
compression) but in most cases end up using only a fraction of it.

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Bluestore different allocator performance Vs FileStore
  2016-08-11 11:24             ` Mark Nelson
@ 2016-08-11 14:06               ` Ben England
  2016-08-11 17:07                 ` Allen Samuels
  0 siblings, 1 reply; 34+ messages in thread
From: Ben England @ 2016-08-11 14:06 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Somnath Roy, Ramesh Chander, Allen Samuels, Sage Weil,
	ceph-devel, Sebastien Han

cc'ing Sebastien Han... 

does "ceph-disk prepare" support completely parallel operation?

The only CBT constraint on parallel OSD creation that I'm aware of is that CBT had to serialize "ceph osd create" command, so that it knew what OSD number it created and what UUID it mapped to.  But even here it could have gotten the osd number from "ceph osd create" output.  Since this only takes 1 second, this was not a problem.  Everything else can run in parallel, and does, in CBT.  For any one OSD, the creation steps are serialized with a OSD creation thread, but OSDs are created in parallel once "ceph osd create" has run.  Ceph CBT does not use ceph-disk, so there is a chance that ceph-ansible, which depends on ceph-disk, has different constraints.

ceph-ansible operates in parallel across OSD hosts, but within an OSD host it's one OSD at a time at present.   The bigger your OSD host count, the more parallelized this can be, although ansible has a default fan-out of 5 hosts, which isn't enough - I run it with much higher fan-out and haven't seen any problems so far.

For example, for a server with 36 drives, this is a bit irritating.  As Somnath said, the mkfs command is the biggest consumer of time.  (i.e. ceph-disk prepare).  But at least it is bounded by the number of drives per server.  See this line in ceph-ansible that runs ceph-disk prepare.

https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/tasks/scenarios/raw_multi_journal.yml#L6

Ansible does appear to have some support for background tasks, See bottom of this page.  

http://docs.ansible.com/ansible/playbooks_async.html

Sebastien, is there some way to fire up "ceph-disk prepare" for each device as separate asynchronous tasks and then wait for them all to complete before proceeding?  In the worst case, ceph-ansible could launch a shell script that would background the ceph-disk prepare processes within a host, and then wait for all of them to complete.  I'll try to follow up on this, was not sure whether this was still an issue in bluestore until I saw this e-mail.

-ben

----- Original Message -----
> From: "Mark Nelson" <mnelson@redhat.com>
> To: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Ramesh Chander" <Ramesh.Chander@sandisk.com>, "Allen Samuels"
> <Allen.Samuels@sandisk.com>, "Sage Weil" <sage@newdream.net>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Ben England" <bengland@redhat.com>
> Sent: Thursday, August 11, 2016 7:24:42 AM
> Subject: Re: Bluestore different allocator performance Vs FileStore
> 
> Ben England added parallel OSD creation to CBT a while back which
> greatly speed up cluster creation time (not just for the bitmap
> alloctaor).  I'm not sure if ceph-ansible creates OSDs in parallel, but
> if not he might have some insights into how easy it would be to improve it.
> 
> Mark
> 
> On 08/11/2016 02:11 AM, Somnath Roy wrote:
> > Yes, we can create OSDs in parallel but I am not sure how many people are
> > creating cluster like that as ceph-deploy end there is no interface for
> > that.
> > FYI, we have introduced some parallelism in SanDisk wrapper script for
> > installer based on ceph-deploy.
> > I don't think even with all these parallel OSD creation, this problem will
> > go away but for sure will be reduced  a bit as we have seen in case of OSD
> > start time since it is inherently parallel.
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Ramesh Chander
> > Sent: Wednesday, August 10, 2016 11:07 PM
> > To: Allen Samuels; Sage Weil; Somnath Roy
> > Cc: ceph-devel
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > Somnath,
> >
> > Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes
> > ( 32 / 16).
> >
> > But is there a reason you should create osds in serial? I think for
> > mmultiple osds mkfs can happen in parallel?
> >
> > As a fix I am looking to batch multiple insert_free calls for now. If still
> > that does not help, thinking of doing insert_free on different part of
> > device in parallel.
> >
> > -Ramesh
> >
> >> -----Original Message-----
> >> From: Ramesh Chander
> >> Sent: Thursday, August 11, 2016 10:04 AM
> >> To: Allen Samuels; Sage Weil; Somnath Roy
> >> Cc: ceph-devel
> >> Subject: RE: Bluestore different allocator performance Vs FileStore
> >>
> >> I think insert_free is limited by speed of function clear_bits here.
> >>
> >> Though set_bits and clear_bits have same logic except one sets and
> >> another clears. Both of these does 64 bits (bitmap size) at a time.
> >>
> >> I am not sure if doing memset will make it faster. But if we can do it
> >> for group of bitmaps, then it might help.
> >>
> >> I am looking in to code if we can handle mkfs and osd mount in special
> >> way to make it faster.
> >>
> >> If I don't find an easy fix, we can go to path of deferring init to
> >> later stage as and when required.
> >>
> >> -Ramesh
> >>
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> >>> Sent: Thursday, August 11, 2016 4:28 AM
> >>> To: Sage Weil; Somnath Roy
> >>> Cc: ceph-devel
> >>> Subject: RE: Bluestore different allocator performance Vs FileStore
> >>>
> >>> We always knew that startup time for bitmap stuff would be somewhat
> >>> longer. Still, the existing implementation can be speeded up
> >>> significantly. The code in BitMapZone::set_blocks_used isn't very
> >>> optimized. Converting it to use memset for all but the first/last
> >>> bytes
> >> should significantly speed it up.
> >>>
> >>>
> >>> Allen Samuels
> >>> SanDisk |a Western Digital brand
> >>> 2880 Junction Avenue, San Jose, CA 95134
> >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >>>> Subject: RE: Bluestore different allocator performance Vs
> >>>> FileStore
> >>>>
> >>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> >>>>> << inline with [Somnath]
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Sage Weil [mailto:sage@newdream.net]
> >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> >>>>> To: Somnath Roy
> >>>>> Cc: ceph-devel
> >>>>> Subject: Re: Bluestore different allocator performance Vs
> >>>>> FileStore
> >>>>>
> >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> >>>>>> Hi, I spent some time on evaluating different Bluestore
> >>>>>> allocator and freelist performance. Also, tried to gaze the
> >>>>>> performance difference of Bluestore and filestore on the
> >>>>>> similar
> >> setup.
> >>>>>>
> >>>>>> Setup:
> >>>>>> --------
> >>>>>>
> >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> >>>>>>
> >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> >>>>>>
> >>>>>> Disabled the exclusive lock feature so that I can run multiple
> >>>>>> write  jobs in
> >>>> parallel.
> >>>>>> rbd_cache is disabled in the client side.
> >>>>>> Each test ran for 15 mins.
> >>>>>>
> >>>>>> Result :
> >>>>>> ---------
> >>>>>>
> >>>>>> Here is the detailed report on this.
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> >>>>>>
> >>>>>> Each profile I named based on <allocator>-<freelist> , so in
> >>>>>> the graph for
> >>>> example "stupid-extent" meaning stupid allocator and extent freelist.
> >>>>>>
> >>>>>> I ran the test for each of the profile in the following order
> >>>>>> after creating a
> >>>> fresh rbd image for all the Bluestore test.
> >>>>>>
> >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> The above are non-preconditioned case i.e ran before filling
> >>>>>> up the entire
> >>>> image. The reason is I don't see any reason of filling up the rbd
> >>>> image before like filestore case where it will give stable
> >>>> performance if we fill up the rbd images first. Filling up rbd
> >>>> images in case of filestore will create the files in the filesystem.
> >>>>>>
> >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> >>>>>> This is
> >>>> primarily because I want to load BlueStore with more data.
> >>>>>>
> >>>>>> 6. Ran 4K RW test again (this is called out preconditioned in
> >>>>>> the
> >>>>>> profile) for 15 min
> >>>>>>
> >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> >>>>>>
> >>>>>> 8. Ran 16K RW test again for 15min
> >>>>>>
> >>>>>> For filestore test, I ran tests after preconditioning the
> >>>>>> entire image
> >> first.
> >>>>>>
> >>>>>> Each sheet on the xls have different block size result , I
> >>>>>> often miss to navigate through the xls sheets , so, thought of
> >>>>>> mentioning here
> >>>>>> :-)
> >>>>>>
> >>>>>> I have also captured the mkfs time , OSD startup time and the
> >>>>>> memory
> >>>> usage after the entire run.
> >>>>>>
> >>>>>> Observation:
> >>>>>> ---------------
> >>>>>>
> >>>>>> 1. First of all, in case of bitmap allocator mkfs time (and
> >>>>>> thus cluster
> >>>> creation time for 16 OSDs) are ~16X slower than stupid allocator
> >>>> and
> >>> filestore.
> >>>> Each OSD creation is taking ~2min or so sometimes and I nailed
> >>>> down the
> >>>> insert_free() function call (marked ****) in the Bitmap allocator
> >>>> is causing that.
> >>>>>>
> >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> >>>>>> enumerate_next start
> >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> >>>>>> enumerate_next
> >>>>>> 0x4663d00000~69959451000
> >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> >>>>>> 0x4663d00000 length 0x69959451000
> >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> >>>>>> 0x4663d00000 len 0x69959451000****
> >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> >>>>>> enumerate_next
> >>>>>> end****
> >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
> >>>>>> in
> >>>>>> 1 extents
> >>>>>>
> >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
> >>>>>> read buffered 0x4a14eb~265 of ^A:5242880+5242880
> >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
> >>>>>> got
> >>>>>> 613
> >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> >>>>>> enumerate_next
> >>>>>> 0x4663d00000~69959451000
> >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> >>>>>> 0x4663d00000 length 0x69959451000
> >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> >>>>>> 0x4663d00000 len
> >>>>>> 0x69959451000*****
> >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> >>>>>> enumerate_next end
> >>>>>
> >>>>> I'm not sure there's any easy fix for this. We can amortize it
> >>>>> by feeding
> >>>> space to bluefs slowly (so that we don't have to do all the
> >>>> inserts at once), but I'm not sure that's really better.
> >>>>>
> >>>>> [Somnath] I don't know that part of the code, so, may be a dumb
> >>> question.
> >>>> This is during mkfs() time , so, can't we say to bluefs entire
> >>>> space is free ? I can understand for osd mount and all other cases
> >>>> we need to feed the free space every time.
> >>>>> IMO this is critical to fix as cluster creation time will be
> >>>>> number of OSDs * 2
> >>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
> >>>> compare to
> >>>> ~2 min for stupid allocator/filestore.
> >>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G
> >>>>> and WAL is
> >>>> ~1G. I guess the time taking is dependent on data partition size as well
> >>>> (?
> >>>>
> >>>> Well, we're fundamentally limited by the fact that it's a bitmap,
> >>>> and a big chunk of space is "allocated" to bluefs and needs to have 1's
> >>>> set.
> >>>>
> >>>> sage
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@vger.kernel.org More
> >>> majordomo
> >>>> info at http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at http://vger.kernel.org/majordomo-info.html
> > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If
> > the reader of this message is not the intended recipient, you are hereby
> > notified that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly
> > prohibited. If you have received this communication in error, please
> > notify the sender by telephone or e-mail (as shown above) immediately and
> > destroy any and all copies of this message in your possession (whether
> > hard copies or electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Bluestore different allocator performance Vs FileStore
  2016-08-11  6:07         ` Ramesh Chander
  2016-08-11  7:11           ` Somnath Roy
@ 2016-08-11 16:04           ` Allen Samuels
  2016-08-11 16:35             ` Ramesh Chander
  1 sibling, 1 reply; 34+ messages in thread
From: Allen Samuels @ 2016-08-11 16:04 UTC (permalink / raw)
  To: Ramesh Chander; +Cc: Sage Weil, Somnath Roy, ceph-devel

Is the initial creation of the keys for the bitmap one by one or are they batched?

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 10, 2016, at 11:07 PM, Ramesh Chander <Ramesh.Chander@sandisk.com> wrote:
> 
> Somnath,
> 
> Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes ( 32 / 16).
> 
> But is there a reason you should create osds in serial? I think for mmultiple osds mkfs can happen in parallel?
> 
> As a fix I am looking to batch multiple insert_free calls for now. If still that does not help, thinking of doing insert_free on different part of device in parallel.
> 
> -Ramesh 
> 
>> -----Original Message-----
>> From: Ramesh Chander
>> Sent: Thursday, August 11, 2016 10:04 AM
>> To: Allen Samuels; Sage Weil; Somnath Roy
>> Cc: ceph-devel
>> Subject: RE: Bluestore different allocator performance Vs FileStore
>> 
>> I think insert_free is limited by speed of function clear_bits here.
>> 
>> Though set_bits and clear_bits have same logic except one sets and another
>> clears. Both of these does 64 bits (bitmap size) at a time.
>> 
>> I am not sure if doing memset will make it faster. But if we can do it for group
>> of bitmaps, then it might help.
>> 
>> I am looking in to code if we can handle mkfs and osd mount in special way to
>> make it faster.
>> 
>> If I don't find an easy fix, we can go to path of deferring init to later stage as
>> and when required.
>> 
>> -Ramesh
>> 
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>> Sent: Thursday, August 11, 2016 4:28 AM
>>> To: Sage Weil; Somnath Roy
>>> Cc: ceph-devel
>>> Subject: RE: Bluestore different allocator performance Vs FileStore
>>> 
>>> We always knew that startup time for bitmap stuff would be somewhat
>>> longer. Still, the existing implementation can be speeded up
>>> significantly. The code in BitMapZone::set_blocks_used isn't very
>>> optimized. Converting it to use memset for all but the first/last bytes
>> should significantly speed it up.
>>> 
>>> 
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Wednesday, August 10, 2016 3:44 PM
>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
>>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>>>> Subject: RE: Bluestore different allocator performance Vs FileStore
>>>> 
>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
>>>>> << inline with [Somnath]
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sage@newdream.net]
>>>>> Sent: Wednesday, August 10, 2016 2:31 PM
>>>>> To: Somnath Roy
>>>>> Cc: ceph-devel
>>>>> Subject: Re: Bluestore different allocator performance Vs
>>>>> FileStore
>>>>> 
>>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
>>>>>> Hi, I spent some time on evaluating different Bluestore
>>>>>> allocator and freelist performance. Also, tried to gaze the
>>>>>> performance difference of Bluestore and filestore on the similar
>> setup.
>>>>>> 
>>>>>> Setup:
>>>>>> --------
>>>>>> 
>>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
>>>>>> 
>>>>>> Single pool and single rbd image of 4TB. 2X replication.
>>>>>> 
>>>>>> Disabled the exclusive lock feature so that I can run multiple
>>>>>> write  jobs in
>>>> parallel.
>>>>>> rbd_cache is disabled in the client side.
>>>>>> Each test ran for 15 mins.
>>>>>> 
>>>>>> Result :
>>>>>> ---------
>>>>>> 
>>>>>> Here is the detailed report on this.
>> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
>>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
>>>>>> 
>>>>>> Each profile I named based on <allocator>-<freelist> , so in the
>>>>>> graph for
>>>> example "stupid-extent" meaning stupid allocator and extent freelist.
>>>>>> 
>>>>>> I ran the test for each of the profile in the following order
>>>>>> after creating a
>>>> fresh rbd image for all the Bluestore test.
>>>>>> 
>>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
>>>>>> 
>>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
>>>>>> 
>>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
>>>>>> 
>>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
>>>>>> 
>>>>>> The above are non-preconditioned case i.e ran before filling up
>>>>>> the entire
>>>> image. The reason is I don't see any reason of filling up the rbd
>>>> image before like filestore case where it will give stable
>>>> performance if we fill up the rbd images first. Filling up rbd
>>>> images in case of filestore will create the files in the filesystem.
>>>>>> 
>>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
>>>>>> This is
>>>> primarily because I want to load BlueStore with more data.
>>>>>> 
>>>>>> 6. Ran 4K RW test again (this is called out preconditioned in
>>>>>> the
>>>>>> profile) for 15 min
>>>>>> 
>>>>>> 7. Ran 4K Seq test for similar QD for 15 min
>>>>>> 
>>>>>> 8. Ran 16K RW test again for 15min
>>>>>> 
>>>>>> For filestore test, I ran tests after preconditioning the entire image
>> first.
>>>>>> 
>>>>>> Each sheet on the xls have different block size result , I often
>>>>>> miss to navigate through the xls sheets , so, thought of
>>>>>> mentioning here
>>>>>> :-)
>>>>>> 
>>>>>> I have also captured the mkfs time , OSD startup time and the
>>>>>> memory
>>>> usage after the entire run.
>>>>>> 
>>>>>> Observation:
>>>>>> ---------------
>>>>>> 
>>>>>> 1. First of all, in case of bitmap allocator mkfs time (and thus
>>>>>> cluster
>>>> creation time for 16 OSDs) are ~16X slower than stupid allocator and
>>> filestore.
>>>> Each OSD creation is taking ~2min or so sometimes and I nailed down
>>>> the
>>>> insert_free() function call (marked ****) in the Bitmap allocator is
>>>> causing that.
>>>>>> 
>>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
>>>>>> enumerate_next start
>>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
>>>>>> enumerate_next
>>>>>> 0x4663d00000~69959451000
>>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
>>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
>>>>>> 0x4663d00000 length 0x69959451000
>>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
>>>>>> bitmapalloc:insert_free instance 139913322803328 off
>>>>>> 0x4663d00000 len 0x69959451000****
>>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
>>>>>> enumerate_next
>>>>>> end****
>>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
>>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in
>>>>>> 1 extents
>>>>>> 
>>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
>>>>>> read buffered 0x4a14eb~265 of ^A:5242880+5242880
>>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
>>>>>> got
>>>>>> 613
>>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
>>>>>> enumerate_next
>>>>>> 0x4663d00000~69959451000
>>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
>>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
>>>>>> 0x4663d00000 length 0x69959451000
>>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
>>>>>> bitmapalloc:insert_free instance 139913306273920 off
>>>>>> 0x4663d00000 len
>>>>>> 0x69959451000*****
>>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
>>>>>> enumerate_next end
>>>>> 
>>>>> I'm not sure there's any easy fix for this. We can amortize it by
>>>>> feeding
>>>> space to bluefs slowly (so that we don't have to do all the inserts
>>>> at once), but I'm not sure that's really better.
>>>>> 
>>>>> [Somnath] I don't know that part of the code, so, may be a dumb
>>> question.
>>>> This is during mkfs() time , so, can't we say to bluefs entire space
>>>> is free ? I can understand for osd mount and all other cases we need
>>>> to feed the free space every time.
>>>>> IMO this is critical to fix as cluster creation time will be
>>>>> number of OSDs * 2
>>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
>>>> compare to
>>>> ~2 min for stupid allocator/filestore.
>>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G and
>>>>> WAL is
>>>> ~1G. I guess the time taking is dependent on data partition size as well (?
>>>> 
>>>> Well, we're fundamentally limited by the fact that it's a bitmap,
>>>> and a big chunk of space is "allocated" to bluefs and needs to have 1's set.
>>>> 
>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 16:04           ` Allen Samuels
@ 2016-08-11 16:35             ` Ramesh Chander
  2016-08-11 16:38               ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Ramesh Chander @ 2016-08-11 16:35 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Somnath Roy, ceph-devel

I think the free list does not initialize all keys at mkfs time, it does sets key that has some allocations.

Rest keys are assumed to have 0's if key does not exist.

The bitmap allocator insert_free is done in group of free bits together(maybe more than bitmap freelist keys at a time).

-Ramesh

> -----Original Message-----
> From: Allen Samuels
> Sent: Thursday, August 11, 2016 9:34 PM
> To: Ramesh Chander
> Cc: Sage Weil; Somnath Roy; ceph-devel
> Subject: Re: Bluestore different allocator performance Vs FileStore
>
> Is the initial creation of the keys for the bitmap one by one or are they
> batched?
>
> Sent from my iPhone. Please excuse all typos and autocorrects.
>
> > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> <Ramesh.Chander@sandisk.com> wrote:
> >
> > Somnath,
> >
> > Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes
> ( 32 / 16).
> >
> > But is there a reason you should create osds in serial? I think for mmultiple
> osds mkfs can happen in parallel?
> >
> > As a fix I am looking to batch multiple insert_free calls for now. If still that
> does not help, thinking of doing insert_free on different part of device in
> parallel.
> >
> > -Ramesh
> >
> >> -----Original Message-----
> >> From: Ramesh Chander
> >> Sent: Thursday, August 11, 2016 10:04 AM
> >> To: Allen Samuels; Sage Weil; Somnath Roy
> >> Cc: ceph-devel
> >> Subject: RE: Bluestore different allocator performance Vs FileStore
> >>
> >> I think insert_free is limited by speed of function clear_bits here.
> >>
> >> Though set_bits and clear_bits have same logic except one sets and
> >> another clears. Both of these does 64 bits (bitmap size) at a time.
> >>
> >> I am not sure if doing memset will make it faster. But if we can do
> >> it for group of bitmaps, then it might help.
> >>
> >> I am looking in to code if we can handle mkfs and osd mount in
> >> special way to make it faster.
> >>
> >> If I don't find an easy fix, we can go to path of deferring init to
> >> later stage as and when required.
> >>
> >> -Ramesh
> >>
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> >>> Sent: Thursday, August 11, 2016 4:28 AM
> >>> To: Sage Weil; Somnath Roy
> >>> Cc: ceph-devel
> >>> Subject: RE: Bluestore different allocator performance Vs FileStore
> >>>
> >>> We always knew that startup time for bitmap stuff would be somewhat
> >>> longer. Still, the existing implementation can be speeded up
> >>> significantly. The code in BitMapZone::set_blocks_used isn't very
> >>> optimized. Converting it to use memset for all but the first/last
> >>> bytes
> >> should significantly speed it up.
> >>>
> >>>
> >>> Allen Samuels
> >>> SanDisk |a Western Digital brand
> >>> 2880 Junction Avenue, San Jose, CA 95134
> >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >>>> Subject: RE: Bluestore different allocator performance Vs FileStore
> >>>>
> >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> >>>>> << inline with [Somnath]
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Sage Weil [mailto:sage@newdream.net]
> >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> >>>>> To: Somnath Roy
> >>>>> Cc: ceph-devel
> >>>>> Subject: Re: Bluestore different allocator performance Vs
> >>>>> FileStore
> >>>>>
> >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> >>>>>> Hi, I spent some time on evaluating different Bluestore allocator
> >>>>>> and freelist performance. Also, tried to gaze the performance
> >>>>>> difference of Bluestore and filestore on the similar
> >> setup.
> >>>>>>
> >>>>>> Setup:
> >>>>>> --------
> >>>>>>
> >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> >>>>>>
> >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> >>>>>>
> >>>>>> Disabled the exclusive lock feature so that I can run multiple
> >>>>>> write  jobs in
> >>>> parallel.
> >>>>>> rbd_cache is disabled in the client side.
> >>>>>> Each test ran for 15 mins.
> >>>>>>
> >>>>>> Result :
> >>>>>> ---------
> >>>>>>
> >>>>>> Here is the detailed report on this.
> >>
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> >>>>>>
> >>>>>> Each profile I named based on <allocator>-<freelist> , so in the
> >>>>>> graph for
> >>>> example "stupid-extent" meaning stupid allocator and extent freelist.
> >>>>>>
> >>>>>> I ran the test for each of the profile in the following order
> >>>>>> after creating a
> >>>> fresh rbd image for all the Bluestore test.
> >>>>>>
> >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> >>>>>>
> >>>>>> The above are non-preconditioned case i.e ran before filling up
> >>>>>> the entire
> >>>> image. The reason is I don't see any reason of filling up the rbd
> >>>> image before like filestore case where it will give stable
> >>>> performance if we fill up the rbd images first. Filling up rbd
> >>>> images in case of filestore will create the files in the filesystem.
> >>>>>>
> >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> >>>>>> This is
> >>>> primarily because I want to load BlueStore with more data.
> >>>>>>
> >>>>>> 6. Ran 4K RW test again (this is called out preconditioned in the
> >>>>>> profile) for 15 min
> >>>>>>
> >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> >>>>>>
> >>>>>> 8. Ran 16K RW test again for 15min
> >>>>>>
> >>>>>> For filestore test, I ran tests after preconditioning the entire
> >>>>>> image
> >> first.
> >>>>>>
> >>>>>> Each sheet on the xls have different block size result , I often
> >>>>>> miss to navigate through the xls sheets , so, thought of
> >>>>>> mentioning here
> >>>>>> :-)
> >>>>>>
> >>>>>> I have also captured the mkfs time , OSD startup time and the
> >>>>>> memory
> >>>> usage after the entire run.
> >>>>>>
> >>>>>> Observation:
> >>>>>> ---------------
> >>>>>>
> >>>>>> 1. First of all, in case of bitmap allocator mkfs time (and thus
> >>>>>> cluster
> >>>> creation time for 16 OSDs) are ~16X slower than stupid allocator
> >>>> and
> >>> filestore.
> >>>> Each OSD creation is taking ~2min or so sometimes and I nailed down
> >>>> the
> >>>> insert_free() function call (marked ****) in the Bitmap allocator
> >>>> is causing that.
> >>>>>>
> >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> >>>>>> enumerate_next start
> >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> >>>>>> enumerate_next
> >>>>>> 0x4663d00000~69959451000
> >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> >>>>>> 0x4663d00000 length 0x69959451000
> >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> >>>>>> 0x4663d00000 len 0x69959451000****
> >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> >>>>>> enumerate_next
> >>>>>> end****
> >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in
> >>>>>> 1 extents
> >>>>>>
> >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
> >>>>>> read buffered 0x4a14eb~265 of ^A:5242880+5242880
> >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
> >>>>>> got
> >>>>>> 613
> >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> >>>>>> enumerate_next
> >>>>>> 0x4663d00000~69959451000
> >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> >>>>>> 0x4663d00000 length 0x69959451000
> >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> >>>>>> 0x4663d00000 len
> >>>>>> 0x69959451000*****
> >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> >>>>>> enumerate_next end
> >>>>>
> >>>>> I'm not sure there's any easy fix for this. We can amortize it by
> >>>>> feeding
> >>>> space to bluefs slowly (so that we don't have to do all the inserts
> >>>> at once), but I'm not sure that's really better.
> >>>>>
> >>>>> [Somnath] I don't know that part of the code, so, may be a dumb
> >>> question.
> >>>> This is during mkfs() time , so, can't we say to bluefs entire
> >>>> space is free ? I can understand for osd mount and all other cases
> >>>> we need to feed the free space every time.
> >>>>> IMO this is critical to fix as cluster creation time will be
> >>>>> number of OSDs * 2
> >>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
> >>>> compare to
> >>>> ~2 min for stupid allocator/filestore.
> >>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G and
> >>>>> WAL is
> >>>> ~1G. I guess the time taking is dependent on data partition size as well
> (?
> >>>>
> >>>> Well, we're fundamentally limited by the fact that it's a bitmap,
> >>>> and a big chunk of space is "allocated" to bluefs and needs to have 1's
> set.
> >>>>
> >>>> sage
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@vger.kernel.org More
> >>> majordomo
> >>>> info at http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 16:35             ` Ramesh Chander
@ 2016-08-11 16:38               ` Sage Weil
  2016-08-11 17:05                 ` Allen Samuels
  0 siblings, 1 reply; 34+ messages in thread
From: Sage Weil @ 2016-08-11 16:38 UTC (permalink / raw)
  To: Ramesh Chander; +Cc: Allen Samuels, Somnath Roy, ceph-devel

On Thu, 11 Aug 2016, Ramesh Chander wrote:
> I think the free list does not initialize all keys at mkfs time, it does 
> sets key that has some allocations.
> 
> Rest keys are assumed to have 0's if key does not exist.

Right.. it's the region "allocated" to bluefs that is consuming the time.
 
> The bitmap allocator insert_free is done in group of free bits 
> together(maybe more than bitmap freelist keys at a time).

I think Allen is asking whether we are doing lots of inserts within a 
single rocksdb transaction, or lots of separate transactions.

FWIW, my guess is that increasing the size of the value (i.e., increasing

OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)

) will probably speed this up.

sage


> 
> -Ramesh
> 
> > -----Original Message-----
> > From: Allen Samuels
> > Sent: Thursday, August 11, 2016 9:34 PM
> > To: Ramesh Chander
> > Cc: Sage Weil; Somnath Roy; ceph-devel
> > Subject: Re: Bluestore different allocator performance Vs FileStore
> >
> > Is the initial creation of the keys for the bitmap one by one or are they
> > batched?
> >
> > Sent from my iPhone. Please excuse all typos and autocorrects.
> >
> > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > <Ramesh.Chander@sandisk.com> wrote:
> > >
> > > Somnath,
> > >
> > > Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2 minutes
> > ( 32 / 16).
> > >
> > > But is there a reason you should create osds in serial? I think for mmultiple
> > osds mkfs can happen in parallel?
> > >
> > > As a fix I am looking to batch multiple insert_free calls for now. If still that
> > does not help, thinking of doing insert_free on different part of device in
> > parallel.
> > >
> > > -Ramesh
> > >
> > >> -----Original Message-----
> > >> From: Ramesh Chander
> > >> Sent: Thursday, August 11, 2016 10:04 AM
> > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > >> Cc: ceph-devel
> > >> Subject: RE: Bluestore different allocator performance Vs FileStore
> > >>
> > >> I think insert_free is limited by speed of function clear_bits here.
> > >>
> > >> Though set_bits and clear_bits have same logic except one sets and
> > >> another clears. Both of these does 64 bits (bitmap size) at a time.
> > >>
> > >> I am not sure if doing memset will make it faster. But if we can do
> > >> it for group of bitmaps, then it might help.
> > >>
> > >> I am looking in to code if we can handle mkfs and osd mount in
> > >> special way to make it faster.
> > >>
> > >> If I don't find an easy fix, we can go to path of deferring init to
> > >> later stage as and when required.
> > >>
> > >> -Ramesh
> > >>
> > >>> -----Original Message-----
> > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > >>> To: Sage Weil; Somnath Roy
> > >>> Cc: ceph-devel
> > >>> Subject: RE: Bluestore different allocator performance Vs FileStore
> > >>>
> > >>> We always knew that startup time for bitmap stuff would be somewhat
> > >>> longer. Still, the existing implementation can be speeded up
> > >>> significantly. The code in BitMapZone::set_blocks_used isn't very
> > >>> optimized. Converting it to use memset for all but the first/last
> > >>> bytes
> > >> should significantly speed it up.
> > >>>
> > >>>
> > >>> Allen Samuels
> > >>> SanDisk |a Western Digital brand
> > >>> 2880 Junction Avenue, San Jose, CA 95134
> > >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > >>>> Subject: RE: Bluestore different allocator performance Vs FileStore
> > >>>>
> > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > >>>>> << inline with [Somnath]
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > >>>>> To: Somnath Roy
> > >>>>> Cc: ceph-devel
> > >>>>> Subject: Re: Bluestore different allocator performance Vs
> > >>>>> FileStore
> > >>>>>
> > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > >>>>>> Hi, I spent some time on evaluating different Bluestore allocator
> > >>>>>> and freelist performance. Also, tried to gaze the performance
> > >>>>>> difference of Bluestore and filestore on the similar
> > >> setup.
> > >>>>>>
> > >>>>>> Setup:
> > >>>>>> --------
> > >>>>>>
> > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > >>>>>>
> > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > >>>>>>
> > >>>>>> Disabled the exclusive lock feature so that I can run multiple
> > >>>>>> write  jobs in
> > >>>> parallel.
> > >>>>>> rbd_cache is disabled in the client side.
> > >>>>>> Each test ran for 15 mins.
> > >>>>>>
> > >>>>>> Result :
> > >>>>>> ---------
> > >>>>>>
> > >>>>>> Here is the detailed report on this.
> > >>
> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > >>>>>>
> > >>>>>> Each profile I named based on <allocator>-<freelist> , so in the
> > >>>>>> graph for
> > >>>> example "stupid-extent" meaning stupid allocator and extent freelist.
> > >>>>>>
> > >>>>>> I ran the test for each of the profile in the following order
> > >>>>>> after creating a
> > >>>> fresh rbd image for all the Bluestore test.
> > >>>>>>
> > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> The above are non-preconditioned case i.e ran before filling up
> > >>>>>> the entire
> > >>>> image. The reason is I don't see any reason of filling up the rbd
> > >>>> image before like filestore case where it will give stable
> > >>>> performance if we fill up the rbd images first. Filling up rbd
> > >>>> images in case of filestore will create the files in the filesystem.
> > >>>>>>
> > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> > >>>>>> This is
> > >>>> primarily because I want to load BlueStore with more data.
> > >>>>>>
> > >>>>>> 6. Ran 4K RW test again (this is called out preconditioned in the
> > >>>>>> profile) for 15 min
> > >>>>>>
> > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > >>>>>>
> > >>>>>> 8. Ran 16K RW test again for 15min
> > >>>>>>
> > >>>>>> For filestore test, I ran tests after preconditioning the entire
> > >>>>>> image
> > >> first.
> > >>>>>>
> > >>>>>> Each sheet on the xls have different block size result , I often
> > >>>>>> miss to navigate through the xls sheets , so, thought of
> > >>>>>> mentioning here
> > >>>>>> :-)
> > >>>>>>
> > >>>>>> I have also captured the mkfs time , OSD startup time and the
> > >>>>>> memory
> > >>>> usage after the entire run.
> > >>>>>>
> > >>>>>> Observation:
> > >>>>>> ---------------
> > >>>>>>
> > >>>>>> 1. First of all, in case of bitmap allocator mkfs time (and thus
> > >>>>>> cluster
> > >>>> creation time for 16 OSDs) are ~16X slower than stupid allocator
> > >>>> and
> > >>> filestore.
> > >>>> Each OSD creation is taking ~2min or so sometimes and I nailed down
> > >>>> the
> > >>>> insert_free() function call (marked ****) in the Bitmap allocator
> > >>>> is causing that.
> > >>>>>>
> > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next start
> > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next
> > >>>>>> 0x4663d00000~69959451000
> > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> > >>>>>> 0x4663d00000 length 0x69959451000
> > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > >>>>>> 0x4663d00000 len 0x69959451000****
> > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next
> > >>>>>> end****
> > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G in
> > >>>>>> 1 extents
> > >>>>>>
> > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
> > >>>>>> read buffered 0x4a14eb~265 of ^A:5242880+5242880
> > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
> > >>>>>> got
> > >>>>>> 613
> > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next
> > >>>>>> 0x4663d00000~69959451000
> > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> > >>>>>> 0x4663d00000 length 0x69959451000
> > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > >>>>>> 0x4663d00000 len
> > >>>>>> 0x69959451000*****
> > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next end
> > >>>>>
> > >>>>> I'm not sure there's any easy fix for this. We can amortize it by
> > >>>>> feeding
> > >>>> space to bluefs slowly (so that we don't have to do all the inserts
> > >>>> at once), but I'm not sure that's really better.
> > >>>>>
> > >>>>> [Somnath] I don't know that part of the code, so, may be a dumb
> > >>> question.
> > >>>> This is during mkfs() time , so, can't we say to bluefs entire
> > >>>> space is free ? I can understand for osd mount and all other cases
> > >>>> we need to feed the free space every time.
> > >>>>> IMO this is critical to fix as cluster creation time will be
> > >>>>> number of OSDs * 2
> > >>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > >>>> compare to
> > >>>> ~2 min for stupid allocator/filestore.
> > >>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G and
> > >>>>> WAL is
> > >>>> ~1G. I guess the time taking is dependent on data partition size as well
> > (?
> > >>>>
> > >>>> Well, we're fundamentally limited by the fact that it's a bitmap,
> > >>>> and a big chunk of space is "allocated" to bluefs and needs to have 1's
> > set.
> > >>>>
> > >>>> sage
> > >>>> --
> > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>>> in the body of a message to majordomo@vger.kernel.org More
> > >>> majordomo
> > >>>> info at http://vger.kernel.org/majordomo-info.html
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>> in the body of a message to majordomo@vger.kernel.org More
> > >> majordomo
> > >>> info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 16:38               ` Sage Weil
@ 2016-08-11 17:05                 ` Allen Samuels
  2016-08-11 17:15                   ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Allen Samuels @ 2016-08-11 17:05 UTC (permalink / raw)
  To: Sage Weil, Ramesh Chander; +Cc: Somnath Roy, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, August 11, 2016 9:38 AM
> To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > I think the free list does not initialize all keys at mkfs time, it
> > does sets key that has some allocations.
> >
> > Rest keys are assumed to have 0's if key does not exist.
> 
> Right.. it's the region "allocated" to bluefs that is consuming the time.
> 
> > The bitmap allocator insert_free is done in group of free bits
> > together(maybe more than bitmap freelist keys at a time).
> 
> I think Allen is asking whether we are doing lots of inserts within a single
> rocksdb transaction, or lots of separate transactions.
> 
> FWIW, my guess is that increasing the size of the value (i.e., increasing
> 
> OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> 
> ) will probably speed this up.

If your assumption (> Right.. it's the region "allocated" to bluefs that is consuming the time) is correct, then I don't understand why this parameter has any effect on the problem.

Aren't we reading BlueFS extents and setting them in the BitMapAllocator? That doesn't care about the chunking of bitmap bits into KV keys.

I would be cautious about just changing this option to affect this problem (though as an experiment, we can change the value and see if it has ANY affect on this problem -- which I don't think it will). The value of this option really needs to be dictated by its effect on the more mainstream read/write operations not on the initialization problem.
> 
> sage
> 
> 
> >
> > -Ramesh
> >
> > > -----Original Message-----
> > > From: Allen Samuels
> > > Sent: Thursday, August 11, 2016 9:34 PM
> > > To: Ramesh Chander
> > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > Subject: Re: Bluestore different allocator performance Vs FileStore
> > >
> > > Is the initial creation of the keys for the bitmap one by one or are
> > > they batched?
> > >
> > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > >
> > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > <Ramesh.Chander@sandisk.com> wrote:
> > > >
> > > > Somnath,
> > > >
> > > > Basically mkfs time has increased from 7.5 seconds (2min / 16) to
> > > > 2 minutes
> > > ( 32 / 16).
> > > >
> > > > But is there a reason you should create osds in serial? I think
> > > > for mmultiple
> > > osds mkfs can happen in parallel?
> > > >
> > > > As a fix I am looking to batch multiple insert_free calls for now.
> > > > If still that
> > > does not help, thinking of doing insert_free on different part of
> > > device in parallel.
> > > >
> > > > -Ramesh
> > > >
> > > >> -----Original Message-----
> > > >> From: Ramesh Chander
> > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > >> Cc: ceph-devel
> > > >> Subject: RE: Bluestore different allocator performance Vs
> > > >> FileStore
> > > >>
> > > >> I think insert_free is limited by speed of function clear_bits here.
> > > >>
> > > >> Though set_bits and clear_bits have same logic except one sets
> > > >> and another clears. Both of these does 64 bits (bitmap size) at a time.
> > > >>
> > > >> I am not sure if doing memset will make it faster. But if we can
> > > >> do it for group of bitmaps, then it might help.
> > > >>
> > > >> I am looking in to code if we can handle mkfs and osd mount in
> > > >> special way to make it faster.
> > > >>
> > > >> If I don't find an easy fix, we can go to path of deferring init
> > > >> to later stage as and when required.
> > > >>
> > > >> -Ramesh
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > >>> To: Sage Weil; Somnath Roy
> > > >>> Cc: ceph-devel
> > > >>> Subject: RE: Bluestore different allocator performance Vs
> > > >>> FileStore
> > > >>>
> > > >>> We always knew that startup time for bitmap stuff would be
> > > >>> somewhat longer. Still, the existing implementation can be
> > > >>> speeded up significantly. The code in
> > > >>> BitMapZone::set_blocks_used isn't very optimized. Converting it
> > > >>> to use memset for all but the first/last bytes
> > > >> should significantly speed it up.
> > > >>>
> > > >>>
> > > >>> Allen Samuels
> > > >>> SanDisk |a Western Digital brand
> > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > >>>
> > > >>>
> > > >>>> -----Original Message-----
> > > >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > >>>> Subject: RE: Bluestore different allocator performance Vs
> > > >>>> FileStore
> > > >>>>
> > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > >>>>> << inline with [Somnath]
> > > >>>>>
> > > >>>>> -----Original Message-----
> > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > >>>>> To: Somnath Roy
> > > >>>>> Cc: ceph-devel
> > > >>>>> Subject: Re: Bluestore different allocator performance Vs
> > > >>>>> FileStore
> > > >>>>>
> > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > >>>>>> Hi, I spent some time on evaluating different Bluestore
> > > >>>>>> allocator and freelist performance. Also, tried to gaze the
> > > >>>>>> performance difference of Bluestore and filestore on the
> > > >>>>>> similar
> > > >> setup.
> > > >>>>>>
> > > >>>>>> Setup:
> > > >>>>>> --------
> > > >>>>>>
> > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > >>>>>>
> > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > >>>>>>
> > > >>>>>> Disabled the exclusive lock feature so that I can run
> > > >>>>>> multiple write  jobs in
> > > >>>> parallel.
> > > >>>>>> rbd_cache is disabled in the client side.
> > > >>>>>> Each test ran for 15 mins.
> > > >>>>>>
> > > >>>>>> Result :
> > > >>>>>> ---------
> > > >>>>>>
> > > >>>>>> Here is the detailed report on this.
> > > >>
> > >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > >>>>>>
> > > >>>>>> Each profile I named based on <allocator>-<freelist> , so in
> > > >>>>>> the graph for
> > > >>>> example "stupid-extent" meaning stupid allocator and extent
> freelist.
> > > >>>>>>
> > > >>>>>> I ran the test for each of the profile in the following order
> > > >>>>>> after creating a
> > > >>>> fresh rbd image for all the Bluestore test.
> > > >>>>>>
> > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > >>>>>>
> > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > >>>>>>
> > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > >>>>>>
> > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > >>>>>>
> > > >>>>>> The above are non-preconditioned case i.e ran before filling
> > > >>>>>> up the entire
> > > >>>> image. The reason is I don't see any reason of filling up the
> > > >>>> rbd image before like filestore case where it will give stable
> > > >>>> performance if we fill up the rbd images first. Filling up rbd
> > > >>>> images in case of filestore will create the files in the filesystem.
> > > >>>>>>
> > > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> > > >>>>>> This is
> > > >>>> primarily because I want to load BlueStore with more data.
> > > >>>>>>
> > > >>>>>> 6. Ran 4K RW test again (this is called out preconditioned in
> > > >>>>>> the
> > > >>>>>> profile) for 15 min
> > > >>>>>>
> > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > >>>>>>
> > > >>>>>> 8. Ran 16K RW test again for 15min
> > > >>>>>>
> > > >>>>>> For filestore test, I ran tests after preconditioning the
> > > >>>>>> entire image
> > > >> first.
> > > >>>>>>
> > > >>>>>> Each sheet on the xls have different block size result , I
> > > >>>>>> often miss to navigate through the xls sheets , so, thought
> > > >>>>>> of mentioning here
> > > >>>>>> :-)
> > > >>>>>>
> > > >>>>>> I have also captured the mkfs time , OSD startup time and the
> > > >>>>>> memory
> > > >>>> usage after the entire run.
> > > >>>>>>
> > > >>>>>> Observation:
> > > >>>>>> ---------------
> > > >>>>>>
> > > >>>>>> 1. First of all, in case of bitmap allocator mkfs time (and
> > > >>>>>> thus cluster
> > > >>>> creation time for 16 OSDs) are ~16X slower than stupid
> > > >>>> allocator and
> > > >>> filestore.
> > > >>>> Each OSD creation is taking ~2min or so sometimes and I nailed
> > > >>>> down the
> > > >>>> insert_free() function call (marked ****) in the Bitmap
> > > >>>> allocator is causing that.
> > > >>>>>>
> > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > >>>>>> enumerate_next start
> > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > >>>>>> enumerate_next
> > > >>>>>> 0x4663d00000~69959451000
> > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> > > >>>>>> 0x4663d00000 length 0x69959451000
> > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > >>>>>> enumerate_next
> > > >>>>>> end****
> > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
> > > >>>>>> in
> > > >>>>>> 1 extents
> > > >>>>>>
> > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> ^A:5242880+5242880
> > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > >>>>>> _read_random got
> > > >>>>>> 613
> > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > >>>>>> enumerate_next
> > > >>>>>> 0x4663d00000~69959451000
> > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> > > >>>>>> 0x4663d00000 length 0x69959451000
> > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > > >>>>>> 0x4663d00000 len
> > > >>>>>> 0x69959451000*****
> > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > >>>>>> enumerate_next end
> > > >>>>>
> > > >>>>> I'm not sure there's any easy fix for this. We can amortize it
> > > >>>>> by feeding
> > > >>>> space to bluefs slowly (so that we don't have to do all the
> > > >>>> inserts at once), but I'm not sure that's really better.
> > > >>>>>
> > > >>>>> [Somnath] I don't know that part of the code, so, may be a
> > > >>>>> dumb
> > > >>> question.
> > > >>>> This is during mkfs() time , so, can't we say to bluefs entire
> > > >>>> space is free ? I can understand for osd mount and all other
> > > >>>> cases we need to feed the free space every time.
> > > >>>>> IMO this is critical to fix as cluster creation time will be
> > > >>>>> number of OSDs * 2
> > > >>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > > >>>> compare to
> > > >>>> ~2 min for stupid allocator/filestore.
> > > >>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G
> > > >>>>> and WAL is
> > > >>>> ~1G. I guess the time taking is dependent on data partition
> > > >>>> size as well
> > > (?
> > > >>>>
> > > >>>> Well, we're fundamentally limited by the fact that it's a
> > > >>>> bitmap, and a big chunk of space is "allocated" to bluefs and
> > > >>>> needs to have 1's
> > > set.
> > > >>>>
> > > >>>> sage
> > > >>>> --
> > > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > >>>> in the body of a message to majordomo@vger.kernel.org More
> > > >>> majordomo
> > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > >>> --
> > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >>> in the body of a message to majordomo@vger.kernel.org More
> > > >> majordomo
> > > >>> info at http://vger.kernel.org/majordomo-info.html
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 14:06               ` Ben England
@ 2016-08-11 17:07                 ` Allen Samuels
  0 siblings, 0 replies; 34+ messages in thread
From: Allen Samuels @ 2016-08-11 17:07 UTC (permalink / raw)
  To: Ben England, Mark Nelson
  Cc: Somnath Roy, Ramesh Chander, Sage Weil, ceph-devel, Sebastien Han

I'd wait on this until we conclude that there is something fundamental about the long initialization time. IMO, It's likely to just be something simple to improve in Bluestore.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Ben England [mailto:bengland@redhat.com]
> Sent: Thursday, August 11, 2016 7:07 AM
> To: Mark Nelson <mnelson@redhat.com>
> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Ramesh Chander
> <Ramesh.Chander@sandisk.com>; Allen Samuels
> <Allen.Samuels@sandisk.com>; Sage Weil <sage@newdream.net>; ceph-
> devel <ceph-devel@vger.kernel.org>; Sebastien Han <shan@redhat.com>
> Subject: Re: Bluestore different allocator performance Vs FileStore
> 
> cc'ing Sebastien Han...
> 
> does "ceph-disk prepare" support completely parallel operation?
> 
> The only CBT constraint on parallel OSD creation that I'm aware of is that CBT
> had to serialize "ceph osd create" command, so that it knew what OSD
> number it created and what UUID it mapped to.  But even here it could have
> gotten the osd number from "ceph osd create" output.  Since this only takes
> 1 second, this was not a problem.  Everything else can run in parallel, and
> does, in CBT.  For any one OSD, the creation steps are serialized with a OSD
> creation thread, but OSDs are created in parallel once "ceph osd create" has
> run.  Ceph CBT does not use ceph-disk, so there is a chance that ceph-
> ansible, which depends on ceph-disk, has different constraints.
> 
> ceph-ansible operates in parallel across OSD hosts, but within an OSD host it's
> one OSD at a time at present.   The bigger your OSD host count, the more
> parallelized this can be, although ansible has a default fan-out of 5 hosts,
> which isn't enough - I run it with much higher fan-out and haven't seen any
> problems so far.
> 
> For example, for a server with 36 drives, this is a bit irritating.  As Somnath
> said, the mkfs command is the biggest consumer of time.  (i.e. ceph-disk
> prepare).  But at least it is bounded by the number of drives per server.  See
> this line in ceph-ansible that runs ceph-disk prepare.
> 
> https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-
> osd/tasks/scenarios/raw_multi_journal.yml#L6
> 
> Ansible does appear to have some support for background tasks, See bottom
> of this page.
> 
> http://docs.ansible.com/ansible/playbooks_async.html
> 
> Sebastien, is there some way to fire up "ceph-disk prepare" for each device
> as separate asynchronous tasks and then wait for them all to complete
> before proceeding?  In the worst case, ceph-ansible could launch a shell
> script that would background the ceph-disk prepare processes within a host,
> and then wait for all of them to complete.  I'll try to follow up on this, was not
> sure whether this was still an issue in bluestore until I saw this e-mail.
> 
> -ben
> 
> ----- Original Message -----
> > From: "Mark Nelson" <mnelson@redhat.com>
> > To: "Somnath Roy" <Somnath.Roy@sandisk.com>, "Ramesh Chander"
> <Ramesh.Chander@sandisk.com>, "Allen Samuels"
> > <Allen.Samuels@sandisk.com>, "Sage Weil" <sage@newdream.net>
> > Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Ben England"
> > <bengland@redhat.com>
> > Sent: Thursday, August 11, 2016 7:24:42 AM
> > Subject: Re: Bluestore different allocator performance Vs FileStore
> >
> > Ben England added parallel OSD creation to CBT a while back which
> > greatly speed up cluster creation time (not just for the bitmap
> > alloctaor).  I'm not sure if ceph-ansible creates OSDs in parallel,
> > but if not he might have some insights into how easy it would be to
> improve it.
> >
> > Mark
> >
> > On 08/11/2016 02:11 AM, Somnath Roy wrote:
> > > Yes, we can create OSDs in parallel but I am not sure how many
> > > people are creating cluster like that as ceph-deploy end there is no
> > > interface for that.
> > > FYI, we have introduced some parallelism in SanDisk wrapper script
> > > for installer based on ceph-deploy.
> > > I don't think even with all these parallel OSD creation, this
> > > problem will go away but for sure will be reduced  a bit as we have
> > > seen in case of OSD start time since it is inherently parallel.
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -----Original Message-----
> > > From: Ramesh Chander
> > > Sent: Wednesday, August 10, 2016 11:07 PM
> > > To: Allen Samuels; Sage Weil; Somnath Roy
> > > Cc: ceph-devel
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > Somnath,
> > >
> > > Basically mkfs time has increased from 7.5 seconds (2min / 16) to 2
> > > minutes ( 32 / 16).
> > >
> > > But is there a reason you should create osds in serial? I think for
> > > mmultiple osds mkfs can happen in parallel?
> > >
> > > As a fix I am looking to batch multiple insert_free calls for now.
> > > If still that does not help, thinking of doing insert_free on
> > > different part of device in parallel.
> > >
> > > -Ramesh
> > >
> > >> -----Original Message-----
> > >> From: Ramesh Chander
> > >> Sent: Thursday, August 11, 2016 10:04 AM
> > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > >> Cc: ceph-devel
> > >> Subject: RE: Bluestore different allocator performance Vs FileStore
> > >>
> > >> I think insert_free is limited by speed of function clear_bits here.
> > >>
> > >> Though set_bits and clear_bits have same logic except one sets and
> > >> another clears. Both of these does 64 bits (bitmap size) at a time.
> > >>
> > >> I am not sure if doing memset will make it faster. But if we can do
> > >> it for group of bitmaps, then it might help.
> > >>
> > >> I am looking in to code if we can handle mkfs and osd mount in
> > >> special way to make it faster.
> > >>
> > >> If I don't find an easy fix, we can go to path of deferring init to
> > >> later stage as and when required.
> > >>
> > >> -Ramesh
> > >>
> > >>> -----Original Message-----
> > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > >>> To: Sage Weil; Somnath Roy
> > >>> Cc: ceph-devel
> > >>> Subject: RE: Bluestore different allocator performance Vs
> > >>> FileStore
> > >>>
> > >>> We always knew that startup time for bitmap stuff would be
> > >>> somewhat longer. Still, the existing implementation can be speeded
> > >>> up significantly. The code in BitMapZone::set_blocks_used isn't
> > >>> very optimized. Converting it to use memset for all but the
> > >>> first/last bytes
> > >> should significantly speed it up.
> > >>>
> > >>>
> > >>> Allen Samuels
> > >>> SanDisk |a Western Digital brand
> > >>> 2880 Junction Avenue, San Jose, CA 95134
> > >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > >>>> Subject: RE: Bluestore different allocator performance Vs
> > >>>> FileStore
> > >>>>
> > >>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > >>>>> << inline with [Somnath]
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > >>>>> To: Somnath Roy
> > >>>>> Cc: ceph-devel
> > >>>>> Subject: Re: Bluestore different allocator performance Vs
> > >>>>> FileStore
> > >>>>>
> > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > >>>>>> Hi, I spent some time on evaluating different Bluestore
> > >>>>>> allocator and freelist performance. Also, tried to gaze the
> > >>>>>> performance difference of Bluestore and filestore on the
> > >>>>>> similar
> > >> setup.
> > >>>>>>
> > >>>>>> Setup:
> > >>>>>> --------
> > >>>>>>
> > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > >>>>>>
> > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > >>>>>>
> > >>>>>> Disabled the exclusive lock feature so that I can run multiple
> > >>>>>> write  jobs in
> > >>>> parallel.
> > >>>>>> rbd_cache is disabled in the client side.
> > >>>>>> Each test ran for 15 mins.
> > >>>>>>
> > >>>>>> Result :
> > >>>>>> ---------
> > >>>>>>
> > >>>>>> Here is the detailed report on this.
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8
> > >> a
> > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > >>>>>>
> > >>>>>> Each profile I named based on <allocator>-<freelist> , so in
> > >>>>>> the graph for
> > >>>> example "stupid-extent" meaning stupid allocator and extent freelist.
> > >>>>>>
> > >>>>>> I ran the test for each of the profile in the following order
> > >>>>>> after creating a
> > >>>> fresh rbd image for all the Bluestore test.
> > >>>>>>
> > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > >>>>>>
> > >>>>>> The above are non-preconditioned case i.e ran before filling up
> > >>>>>> the entire
> > >>>> image. The reason is I don't see any reason of filling up the rbd
> > >>>> image before like filestore case where it will give stable
> > >>>> performance if we fill up the rbd images first. Filling up rbd
> > >>>> images in case of filestore will create the files in the filesystem.
> > >>>>>>
> > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> > >>>>>> This is
> > >>>> primarily because I want to load BlueStore with more data.
> > >>>>>>
> > >>>>>> 6. Ran 4K RW test again (this is called out preconditioned in
> > >>>>>> the
> > >>>>>> profile) for 15 min
> > >>>>>>
> > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > >>>>>>
> > >>>>>> 8. Ran 16K RW test again for 15min
> > >>>>>>
> > >>>>>> For filestore test, I ran tests after preconditioning the
> > >>>>>> entire image
> > >> first.
> > >>>>>>
> > >>>>>> Each sheet on the xls have different block size result , I
> > >>>>>> often miss to navigate through the xls sheets , so, thought of
> > >>>>>> mentioning here
> > >>>>>> :-)
> > >>>>>>
> > >>>>>> I have also captured the mkfs time , OSD startup time and the
> > >>>>>> memory
> > >>>> usage after the entire run.
> > >>>>>>
> > >>>>>> Observation:
> > >>>>>> ---------------
> > >>>>>>
> > >>>>>> 1. First of all, in case of bitmap allocator mkfs time (and
> > >>>>>> thus cluster
> > >>>> creation time for 16 OSDs) are ~16X slower than stupid allocator
> > >>>> and
> > >>> filestore.
> > >>>> Each OSD creation is taking ~2min or so sometimes and I nailed
> > >>>> down the
> > >>>> insert_free() function call (marked ****) in the Bitmap allocator
> > >>>> is causing that.
> > >>>>>>
> > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next start
> > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next
> > >>>>>> 0x4663d00000~69959451000
> > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> > >>>>>> 0x4663d00000 length 0x69959451000
> > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > >>>>>> 0x4663d00000 len 0x69959451000****
> > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next
> > >>>>>> end****
> > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
> > >>>>>> in
> > >>>>>> 1 extents
> > >>>>>>
> > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs _read_random
> > >>>>>> read buffered 0x4a14eb~265 of ^A:5242880+5242880
> > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs _read_random
> > >>>>>> got
> > >>>>>> 613
> > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next
> > >>>>>> 0x4663d00000~69959451000
> > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> > >>>>>> 0x4663d00000 length 0x69959451000
> > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > >>>>>> 0x4663d00000 len
> > >>>>>> 0x69959451000*****
> > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > >>>>>> enumerate_next end
> > >>>>>
> > >>>>> I'm not sure there's any easy fix for this. We can amortize it
> > >>>>> by feeding
> > >>>> space to bluefs slowly (so that we don't have to do all the
> > >>>> inserts at once), but I'm not sure that's really better.
> > >>>>>
> > >>>>> [Somnath] I don't know that part of the code, so, may be a dumb
> > >>> question.
> > >>>> This is during mkfs() time , so, can't we say to bluefs entire
> > >>>> space is free ? I can understand for osd mount and all other
> > >>>> cases we need to feed the free space every time.
> > >>>>> IMO this is critical to fix as cluster creation time will be
> > >>>>> number of OSDs * 2
> > >>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > >>>> compare to
> > >>>> ~2 min for stupid allocator/filestore.
> > >>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G
> > >>>>> and WAL is
> > >>>> ~1G. I guess the time taking is dependent on data partition size
> > >>>> as well (?
> > >>>>
> > >>>> Well, we're fundamentally limited by the fact that it's a bitmap,
> > >>>> and a big chunk of space is "allocated" to bluefs and needs to
> > >>>> have 1's set.
> > >>>>
> > >>>> sage
> > >>>> --
> > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>>> in the body of a message to majordomo@vger.kernel.org More
> > >>> majordomo
> > >>>> info at http://vger.kernel.org/majordomo-info.html
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>> in the body of a message to majordomo@vger.kernel.org More
> > >> majordomo
> > >>> info at http://vger.kernel.org/majordomo-info.html
> > > PLEASE NOTE: The information contained in this electronic mail
> > > message is intended only for the use of the designated recipient(s)
> > > named above. If the reader of this message is not the intended
> > > recipient, you are hereby notified that you have received this
> > > message in error and that any review, dissemination, distribution,
> > > or copying of this message is strictly prohibited. If you have
> > > received this communication in error, please notify the sender by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> > > all copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 17:05                 ` Allen Samuels
@ 2016-08-11 17:15                   ` Sage Weil
  2016-08-11 17:26                     ` Allen Samuels
  0 siblings, 1 reply; 34+ messages in thread
From: Sage Weil @ 2016-08-11 17:15 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 9:38 AM
> > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> > 
> > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > I think the free list does not initialize all keys at mkfs time, it
> > > does sets key that has some allocations.
> > >
> > > Rest keys are assumed to have 0's if key does not exist.
> > 
> > Right.. it's the region "allocated" to bluefs that is consuming the time.
> > 
> > > The bitmap allocator insert_free is done in group of free bits
> > > together(maybe more than bitmap freelist keys at a time).
> > 
> > I think Allen is asking whether we are doing lots of inserts within a single
> > rocksdb transaction, or lots of separate transactions.
> > 
> > FWIW, my guess is that increasing the size of the value (i.e., increasing
> > 
> > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > 
> > ) will probably speed this up.
> 
> If your assumption (> Right.. it's the region "allocated" to bluefs that 
> is consuming the time) is correct, then I don't understand why this 
> parameter has any effect on the problem.
> 
> Aren't we reading BlueFS extents and setting them in the 
> BitMapAllocator? That doesn't care about the chunking of bitmap bits 
> into KV keys.

I think this is something different.  During mkfs we take ~2% (or 
somethign like that) of the block device, mark it 'allocated' (from the 
bluestore freelist's perspective) and give it to bluefs.  On a large 
device that's a lot of bits to set.  Larger keys should speed that up.

The amount of space we start with comes from _open_db():

      uint64_t initial =
	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
			    g_conf->bluestore_bluefs_gift_ratio);
      initial = MAX(initial, g_conf->bluestore_bluefs_min);

Simply lowering min_ratio might also be fine.  The current value of 2% is 
meant to be enough for most stores, and to avoid giving over lots of 
little extents later (and making the bluefs_extents list too big).  That 
can overflow the superblock, another annoying thing we need to fix (though 
not a big deal to fix).

Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve the time 
spent on this.. that is probably another useful test to confirm this is 
what is going on.

sage

> I would be cautious about just changing this option to affect this 
> problem (though as an experiment, we can change the value and see if it 
> has ANY affect on this problem -- which I don't think it will). The 
> value of this option really needs to be dictated by its effect on the 
> more mainstream read/write operations not on the initialization problem.
> > 
> > sage
> > 
> > 
> > >
> > > -Ramesh
> > >
> > > > -----Original Message-----
> > > > From: Allen Samuels
> > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > To: Ramesh Chander
> > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > Subject: Re: Bluestore different allocator performance Vs FileStore
> > > >
> > > > Is the initial creation of the keys for the bitmap one by one or are
> > > > they batched?
> > > >
> > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > >
> > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > >
> > > > > Somnath,
> > > > >
> > > > > Basically mkfs time has increased from 7.5 seconds (2min / 16) to
> > > > > 2 minutes
> > > > ( 32 / 16).
> > > > >
> > > > > But is there a reason you should create osds in serial? I think
> > > > > for mmultiple
> > > > osds mkfs can happen in parallel?
> > > > >
> > > > > As a fix I am looking to batch multiple insert_free calls for now.
> > > > > If still that
> > > > does not help, thinking of doing insert_free on different part of
> > > > device in parallel.
> > > > >
> > > > > -Ramesh
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Ramesh Chander
> > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > >> Cc: ceph-devel
> > > > >> Subject: RE: Bluestore different allocator performance Vs
> > > > >> FileStore
> > > > >>
> > > > >> I think insert_free is limited by speed of function clear_bits here.
> > > > >>
> > > > >> Though set_bits and clear_bits have same logic except one sets
> > > > >> and another clears. Both of these does 64 bits (bitmap size) at a time.
> > > > >>
> > > > >> I am not sure if doing memset will make it faster. But if we can
> > > > >> do it for group of bitmaps, then it might help.
> > > > >>
> > > > >> I am looking in to code if we can handle mkfs and osd mount in
> > > > >> special way to make it faster.
> > > > >>
> > > > >> If I don't find an easy fix, we can go to path of deferring init
> > > > >> to later stage as and when required.
> > > > >>
> > > > >> -Ramesh
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > >>> To: Sage Weil; Somnath Roy
> > > > >>> Cc: ceph-devel
> > > > >>> Subject: RE: Bluestore different allocator performance Vs
> > > > >>> FileStore
> > > > >>>
> > > > >>> We always knew that startup time for bitmap stuff would be
> > > > >>> somewhat longer. Still, the existing implementation can be
> > > > >>> speeded up significantly. The code in
> > > > >>> BitMapZone::set_blocks_used isn't very optimized. Converting it
> > > > >>> to use memset for all but the first/last bytes
> > > > >> should significantly speed it up.
> > > > >>>
> > > > >>>
> > > > >>> Allen Samuels
> > > > >>> SanDisk |a Western Digital brand
> > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >>>
> > > > >>>
> > > > >>>> -----Original Message-----
> > > > >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > >>>> Subject: RE: Bluestore different allocator performance Vs
> > > > >>>> FileStore
> > > > >>>>
> > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > >>>>> << inline with [Somnath]
> > > > >>>>>
> > > > >>>>> -----Original Message-----
> > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > >>>>> To: Somnath Roy
> > > > >>>>> Cc: ceph-devel
> > > > >>>>> Subject: Re: Bluestore different allocator performance Vs
> > > > >>>>> FileStore
> > > > >>>>>
> > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > >>>>>> Hi, I spent some time on evaluating different Bluestore
> > > > >>>>>> allocator and freelist performance. Also, tried to gaze the
> > > > >>>>>> performance difference of Bluestore and filestore on the
> > > > >>>>>> similar
> > > > >> setup.
> > > > >>>>>>
> > > > >>>>>> Setup:
> > > > >>>>>> --------
> > > > >>>>>>
> > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > >>>>>>
> > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > >>>>>>
> > > > >>>>>> Disabled the exclusive lock feature so that I can run
> > > > >>>>>> multiple write  jobs in
> > > > >>>> parallel.
> > > > >>>>>> rbd_cache is disabled in the client side.
> > > > >>>>>> Each test ran for 15 mins.
> > > > >>>>>>
> > > > >>>>>> Result :
> > > > >>>>>> ---------
> > > > >>>>>>
> > > > >>>>>> Here is the detailed report on this.
> > > > >>
> > > >
> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > >>>>>>
> > > > >>>>>> Each profile I named based on <allocator>-<freelist> , so in
> > > > >>>>>> the graph for
> > > > >>>> example "stupid-extent" meaning stupid allocator and extent
> > freelist.
> > > > >>>>>>
> > > > >>>>>> I ran the test for each of the profile in the following order
> > > > >>>>>> after creating a
> > > > >>>> fresh rbd image for all the Bluestore test.
> > > > >>>>>>
> > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > >>>>>>
> > > > >>>>>> The above are non-preconditioned case i.e ran before filling
> > > > >>>>>> up the entire
> > > > >>>> image. The reason is I don't see any reason of filling up the
> > > > >>>> rbd image before like filestore case where it will give stable
> > > > >>>> performance if we fill up the rbd images first. Filling up rbd
> > > > >>>> images in case of filestore will create the files in the filesystem.
> > > > >>>>>>
> > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> > > > >>>>>> This is
> > > > >>>> primarily because I want to load BlueStore with more data.
> > > > >>>>>>
> > > > >>>>>> 6. Ran 4K RW test again (this is called out preconditioned in
> > > > >>>>>> the
> > > > >>>>>> profile) for 15 min
> > > > >>>>>>
> > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > >>>>>>
> > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > >>>>>>
> > > > >>>>>> For filestore test, I ran tests after preconditioning the
> > > > >>>>>> entire image
> > > > >> first.
> > > > >>>>>>
> > > > >>>>>> Each sheet on the xls have different block size result , I
> > > > >>>>>> often miss to navigate through the xls sheets , so, thought
> > > > >>>>>> of mentioning here
> > > > >>>>>> :-)
> > > > >>>>>>
> > > > >>>>>> I have also captured the mkfs time , OSD startup time and the
> > > > >>>>>> memory
> > > > >>>> usage after the entire run.
> > > > >>>>>>
> > > > >>>>>> Observation:
> > > > >>>>>> ---------------
> > > > >>>>>>
> > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs time (and
> > > > >>>>>> thus cluster
> > > > >>>> creation time for 16 OSDs) are ~16X slower than stupid
> > > > >>>> allocator and
> > > > >>> filestore.
> > > > >>>> Each OSD creation is taking ~2min or so sometimes and I nailed
> > > > >>>> down the
> > > > >>>> insert_free() function call (marked ****) in the Bitmap
> > > > >>>> allocator is causing that.
> > > > >>>>>>
> > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next start
> > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next
> > > > >>>>>> 0x4663d00000~69959451000
> > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next
> > > > >>>>>> end****
> > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded 6757 G
> > > > >>>>>> in
> > > > >>>>>> 1 extents
> > > > >>>>>>
> > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > ^A:5242880+5242880
> > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > >>>>>> _read_random got
> > > > >>>>>> 613
> > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next
> > > > >>>>>> 0x4663d00000~69959451000
> > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > > > >>>>>> 0x4663d00000 len
> > > > >>>>>> 0x69959451000*****
> > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > > >>>>>> enumerate_next end
> > > > >>>>>
> > > > >>>>> I'm not sure there's any easy fix for this. We can amortize it
> > > > >>>>> by feeding
> > > > >>>> space to bluefs slowly (so that we don't have to do all the
> > > > >>>> inserts at once), but I'm not sure that's really better.
> > > > >>>>>
> > > > >>>>> [Somnath] I don't know that part of the code, so, may be a
> > > > >>>>> dumb
> > > > >>> question.
> > > > >>>> This is during mkfs() time , so, can't we say to bluefs entire
> > > > >>>> space is free ? I can understand for osd mount and all other
> > > > >>>> cases we need to feed the free space every time.
> > > > >>>>> IMO this is critical to fix as cluster creation time will be
> > > > >>>>> number of OSDs * 2
> > > > >>>> min otherwise. For me creating 16 OSDs cluster is taking ~32min
> > > > >>>> compare to
> > > > >>>> ~2 min for stupid allocator/filestore.
> > > > >>>>> BTW, my drive data partition is ~6.9TB , db partition is ~100G
> > > > >>>>> and WAL is
> > > > >>>> ~1G. I guess the time taking is dependent on data partition
> > > > >>>> size as well
> > > > (?
> > > > >>>>
> > > > >>>> Well, we're fundamentally limited by the fact that it's a
> > > > >>>> bitmap, and a big chunk of space is "allocated" to bluefs and
> > > > >>>> needs to have 1's
> > > > set.
> > > > >>>>
> > > > >>>> sage
> > > > >>>> --
> > > > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> > devel"
> > > > >>>> in the body of a message to majordomo@vger.kernel.org More
> > > > >>> majordomo
> > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >>> in the body of a message to majordomo@vger.kernel.org More
> > > > >> majordomo
> > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If the
> > reader of this message is not the intended recipient, you are hereby notified
> > that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly prohibited. If
> > you have received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and all
> > copies of this message in your possession (whether hard copies or
> > electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 17:15                   ` Sage Weil
@ 2016-08-11 17:26                     ` Allen Samuels
  2016-08-11 19:34                       ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Allen Samuels @ 2016-08-11 17:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, August 11, 2016 10:15 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 9:38 AM
> > > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > > I think the free list does not initialize all keys at mkfs time,
> > > > it does sets key that has some allocations.
> > > >
> > > > Rest keys are assumed to have 0's if key does not exist.
> > >
> > > Right.. it's the region "allocated" to bluefs that is consuming the time.
> > >
> > > > The bitmap allocator insert_free is done in group of free bits
> > > > together(maybe more than bitmap freelist keys at a time).
> > >
> > > I think Allen is asking whether we are doing lots of inserts within
> > > a single rocksdb transaction, or lots of separate transactions.
> > >
> > > FWIW, my guess is that increasing the size of the value (i.e.,
> > > increasing
> > >
> > > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > >
> > > ) will probably speed this up.
> >
> > If your assumption (> Right.. it's the region "allocated" to bluefs
> > that is consuming the time) is correct, then I don't understand why
> > this parameter has any effect on the problem.
> >
> > Aren't we reading BlueFS extents and setting them in the
> > BitMapAllocator? That doesn't care about the chunking of bitmap bits
> > into KV keys.
> 
> I think this is something different.  During mkfs we take ~2% (or somethign
> like that) of the block device, mark it 'allocated' (from the bluestore freelist's
> perspective) and give it to bluefs.  On a large device that's a lot of bits to set.
> Larger keys should speed that up.

But the bits in the BitMap shouldn't be chunked up in the same units as the Keys. Right? Sharding of the bitmap is done for internal parallelism -- only, it has nothing to do with the persistent representation.

BlueFS allocations aren't stored in the KV database (to avoid circularity).

So I don't see why a bitset of 2m bits should be taking so long..... Makes me thing that we don't really understand the problem.

> 
> The amount of space we start with comes from _open_db():
> 
>       uint64_t initial =
> 	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
> 			    g_conf->bluestore_bluefs_gift_ratio);
>       initial = MAX(initial, g_conf->bluestore_bluefs_min);
> 
> Simply lowering min_ratio might also be fine.  The current value of 2% is
> meant to be enough for most stores, and to avoid giving over lots of little
> extents later (and making the bluefs_extents list too big).  That can overflow
> the superblock, another annoying thing we need to fix (though not a big deal
> to fix).
> 
> Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve the time
> spent on this.. that is probably another useful test to confirm this is what is
> going on.

Yes, this should help -- but still seems like a bandaid.

> 
> sage
> 
> > I would be cautious about just changing this option to affect this
> > problem (though as an experiment, we can change the value and see if
> > it has ANY affect on this problem -- which I don't think it will). The
> > value of this option really needs to be dictated by its effect on the
> > more mainstream read/write operations not on the initialization problem.
> > >
> > > sage
> > >
> > >
> > > >
> > > > -Ramesh
> > > >
> > > > > -----Original Message-----
> > > > > From: Allen Samuels
> > > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > > To: Ramesh Chander
> > > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > > Subject: Re: Bluestore different allocator performance Vs
> > > > > FileStore
> > > > >
> > > > > Is the initial creation of the keys for the bitmap one by one or
> > > > > are they batched?
> > > > >
> > > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > > >
> > > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > > >
> > > > > > Somnath,
> > > > > >
> > > > > > Basically mkfs time has increased from 7.5 seconds (2min / 16)
> > > > > > to
> > > > > > 2 minutes
> > > > > ( 32 / 16).
> > > > > >
> > > > > > But is there a reason you should create osds in serial? I
> > > > > > think for mmultiple
> > > > > osds mkfs can happen in parallel?
> > > > > >
> > > > > > As a fix I am looking to batch multiple insert_free calls for now.
> > > > > > If still that
> > > > > does not help, thinking of doing insert_free on different part
> > > > > of device in parallel.
> > > > > >
> > > > > > -Ramesh
> > > > > >
> > > > > >> -----Original Message-----
> > > > > >> From: Ramesh Chander
> > > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > > >> Cc: ceph-devel
> > > > > >> Subject: RE: Bluestore different allocator performance Vs
> > > > > >> FileStore
> > > > > >>
> > > > > >> I think insert_free is limited by speed of function clear_bits here.
> > > > > >>
> > > > > >> Though set_bits and clear_bits have same logic except one
> > > > > >> sets and another clears. Both of these does 64 bits (bitmap size) at
> a time.
> > > > > >>
> > > > > >> I am not sure if doing memset will make it faster. But if we
> > > > > >> can do it for group of bitmaps, then it might help.
> > > > > >>
> > > > > >> I am looking in to code if we can handle mkfs and osd mount
> > > > > >> in special way to make it faster.
> > > > > >>
> > > > > >> If I don't find an easy fix, we can go to path of deferring
> > > > > >> init to later stage as and when required.
> > > > > >>
> > > > > >> -Ramesh
> > > > > >>
> > > > > >>> -----Original Message-----
> > > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > > >>> To: Sage Weil; Somnath Roy
> > > > > >>> Cc: ceph-devel
> > > > > >>> Subject: RE: Bluestore different allocator performance Vs
> > > > > >>> FileStore
> > > > > >>>
> > > > > >>> We always knew that startup time for bitmap stuff would be
> > > > > >>> somewhat longer. Still, the existing implementation can be
> > > > > >>> speeded up significantly. The code in
> > > > > >>> BitMapZone::set_blocks_used isn't very optimized. Converting
> > > > > >>> it to use memset for all but the first/last bytes
> > > > > >> should significantly speed it up.
> > > > > >>>
> > > > > >>>
> > > > > >>> Allen Samuels
> > > > > >>> SanDisk |a Western Digital brand
> > > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > > >>> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > >>> allen.samuels@SanDisk.com
> > > > > >>>
> > > > > >>>
> > > > > >>>> -----Original Message-----
> > > > > >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > >>>> Subject: RE: Bluestore different allocator performance Vs
> > > > > >>>> FileStore
> > > > > >>>>
> > > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > >>>>> << inline with [Somnath]
> > > > > >>>>>
> > > > > >>>>> -----Original Message-----
> > > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > > >>>>> To: Somnath Roy
> > > > > >>>>> Cc: ceph-devel
> > > > > >>>>> Subject: Re: Bluestore different allocator performance Vs
> > > > > >>>>> FileStore
> > > > > >>>>>
> > > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > >>>>>> Hi, I spent some time on evaluating different Bluestore
> > > > > >>>>>> allocator and freelist performance. Also, tried to gaze
> > > > > >>>>>> the performance difference of Bluestore and filestore on
> > > > > >>>>>> the similar
> > > > > >> setup.
> > > > > >>>>>>
> > > > > >>>>>> Setup:
> > > > > >>>>>> --------
> > > > > >>>>>>
> > > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > > >>>>>>
> > > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > > >>>>>>
> > > > > >>>>>> Disabled the exclusive lock feature so that I can run
> > > > > >>>>>> multiple write  jobs in
> > > > > >>>> parallel.
> > > > > >>>>>> rbd_cache is disabled in the client side.
> > > > > >>>>>> Each test ran for 15 mins.
> > > > > >>>>>>
> > > > > >>>>>> Result :
> > > > > >>>>>> ---------
> > > > > >>>>>>
> > > > > >>>>>> Here is the detailed report on this.
> > > > > >>
> > > > >
> > >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > > >>>>>>
> > > > > >>>>>> Each profile I named based on <allocator>-<freelist> , so
> > > > > >>>>>> in the graph for
> > > > > >>>> example "stupid-extent" meaning stupid allocator and extent
> > > freelist.
> > > > > >>>>>>
> > > > > >>>>>> I ran the test for each of the profile in the following
> > > > > >>>>>> order after creating a
> > > > > >>>> fresh rbd image for all the Bluestore test.
> > > > > >>>>>>
> > > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > > >>>>>>
> > > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > > >>>>>>
> > > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > > >>>>>>
> > > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > > >>>>>>
> > > > > >>>>>> The above are non-preconditioned case i.e ran before
> > > > > >>>>>> filling up the entire
> > > > > >>>> image. The reason is I don't see any reason of filling up
> > > > > >>>> the rbd image before like filestore case where it will give
> > > > > >>>> stable performance if we fill up the rbd images first.
> > > > > >>>> Filling up rbd images in case of filestore will create the files in
> the filesystem.
> > > > > >>>>>>
> > > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> > > > > >>>>>> This is
> > > > > >>>> primarily because I want to load BlueStore with more data.
> > > > > >>>>>>
> > > > > >>>>>> 6. Ran 4K RW test again (this is called out
> > > > > >>>>>> preconditioned in the
> > > > > >>>>>> profile) for 15 min
> > > > > >>>>>>
> > > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > > >>>>>>
> > > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > > >>>>>>
> > > > > >>>>>> For filestore test, I ran tests after preconditioning the
> > > > > >>>>>> entire image
> > > > > >> first.
> > > > > >>>>>>
> > > > > >>>>>> Each sheet on the xls have different block size result ,
> > > > > >>>>>> I often miss to navigate through the xls sheets , so,
> > > > > >>>>>> thought of mentioning here
> > > > > >>>>>> :-)
> > > > > >>>>>>
> > > > > >>>>>> I have also captured the mkfs time , OSD startup time and
> > > > > >>>>>> the memory
> > > > > >>>> usage after the entire run.
> > > > > >>>>>>
> > > > > >>>>>> Observation:
> > > > > >>>>>> ---------------
> > > > > >>>>>>
> > > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs time
> > > > > >>>>>> (and thus cluster
> > > > > >>>> creation time for 16 OSDs) are ~16X slower than stupid
> > > > > >>>> allocator and
> > > > > >>> filestore.
> > > > > >>>> Each OSD creation is taking ~2min or so sometimes and I
> > > > > >>>> nailed down the
> > > > > >>>> insert_free() function call (marked ****) in the Bitmap
> > > > > >>>> allocator is causing that.
> > > > > >>>>>>
> > > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > > >>>>>> enumerate_next start
> > > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > > >>>>>> enumerate_next
> > > > > >>>>>> 0x4663d00000~69959451000
> > > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > > > >>>>>> enumerate_next
> > > > > >>>>>> end****
> > > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded
> > > > > >>>>>> 6757 G in
> > > > > >>>>>> 1 extents
> > > > > >>>>>>
> > > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > > ^A:5242880+5242880
> > > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > > >>>>>> _read_random got
> > > > > >>>>>> 613
> > > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > > >>>>>> enumerate_next
> > > > > >>>>>> 0x4663d00000~69959451000
> > > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > > > > >>>>>> 0x4663d00000 len
> > > > > >>>>>> 0x69959451000*****
> > > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > > > >>>>>> enumerate_next end
> > > > > >>>>>
> > > > > >>>>> I'm not sure there's any easy fix for this. We can
> > > > > >>>>> amortize it by feeding
> > > > > >>>> space to bluefs slowly (so that we don't have to do all the
> > > > > >>>> inserts at once), but I'm not sure that's really better.
> > > > > >>>>>
> > > > > >>>>> [Somnath] I don't know that part of the code, so, may be a
> > > > > >>>>> dumb
> > > > > >>> question.
> > > > > >>>> This is during mkfs() time , so, can't we say to bluefs
> > > > > >>>> entire space is free ? I can understand for osd mount and
> > > > > >>>> all other cases we need to feed the free space every time.
> > > > > >>>>> IMO this is critical to fix as cluster creation time will
> > > > > >>>>> be number of OSDs * 2
> > > > > >>>> min otherwise. For me creating 16 OSDs cluster is taking
> > > > > >>>> ~32min compare to
> > > > > >>>> ~2 min for stupid allocator/filestore.
> > > > > >>>>> BTW, my drive data partition is ~6.9TB , db partition is
> > > > > >>>>> ~100G and WAL is
> > > > > >>>> ~1G. I guess the time taking is dependent on data partition
> > > > > >>>> size as well
> > > > > (?
> > > > > >>>>
> > > > > >>>> Well, we're fundamentally limited by the fact that it's a
> > > > > >>>> bitmap, and a big chunk of space is "allocated" to bluefs
> > > > > >>>> and needs to have 1's
> > > > > set.
> > > > > >>>>
> > > > > >>>> sage
> > > > > >>>> --
> > > > > >>>> To unsubscribe from this list: send the line "unsubscribe
> > > > > >>>> ceph-
> > > devel"
> > > > > >>>> in the body of a message to majordomo@vger.kernel.org More
> > > > > >>> majordomo
> > > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > > >>> --
> > > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > > >>> in the body of a message to majordomo@vger.kernel.org More
> > > > > >> majordomo
> > > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > > PLEASE NOTE: The information contained in this electronic mail
> > > > message is
> > > intended only for the use of the designated recipient(s) named
> > > above. If the reader of this message is not the intended recipient,
> > > you are hereby notified that you have received this message in error
> > > and that any review, dissemination, distribution, or copying of this
> > > message is strictly prohibited. If you have received this
> > > communication in error, please notify the sender by telephone or
> > > e-mail (as shown above) immediately and destroy any and all copies
> > > of this message in your possession (whether hard copies or electronically
> stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 17:26                     ` Allen Samuels
@ 2016-08-11 19:34                       ` Sage Weil
  2016-08-11 19:45                         ` Allen Samuels
  0 siblings, 1 reply; 34+ messages in thread
From: Sage Weil @ 2016-08-11 19:34 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 10:15 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> > 
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > Sent: Thursday, August 11, 2016 9:38 AM
> > > > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > > >
> > > > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > > > I think the free list does not initialize all keys at mkfs time,
> > > > > it does sets key that has some allocations.
> > > > >
> > > > > Rest keys are assumed to have 0's if key does not exist.
> > > >
> > > > Right.. it's the region "allocated" to bluefs that is consuming the time.
> > > >
> > > > > The bitmap allocator insert_free is done in group of free bits
> > > > > together(maybe more than bitmap freelist keys at a time).
> > > >
> > > > I think Allen is asking whether we are doing lots of inserts within
> > > > a single rocksdb transaction, or lots of separate transactions.
> > > >
> > > > FWIW, my guess is that increasing the size of the value (i.e.,
> > > > increasing
> > > >
> > > > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > > >
> > > > ) will probably speed this up.
> > >
> > > If your assumption (> Right.. it's the region "allocated" to bluefs
> > > that is consuming the time) is correct, then I don't understand why
> > > this parameter has any effect on the problem.
> > >
> > > Aren't we reading BlueFS extents and setting them in the
> > > BitMapAllocator? That doesn't care about the chunking of bitmap bits
> > > into KV keys.
> > 
> > I think this is something different.  During mkfs we take ~2% (or somethign
> > like that) of the block device, mark it 'allocated' (from the bluestore freelist's
> > perspective) and give it to bluefs.  On a large device that's a lot of bits to set.
> > Larger keys should speed that up.
> 
> But the bits in the BitMap shouldn't be chunked up in the same units as 
> the Keys. Right? Sharding of the bitmap is done for internal parallelism 
> -- only, it has nothing to do with the persistent representation.

I'm not really sure what the BitmapAllocator is doing, but yeah, it's 
independent.  The tunable I'm talking about though is the one that 
controls how many bits BitmapFreelist puts in each key/value pair.

> BlueFS allocations aren't stored in the KV database (to avoid 
> circularity).
> 
> So I don't see why a bitset of 2m bits should be taking so long..... 
> Makes me thing that we don't really understand the problem.

Could be, I'm just guessing.  During mkfs, _open_fm() does

    fm->create(bdev->get_size(), t);

and then

    fm->allocate(0, reserved, t);

where the value of reserved depends on how much we give to bluefs.  I'm 
assuming this is the mkfs allocation that is taking time, but I haven't 
looked at the allocator code at all or whether insert_free is part of this 
path...

sage



> 
> > 
> > The amount of space we start with comes from _open_db():
> > 
> >       uint64_t initial =
> > 	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
> > 			    g_conf->bluestore_bluefs_gift_ratio);
> >       initial = MAX(initial, g_conf->bluestore_bluefs_min);
> > 
> > Simply lowering min_ratio might also be fine.  The current value of 2% is
> > meant to be enough for most stores, and to avoid giving over lots of little
> > extents later (and making the bluefs_extents list too big).  That can overflow
> > the superblock, another annoying thing we need to fix (though not a big deal
> > to fix).
> > 
> > Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve the time
> > spent on this.. that is probably another useful test to confirm this is what is
> > going on.
> 
> Yes, this should help -- but still seems like a bandaid.
> 
> > 
> > sage
> > 
> > > I would be cautious about just changing this option to affect this
> > > problem (though as an experiment, we can change the value and see if
> > > it has ANY affect on this problem -- which I don't think it will). The
> > > value of this option really needs to be dictated by its effect on the
> > > more mainstream read/write operations not on the initialization problem.
> > > >
> > > > sage
> > > >
> > > >
> > > > >
> > > > > -Ramesh
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Allen Samuels
> > > > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > > > To: Ramesh Chander
> > > > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > > > Subject: Re: Bluestore different allocator performance Vs
> > > > > > FileStore
> > > > > >
> > > > > > Is the initial creation of the keys for the bitmap one by one or
> > > > > > are they batched?
> > > > > >
> > > > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > > > >
> > > > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > > > >
> > > > > > > Somnath,
> > > > > > >
> > > > > > > Basically mkfs time has increased from 7.5 seconds (2min / 16)
> > > > > > > to
> > > > > > > 2 minutes
> > > > > > ( 32 / 16).
> > > > > > >
> > > > > > > But is there a reason you should create osds in serial? I
> > > > > > > think for mmultiple
> > > > > > osds mkfs can happen in parallel?
> > > > > > >
> > > > > > > As a fix I am looking to batch multiple insert_free calls for now.
> > > > > > > If still that
> > > > > > does not help, thinking of doing insert_free on different part
> > > > > > of device in parallel.
> > > > > > >
> > > > > > > -Ramesh
> > > > > > >
> > > > > > >> -----Original Message-----
> > > > > > >> From: Ramesh Chander
> > > > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > > > >> Cc: ceph-devel
> > > > > > >> Subject: RE: Bluestore different allocator performance Vs
> > > > > > >> FileStore
> > > > > > >>
> > > > > > >> I think insert_free is limited by speed of function clear_bits here.
> > > > > > >>
> > > > > > >> Though set_bits and clear_bits have same logic except one
> > > > > > >> sets and another clears. Both of these does 64 bits (bitmap size) at
> > a time.
> > > > > > >>
> > > > > > >> I am not sure if doing memset will make it faster. But if we
> > > > > > >> can do it for group of bitmaps, then it might help.
> > > > > > >>
> > > > > > >> I am looking in to code if we can handle mkfs and osd mount
> > > > > > >> in special way to make it faster.
> > > > > > >>
> > > > > > >> If I don't find an easy fix, we can go to path of deferring
> > > > > > >> init to later stage as and when required.
> > > > > > >>
> > > > > > >> -Ramesh
> > > > > > >>
> > > > > > >>> -----Original Message-----
> > > > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > > > >>> To: Sage Weil; Somnath Roy
> > > > > > >>> Cc: ceph-devel
> > > > > > >>> Subject: RE: Bluestore different allocator performance Vs
> > > > > > >>> FileStore
> > > > > > >>>
> > > > > > >>> We always knew that startup time for bitmap stuff would be
> > > > > > >>> somewhat longer. Still, the existing implementation can be
> > > > > > >>> speeded up significantly. The code in
> > > > > > >>> BitMapZone::set_blocks_used isn't very optimized. Converting
> > > > > > >>> it to use memset for all but the first/last bytes
> > > > > > >> should significantly speed it up.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Allen Samuels
> > > > > > >>> SanDisk |a Western Digital brand
> > > > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > > > >>> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > >>> allen.samuels@SanDisk.com
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>> -----Original Message-----
> > > > > > >>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > >>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > > >>>> Subject: RE: Bluestore different allocator performance Vs
> > > > > > >>>> FileStore
> > > > > > >>>>
> > > > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > >>>>> << inline with [Somnath]
> > > > > > >>>>>
> > > > > > >>>>> -----Original Message-----
> > > > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > > > >>>>> To: Somnath Roy
> > > > > > >>>>> Cc: ceph-devel
> > > > > > >>>>> Subject: Re: Bluestore different allocator performance Vs
> > > > > > >>>>> FileStore
> > > > > > >>>>>
> > > > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > >>>>>> Hi, I spent some time on evaluating different Bluestore
> > > > > > >>>>>> allocator and freelist performance. Also, tried to gaze
> > > > > > >>>>>> the performance difference of Bluestore and filestore on
> > > > > > >>>>>> the similar
> > > > > > >> setup.
> > > > > > >>>>>>
> > > > > > >>>>>> Setup:
> > > > > > >>>>>> --------
> > > > > > >>>>>>
> > > > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > > > >>>>>>
> > > > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > > > >>>>>>
> > > > > > >>>>>> Disabled the exclusive lock feature so that I can run
> > > > > > >>>>>> multiple write  jobs in
> > > > > > >>>> parallel.
> > > > > > >>>>>> rbd_cache is disabled in the client side.
> > > > > > >>>>>> Each test ran for 15 mins.
> > > > > > >>>>>>
> > > > > > >>>>>> Result :
> > > > > > >>>>>> ---------
> > > > > > >>>>>>
> > > > > > >>>>>> Here is the detailed report on this.
> > > > > > >>
> > > > > >
> > > >
> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > > > >>>>>>
> > > > > > >>>>>> Each profile I named based on <allocator>-<freelist> , so
> > > > > > >>>>>> in the graph for
> > > > > > >>>> example "stupid-extent" meaning stupid allocator and extent
> > > > freelist.
> > > > > > >>>>>>
> > > > > > >>>>>> I ran the test for each of the profile in the following
> > > > > > >>>>>> order after creating a
> > > > > > >>>> fresh rbd image for all the Bluestore test.
> > > > > > >>>>>>
> > > > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > > > >>>>>>
> > > > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > > > >>>>>>
> > > > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > > > >>>>>>
> > > > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > > > >>>>>>
> > > > > > >>>>>> The above are non-preconditioned case i.e ran before
> > > > > > >>>>>> filling up the entire
> > > > > > >>>> image. The reason is I don't see any reason of filling up
> > > > > > >>>> the rbd image before like filestore case where it will give
> > > > > > >>>> stable performance if we fill up the rbd images first.
> > > > > > >>>> Filling up rbd images in case of filestore will create the files in
> > the filesystem.
> > > > > > >>>>>>
> > > > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq write.
> > > > > > >>>>>> This is
> > > > > > >>>> primarily because I want to load BlueStore with more data.
> > > > > > >>>>>>
> > > > > > >>>>>> 6. Ran 4K RW test again (this is called out
> > > > > > >>>>>> preconditioned in the
> > > > > > >>>>>> profile) for 15 min
> > > > > > >>>>>>
> > > > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > > > >>>>>>
> > > > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > > > >>>>>>
> > > > > > >>>>>> For filestore test, I ran tests after preconditioning the
> > > > > > >>>>>> entire image
> > > > > > >> first.
> > > > > > >>>>>>
> > > > > > >>>>>> Each sheet on the xls have different block size result ,
> > > > > > >>>>>> I often miss to navigate through the xls sheets , so,
> > > > > > >>>>>> thought of mentioning here
> > > > > > >>>>>> :-)
> > > > > > >>>>>>
> > > > > > >>>>>> I have also captured the mkfs time , OSD startup time and
> > > > > > >>>>>> the memory
> > > > > > >>>> usage after the entire run.
> > > > > > >>>>>>
> > > > > > >>>>>> Observation:
> > > > > > >>>>>> ---------------
> > > > > > >>>>>>
> > > > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs time
> > > > > > >>>>>> (and thus cluster
> > > > > > >>>> creation time for 16 OSDs) are ~16X slower than stupid
> > > > > > >>>> allocator and
> > > > > > >>> filestore.
> > > > > > >>>> Each OSD creation is taking ~2min or so sometimes and I
> > > > > > >>>> nailed down the
> > > > > > >>>> insert_free() function call (marked ****) in the Bitmap
> > > > > > >>>> allocator is causing that.
> > > > > > >>>>>>
> > > > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > > > >>>>>> enumerate_next start
> > > > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > > > >>>>>> enumerate_next
> > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > > >>>>>> bitmapalloc:init_add_free instance 139913322803328 offset
> > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > > > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10 freelist
> > > > > > >>>>>> enumerate_next
> > > > > > >>>>>> end****
> > > > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc loaded
> > > > > > >>>>>> 6757 G in
> > > > > > >>>>>> 1 extents
> > > > > > >>>>>>
> > > > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > > > ^A:5242880+5242880
> > > > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > > > >>>>>> _read_random got
> > > > > > >>>>>> 613
> > > > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > > > >>>>>> enumerate_next
> > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > > >>>>>> bitmapalloc:init_add_free instance 139913306273920 offset
> > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > > > > > >>>>>> 0x4663d00000 len
> > > > > > >>>>>> 0x69959451000*****
> > > > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10 freelist
> > > > > > >>>>>> enumerate_next end
> > > > > > >>>>>
> > > > > > >>>>> I'm not sure there's any easy fix for this. We can
> > > > > > >>>>> amortize it by feeding
> > > > > > >>>> space to bluefs slowly (so that we don't have to do all the
> > > > > > >>>> inserts at once), but I'm not sure that's really better.
> > > > > > >>>>>
> > > > > > >>>>> [Somnath] I don't know that part of the code, so, may be a
> > > > > > >>>>> dumb
> > > > > > >>> question.
> > > > > > >>>> This is during mkfs() time , so, can't we say to bluefs
> > > > > > >>>> entire space is free ? I can understand for osd mount and
> > > > > > >>>> all other cases we need to feed the free space every time.
> > > > > > >>>>> IMO this is critical to fix as cluster creation time will
> > > > > > >>>>> be number of OSDs * 2
> > > > > > >>>> min otherwise. For me creating 16 OSDs cluster is taking
> > > > > > >>>> ~32min compare to
> > > > > > >>>> ~2 min for stupid allocator/filestore.
> > > > > > >>>>> BTW, my drive data partition is ~6.9TB , db partition is
> > > > > > >>>>> ~100G and WAL is
> > > > > > >>>> ~1G. I guess the time taking is dependent on data partition
> > > > > > >>>> size as well
> > > > > > (?
> > > > > > >>>>
> > > > > > >>>> Well, we're fundamentally limited by the fact that it's a
> > > > > > >>>> bitmap, and a big chunk of space is "allocated" to bluefs
> > > > > > >>>> and needs to have 1's
> > > > > > set.
> > > > > > >>>>
> > > > > > >>>> sage
> > > > > > >>>> --
> > > > > > >>>> To unsubscribe from this list: send the line "unsubscribe
> > > > > > >>>> ceph-
> > > > devel"
> > > > > > >>>> in the body of a message to majordomo@vger.kernel.org More
> > > > > > >>> majordomo
> > > > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > >>> --
> > > > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> > devel"
> > > > > > >>> in the body of a message to majordomo@vger.kernel.org More
> > > > > > >> majordomo
> > > > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > message is
> > > > intended only for the use of the designated recipient(s) named
> > > > above. If the reader of this message is not the intended recipient,
> > > > you are hereby notified that you have received this message in error
> > > > and that any review, dissemination, distribution, or copying of this
> > > > message is strictly prohibited. If you have received this
> > > > communication in error, please notify the sender by telephone or
> > > > e-mail (as shown above) immediately and destroy any and all copies
> > > > of this message in your possession (whether hard copies or electronically
> > stored copies).
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 19:34                       ` Sage Weil
@ 2016-08-11 19:45                         ` Allen Samuels
  2016-08-11 20:03                           ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Allen Samuels @ 2016-08-11 19:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, August 11, 2016 12:34 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 10:15 AM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > Sent: Thursday, August 11, 2016 9:38 AM
> > > > > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > devel@vger.kernel.org>
> > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > FileStore
> > > > >
> > > > > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > > > > I think the free list does not initialize all keys at mkfs
> > > > > > time, it does sets key that has some allocations.
> > > > > >
> > > > > > Rest keys are assumed to have 0's if key does not exist.
> > > > >
> > > > > Right.. it's the region "allocated" to bluefs that is consuming the time.
> > > > >
> > > > > > The bitmap allocator insert_free is done in group of free bits
> > > > > > together(maybe more than bitmap freelist keys at a time).
> > > > >
> > > > > I think Allen is asking whether we are doing lots of inserts
> > > > > within a single rocksdb transaction, or lots of separate transactions.
> > > > >
> > > > > FWIW, my guess is that increasing the size of the value (i.e.,
> > > > > increasing
> > > > >
> > > > > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > > > >
> > > > > ) will probably speed this up.
> > > >
> > > > If your assumption (> Right.. it's the region "allocated" to
> > > > bluefs that is consuming the time) is correct, then I don't
> > > > understand why this parameter has any effect on the problem.
> > > >
> > > > Aren't we reading BlueFS extents and setting them in the
> > > > BitMapAllocator? That doesn't care about the chunking of bitmap
> > > > bits into KV keys.
> > >
> > > I think this is something different.  During mkfs we take ~2% (or
> > > somethign like that) of the block device, mark it 'allocated' (from
> > > the bluestore freelist's
> > > perspective) and give it to bluefs.  On a large device that's a lot of bits to
> set.
> > > Larger keys should speed that up.
> >
> > But the bits in the BitMap shouldn't be chunked up in the same units
> > as the Keys. Right? Sharding of the bitmap is done for internal
> > parallelism
> > -- only, it has nothing to do with the persistent representation.
> 
> I'm not really sure what the BitmapAllocator is doing, but yeah, it's
> independent.  The tunable I'm talking about though is the one that controls
> how many bits BitmapFreelist puts in each key/value pair.

I understand, but that should be relevant only to operations that actually either read or write to the KV Store. That's not the case here, allocations by BlueFS are not recorded in the KVStore.

Whatever chunking/sharding of the bitmapfreelist is present should be independent (well an integer multiple thereof....) of the number of bits that are chunked up into a single KV Key/Value pair. Hence when doing the initialization here (i.e., the marking of BlueFS allocated space in the freelist) that shouldn't involve ANY KVStore operations. I think it's worthwhile to modify the option (say make it 16 or 64x larger) and see if that actually affects the initialization time -- if it does, then there's something structurally inefficient in the code that's hopefully easy to fix. 
 
> 
> > BlueFS allocations aren't stored in the KV database (to avoid
> > circularity).
> >
> > So I don't see why a bitset of 2m bits should be taking so long.....
> > Makes me thing that we don't really understand the problem.
> 
> Could be, I'm just guessing.  During mkfs, _open_fm() does
> 
>     fm->create(bdev->get_size(), t);
> 
> and then
> 
>     fm->allocate(0, reserved, t);
> 
> where the value of reserved depends on how much we give to bluefs.  I'm
> assuming this is the mkfs allocation that is taking time, but I haven't looked at
> the allocator code at all or whether insert_free is part of this path...

Somnath's data clearly points to this....

> 
> sage
> 
> 
> 
> >
> > >
> > > The amount of space we start with comes from _open_db():
> > >
> > >       uint64_t initial =
> > > 	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
> > > 			    g_conf->bluestore_bluefs_gift_ratio);
> > >       initial = MAX(initial, g_conf->bluestore_bluefs_min);
> > >
> > > Simply lowering min_ratio might also be fine.  The current value of
> > > 2% is meant to be enough for most stores, and to avoid giving over
> > > lots of little extents later (and making the bluefs_extents list too
> > > big).  That can overflow the superblock, another annoying thing we
> > > need to fix (though not a big deal to fix).
> > >
> > > Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve the
> > > time spent on this.. that is probably another useful test to confirm
> > > this is what is going on.
> >
> > Yes, this should help -- but still seems like a bandaid. 
> >
> > >
> > > sage
> > >
> > > > I would be cautious about just changing this option to affect this
> > > > problem (though as an experiment, we can change the value and see
> > > > if it has ANY affect on this problem -- which I don't think it
> > > > will). The value of this option really needs to be dictated by its
> > > > effect on the more mainstream read/write operations not on the
> initialization problem.
> > > > >
> > > > > sage
> > > > >
> > > > >
> > > > > >
> > > > > > -Ramesh
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Allen Samuels
> > > > > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > > > > To: Ramesh Chander
> > > > > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > > > > Subject: Re: Bluestore different allocator performance Vs
> > > > > > > FileStore
> > > > > > >
> > > > > > > Is the initial creation of the keys for the bitmap one by
> > > > > > > one or are they batched?
> > > > > > >
> > > > > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > > > > >
> > > > > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > > > > >
> > > > > > > > Somnath,
> > > > > > > >
> > > > > > > > Basically mkfs time has increased from 7.5 seconds (2min /
> > > > > > > > 16) to
> > > > > > > > 2 minutes
> > > > > > > ( 32 / 16).
> > > > > > > >
> > > > > > > > But is there a reason you should create osds in serial? I
> > > > > > > > think for mmultiple
> > > > > > > osds mkfs can happen in parallel?
> > > > > > > >
> > > > > > > > As a fix I am looking to batch multiple insert_free calls for now.
> > > > > > > > If still that
> > > > > > > does not help, thinking of doing insert_free on different
> > > > > > > part of device in parallel.
> > > > > > > >
> > > > > > > > -Ramesh
> > > > > > > >
> > > > > > > >> -----Original Message-----
> > > > > > > >> From: Ramesh Chander
> > > > > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > > > > >> Cc: ceph-devel
> > > > > > > >> Subject: RE: Bluestore different allocator performance Vs
> > > > > > > >> FileStore
> > > > > > > >>
> > > > > > > >> I think insert_free is limited by speed of function clear_bits
> here.
> > > > > > > >>
> > > > > > > >> Though set_bits and clear_bits have same logic except one
> > > > > > > >> sets and another clears. Both of these does 64 bits
> > > > > > > >> (bitmap size) at
> > > a time.
> > > > > > > >>
> > > > > > > >> I am not sure if doing memset will make it faster. But if
> > > > > > > >> we can do it for group of bitmaps, then it might help.
> > > > > > > >>
> > > > > > > >> I am looking in to code if we can handle mkfs and osd
> > > > > > > >> mount in special way to make it faster.
> > > > > > > >>
> > > > > > > >> If I don't find an easy fix, we can go to path of
> > > > > > > >> deferring init to later stage as and when required.
> > > > > > > >>
> > > > > > > >> -Ramesh
> > > > > > > >>
> > > > > > > >>> -----Original Message-----
> > > > > > > >>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > >>> [mailto:ceph-devel- owner@vger.kernel.org] On Behalf Of
> > > > > > > >>> Allen Samuels
> > > > > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > > > > >>> To: Sage Weil; Somnath Roy
> > > > > > > >>> Cc: ceph-devel
> > > > > > > >>> Subject: RE: Bluestore different allocator performance
> > > > > > > >>> Vs FileStore
> > > > > > > >>>
> > > > > > > >>> We always knew that startup time for bitmap stuff would
> > > > > > > >>> be somewhat longer. Still, the existing implementation
> > > > > > > >>> can be speeded up significantly. The code in
> > > > > > > >>> BitMapZone::set_blocks_used isn't very optimized.
> > > > > > > >>> Converting it to use memset for all but the first/last
> > > > > > > >>> bytes
> > > > > > > >> should significantly speed it up.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Allen Samuels
> > > > > > > >>> SanDisk |a Western Digital brand
> > > > > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > >>> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > >>> allen.samuels@SanDisk.com
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>> -----Original Message-----
> > > > > > > >>>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > >>>> [mailto:ceph-devel- owner@vger.kernel.org] On Behalf Of
> > > > > > > >>>> Sage Weil
> > > > > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > > > >>>> Subject: RE: Bluestore different allocator performance
> > > > > > > >>>> Vs FileStore
> > > > > > > >>>>
> > > > > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > >>>>> << inline with [Somnath]
> > > > > > > >>>>>
> > > > > > > >>>>> -----Original Message-----
> > > > > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > > > > >>>>> To: Somnath Roy
> > > > > > > >>>>> Cc: ceph-devel
> > > > > > > >>>>> Subject: Re: Bluestore different allocator performance
> > > > > > > >>>>> Vs FileStore
> > > > > > > >>>>>
> > > > > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > >>>>>> Hi, I spent some time on evaluating different
> > > > > > > >>>>>> Bluestore allocator and freelist performance. Also,
> > > > > > > >>>>>> tried to gaze the performance difference of Bluestore
> > > > > > > >>>>>> and filestore on the similar
> > > > > > > >> setup.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Setup:
> > > > > > > >>>>>> --------
> > > > > > > >>>>>>
> > > > > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > > > > >>>>>>
> > > > > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Disabled the exclusive lock feature so that I can run
> > > > > > > >>>>>> multiple write  jobs in
> > > > > > > >>>> parallel.
> > > > > > > >>>>>> rbd_cache is disabled in the client side.
> > > > > > > >>>>>> Each test ran for 15 mins.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Result :
> > > > > > > >>>>>> ---------
> > > > > > > >>>>>>
> > > > > > > >>>>>> Here is the detailed report on this.
> > > > > > > >>
> > > > > > >
> > > > >
> > >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > > > > >>>>>>
> > > > > > > >>>>>> Each profile I named based on <allocator>-<freelist>
> > > > > > > >>>>>> , so in the graph for
> > > > > > > >>>> example "stupid-extent" meaning stupid allocator and
> > > > > > > >>>> extent
> > > > > freelist.
> > > > > > > >>>>>>
> > > > > > > >>>>>> I ran the test for each of the profile in the
> > > > > > > >>>>>> following order after creating a
> > > > > > > >>>> fresh rbd image for all the Bluestore test.
> > > > > > > >>>>>>
> > > > > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > > > > >>>>>>
> > > > > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > > > > >>>>>>
> > > > > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > > > > >>>>>>
> > > > > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > > > > >>>>>>
> > > > > > > >>>>>> The above are non-preconditioned case i.e ran before
> > > > > > > >>>>>> filling up the entire
> > > > > > > >>>> image. The reason is I don't see any reason of filling
> > > > > > > >>>> up the rbd image before like filestore case where it
> > > > > > > >>>> will give stable performance if we fill up the rbd images first.
> > > > > > > >>>> Filling up rbd images in case of filestore will create
> > > > > > > >>>> the files in
> > > the filesystem.
> > > > > > > >>>>>>
> > > > > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq
> write.
> > > > > > > >>>>>> This is
> > > > > > > >>>> primarily because I want to load BlueStore with more data.
> > > > > > > >>>>>>
> > > > > > > >>>>>> 6. Ran 4K RW test again (this is called out
> > > > > > > >>>>>> preconditioned in the
> > > > > > > >>>>>> profile) for 15 min
> > > > > > > >>>>>>
> > > > > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > > > > >>>>>>
> > > > > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > > > > >>>>>>
> > > > > > > >>>>>> For filestore test, I ran tests after preconditioning
> > > > > > > >>>>>> the entire image
> > > > > > > >> first.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Each sheet on the xls have different block size
> > > > > > > >>>>>> result , I often miss to navigate through the xls
> > > > > > > >>>>>> sheets , so, thought of mentioning here
> > > > > > > >>>>>> :-)
> > > > > > > >>>>>>
> > > > > > > >>>>>> I have also captured the mkfs time , OSD startup time
> > > > > > > >>>>>> and the memory
> > > > > > > >>>> usage after the entire run.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Observation:
> > > > > > > >>>>>> ---------------
> > > > > > > >>>>>>
> > > > > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs
> > > > > > > >>>>>> time (and thus cluster
> > > > > > > >>>> creation time for 16 OSDs) are ~16X slower than stupid
> > > > > > > >>>> allocator and
> > > > > > > >>> filestore.
> > > > > > > >>>> Each OSD creation is taking ~2min or so sometimes and I
> > > > > > > >>>> nailed down the
> > > > > > > >>>> insert_free() function call (marked ****) in the Bitmap
> > > > > > > >>>> allocator is causing that.
> > > > > > > >>>>>>
> > > > > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > > > > >>>>>> enumerate_next start
> > > > > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > > > > >>>>>> enumerate_next
> > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > > > >>>>>> bitmapalloc:init_add_free instance 139913322803328
> > > > > > > >>>>>> offset
> > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > > > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > > > > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10
> > > > > > > >>>>>> freelist enumerate_next
> > > > > > > >>>>>> end****
> > > > > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc
> > > > > > > >>>>>> loaded
> > > > > > > >>>>>> 6757 G in
> > > > > > > >>>>>> 1 extents
> > > > > > > >>>>>>
> > > > > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > > > > ^A:5242880+5242880
> > > > > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > > > > >>>>>> _read_random got
> > > > > > > >>>>>> 613
> > > > > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > > > > >>>>>> enumerate_next
> > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > > > >>>>>> bitmapalloc:init_add_free instance 139913306273920
> > > > > > > >>>>>> offset
> > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > > > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > > > > > > >>>>>> 0x4663d00000 len
> > > > > > > >>>>>> 0x69959451000*****
> > > > > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10
> > > > > > > >>>>>> freelist enumerate_next end
> > > > > > > >>>>>
> > > > > > > >>>>> I'm not sure there's any easy fix for this. We can
> > > > > > > >>>>> amortize it by feeding
> > > > > > > >>>> space to bluefs slowly (so that we don't have to do all
> > > > > > > >>>> the inserts at once), but I'm not sure that's really better.
> > > > > > > >>>>>
> > > > > > > >>>>> [Somnath] I don't know that part of the code, so, may
> > > > > > > >>>>> be a dumb
> > > > > > > >>> question.
> > > > > > > >>>> This is during mkfs() time , so, can't we say to bluefs
> > > > > > > >>>> entire space is free ? I can understand for osd mount
> > > > > > > >>>> and all other cases we need to feed the free space every
> time.
> > > > > > > >>>>> IMO this is critical to fix as cluster creation time
> > > > > > > >>>>> will be number of OSDs * 2
> > > > > > > >>>> min otherwise. For me creating 16 OSDs cluster is
> > > > > > > >>>> taking ~32min compare to
> > > > > > > >>>> ~2 min for stupid allocator/filestore.
> > > > > > > >>>>> BTW, my drive data partition is ~6.9TB , db partition
> > > > > > > >>>>> is ~100G and WAL is
> > > > > > > >>>> ~1G. I guess the time taking is dependent on data
> > > > > > > >>>> partition size as well
> > > > > > > (?
> > > > > > > >>>>
> > > > > > > >>>> Well, we're fundamentally limited by the fact that it's
> > > > > > > >>>> a bitmap, and a big chunk of space is "allocated" to
> > > > > > > >>>> bluefs and needs to have 1's
> > > > > > > set.
> > > > > > > >>>>
> > > > > > > >>>> sage
> > > > > > > >>>> --
> > > > > > > >>>> To unsubscribe from this list: send the line
> > > > > > > >>>> "unsubscribe
> > > > > > > >>>> ceph-
> > > > > devel"
> > > > > > > >>>> in the body of a message to majordomo@vger.kernel.org
> > > > > > > >>>> More
> > > > > > > >>> majordomo
> > > > > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > > >>> --
> > > > > > > >>> To unsubscribe from this list: send the line
> > > > > > > >>> "unsubscribe ceph-
> > > devel"
> > > > > > > >>> in the body of a message to majordomo@vger.kernel.org
> > > > > > > >>> More
> > > > > > > >> majordomo
> > > > > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > > message is
> > > > > intended only for the use of the designated recipient(s) named
> > > > > above. If the reader of this message is not the intended
> > > > > recipient, you are hereby notified that you have received this
> > > > > message in error and that any review, dissemination,
> > > > > distribution, or copying of this message is strictly prohibited.
> > > > > If you have received this communication in error, please notify
> > > > > the sender by telephone or e-mail (as shown above) immediately
> > > > > and destroy any and all copies of this message in your
> > > > > possession (whether hard copies or electronically
> > > stored copies).
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > majordomo
> > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > > >
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 19:45                         ` Allen Samuels
@ 2016-08-11 20:03                           ` Sage Weil
  2016-08-11 20:16                             ` Allen Samuels
  0 siblings, 1 reply; 34+ messages in thread
From: Sage Weil @ 2016-08-11 20:03 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 12:34 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> > 
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > Sent: Thursday, August 11, 2016 10:15 AM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > > >
> > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > > Sent: Thursday, August 11, 2016 9:38 AM
> > > > > > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > > > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> > > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > > devel@vger.kernel.org>
> > > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > > FileStore
> > > > > >
> > > > > > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > > > > > I think the free list does not initialize all keys at mkfs
> > > > > > > time, it does sets key that has some allocations.
> > > > > > >
> > > > > > > Rest keys are assumed to have 0's if key does not exist.
> > > > > >
> > > > > > Right.. it's the region "allocated" to bluefs that is consuming the time.
> > > > > >
> > > > > > > The bitmap allocator insert_free is done in group of free bits
> > > > > > > together(maybe more than bitmap freelist keys at a time).
> > > > > >
> > > > > > I think Allen is asking whether we are doing lots of inserts
> > > > > > within a single rocksdb transaction, or lots of separate transactions.
> > > > > >
> > > > > > FWIW, my guess is that increasing the size of the value (i.e.,
> > > > > > increasing
> > > > > >
> > > > > > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > > > > >
> > > > > > ) will probably speed this up.
> > > > >
> > > > > If your assumption (> Right.. it's the region "allocated" to
> > > > > bluefs that is consuming the time) is correct, then I don't
> > > > > understand why this parameter has any effect on the problem.
> > > > >
> > > > > Aren't we reading BlueFS extents and setting them in the
> > > > > BitMapAllocator? That doesn't care about the chunking of bitmap
> > > > > bits into KV keys.
> > > >
> > > > I think this is something different.  During mkfs we take ~2% (or
> > > > somethign like that) of the block device, mark it 'allocated' (from
> > > > the bluestore freelist's
> > > > perspective) and give it to bluefs.  On a large device that's a lot of bits to
> > set.
> > > > Larger keys should speed that up.
> > >
> > > But the bits in the BitMap shouldn't be chunked up in the same units
> > > as the Keys. Right? Sharding of the bitmap is done for internal
> > > parallelism
> > > -- only, it has nothing to do with the persistent representation.
> > 
> > I'm not really sure what the BitmapAllocator is doing, but yeah, it's
> > independent.  The tunable I'm talking about though is the one that controls
> > how many bits BitmapFreelist puts in each key/value pair.
> 
> I understand, but that should be relevant only to operations that 
> actually either read or write to the KV Store. That's not the case here, 
> allocations by BlueFS are not recorded in the KVStore.
> 
> Whatever chunking/sharding of the bitmapfreelist is present should be 
> independent (well an integer multiple thereof....) of the number of bits 
> that are chunked up into a single KV Key/Value pair. Hence when doing 
> the initialization here (i.e., the marking of BlueFS allocated space in 
> the freelist) that shouldn't involve ANY KVStore operations. I think 
> it's worthwhile to modify the option (say make it 16 or 64x larger) and 
> see if that actually affects the initialization time -- if it does, then 
> there's something structurally inefficient in the code that's hopefully 
> easy to fix.

This is the allocation of space *to* bluefs, not *by* bluefs.  At mkfs 
time, we (BlueStore::mkfs() -> _open_fm()) will take 2% of the block 
device and mark it in-use with that fm->allocate() call below, and that 
flips a bunch of bits in the kv store.

> > > BlueFS allocations aren't stored in the KV database (to avoid
> > > circularity).
> > >
> > > So I don't see why a bitset of 2m bits should be taking so long.....
> > > Makes me thing that we don't really understand the problem.
> > 
> > Could be, I'm just guessing.  During mkfs, _open_fm() does
> > 
> >     fm->create(bdev->get_size(), t);
> > 
> > and then
> > 
> >     fm->allocate(0, reserved, t);

        ^ here.

> > 
> > where the value of reserved depends on how much we give to bluefs.  I'm
> > assuming this is the mkfs allocation that is taking time, but I haven't looked at
> > the allocator code at all or whether insert_free is part of this path...
> 
> Somnath's data clearly points to this....

sage

> 
> > 
> > sage
> > 
> > 
> > 
> > >
> > > >
> > > > The amount of space we start with comes from _open_db():
> > > >
> > > >       uint64_t initial =
> > > > 	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
> > > > 			    g_conf->bluestore_bluefs_gift_ratio);
> > > >       initial = MAX(initial, g_conf->bluestore_bluefs_min);
> > > >
> > > > Simply lowering min_ratio might also be fine.  The current value of
> > > > 2% is meant to be enough for most stores, and to avoid giving over
> > > > lots of little extents later (and making the bluefs_extents list too
> > > > big).  That can overflow the superblock, another annoying thing we
> > > > need to fix (though not a big deal to fix).
> > > >
> > > > Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve the
> > > > time spent on this.. that is probably another useful test to confirm
> > > > this is what is going on.
> > >
> > > Yes, this should help -- but still seems like a bandaid. 
> > >
> > > >
> > > > sage
> > > >
> > > > > I would be cautious about just changing this option to affect this
> > > > > problem (though as an experiment, we can change the value and see
> > > > > if it has ANY affect on this problem -- which I don't think it
> > > > > will). The value of this option really needs to be dictated by its
> > > > > effect on the more mainstream read/write operations not on the
> > initialization problem.
> > > > > >
> > > > > > sage
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > -Ramesh
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Allen Samuels
> > > > > > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > > > > > To: Ramesh Chander
> > > > > > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > > > > > Subject: Re: Bluestore different allocator performance Vs
> > > > > > > > FileStore
> > > > > > > >
> > > > > > > > Is the initial creation of the keys for the bitmap one by
> > > > > > > > one or are they batched?
> > > > > > > >
> > > > > > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > > > > > >
> > > > > > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > > > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > > > > > >
> > > > > > > > > Somnath,
> > > > > > > > >
> > > > > > > > > Basically mkfs time has increased from 7.5 seconds (2min /
> > > > > > > > > 16) to
> > > > > > > > > 2 minutes
> > > > > > > > ( 32 / 16).
> > > > > > > > >
> > > > > > > > > But is there a reason you should create osds in serial? I
> > > > > > > > > think for mmultiple
> > > > > > > > osds mkfs can happen in parallel?
> > > > > > > > >
> > > > > > > > > As a fix I am looking to batch multiple insert_free calls for now.
> > > > > > > > > If still that
> > > > > > > > does not help, thinking of doing insert_free on different
> > > > > > > > part of device in parallel.
> > > > > > > > >
> > > > > > > > > -Ramesh
> > > > > > > > >
> > > > > > > > >> -----Original Message-----
> > > > > > > > >> From: Ramesh Chander
> > > > > > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > > > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > > > > > >> Cc: ceph-devel
> > > > > > > > >> Subject: RE: Bluestore different allocator performance Vs
> > > > > > > > >> FileStore
> > > > > > > > >>
> > > > > > > > >> I think insert_free is limited by speed of function clear_bits
> > here.
> > > > > > > > >>
> > > > > > > > >> Though set_bits and clear_bits have same logic except one
> > > > > > > > >> sets and another clears. Both of these does 64 bits
> > > > > > > > >> (bitmap size) at
> > > > a time.
> > > > > > > > >>
> > > > > > > > >> I am not sure if doing memset will make it faster. But if
> > > > > > > > >> we can do it for group of bitmaps, then it might help.
> > > > > > > > >>
> > > > > > > > >> I am looking in to code if we can handle mkfs and osd
> > > > > > > > >> mount in special way to make it faster.
> > > > > > > > >>
> > > > > > > > >> If I don't find an easy fix, we can go to path of
> > > > > > > > >> deferring init to later stage as and when required.
> > > > > > > > >>
> > > > > > > > >> -Ramesh
> > > > > > > > >>
> > > > > > > > >>> -----Original Message-----
> > > > > > > > >>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > >>> [mailto:ceph-devel- owner@vger.kernel.org] On Behalf Of
> > > > > > > > >>> Allen Samuels
> > > > > > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > > > > > >>> To: Sage Weil; Somnath Roy
> > > > > > > > >>> Cc: ceph-devel
> > > > > > > > >>> Subject: RE: Bluestore different allocator performance
> > > > > > > > >>> Vs FileStore
> > > > > > > > >>>
> > > > > > > > >>> We always knew that startup time for bitmap stuff would
> > > > > > > > >>> be somewhat longer. Still, the existing implementation
> > > > > > > > >>> can be speeded up significantly. The code in
> > > > > > > > >>> BitMapZone::set_blocks_used isn't very optimized.
> > > > > > > > >>> Converting it to use memset for all but the first/last
> > > > > > > > >>> bytes
> > > > > > > > >> should significantly speed it up.
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> Allen Samuels
> > > > > > > > >>> SanDisk |a Western Digital brand
> > > > > > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > > >>> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > >>> allen.samuels@SanDisk.com
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>> -----Original Message-----
> > > > > > > > >>>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > >>>> [mailto:ceph-devel- owner@vger.kernel.org] On Behalf Of
> > > > > > > > >>>> Sage Weil
> > > > > > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > > > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > > > > >>>> Subject: RE: Bluestore different allocator performance
> > > > > > > > >>>> Vs FileStore
> > > > > > > > >>>>
> > > > > > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > >>>>> << inline with [Somnath]
> > > > > > > > >>>>>
> > > > > > > > >>>>> -----Original Message-----
> > > > > > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > > > > > >>>>> To: Somnath Roy
> > > > > > > > >>>>> Cc: ceph-devel
> > > > > > > > >>>>> Subject: Re: Bluestore different allocator performance
> > > > > > > > >>>>> Vs FileStore
> > > > > > > > >>>>>
> > > > > > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > >>>>>> Hi, I spent some time on evaluating different
> > > > > > > > >>>>>> Bluestore allocator and freelist performance. Also,
> > > > > > > > >>>>>> tried to gaze the performance difference of Bluestore
> > > > > > > > >>>>>> and filestore on the similar
> > > > > > > > >> setup.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Setup:
> > > > > > > > >>>>>> --------
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Disabled the exclusive lock feature so that I can run
> > > > > > > > >>>>>> multiple write  jobs in
> > > > > > > > >>>> parallel.
> > > > > > > > >>>>>> rbd_cache is disabled in the client side.
> > > > > > > > >>>>>> Each test ran for 15 mins.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Result :
> > > > > > > > >>>>>> ---------
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Here is the detailed report on this.
> > > > > > > > >>
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Each profile I named based on <allocator>-<freelist>
> > > > > > > > >>>>>> , so in the graph for
> > > > > > > > >>>> example "stupid-extent" meaning stupid allocator and
> > > > > > > > >>>> extent
> > > > > > freelist.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> I ran the test for each of the profile in the
> > > > > > > > >>>>>> following order after creating a
> > > > > > > > >>>> fresh rbd image for all the Bluestore test.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> The above are non-preconditioned case i.e ran before
> > > > > > > > >>>>>> filling up the entire
> > > > > > > > >>>> image. The reason is I don't see any reason of filling
> > > > > > > > >>>> up the rbd image before like filestore case where it
> > > > > > > > >>>> will give stable performance if we fill up the rbd images first.
> > > > > > > > >>>> Filling up rbd images in case of filestore will create
> > > > > > > > >>>> the files in
> > > > the filesystem.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M seq
> > write.
> > > > > > > > >>>>>> This is
> > > > > > > > >>>> primarily because I want to load BlueStore with more data.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 6. Ran 4K RW test again (this is called out
> > > > > > > > >>>>>> preconditioned in the
> > > > > > > > >>>>>> profile) for 15 min
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> For filestore test, I ran tests after preconditioning
> > > > > > > > >>>>>> the entire image
> > > > > > > > >> first.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Each sheet on the xls have different block size
> > > > > > > > >>>>>> result , I often miss to navigate through the xls
> > > > > > > > >>>>>> sheets , so, thought of mentioning here
> > > > > > > > >>>>>> :-)
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> I have also captured the mkfs time , OSD startup time
> > > > > > > > >>>>>> and the memory
> > > > > > > > >>>> usage after the entire run.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Observation:
> > > > > > > > >>>>>> ---------------
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs
> > > > > > > > >>>>>> time (and thus cluster
> > > > > > > > >>>> creation time for 16 OSDs) are ~16X slower than stupid
> > > > > > > > >>>> allocator and
> > > > > > > > >>> filestore.
> > > > > > > > >>>> Each OSD creation is taking ~2min or so sometimes and I
> > > > > > > > >>>> nailed down the
> > > > > > > > >>>> insert_free() function call (marked ****) in the Bitmap
> > > > > > > > >>>> allocator is causing that.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10 freelist
> > > > > > > > >>>>>> enumerate_next start
> > > > > > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10 freelist
> > > > > > > > >>>>>> enumerate_next
> > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > > > > >>>>>> bitmapalloc:init_add_free instance 139913322803328
> > > > > > > > >>>>>> offset
> > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > > > > >>>>>> bitmapalloc:insert_free instance 139913322803328 off
> > > > > > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > > > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10
> > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > >>>>>> end****
> > > > > > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc
> > > > > > > > >>>>>> loaded
> > > > > > > > >>>>>> 6757 G in
> > > > > > > > >>>>>> 1 extents
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > > > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > > > > > ^A:5242880+5242880
> > > > > > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > > > > > >>>>>> _read_random got
> > > > > > > > >>>>>> 613
> > > > > > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10 freelist
> > > > > > > > >>>>>> enumerate_next
> > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > > > > >>>>>> bitmapalloc:init_add_free instance 139913306273920
> > > > > > > > >>>>>> offset
> > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > > > > >>>>>> bitmapalloc:insert_free instance 139913306273920 off
> > > > > > > > >>>>>> 0x4663d00000 len
> > > > > > > > >>>>>> 0x69959451000*****
> > > > > > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10
> > > > > > > > >>>>>> freelist enumerate_next end
> > > > > > > > >>>>>
> > > > > > > > >>>>> I'm not sure there's any easy fix for this. We can
> > > > > > > > >>>>> amortize it by feeding
> > > > > > > > >>>> space to bluefs slowly (so that we don't have to do all
> > > > > > > > >>>> the inserts at once), but I'm not sure that's really better.
> > > > > > > > >>>>>
> > > > > > > > >>>>> [Somnath] I don't know that part of the code, so, may
> > > > > > > > >>>>> be a dumb
> > > > > > > > >>> question.
> > > > > > > > >>>> This is during mkfs() time , so, can't we say to bluefs
> > > > > > > > >>>> entire space is free ? I can understand for osd mount
> > > > > > > > >>>> and all other cases we need to feed the free space every
> > time.
> > > > > > > > >>>>> IMO this is critical to fix as cluster creation time
> > > > > > > > >>>>> will be number of OSDs * 2
> > > > > > > > >>>> min otherwise. For me creating 16 OSDs cluster is
> > > > > > > > >>>> taking ~32min compare to
> > > > > > > > >>>> ~2 min for stupid allocator/filestore.
> > > > > > > > >>>>> BTW, my drive data partition is ~6.9TB , db partition
> > > > > > > > >>>>> is ~100G and WAL is
> > > > > > > > >>>> ~1G. I guess the time taking is dependent on data
> > > > > > > > >>>> partition size as well
> > > > > > > > (?
> > > > > > > > >>>>
> > > > > > > > >>>> Well, we're fundamentally limited by the fact that it's
> > > > > > > > >>>> a bitmap, and a big chunk of space is "allocated" to
> > > > > > > > >>>> bluefs and needs to have 1's
> > > > > > > > set.
> > > > > > > > >>>>
> > > > > > > > >>>> sage
> > > > > > > > >>>> --
> > > > > > > > >>>> To unsubscribe from this list: send the line
> > > > > > > > >>>> "unsubscribe
> > > > > > > > >>>> ceph-
> > > > > > devel"
> > > > > > > > >>>> in the body of a message to majordomo@vger.kernel.org
> > > > > > > > >>>> More
> > > > > > > > >>> majordomo
> > > > > > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > > > >>> --
> > > > > > > > >>> To unsubscribe from this list: send the line
> > > > > > > > >>> "unsubscribe ceph-
> > > > devel"
> > > > > > > > >>> in the body of a message to majordomo@vger.kernel.org
> > > > > > > > >>> More
> > > > > > > > >> majordomo
> > > > > > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > > > message is
> > > > > > intended only for the use of the designated recipient(s) named
> > > > > > above. If the reader of this message is not the intended
> > > > > > recipient, you are hereby notified that you have received this
> > > > > > message in error and that any review, dissemination,
> > > > > > distribution, or copying of this message is strictly prohibited.
> > > > > > If you have received this communication in error, please notify
> > > > > > the sender by telephone or e-mail (as shown above) immediately
> > > > > > and destroy any and all copies of this message in your
> > > > > > possession (whether hard copies or electronically
> > > > stored copies).
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-
> > devel"
> > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > majordomo
> > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 20:03                           ` Sage Weil
@ 2016-08-11 20:16                             ` Allen Samuels
  2016-08-11 20:24                               ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Allen Samuels @ 2016-08-11 20:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

Perhaps my understanding of the blueFS is incorrect -- so please clarify as needed.

I thought that the authoritative indication of space used by BlueFS was contained in the snapshot/journal of BlueFS itself, NOT in the KV store itself. This requires that upon startup, we replay the BlueFS snapshot/journal into the FreeListManager so that it properly records the consumption of BlueFS space (since that allocation MAY NOT be accurate within the FreeListmanager itself). But that this playback need not generate an KVStore operations (since those are duplicates of the BlueFS). 

So in the code you cite:

fm->allocate(0, reserved, t);

There's no need to commit 't', and in fact, in the general case, you don't want to commit 't'.

That suggests to me that a version of allocate that doesn't have a transaction could be easily created would have the speed we're looking for (and independence from the BitMapAllocator to KVStore chunking).

I suspect that we also have long startup times because we're doing the same underlying bitmap operations except they come from the BlueFS replay code instead of the BlueFS initialization code, but same problem with likely the same fix.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, August 11, 2016 1:03 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 12:34 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > Sent: Thursday, August 11, 2016 10:15 AM
> > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > devel@vger.kernel.org>
> > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > FileStore
> > > > >
> > > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > Sent: Thursday, August 11, 2016 9:38 AM
> > > > > > > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > > > > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> > > > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > > > devel@vger.kernel.org>
> > > > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > > > FileStore
> > > > > > >
> > > > > > > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > > > > > > I think the free list does not initialize all keys at mkfs
> > > > > > > > time, it does sets key that has some allocations.
> > > > > > > >
> > > > > > > > Rest keys are assumed to have 0's if key does not exist.
> > > > > > >
> > > > > > > Right.. it's the region "allocated" to bluefs that is consuming the
> time.
> > > > > > >
> > > > > > > > The bitmap allocator insert_free is done in group of free
> > > > > > > > bits together(maybe more than bitmap freelist keys at a time).
> > > > > > >
> > > > > > > I think Allen is asking whether we are doing lots of inserts
> > > > > > > within a single rocksdb transaction, or lots of separate
> transactions.
> > > > > > >
> > > > > > > FWIW, my guess is that increasing the size of the value
> > > > > > > (i.e., increasing
> > > > > > >
> > > > > > > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > > > > > >
> > > > > > > ) will probably speed this up.
> > > > > >
> > > > > > If your assumption (> Right.. it's the region "allocated" to
> > > > > > bluefs that is consuming the time) is correct, then I don't
> > > > > > understand why this parameter has any effect on the problem.
> > > > > >
> > > > > > Aren't we reading BlueFS extents and setting them in the
> > > > > > BitMapAllocator? That doesn't care about the chunking of
> > > > > > bitmap bits into KV keys.
> > > > >
> > > > > I think this is something different.  During mkfs we take ~2%
> > > > > (or somethign like that) of the block device, mark it
> > > > > 'allocated' (from the bluestore freelist's
> > > > > perspective) and give it to bluefs.  On a large device that's a
> > > > > lot of bits to
> > > set.
> > > > > Larger keys should speed that up.
> > > >
> > > > But the bits in the BitMap shouldn't be chunked up in the same
> > > > units as the Keys. Right? Sharding of the bitmap is done for
> > > > internal parallelism
> > > > -- only, it has nothing to do with the persistent representation.
> > >
> > > I'm not really sure what the BitmapAllocator is doing, but yeah,
> > > it's independent.  The tunable I'm talking about though is the one
> > > that controls how many bits BitmapFreelist puts in each key/value pair.
> >
> > I understand, but that should be relevant only to operations that
> > actually either read or write to the KV Store. That's not the case
> > here, allocations by BlueFS are not recorded in the KVStore.
> >
> > Whatever chunking/sharding of the bitmapfreelist is present should be
> > independent (well an integer multiple thereof....) of the number of
> > bits that are chunked up into a single KV Key/Value pair. Hence when
> > doing the initialization here (i.e., the marking of BlueFS allocated
> > space in the freelist) that shouldn't involve ANY KVStore operations.
> > I think it's worthwhile to modify the option (say make it 16 or 64x
> > larger) and see if that actually affects the initialization time -- if
> > it does, then there's something structurally inefficient in the code
> > that's hopefully easy to fix.
> 
> This is the allocation of space *to* bluefs, not *by* bluefs.  At mkfs time, we
> (BlueStore::mkfs() -> _open_fm()) will take 2% of the block device and mark
> it in-use with that fm->allocate() call below, and that flips a bunch of bits in
> the kv store.
> 
> > > > BlueFS allocations aren't stored in the KV database (to avoid
> > > > circularity).
> > > >
> > > > So I don't see why a bitset of 2m bits should be taking so long.....
> > > > Makes me thing that we don't really understand the problem.
> > >
> > > Could be, I'm just guessing.  During mkfs, _open_fm() does
> > >
> > >     fm->create(bdev->get_size(), t);
> > >
> > > and then
> > >
> > >     fm->allocate(0, reserved, t);
> 
>         ^ here.
> 
> > >
> > > where the value of reserved depends on how much we give to bluefs.
> > > I'm assuming this is the mkfs allocation that is taking time, but I
> > > haven't looked at the allocator code at all or whether insert_free is part
> of this path...
> >
> > Somnath's data clearly points to this....
> 
> sage
> 
> >
> > >
> > > sage
> > >
> > >
> > >
> > > >
> > > > >
> > > > > The amount of space we start with comes from _open_db():
> > > > >
> > > > >       uint64_t initial =
> > > > > 	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
> > > > > 			    g_conf->bluestore_bluefs_gift_ratio);
> > > > >       initial = MAX(initial, g_conf->bluestore_bluefs_min);
> > > > >
> > > > > Simply lowering min_ratio might also be fine.  The current value
> > > > > of 2% is meant to be enough for most stores, and to avoid giving
> > > > > over lots of little extents later (and making the bluefs_extents
> > > > > list too big).  That can overflow the superblock, another
> > > > > annoying thing we need to fix (though not a big deal to fix).
> > > > >
> > > > > Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve
> > > > > the time spent on this.. that is probably another useful test to
> > > > > confirm this is what is going on.
> > > >
> > > > Yes, this should help -- but still seems like a bandaid.
> > > >
> > > > >
> > > > > sage
> > > > >
> > > > > > I would be cautious about just changing this option to affect
> > > > > > this problem (though as an experiment, we can change the value
> > > > > > and see if it has ANY affect on this problem -- which I don't
> > > > > > think it will). The value of this option really needs to be
> > > > > > dictated by its effect on the more mainstream read/write
> > > > > > operations not on the
> > > initialization problem.
> > > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > -Ramesh
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Allen Samuels
> > > > > > > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > > > > > > To: Ramesh Chander
> > > > > > > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > > > > > > Subject: Re: Bluestore different allocator performance
> > > > > > > > > Vs FileStore
> > > > > > > > >
> > > > > > > > > Is the initial creation of the keys for the bitmap one
> > > > > > > > > by one or are they batched?
> > > > > > > > >
> > > > > > > > > Sent from my iPhone. Please excuse all typos and
> autocorrects.
> > > > > > > > >
> > > > > > > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > > > > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Somnath,
> > > > > > > > > >
> > > > > > > > > > Basically mkfs time has increased from 7.5 seconds
> > > > > > > > > > (2min /
> > > > > > > > > > 16) to
> > > > > > > > > > 2 minutes
> > > > > > > > > ( 32 / 16).
> > > > > > > > > >
> > > > > > > > > > But is there a reason you should create osds in
> > > > > > > > > > serial? I think for mmultiple
> > > > > > > > > osds mkfs can happen in parallel?
> > > > > > > > > >
> > > > > > > > > > As a fix I am looking to batch multiple insert_free calls for
> now.
> > > > > > > > > > If still that
> > > > > > > > > does not help, thinking of doing insert_free on
> > > > > > > > > different part of device in parallel.
> > > > > > > > > >
> > > > > > > > > > -Ramesh
> > > > > > > > > >
> > > > > > > > > >> -----Original Message-----
> > > > > > > > > >> From: Ramesh Chander
> > > > > > > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > > > > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > > > > > > >> Cc: ceph-devel
> > > > > > > > > >> Subject: RE: Bluestore different allocator
> > > > > > > > > >> performance Vs FileStore
> > > > > > > > > >>
> > > > > > > > > >> I think insert_free is limited by speed of function
> > > > > > > > > >> clear_bits
> > > here.
> > > > > > > > > >>
> > > > > > > > > >> Though set_bits and clear_bits have same logic except
> > > > > > > > > >> one sets and another clears. Both of these does 64
> > > > > > > > > >> bits (bitmap size) at
> > > > > a time.
> > > > > > > > > >>
> > > > > > > > > >> I am not sure if doing memset will make it faster.
> > > > > > > > > >> But if we can do it for group of bitmaps, then it might help.
> > > > > > > > > >>
> > > > > > > > > >> I am looking in to code if we can handle mkfs and osd
> > > > > > > > > >> mount in special way to make it faster.
> > > > > > > > > >>
> > > > > > > > > >> If I don't find an easy fix, we can go to path of
> > > > > > > > > >> deferring init to later stage as and when required.
> > > > > > > > > >>
> > > > > > > > > >> -Ramesh
> > > > > > > > > >>
> > > > > > > > > >>> -----Original Message-----
> > > > > > > > > >>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > >>> [mailto:ceph-devel- owner@vger.kernel.org] On Behalf
> > > > > > > > > >>> Of Allen Samuels
> > > > > > > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > > > > > > >>> To: Sage Weil; Somnath Roy
> > > > > > > > > >>> Cc: ceph-devel
> > > > > > > > > >>> Subject: RE: Bluestore different allocator
> > > > > > > > > >>> performance Vs FileStore
> > > > > > > > > >>>
> > > > > > > > > >>> We always knew that startup time for bitmap stuff
> > > > > > > > > >>> would be somewhat longer. Still, the existing
> > > > > > > > > >>> implementation can be speeded up significantly. The
> > > > > > > > > >>> code in BitMapZone::set_blocks_used isn't very
> optimized.
> > > > > > > > > >>> Converting it to use memset for all but the
> > > > > > > > > >>> first/last bytes
> > > > > > > > > >> should significantly speed it up.
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>> Allen Samuels
> > > > > > > > > >>> SanDisk |a Western Digital brand
> > > > > > > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > > > >>> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > > >>> allen.samuels@SanDisk.com
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>> -----Original Message-----
> > > > > > > > > >>>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > >>>> [mailto:ceph-devel- owner@vger.kernel.org] On
> > > > > > > > > >>>> Behalf Of Sage Weil
> > > > > > > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > > > > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > > > > > >>>> Subject: RE: Bluestore different allocator
> > > > > > > > > >>>> performance Vs FileStore
> > > > > > > > > >>>>
> > > > > > > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > > >>>>> << inline with [Somnath]
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> -----Original Message-----
> > > > > > > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > > > > > > >>>>> To: Somnath Roy
> > > > > > > > > >>>>> Cc: ceph-devel
> > > > > > > > > >>>>> Subject: Re: Bluestore different allocator
> > > > > > > > > >>>>> performance Vs FileStore
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > > >>>>>> Hi, I spent some time on evaluating different
> > > > > > > > > >>>>>> Bluestore allocator and freelist performance.
> > > > > > > > > >>>>>> Also, tried to gaze the performance difference of
> > > > > > > > > >>>>>> Bluestore and filestore on the similar
> > > > > > > > > >> setup.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Setup:
> > > > > > > > > >>>>>> --------
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Disabled the exclusive lock feature so that I can
> > > > > > > > > >>>>>> run multiple write  jobs in
> > > > > > > > > >>>> parallel.
> > > > > > > > > >>>>>> rbd_cache is disabled in the client side.
> > > > > > > > > >>>>>> Each test ran for 15 mins.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Result :
> > > > > > > > > >>>>>> ---------
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Here is the detailed report on this.
> > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > > > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Each profile I named based on
> > > > > > > > > >>>>>> <allocator>-<freelist> , so in the graph for
> > > > > > > > > >>>> example "stupid-extent" meaning stupid allocator
> > > > > > > > > >>>> and extent
> > > > > > > freelist.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> I ran the test for each of the profile in the
> > > > > > > > > >>>>>> following order after creating a
> > > > > > > > > >>>> fresh rbd image for all the Bluestore test.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> The above are non-preconditioned case i.e ran
> > > > > > > > > >>>>>> before filling up the entire
> > > > > > > > > >>>> image. The reason is I don't see any reason of
> > > > > > > > > >>>> filling up the rbd image before like filestore case
> > > > > > > > > >>>> where it will give stable performance if we fill up the rbd
> images first.
> > > > > > > > > >>>> Filling up rbd images in case of filestore will
> > > > > > > > > >>>> create the files in
> > > > > the filesystem.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M
> > > > > > > > > >>>>>> seq
> > > write.
> > > > > > > > > >>>>>> This is
> > > > > > > > > >>>> primarily because I want to load BlueStore with more
> data.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 6. Ran 4K RW test again (this is called out
> > > > > > > > > >>>>>> preconditioned in the
> > > > > > > > > >>>>>> profile) for 15 min
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> For filestore test, I ran tests after
> > > > > > > > > >>>>>> preconditioning the entire image
> > > > > > > > > >> first.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Each sheet on the xls have different block size
> > > > > > > > > >>>>>> result , I often miss to navigate through the xls
> > > > > > > > > >>>>>> sheets , so, thought of mentioning here
> > > > > > > > > >>>>>> :-)
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> I have also captured the mkfs time , OSD startup
> > > > > > > > > >>>>>> time and the memory
> > > > > > > > > >>>> usage after the entire run.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Observation:
> > > > > > > > > >>>>>> ---------------
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs
> > > > > > > > > >>>>>> time (and thus cluster
> > > > > > > > > >>>> creation time for 16 OSDs) are ~16X slower than
> > > > > > > > > >>>> stupid allocator and
> > > > > > > > > >>> filestore.
> > > > > > > > > >>>> Each OSD creation is taking ~2min or so sometimes
> > > > > > > > > >>>> and I nailed down the
> > > > > > > > > >>>> insert_free() function call (marked ****) in the
> > > > > > > > > >>>> Bitmap allocator is causing that.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10
> > > > > > > > > >>>>>> freelist enumerate_next start
> > > > > > > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10
> > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > > > > > >>>>>> bitmapalloc:init_add_free instance
> > > > > > > > > >>>>>> 139913322803328 offset
> > > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > > > > > >>>>>> bitmapalloc:insert_free instance 139913322803328
> > > > > > > > > >>>>>> off
> > > > > > > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > > > > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10
> > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > >>>>>> end****
> > > > > > > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > > > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc
> > > > > > > > > >>>>>> loaded
> > > > > > > > > >>>>>> 6757 G in
> > > > > > > > > >>>>>> 1 extents
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > > > > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > > > > > > ^A:5242880+5242880
> > > > > > > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > > > > > > >>>>>> _read_random got
> > > > > > > > > >>>>>> 613
> > > > > > > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10
> > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > > > > > >>>>>> bitmapalloc:init_add_free instance
> > > > > > > > > >>>>>> 139913306273920 offset
> > > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > > > > > >>>>>> bitmapalloc:insert_free instance 139913306273920
> > > > > > > > > >>>>>> off
> > > > > > > > > >>>>>> 0x4663d00000 len
> > > > > > > > > >>>>>> 0x69959451000*****
> > > > > > > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10
> > > > > > > > > >>>>>> freelist enumerate_next end
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> I'm not sure there's any easy fix for this. We can
> > > > > > > > > >>>>> amortize it by feeding
> > > > > > > > > >>>> space to bluefs slowly (so that we don't have to do
> > > > > > > > > >>>> all the inserts at once), but I'm not sure that's really
> better.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> [Somnath] I don't know that part of the code, so,
> > > > > > > > > >>>>> may be a dumb
> > > > > > > > > >>> question.
> > > > > > > > > >>>> This is during mkfs() time , so, can't we say to
> > > > > > > > > >>>> bluefs entire space is free ? I can understand for
> > > > > > > > > >>>> osd mount and all other cases we need to feed the
> > > > > > > > > >>>> free space every
> > > time.
> > > > > > > > > >>>>> IMO this is critical to fix as cluster creation
> > > > > > > > > >>>>> time will be number of OSDs * 2
> > > > > > > > > >>>> min otherwise. For me creating 16 OSDs cluster is
> > > > > > > > > >>>> taking ~32min compare to
> > > > > > > > > >>>> ~2 min for stupid allocator/filestore.
> > > > > > > > > >>>>> BTW, my drive data partition is ~6.9TB , db
> > > > > > > > > >>>>> partition is ~100G and WAL is
> > > > > > > > > >>>> ~1G. I guess the time taking is dependent on data
> > > > > > > > > >>>> partition size as well
> > > > > > > > > (?
> > > > > > > > > >>>>
> > > > > > > > > >>>> Well, we're fundamentally limited by the fact that
> > > > > > > > > >>>> it's a bitmap, and a big chunk of space is
> > > > > > > > > >>>> "allocated" to bluefs and needs to have 1's
> > > > > > > > > set.
> > > > > > > > > >>>>
> > > > > > > > > >>>> sage
> > > > > > > > > >>>> --
> > > > > > > > > >>>> To unsubscribe from this list: send the line
> > > > > > > > > >>>> "unsubscribe
> > > > > > > > > >>>> ceph-
> > > > > > > devel"
> > > > > > > > > >>>> in the body of a message to
> > > > > > > > > >>>> majordomo@vger.kernel.org More
> > > > > > > > > >>> majordomo
> > > > > > > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > > > > >>> --
> > > > > > > > > >>> To unsubscribe from this list: send the line
> > > > > > > > > >>> "unsubscribe ceph-
> > > > > devel"
> > > > > > > > > >>> in the body of a message to
> > > > > > > > > >>> majordomo@vger.kernel.org More
> > > > > > > > > >> majordomo
> > > > > > > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > > > PLEASE NOTE: The information contained in this electronic
> > > > > > > > mail message is
> > > > > > > intended only for the use of the designated recipient(s)
> > > > > > > named above. If the reader of this message is not the
> > > > > > > intended recipient, you are hereby notified that you have
> > > > > > > received this message in error and that any review,
> > > > > > > dissemination, distribution, or copying of this message is strictly
> prohibited.
> > > > > > > If you have received this communication in error, please
> > > > > > > notify the sender by telephone or e-mail (as shown above)
> > > > > > > immediately and destroy any and all copies of this message
> > > > > > > in your possession (whether hard copies or electronically
> > > > > stored copies).
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > ceph-
> > > devel"
> > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > majordomo
> > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 20:16                             ` Allen Samuels
@ 2016-08-11 20:24                               ` Sage Weil
  2016-08-11 20:28                                 ` Allen Samuels
  0 siblings, 1 reply; 34+ messages in thread
From: Sage Weil @ 2016-08-11 20:24 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

On Thu, 11 Aug 2016, Allen Samuels wrote:
> Perhaps my understanding of the blueFS is incorrect -- so please clarify 
> as needed.
> 
> I thought that the authoritative indication of space used by BlueFS was 
> contained in the snapshot/journal of BlueFS itself, NOT in the KV store 
> itself. This requires that upon startup, we replay the BlueFS 
> snapshot/journal into the FreeListManager so that it properly records 
> the consumption of BlueFS space (since that allocation MAY NOT be 
> accurate within the FreeListmanager itself). But that this playback need 
> not generate an KVStore operations (since those are duplicates of the 
> BlueFS).
> 
> So in the code you cite:
> 
> fm->allocate(0, reserved, t);
> 
> There's no need to commit 't', and in fact, in the general case, you 
> don't want to commit 't'.
> 
> That suggests to me that a version of allocate that doesn't have a 
> transaction could be easily created would have the speed we're looking 
> for (and independence from the BitMapAllocator to KVStore chunking).

Oh, I see.  Yeah, you're right--this step isn't really necessary, as long 
as we ensure that the auxilliary representation of what bluefs owns 
(bluefs_extents in the superblock) is still passed into the Allocator 
during initialization.  Having the freelist reflect the allocator that 
this space was "in use" (by bluefs) and thus off limits to bluestore is 
simple but not strictly necessary.

I'll work on a PR that avoids this...

> I suspect that we also have long startup times because we're doing the 
> same underlying bitmap operations except they come from the BlueFS 
> replay code instead of the BlueFS initialization code, but same problem 
> with likely the same fix.

BlueFS doesn't touch the FreelistManager (or explicitly persist the 
freelist at all)... we initialize the in-memory Allocator state from the 
metadata in the bluefs log.  I think we should be fine on this end.

Thanks!
sage


> 
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 1:03 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> > 
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > Sent: Thursday, August 11, 2016 12:34 PM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > > >
> > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > > Sent: Thursday, August 11, 2016 10:15 AM
> > > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > > devel@vger.kernel.org>
> > > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > > FileStore
> > > > > >
> > > > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > > Sent: Thursday, August 11, 2016 9:38 AM
> > > > > > > > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > > > > > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> > > > > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > > > > devel@vger.kernel.org>
> > > > > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > > > > FileStore
> > > > > > > >
> > > > > > > > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > > > > > > > I think the free list does not initialize all keys at mkfs
> > > > > > > > > time, it does sets key that has some allocations.
> > > > > > > > >
> > > > > > > > > Rest keys are assumed to have 0's if key does not exist.
> > > > > > > >
> > > > > > > > Right.. it's the region "allocated" to bluefs that is consuming the
> > time.
> > > > > > > >
> > > > > > > > > The bitmap allocator insert_free is done in group of free
> > > > > > > > > bits together(maybe more than bitmap freelist keys at a time).
> > > > > > > >
> > > > > > > > I think Allen is asking whether we are doing lots of inserts
> > > > > > > > within a single rocksdb transaction, or lots of separate
> > transactions.
> > > > > > > >
> > > > > > > > FWIW, my guess is that increasing the size of the value
> > > > > > > > (i.e., increasing
> > > > > > > >
> > > > > > > > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > > > > > > >
> > > > > > > > ) will probably speed this up.
> > > > > > >
> > > > > > > If your assumption (> Right.. it's the region "allocated" to
> > > > > > > bluefs that is consuming the time) is correct, then I don't
> > > > > > > understand why this parameter has any effect on the problem.
> > > > > > >
> > > > > > > Aren't we reading BlueFS extents and setting them in the
> > > > > > > BitMapAllocator? That doesn't care about the chunking of
> > > > > > > bitmap bits into KV keys.
> > > > > >
> > > > > > I think this is something different.  During mkfs we take ~2%
> > > > > > (or somethign like that) of the block device, mark it
> > > > > > 'allocated' (from the bluestore freelist's
> > > > > > perspective) and give it to bluefs.  On a large device that's a
> > > > > > lot of bits to
> > > > set.
> > > > > > Larger keys should speed that up.
> > > > >
> > > > > But the bits in the BitMap shouldn't be chunked up in the same
> > > > > units as the Keys. Right? Sharding of the bitmap is done for
> > > > > internal parallelism
> > > > > -- only, it has nothing to do with the persistent representation.
> > > >
> > > > I'm not really sure what the BitmapAllocator is doing, but yeah,
> > > > it's independent.  The tunable I'm talking about though is the one
> > > > that controls how many bits BitmapFreelist puts in each key/value pair.
> > >
> > > I understand, but that should be relevant only to operations that
> > > actually either read or write to the KV Store. That's not the case
> > > here, allocations by BlueFS are not recorded in the KVStore.
> > >
> > > Whatever chunking/sharding of the bitmapfreelist is present should be
> > > independent (well an integer multiple thereof....) of the number of
> > > bits that are chunked up into a single KV Key/Value pair. Hence when
> > > doing the initialization here (i.e., the marking of BlueFS allocated
> > > space in the freelist) that shouldn't involve ANY KVStore operations.
> > > I think it's worthwhile to modify the option (say make it 16 or 64x
> > > larger) and see if that actually affects the initialization time -- if
> > > it does, then there's something structurally inefficient in the code
> > > that's hopefully easy to fix.
> > 
> > This is the allocation of space *to* bluefs, not *by* bluefs.  At mkfs time, we
> > (BlueStore::mkfs() -> _open_fm()) will take 2% of the block device and mark
> > it in-use with that fm->allocate() call below, and that flips a bunch of bits in
> > the kv store.
> > 
> > > > > BlueFS allocations aren't stored in the KV database (to avoid
> > > > > circularity).
> > > > >
> > > > > So I don't see why a bitset of 2m bits should be taking so long.....
> > > > > Makes me thing that we don't really understand the problem.
> > > >
> > > > Could be, I'm just guessing.  During mkfs, _open_fm() does
> > > >
> > > >     fm->create(bdev->get_size(), t);
> > > >
> > > > and then
> > > >
> > > >     fm->allocate(0, reserved, t);
> > 
> >         ^ here.
> > 
> > > >
> > > > where the value of reserved depends on how much we give to bluefs.
> > > > I'm assuming this is the mkfs allocation that is taking time, but I
> > > > haven't looked at the allocator code at all or whether insert_free is part
> > of this path...
> > >
> > > Somnath's data clearly points to this....
> > 
> > sage
> > 
> > >
> > > >
> > > > sage
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > The amount of space we start with comes from _open_db():
> > > > > >
> > > > > >       uint64_t initial =
> > > > > > 	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
> > > > > > 			    g_conf->bluestore_bluefs_gift_ratio);
> > > > > >       initial = MAX(initial, g_conf->bluestore_bluefs_min);
> > > > > >
> > > > > > Simply lowering min_ratio might also be fine.  The current value
> > > > > > of 2% is meant to be enough for most stores, and to avoid giving
> > > > > > over lots of little extents later (and making the bluefs_extents
> > > > > > list too big).  That can overflow the superblock, another
> > > > > > annoying thing we need to fix (though not a big deal to fix).
> > > > > >
> > > > > > Anyway, adjust bluestore_bluefs_min_ratio to .01 should ~halve
> > > > > > the time spent on this.. that is probably another useful test to
> > > > > > confirm this is what is going on.
> > > > >
> > > > > Yes, this should help -- but still seems like a bandaid.
> > > > >
> > > > > >
> > > > > > sage
> > > > > >
> > > > > > > I would be cautious about just changing this option to affect
> > > > > > > this problem (though as an experiment, we can change the value
> > > > > > > and see if it has ANY affect on this problem -- which I don't
> > > > > > > think it will). The value of this option really needs to be
> > > > > > > dictated by its effect on the more mainstream read/write
> > > > > > > operations not on the
> > > > initialization problem.
> > > > > > > >
> > > > > > > > sage
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > -Ramesh
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Allen Samuels
> > > > > > > > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > > > > > > > To: Ramesh Chander
> > > > > > > > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > > > > > > > Subject: Re: Bluestore different allocator performance
> > > > > > > > > > Vs FileStore
> > > > > > > > > >
> > > > > > > > > > Is the initial creation of the keys for the bitmap one
> > > > > > > > > > by one or are they batched?
> > > > > > > > > >
> > > > > > > > > > Sent from my iPhone. Please excuse all typos and
> > autocorrects.
> > > > > > > > > >
> > > > > > > > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > > > > > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Somnath,
> > > > > > > > > > >
> > > > > > > > > > > Basically mkfs time has increased from 7.5 seconds
> > > > > > > > > > > (2min /
> > > > > > > > > > > 16) to
> > > > > > > > > > > 2 minutes
> > > > > > > > > > ( 32 / 16).
> > > > > > > > > > >
> > > > > > > > > > > But is there a reason you should create osds in
> > > > > > > > > > > serial? I think for mmultiple
> > > > > > > > > > osds mkfs can happen in parallel?
> > > > > > > > > > >
> > > > > > > > > > > As a fix I am looking to batch multiple insert_free calls for
> > now.
> > > > > > > > > > > If still that
> > > > > > > > > > does not help, thinking of doing insert_free on
> > > > > > > > > > different part of device in parallel.
> > > > > > > > > > >
> > > > > > > > > > > -Ramesh
> > > > > > > > > > >
> > > > > > > > > > >> -----Original Message-----
> > > > > > > > > > >> From: Ramesh Chander
> > > > > > > > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > > > > > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > > > > > > > >> Cc: ceph-devel
> > > > > > > > > > >> Subject: RE: Bluestore different allocator
> > > > > > > > > > >> performance Vs FileStore
> > > > > > > > > > >>
> > > > > > > > > > >> I think insert_free is limited by speed of function
> > > > > > > > > > >> clear_bits
> > > > here.
> > > > > > > > > > >>
> > > > > > > > > > >> Though set_bits and clear_bits have same logic except
> > > > > > > > > > >> one sets and another clears. Both of these does 64
> > > > > > > > > > >> bits (bitmap size) at
> > > > > > a time.
> > > > > > > > > > >>
> > > > > > > > > > >> I am not sure if doing memset will make it faster.
> > > > > > > > > > >> But if we can do it for group of bitmaps, then it might help.
> > > > > > > > > > >>
> > > > > > > > > > >> I am looking in to code if we can handle mkfs and osd
> > > > > > > > > > >> mount in special way to make it faster.
> > > > > > > > > > >>
> > > > > > > > > > >> If I don't find an easy fix, we can go to path of
> > > > > > > > > > >> deferring init to later stage as and when required.
> > > > > > > > > > >>
> > > > > > > > > > >> -Ramesh
> > > > > > > > > > >>
> > > > > > > > > > >>> -----Original Message-----
> > > > > > > > > > >>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > >>> [mailto:ceph-devel- owner@vger.kernel.org] On Behalf
> > > > > > > > > > >>> Of Allen Samuels
> > > > > > > > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > > > > > > > >>> To: Sage Weil; Somnath Roy
> > > > > > > > > > >>> Cc: ceph-devel
> > > > > > > > > > >>> Subject: RE: Bluestore different allocator
> > > > > > > > > > >>> performance Vs FileStore
> > > > > > > > > > >>>
> > > > > > > > > > >>> We always knew that startup time for bitmap stuff
> > > > > > > > > > >>> would be somewhat longer. Still, the existing
> > > > > > > > > > >>> implementation can be speeded up significantly. The
> > > > > > > > > > >>> code in BitMapZone::set_blocks_used isn't very
> > optimized.
> > > > > > > > > > >>> Converting it to use memset for all but the
> > > > > > > > > > >>> first/last bytes
> > > > > > > > > > >> should significantly speed it up.
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>> Allen Samuels
> > > > > > > > > > >>> SanDisk |a Western Digital brand
> > > > > > > > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > > > > >>> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > > > >>> allen.samuels@SanDisk.com
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>> -----Original Message-----
> > > > > > > > > > >>>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > >>>> [mailto:ceph-devel- owner@vger.kernel.org] On
> > > > > > > > > > >>>> Behalf Of Sage Weil
> > > > > > > > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > > > > > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > > > > > > >>>> Subject: RE: Bluestore different allocator
> > > > > > > > > > >>>> performance Vs FileStore
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > >>>>> << inline with [Somnath]
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> -----Original Message-----
> > > > > > > > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > > > > > > > >>>>> To: Somnath Roy
> > > > > > > > > > >>>>> Cc: ceph-devel
> > > > > > > > > > >>>>> Subject: Re: Bluestore different allocator
> > > > > > > > > > >>>>> performance Vs FileStore
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > >>>>>> Hi, I spent some time on evaluating different
> > > > > > > > > > >>>>>> Bluestore allocator and freelist performance.
> > > > > > > > > > >>>>>> Also, tried to gaze the performance difference of
> > > > > > > > > > >>>>>> Bluestore and filestore on the similar
> > > > > > > > > > >> setup.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Setup:
> > > > > > > > > > >>>>>> --------
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Single pool and single rbd image of 4TB. 2X replication.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Disabled the exclusive lock feature so that I can
> > > > > > > > > > >>>>>> run multiple write  jobs in
> > > > > > > > > > >>>> parallel.
> > > > > > > > > > >>>>>> rbd_cache is disabled in the client side.
> > > > > > > > > > >>>>>> Each test ran for 15 mins.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Result :
> > > > > > > > > > >>>>>> ---------
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Here is the detailed report on this.
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > > > > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Each profile I named based on
> > > > > > > > > > >>>>>> <allocator>-<freelist> , so in the graph for
> > > > > > > > > > >>>> example "stupid-extent" meaning stupid allocator
> > > > > > > > > > >>>> and extent
> > > > > > > > freelist.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> I ran the test for each of the profile in the
> > > > > > > > > > >>>>>> following order after creating a
> > > > > > > > > > >>>> fresh rbd image for all the Bluestore test.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> The above are non-preconditioned case i.e ran
> > > > > > > > > > >>>>>> before filling up the entire
> > > > > > > > > > >>>> image. The reason is I don't see any reason of
> > > > > > > > > > >>>> filling up the rbd image before like filestore case
> > > > > > > > > > >>>> where it will give stable performance if we fill up the rbd
> > images first.
> > > > > > > > > > >>>> Filling up rbd images in case of filestore will
> > > > > > > > > > >>>> create the files in
> > > > > > the filesystem.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 5. Next, I did precondition the 4TB image with 1M
> > > > > > > > > > >>>>>> seq
> > > > write.
> > > > > > > > > > >>>>>> This is
> > > > > > > > > > >>>> primarily because I want to load BlueStore with more
> > data.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 6. Ran 4K RW test again (this is called out
> > > > > > > > > > >>>>>> preconditioned in the
> > > > > > > > > > >>>>>> profile) for 15 min
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> For filestore test, I ran tests after
> > > > > > > > > > >>>>>> preconditioning the entire image
> > > > > > > > > > >> first.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Each sheet on the xls have different block size
> > > > > > > > > > >>>>>> result , I often miss to navigate through the xls
> > > > > > > > > > >>>>>> sheets , so, thought of mentioning here
> > > > > > > > > > >>>>>> :-)
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> I have also captured the mkfs time , OSD startup
> > > > > > > > > > >>>>>> time and the memory
> > > > > > > > > > >>>> usage after the entire run.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Observation:
> > > > > > > > > > >>>>>> ---------------
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 1. First of all, in case of bitmap allocator mkfs
> > > > > > > > > > >>>>>> time (and thus cluster
> > > > > > > > > > >>>> creation time for 16 OSDs) are ~16X slower than
> > > > > > > > > > >>>> stupid allocator and
> > > > > > > > > > >>> filestore.
> > > > > > > > > > >>>> Each OSD creation is taking ~2min or so sometimes
> > > > > > > > > > >>>> and I nailed down the
> > > > > > > > > > >>>> insert_free() function call (marked ****) in the
> > > > > > > > > > >>>> Bitmap allocator is causing that.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10
> > > > > > > > > > >>>>>> freelist enumerate_next start
> > > > > > > > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10
> > > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > > > > > > >>>>>> bitmapalloc:init_add_free instance
> > > > > > > > > > >>>>>> 139913322803328 offset
> > > > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0 20
> > > > > > > > > > >>>>>> bitmapalloc:insert_free instance 139913322803328
> > > > > > > > > > >>>>>> off
> > > > > > > > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > > > > > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0 10
> > > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > > >>>>>> end****
> > > > > > > > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > > > > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0) _open_alloc
> > > > > > > > > > >>>>>> loaded
> > > > > > > > > > >>>>>> 6757 G in
> > > > > > > > > > >>>>>> 1 extents
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20 bluefs
> > > > > > > > > > >>>>>> _read_random read buffered 0x4a14eb~265 of
> > > > > > > > ^A:5242880+5242880
> > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20 bluefs
> > > > > > > > > > >>>>>> _read_random got
> > > > > > > > > > >>>>>> 613
> > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10
> > > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > > > > > > >>>>>> bitmapalloc:init_add_free instance
> > > > > > > > > > >>>>>> 139913306273920 offset
> > > > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0 20
> > > > > > > > > > >>>>>> bitmapalloc:insert_free instance 139913306273920
> > > > > > > > > > >>>>>> off
> > > > > > > > > > >>>>>> 0x4663d00000 len
> > > > > > > > > > >>>>>> 0x69959451000*****
> > > > > > > > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0 10
> > > > > > > > > > >>>>>> freelist enumerate_next end
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> I'm not sure there's any easy fix for this. We can
> > > > > > > > > > >>>>> amortize it by feeding
> > > > > > > > > > >>>> space to bluefs slowly (so that we don't have to do
> > > > > > > > > > >>>> all the inserts at once), but I'm not sure that's really
> > better.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> [Somnath] I don't know that part of the code, so,
> > > > > > > > > > >>>>> may be a dumb
> > > > > > > > > > >>> question.
> > > > > > > > > > >>>> This is during mkfs() time , so, can't we say to
> > > > > > > > > > >>>> bluefs entire space is free ? I can understand for
> > > > > > > > > > >>>> osd mount and all other cases we need to feed the
> > > > > > > > > > >>>> free space every
> > > > time.
> > > > > > > > > > >>>>> IMO this is critical to fix as cluster creation
> > > > > > > > > > >>>>> time will be number of OSDs * 2
> > > > > > > > > > >>>> min otherwise. For me creating 16 OSDs cluster is
> > > > > > > > > > >>>> taking ~32min compare to
> > > > > > > > > > >>>> ~2 min for stupid allocator/filestore.
> > > > > > > > > > >>>>> BTW, my drive data partition is ~6.9TB , db
> > > > > > > > > > >>>>> partition is ~100G and WAL is
> > > > > > > > > > >>>> ~1G. I guess the time taking is dependent on data
> > > > > > > > > > >>>> partition size as well
> > > > > > > > > > (?
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Well, we're fundamentally limited by the fact that
> > > > > > > > > > >>>> it's a bitmap, and a big chunk of space is
> > > > > > > > > > >>>> "allocated" to bluefs and needs to have 1's
> > > > > > > > > > set.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> sage
> > > > > > > > > > >>>> --
> > > > > > > > > > >>>> To unsubscribe from this list: send the line
> > > > > > > > > > >>>> "unsubscribe
> > > > > > > > > > >>>> ceph-
> > > > > > > > devel"
> > > > > > > > > > >>>> in the body of a message to
> > > > > > > > > > >>>> majordomo@vger.kernel.org More
> > > > > > > > > > >>> majordomo
> > > > > > > > > > >>>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > >>> --
> > > > > > > > > > >>> To unsubscribe from this list: send the line
> > > > > > > > > > >>> "unsubscribe ceph-
> > > > > > devel"
> > > > > > > > > > >>> in the body of a message to
> > > > > > > > > > >>> majordomo@vger.kernel.org More
> > > > > > > > > > >> majordomo
> > > > > > > > > > >>> info at http://vger.kernel.org/majordomo-info.html
> > > > > > > > > PLEASE NOTE: The information contained in this electronic
> > > > > > > > > mail message is
> > > > > > > > intended only for the use of the designated recipient(s)
> > > > > > > > named above. If the reader of this message is not the
> > > > > > > > intended recipient, you are hereby notified that you have
> > > > > > > > received this message in error and that any review,
> > > > > > > > dissemination, distribution, or copying of this message is strictly
> > prohibited.
> > > > > > > > If you have received this communication in error, please
> > > > > > > > notify the sender by telephone or e-mail (as shown above)
> > > > > > > > immediately and destroy any and all copies of this message
> > > > > > > > in your possession (whether hard copies or electronically
> > > > > > stored copies).
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > ceph-
> > > > devel"
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > > majordomo
> > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 20:24                               ` Sage Weil
@ 2016-08-11 20:28                                 ` Allen Samuels
  2016-08-11 21:19                                   ` Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Allen Samuels @ 2016-08-11 20:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, August 11, 2016 1:24 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > Perhaps my understanding of the blueFS is incorrect -- so please
> > clarify as needed.
> >
> > I thought that the authoritative indication of space used by BlueFS
> > was contained in the snapshot/journal of BlueFS itself, NOT in the KV
> > store itself. This requires that upon startup, we replay the BlueFS
> > snapshot/journal into the FreeListManager so that it properly records
> > the consumption of BlueFS space (since that allocation MAY NOT be
> > accurate within the FreeListmanager itself). But that this playback
> > need not generate an KVStore operations (since those are duplicates of
> > the BlueFS).
> >
> > So in the code you cite:
> >
> > fm->allocate(0, reserved, t);
> >
> > There's no need to commit 't', and in fact, in the general case, you
> > don't want to commit 't'.
> >
> > That suggests to me that a version of allocate that doesn't have a
> > transaction could be easily created would have the speed we're looking
> > for (and independence from the BitMapAllocator to KVStore chunking).
> 
> Oh, I see.  Yeah, you're right--this step isn't really necessary, as long as we
> ensure that the auxilliary representation of what bluefs owns
> (bluefs_extents in the superblock) is still passed into the Allocator during
> initialization.  Having the freelist reflect the allocator that this space was "in
> use" (by bluefs) and thus off limits to bluestore is simple but not strictly
> necessary.
> 
> I'll work on a PR that avoids this...
> 
> > I suspect that we also have long startup times because we're doing the
> > same underlying bitmap operations except they come from the BlueFS
> > replay code instead of the BlueFS initialization code, but same
> > problem with likely the same fix.
> 
> BlueFS doesn't touch the FreelistManager (or explicitly persist the freelist at
> all)... we initialize the in-memory Allocator state from the metadata in the
> bluefs log.  I think we should be fine on this end.

Likely that code suffers from the same problem -- a false need to update the KV Store (During the playback, BlueFS extents are converted to bitmap runs, it's essentially the same lower level code as the case we're seeing now, but it instead of being driven by an artificial "big run", it'sll be driven from the BlueFS Journal replay code). But that's just a guess, I don't have time to track down the actual code right now.

> 
> Thanks!
> sage
> 
> 
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 1:03 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > Sent: Thursday, August 11, 2016 12:34 PM
> > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > devel@vger.kernel.org>
> > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > FileStore
> > > > >
> > > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > Sent: Thursday, August 11, 2016 10:15 AM
> > > > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath
> Roy
> > > > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > > > devel@vger.kernel.org>
> > > > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > > > FileStore
> > > > > > >
> > > > > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > > > Sent: Thursday, August 11, 2016 9:38 AM
> > > > > > > > > To: Ramesh Chander <Ramesh.Chander@sandisk.com>
> > > > > > > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath
> > > > > > > > > Roy <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > > > > > devel@vger.kernel.org>
> > > > > > > > > Subject: RE: Bluestore different allocator performance
> > > > > > > > > Vs FileStore
> > > > > > > > >
> > > > > > > > > On Thu, 11 Aug 2016, Ramesh Chander wrote:
> > > > > > > > > > I think the free list does not initialize all keys at
> > > > > > > > > > mkfs time, it does sets key that has some allocations.
> > > > > > > > > >
> > > > > > > > > > Rest keys are assumed to have 0's if key does not exist.
> > > > > > > > >
> > > > > > > > > Right.. it's the region "allocated" to bluefs that is
> > > > > > > > > consuming the
> > > time.
> > > > > > > > >
> > > > > > > > > > The bitmap allocator insert_free is done in group of
> > > > > > > > > > free bits together(maybe more than bitmap freelist keys at a
> time).
> > > > > > > > >
> > > > > > > > > I think Allen is asking whether we are doing lots of
> > > > > > > > > inserts within a single rocksdb transaction, or lots of
> > > > > > > > > separate
> > > transactions.
> > > > > > > > >
> > > > > > > > > FWIW, my guess is that increasing the size of the value
> > > > > > > > > (i.e., increasing
> > > > > > > > >
> > > > > > > > > OPTION(bluestore_freelist_blocks_per_key, OPT_INT, 128)
> > > > > > > > >
> > > > > > > > > ) will probably speed this up.
> > > > > > > >
> > > > > > > > If your assumption (> Right.. it's the region "allocated"
> > > > > > > > to bluefs that is consuming the time) is correct, then I
> > > > > > > > don't understand why this parameter has any effect on the
> problem.
> > > > > > > >
> > > > > > > > Aren't we reading BlueFS extents and setting them in the
> > > > > > > > BitMapAllocator? That doesn't care about the chunking of
> > > > > > > > bitmap bits into KV keys.
> > > > > > >
> > > > > > > I think this is something different.  During mkfs we take
> > > > > > > ~2% (or somethign like that) of the block device, mark it
> > > > > > > 'allocated' (from the bluestore freelist's
> > > > > > > perspective) and give it to bluefs.  On a large device
> > > > > > > that's a lot of bits to
> > > > > set.
> > > > > > > Larger keys should speed that up.
> > > > > >
> > > > > > But the bits in the BitMap shouldn't be chunked up in the same
> > > > > > units as the Keys. Right? Sharding of the bitmap is done for
> > > > > > internal parallelism
> > > > > > -- only, it has nothing to do with the persistent representation.
> > > > >
> > > > > I'm not really sure what the BitmapAllocator is doing, but yeah,
> > > > > it's independent.  The tunable I'm talking about though is the
> > > > > one that controls how many bits BitmapFreelist puts in each
> key/value pair.
> > > >
> > > > I understand, but that should be relevant only to operations that
> > > > actually either read or write to the KV Store. That's not the case
> > > > here, allocations by BlueFS are not recorded in the KVStore.
> > > >
> > > > Whatever chunking/sharding of the bitmapfreelist is present should
> > > > be independent (well an integer multiple thereof....) of the
> > > > number of bits that are chunked up into a single KV Key/Value
> > > > pair. Hence when doing the initialization here (i.e., the marking
> > > > of BlueFS allocated space in the freelist) that shouldn't involve ANY
> KVStore operations.
> > > > I think it's worthwhile to modify the option (say make it 16 or
> > > > 64x
> > > > larger) and see if that actually affects the initialization time
> > > > -- if it does, then there's something structurally inefficient in
> > > > the code that's hopefully easy to fix.
> > >
> > > This is the allocation of space *to* bluefs, not *by* bluefs.  At
> > > mkfs time, we
> > > (BlueStore::mkfs() -> _open_fm()) will take 2% of the block device
> > > and mark it in-use with that fm->allocate() call below, and that
> > > flips a bunch of bits in the kv store.
> > >
> > > > > > BlueFS allocations aren't stored in the KV database (to avoid
> > > > > > circularity).
> > > > > >
> > > > > > So I don't see why a bitset of 2m bits should be taking so long.....
> > > > > > Makes me thing that we don't really understand the problem.
> > > > >
> > > > > Could be, I'm just guessing.  During mkfs, _open_fm() does
> > > > >
> > > > >     fm->create(bdev->get_size(), t);
> > > > >
> > > > > and then
> > > > >
> > > > >     fm->allocate(0, reserved, t);
> > >
> > >         ^ here.
> > >
> > > > >
> > > > > where the value of reserved depends on how much we give to
> bluefs.
> > > > > I'm assuming this is the mkfs allocation that is taking time,
> > > > > but I haven't looked at the allocator code at all or whether
> > > > > insert_free is part
> > > of this path...
> > > >
> > > > Somnath's data clearly points to this....
> > >
> > > sage
> > >
> > > >
> > > > >
> > > > > sage
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > The amount of space we start with comes from _open_db():
> > > > > > >
> > > > > > >       uint64_t initial =
> > > > > > > 	bdev->get_size() * (g_conf->bluestore_bluefs_min_ratio +
> > > > > > > 			    g_conf->bluestore_bluefs_gift_ratio);
> > > > > > >       initial = MAX(initial, g_conf->bluestore_bluefs_min);
> > > > > > >
> > > > > > > Simply lowering min_ratio might also be fine.  The current
> > > > > > > value of 2% is meant to be enough for most stores, and to
> > > > > > > avoid giving over lots of little extents later (and making
> > > > > > > the bluefs_extents list too big).  That can overflow the
> > > > > > > superblock, another annoying thing we need to fix (though not a
> big deal to fix).
> > > > > > >
> > > > > > > Anyway, adjust bluestore_bluefs_min_ratio to .01 should
> > > > > > > ~halve the time spent on this.. that is probably another
> > > > > > > useful test to confirm this is what is going on.
> > > > > >
> > > > > > Yes, this should help -- but still seems like a bandaid.
> > > > > >
> > > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > > > I would be cautious about just changing this option to
> > > > > > > > affect this problem (though as an experiment, we can
> > > > > > > > change the value and see if it has ANY affect on this
> > > > > > > > problem -- which I don't think it will). The value of this
> > > > > > > > option really needs to be dictated by its effect on the
> > > > > > > > more mainstream read/write operations not on the
> > > > > initialization problem.
> > > > > > > > >
> > > > > > > > > sage
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -Ramesh
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Allen Samuels
> > > > > > > > > > > Sent: Thursday, August 11, 2016 9:34 PM
> > > > > > > > > > > To: Ramesh Chander
> > > > > > > > > > > Cc: Sage Weil; Somnath Roy; ceph-devel
> > > > > > > > > > > Subject: Re: Bluestore different allocator
> > > > > > > > > > > performance Vs FileStore
> > > > > > > > > > >
> > > > > > > > > > > Is the initial creation of the keys for the bitmap
> > > > > > > > > > > one by one or are they batched?
> > > > > > > > > > >
> > > > > > > > > > > Sent from my iPhone. Please excuse all typos and
> > > autocorrects.
> > > > > > > > > > >
> > > > > > > > > > > > On Aug 10, 2016, at 11:07 PM, Ramesh Chander
> > > > > > > > > > > <Ramesh.Chander@sandisk.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Somnath,
> > > > > > > > > > > >
> > > > > > > > > > > > Basically mkfs time has increased from 7.5 seconds
> > > > > > > > > > > > (2min /
> > > > > > > > > > > > 16) to
> > > > > > > > > > > > 2 minutes
> > > > > > > > > > > ( 32 / 16).
> > > > > > > > > > > >
> > > > > > > > > > > > But is there a reason you should create osds in
> > > > > > > > > > > > serial? I think for mmultiple
> > > > > > > > > > > osds mkfs can happen in parallel?
> > > > > > > > > > > >
> > > > > > > > > > > > As a fix I am looking to batch multiple
> > > > > > > > > > > > insert_free calls for
> > > now.
> > > > > > > > > > > > If still that
> > > > > > > > > > > does not help, thinking of doing insert_free on
> > > > > > > > > > > different part of device in parallel.
> > > > > > > > > > > >
> > > > > > > > > > > > -Ramesh
> > > > > > > > > > > >
> > > > > > > > > > > >> -----Original Message-----
> > > > > > > > > > > >> From: Ramesh Chander
> > > > > > > > > > > >> Sent: Thursday, August 11, 2016 10:04 AM
> > > > > > > > > > > >> To: Allen Samuels; Sage Weil; Somnath Roy
> > > > > > > > > > > >> Cc: ceph-devel
> > > > > > > > > > > >> Subject: RE: Bluestore different allocator
> > > > > > > > > > > >> performance Vs FileStore
> > > > > > > > > > > >>
> > > > > > > > > > > >> I think insert_free is limited by speed of
> > > > > > > > > > > >> function clear_bits
> > > > > here.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Though set_bits and clear_bits have same logic
> > > > > > > > > > > >> except one sets and another clears. Both of these
> > > > > > > > > > > >> does 64 bits (bitmap size) at
> > > > > > > a time.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I am not sure if doing memset will make it faster.
> > > > > > > > > > > >> But if we can do it for group of bitmaps, then it might
> help.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I am looking in to code if we can handle mkfs and
> > > > > > > > > > > >> osd mount in special way to make it faster.
> > > > > > > > > > > >>
> > > > > > > > > > > >> If I don't find an easy fix, we can go to path of
> > > > > > > > > > > >> deferring init to later stage as and when required.
> > > > > > > > > > > >>
> > > > > > > > > > > >> -Ramesh
> > > > > > > > > > > >>
> > > > > > > > > > > >>> -----Original Message-----
> > > > > > > > > > > >>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > > >>> [mailto:ceph-devel- owner@vger.kernel.org] On
> > > > > > > > > > > >>> Behalf Of Allen Samuels
> > > > > > > > > > > >>> Sent: Thursday, August 11, 2016 4:28 AM
> > > > > > > > > > > >>> To: Sage Weil; Somnath Roy
> > > > > > > > > > > >>> Cc: ceph-devel
> > > > > > > > > > > >>> Subject: RE: Bluestore different allocator
> > > > > > > > > > > >>> performance Vs FileStore
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> We always knew that startup time for bitmap
> > > > > > > > > > > >>> stuff would be somewhat longer. Still, the
> > > > > > > > > > > >>> existing implementation can be speeded up
> > > > > > > > > > > >>> significantly. The code in
> > > > > > > > > > > >>> BitMapZone::set_blocks_used isn't very
> > > optimized.
> > > > > > > > > > > >>> Converting it to use memset for all but the
> > > > > > > > > > > >>> first/last bytes
> > > > > > > > > > > >> should significantly speed it up.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Allen Samuels
> > > > > > > > > > > >>> SanDisk |a Western Digital brand
> > > > > > > > > > > >>> 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > > > > > >>> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > > > > >>> allen.samuels@SanDisk.com
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>> -----Original Message-----
> > > > > > > > > > > >>>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > > >>>> [mailto:ceph-devel- owner@vger.kernel.org] On
> > > > > > > > > > > >>>> Behalf Of Sage Weil
> > > > > > > > > > > >>>> Sent: Wednesday, August 10, 2016 3:44 PM
> > > > > > > > > > > >>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > > > > > >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > > > > > > > >>>> Subject: RE: Bluestore different allocator
> > > > > > > > > > > >>>> performance Vs FileStore
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > > >>>>> << inline with [Somnath]
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> -----Original Message-----
> > > > > > > > > > > >>>>> From: Sage Weil [mailto:sage@newdream.net]
> > > > > > > > > > > >>>>> Sent: Wednesday, August 10, 2016 2:31 PM
> > > > > > > > > > > >>>>> To: Somnath Roy
> > > > > > > > > > > >>>>> Cc: ceph-devel
> > > > > > > > > > > >>>>> Subject: Re: Bluestore different allocator
> > > > > > > > > > > >>>>> performance Vs FileStore
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>>> On Wed, 10 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > > >>>>>> Hi, I spent some time on evaluating different
> > > > > > > > > > > >>>>>> Bluestore allocator and freelist performance.
> > > > > > > > > > > >>>>>> Also, tried to gaze the performance
> > > > > > > > > > > >>>>>> difference of Bluestore and filestore on the
> > > > > > > > > > > >>>>>> similar
> > > > > > > > > > > >> setup.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Setup:
> > > > > > > > > > > >>>>>> --------
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 16 OSDs (8TB Flash) across 2 OSD nodes
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Single pool and single rbd image of 4TB. 2X
> replication.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Disabled the exclusive lock feature so that I
> > > > > > > > > > > >>>>>> can run multiple write  jobs in
> > > > > > > > > > > >>>> parallel.
> > > > > > > > > > > >>>>>> rbd_cache is disabled in the client side.
> > > > > > > > > > > >>>>>> Each test ran for 15 mins.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Result :
> > > > > > > > > > > >>>>>> ---------
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Here is the detailed report on this.
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/somnathr/ceph/blob/6e03a5a41fe2c9b213a610200b2e8a
> > > > > > > > > > > >>>>>> 25 0cb05986/Bluestore_allocator_comp.xlsx
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Each profile I named based on
> > > > > > > > > > > >>>>>> <allocator>-<freelist> , so in the graph for
> > > > > > > > > > > >>>> example "stupid-extent" meaning stupid
> > > > > > > > > > > >>>> allocator and extent
> > > > > > > > > freelist.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> I ran the test for each of the profile in the
> > > > > > > > > > > >>>>>> following order after creating a
> > > > > > > > > > > >>>> fresh rbd image for all the Bluestore test.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 1. 4K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 2. 16K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 3. 64K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 4. 256K RW for 15 min with 16QD and 10 jobs.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> The above are non-preconditioned case i.e ran
> > > > > > > > > > > >>>>>> before filling up the entire
> > > > > > > > > > > >>>> image. The reason is I don't see any reason of
> > > > > > > > > > > >>>> filling up the rbd image before like filestore
> > > > > > > > > > > >>>> case where it will give stable performance if
> > > > > > > > > > > >>>> we fill up the rbd
> > > images first.
> > > > > > > > > > > >>>> Filling up rbd images in case of filestore will
> > > > > > > > > > > >>>> create the files in
> > > > > > > the filesystem.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 5. Next, I did precondition the 4TB image
> > > > > > > > > > > >>>>>> with 1M seq
> > > > > write.
> > > > > > > > > > > >>>>>> This is
> > > > > > > > > > > >>>> primarily because I want to load BlueStore with
> > > > > > > > > > > >>>> more
> > > data.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 6. Ran 4K RW test again (this is called out
> > > > > > > > > > > >>>>>> preconditioned in the
> > > > > > > > > > > >>>>>> profile) for 15 min
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 7. Ran 4K Seq test for similar QD for 15 min
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 8. Ran 16K RW test again for 15min
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> For filestore test, I ran tests after
> > > > > > > > > > > >>>>>> preconditioning the entire image
> > > > > > > > > > > >> first.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Each sheet on the xls have different block
> > > > > > > > > > > >>>>>> size result , I often miss to navigate
> > > > > > > > > > > >>>>>> through the xls sheets , so, thought of
> > > > > > > > > > > >>>>>> mentioning here
> > > > > > > > > > > >>>>>> :-)
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> I have also captured the mkfs time , OSD
> > > > > > > > > > > >>>>>> startup time and the memory
> > > > > > > > > > > >>>> usage after the entire run.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Observation:
> > > > > > > > > > > >>>>>> ---------------
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 1. First of all, in case of bitmap allocator
> > > > > > > > > > > >>>>>> mkfs time (and thus cluster
> > > > > > > > > > > >>>> creation time for 16 OSDs) are ~16X slower than
> > > > > > > > > > > >>>> stupid allocator and
> > > > > > > > > > > >>> filestore.
> > > > > > > > > > > >>>> Each OSD creation is taking ~2min or so
> > > > > > > > > > > >>>> sometimes and I nailed down the
> > > > > > > > > > > >>>> insert_free() function call (marked ****) in
> > > > > > > > > > > >>>> the Bitmap allocator is causing that.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 2016-08-05 16:12:40.587148 7f4024d258c0 10
> > > > > > > > > > > >>>>>> freelist enumerate_next start
> > > > > > > > > > > >>>>>> 2016-08-05 16:12:40.975539 7f4024d258c0 10
> > > > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > > > > >>>>>> 2016-08-05 16:12:40.975555 7f4024d258c0 10
> > > > > > > > > > > >>>>>> bitmapalloc:init_add_free instance
> > > > > > > > > > > >>>>>> 139913322803328 offset
> > > > > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > > > > >>>>>> ****2016-08-05 16:12:40.975557 7f4024d258c0
> > > > > > > > > > > >>>>>> 20 bitmapalloc:insert_free instance
> > > > > > > > > > > >>>>>> 139913322803328 off
> > > > > > > > > > > >>>>>> 0x4663d00000 len 0x69959451000****
> > > > > > > > > > > >>>>>> ****2016-08-05 16:13:20.748934 7f4024d258c0
> > > > > > > > > > > >>>>>> 10 freelist enumerate_next
> > > > > > > > > > > >>>>>> end****
> > > > > > > > > > > >>>>>> 2016-08-05 16:13:20.748978 7f4024d258c0 10
> > > > > > > > > > > >>>>>> bluestore(/var/lib/ceph/osd/ceph-0)
> > > > > > > > > > > >>>>>> _open_alloc loaded
> > > > > > > > > > > >>>>>> 6757 G in
> > > > > > > > > > > >>>>>> 1 extents
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438511 7f4024d258c0 20
> > > > > > > > > > > >>>>>> bluefs _read_random read buffered
> > > > > > > > > > > >>>>>> 0x4a14eb~265 of
> > > > > > > > > ^A:5242880+5242880
> > > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438587 7f4024d258c0 20
> > > > > > > > > > > >>>>>> bluefs _read_random got
> > > > > > > > > > > >>>>>> 613
> > > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438658 7f4024d258c0 10
> > > > > > > > > > > >>>>>> freelist enumerate_next
> > > > > > > > > > > >>>>>> 0x4663d00000~69959451000
> > > > > > > > > > > >>>>>> 2016-08-05 16:13:23.438664 7f4024d258c0 10
> > > > > > > > > > > >>>>>> bitmapalloc:init_add_free instance
> > > > > > > > > > > >>>>>> 139913306273920 offset
> > > > > > > > > > > >>>>>> 0x4663d00000 length 0x69959451000
> > > > > > > > > > > >>>>>> *****2016-08-05 16:13:23.438666 7f4024d258c0
> > > > > > > > > > > >>>>>> 20 bitmapalloc:insert_free instance
> > > > > > > > > > > >>>>>> 139913306273920 off
> > > > > > > > > > > >>>>>> 0x4663d00000 len
> > > > > > > > > > > >>>>>> 0x69959451000*****
> > > > > > > > > > > >>>>>> *****2016-08-05 16:14:03.132914 7f4024d258c0
> > > > > > > > > > > >>>>>> 10 freelist enumerate_next end
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> I'm not sure there's any easy fix for this. We
> > > > > > > > > > > >>>>> can amortize it by feeding
> > > > > > > > > > > >>>> space to bluefs slowly (so that we don't have
> > > > > > > > > > > >>>> to do all the inserts at once), but I'm not
> > > > > > > > > > > >>>> sure that's really
> > > better.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> [Somnath] I don't know that part of the code,
> > > > > > > > > > > >>>>> so, may be a dumb
> > > > > > > > > > > >>> question.
> > > > > > > > > > > >>>> This is during mkfs() time , so, can't we say
> > > > > > > > > > > >>>> to bluefs entire space is free ? I can
> > > > > > > > > > > >>>> understand for osd mount and all other cases we
> > > > > > > > > > > >>>> need to feed the free space every
> > > > > time.
> > > > > > > > > > > >>>>> IMO this is critical to fix as cluster
> > > > > > > > > > > >>>>> creation time will be number of OSDs * 2
> > > > > > > > > > > >>>> min otherwise. For me creating 16 OSDs cluster
> > > > > > > > > > > >>>> is taking ~32min compare to
> > > > > > > > > > > >>>> ~2 min for stupid allocator/filestore.
> > > > > > > > > > > >>>>> BTW, my drive data partition is ~6.9TB , db
> > > > > > > > > > > >>>>> partition is ~100G and WAL is
> > > > > > > > > > > >>>> ~1G. I guess the time taking is dependent on
> > > > > > > > > > > >>>> data partition size as well
> > > > > > > > > > > (?
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Well, we're fundamentally limited by the fact
> > > > > > > > > > > >>>> that it's a bitmap, and a big chunk of space is
> > > > > > > > > > > >>>> "allocated" to bluefs and needs to have 1's
> > > > > > > > > > > set.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> sage
> > > > > > > > > > > >>>> --
> > > > > > > > > > > >>>> To unsubscribe from this list: send the line
> > > > > > > > > > > >>>> "unsubscribe
> > > > > > > > > > > >>>> ceph-
> > > > > > > > > devel"
> > > > > > > > > > > >>>> in the body of a message to
> > > > > > > > > > > >>>> majordomo@vger.kernel.org More
> > > > > > > > > > > >>> majordomo
> > > > > > > > > > > >>>> info at
> > > > > > > > > > > >>>> http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > >>> --
> > > > > > > > > > > >>> To unsubscribe from this list: send the line
> > > > > > > > > > > >>> "unsubscribe ceph-
> > > > > > > devel"
> > > > > > > > > > > >>> in the body of a message to
> > > > > > > > > > > >>> majordomo@vger.kernel.org More
> > > > > > > > > > > >> majordomo
> > > > > > > > > > > >>> info at
> > > > > > > > > > > >>> http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > PLEASE NOTE: The information contained in this
> > > > > > > > > > electronic mail message is
> > > > > > > > > intended only for the use of the designated recipient(s)
> > > > > > > > > named above. If the reader of this message is not the
> > > > > > > > > intended recipient, you are hereby notified that you
> > > > > > > > > have received this message in error and that any review,
> > > > > > > > > dissemination, distribution, or copying of this message
> > > > > > > > > is strictly
> > > prohibited.
> > > > > > > > > If you have received this communication in error, please
> > > > > > > > > notify the sender by telephone or e-mail (as shown
> > > > > > > > > above) immediately and destroy any and all copies of
> > > > > > > > > this message in your possession (whether hard copies or
> > > > > > > > > electronically
> > > > > > > stored copies).
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > "unsubscribe
> > > > > > > > > > ceph-
> > > > > devel"
> > > > > > > > > > in the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More
> > > > > > > > > majordomo
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > majordomo
> > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > > >
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 20:28                                 ` Allen Samuels
@ 2016-08-11 21:19                                   ` Sage Weil
  2016-08-12  3:10                                     ` Somnath Roy
                                                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Sage Weil @ 2016-08-11 21:19 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Ramesh Chander, Somnath Roy, ceph-devel

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 1:24 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> > 
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > clarify as needed.
> > >
> > > I thought that the authoritative indication of space used by BlueFS
> > > was contained in the snapshot/journal of BlueFS itself, NOT in the KV
> > > store itself. This requires that upon startup, we replay the BlueFS
> > > snapshot/journal into the FreeListManager so that it properly records
> > > the consumption of BlueFS space (since that allocation MAY NOT be
> > > accurate within the FreeListmanager itself). But that this playback
> > > need not generate an KVStore operations (since those are duplicates of
> > > the BlueFS).
> > >
> > > So in the code you cite:
> > >
> > > fm->allocate(0, reserved, t);
> > >
> > > There's no need to commit 't', and in fact, in the general case, you
> > > don't want to commit 't'.
> > >
> > > That suggests to me that a version of allocate that doesn't have a
> > > transaction could be easily created would have the speed we're looking
> > > for (and independence from the BitMapAllocator to KVStore chunking).
> > 
> > Oh, I see.  Yeah, you're right--this step isn't really necessary, as long as we
> > ensure that the auxilliary representation of what bluefs owns
> > (bluefs_extents in the superblock) is still passed into the Allocator during
> > initialization.  Having the freelist reflect the allocator that this space was "in
> > use" (by bluefs) and thus off limits to bluestore is simple but not strictly
> > necessary.
> > 
> > I'll work on a PR that avoids this...

https://github.com/ceph/ceph/pull/10698

Ramesh, can you give it a try?

> > > I suspect that we also have long startup times because we're doing the
> > > same underlying bitmap operations except they come from the BlueFS
> > > replay code instead of the BlueFS initialization code, but same
> > > problem with likely the same fix.
> > 
> > BlueFS doesn't touch the FreelistManager (or explicitly persist the freelist at
> > all)... we initialize the in-memory Allocator state from the metadata in the
> > bluefs log.  I think we should be fine on this end.
> 
> Likely that code suffers from the same problem -- a false need to update 
> the KV Store (During the playback, BlueFS extents are converted to 
> bitmap runs, it's essentially the same lower level code as the case 
> we're seeing now, but it instead of being driven by an artificial "big 
> run", it'sll be driven from the BlueFS Journal replay code). But that's 
> just a guess, I don't have time to track down the actual code right now.

BlueFS can't touch the freelist (or kv store, ever) since it ultimately 
backs the kv store and that would be problematic.  We do initialize the 
bluefs Allocator's in-memory state, but that's it.

The PR above changes the BlueStore::_init_alloc() so that BlueStore's 
Allocator state is initialize with both the freelist state (from kv store) 
*and* the bluefs_extents list (from the bluestore superblock).  (From this 
Allocator's perspective, all of bluefs's space is allocated and can't be 
used.  BlueFS has it's own separate instance to do it's internal 
allocations.)

sage

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 21:19                                   ` Sage Weil
@ 2016-08-12  3:10                                     ` Somnath Roy
  2016-08-12  3:44                                       ` Allen Samuels
  2016-08-12  6:19                                     ` Somnath Roy
  2016-08-12 15:26                                     ` Sage Weil
  2 siblings, 1 reply; 34+ messages in thread
From: Somnath Roy @ 2016-08-12  3:10 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: Ramesh Chander, ceph-devel

Sage,
I tried your PR but it is not helping much. See this each insert_free() call is taking ~40sec to complete and we have 2 calls that is taking time..

2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free instance 140128595341440 off 0x2000 len 0x6ab7d14f000
2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear bits in 0x6ab7d100000

2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free instance 140127837929472 off 0x2000 len 0x6ab7d14f000
2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear bits in 0x6ab7d100000

I have also tried with the following and it is not helping either..

       bluestore_bluefs_min_ratio = .01
        bluestore_freelist_blocks_per_key = 512


I did some debugging on this to find out which call inside this function is taking time and I found this within BitAllocator::free_blocks

  debug_assert(is_allocated(start_block, num_blocks));

  free_blocks_int(start_block, num_blocks);

I did skip this debug_assert and total time reduced from ~80sec ~49sec , so, that's a significant improvement.

Next, I found out that debug_assert(is_allocated()) is called from free_blocks_int as well. I commented out blindly all debug_assert(is_allocated()) and performance became similar to stupid/filestore.
I didn't bother to look into is_allocated() anymore as my guess is we can safely ignore this during mkfs() time ?
But, it will be good if we can optimize this as it may induce latency in the IO path (?).

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net]
Sent: Thursday, August 11, 2016 2:20 PM
To: Allen Samuels
Cc: Ramesh Chander; Somnath Roy; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 1:24 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > clarify as needed.
> > >
> > > I thought that the authoritative indication of space used by
> > > BlueFS was contained in the snapshot/journal of BlueFS itself, NOT
> > > in the KV store itself. This requires that upon startup, we replay
> > > the BlueFS snapshot/journal into the FreeListManager so that it
> > > properly records the consumption of BlueFS space (since that
> > > allocation MAY NOT be accurate within the FreeListmanager itself).
> > > But that this playback need not generate an KVStore operations
> > > (since those are duplicates of the BlueFS).
> > >
> > > So in the code you cite:
> > >
> > > fm->allocate(0, reserved, t);
> > >
> > > There's no need to commit 't', and in fact, in the general case,
> > > you don't want to commit 't'.
> > >
> > > That suggests to me that a version of allocate that doesn't have a
> > > transaction could be easily created would have the speed we're
> > > looking for (and independence from the BitMapAllocator to KVStore chunking).
> >
> > Oh, I see.  Yeah, you're right--this step isn't really necessary, as
> > long as we ensure that the auxilliary representation of what bluefs
> > owns (bluefs_extents in the superblock) is still passed into the
> > Allocator during initialization.  Having the freelist reflect the
> > allocator that this space was "in use" (by bluefs) and thus off
> > limits to bluestore is simple but not strictly necessary.
> >
> > I'll work on a PR that avoids this...

https://github.com/ceph/ceph/pull/10698

Ramesh, can you give it a try?

> > > I suspect that we also have long startup times because we're doing
> > > the same underlying bitmap operations except they come from the
> > > BlueFS replay code instead of the BlueFS initialization code, but
> > > same problem with likely the same fix.
> >
> > BlueFS doesn't touch the FreelistManager (or explicitly persist the
> > freelist at all)... we initialize the in-memory Allocator state from
> > the metadata in the bluefs log.  I think we should be fine on this end.
>
> Likely that code suffers from the same problem -- a false need to
> update the KV Store (During the playback, BlueFS extents are converted
> to bitmap runs, it's essentially the same lower level code as the case
> we're seeing now, but it instead of being driven by an artificial "big
> run", it'sll be driven from the BlueFS Journal replay code). But
> that's just a guess, I don't have time to track down the actual code right now.

BlueFS can't touch the freelist (or kv store, ever) since it ultimately backs the kv store and that would be problematic.  We do initialize the bluefs Allocator's in-memory state, but that's it.

The PR above changes the BlueStore::_init_alloc() so that BlueStore's Allocator state is initialize with both the freelist state (from kv store)
*and* the bluefs_extents list (from the bluestore superblock).  (From this Allocator's perspective, all of bluefs's space is allocated and can't be used.  BlueFS has it's own separate instance to do it's internal
allocations.)

sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-12  3:10                                     ` Somnath Roy
@ 2016-08-12  3:44                                       ` Allen Samuels
  2016-08-12  5:27                                         ` Ramesh Chander
                                                           ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Allen Samuels @ 2016-08-12  3:44 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil; +Cc: Ramesh Chander, ceph-devel

Is there a simple way to detect whether you're in initialization/not? If so, you could augment the debug_asserts to skip the is_allocated during initialization but re-enable them during normal operation.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 11, 2016 8:10 PM
> To: Sage Weil <sage@newdream.net>; Allen Samuels
> <Allen.Samuels@sandisk.com>
> Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> Sage,
> I tried your PR but it is not helping much. See this each insert_free() call is
> taking ~40sec to complete and we have 2 calls that is taking time..
> 
> 2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free
> instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
> 2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free instance
> 140128595341440 off 0x2000 len 0x6ab7d14f000
> 2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear bits in
> 0x6ab7d100000
> 
> 2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free
> instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
> 2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free instance
> 140127837929472 off 0x2000 len 0x6ab7d14f000
> 2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear bits in
> 0x6ab7d100000
> 
> I have also tried with the following and it is not helping either..
> 
>        bluestore_bluefs_min_ratio = .01
>         bluestore_freelist_blocks_per_key = 512
> 
> 
> I did some debugging on this to find out which call inside this function is
> taking time and I found this within BitAllocator::free_blocks
> 
>   debug_assert(is_allocated(start_block, num_blocks));
> 
>   free_blocks_int(start_block, num_blocks);
> 
> I did skip this debug_assert and total time reduced from ~80sec ~49sec , so,
> that's a significant improvement.
> 
> Next, I found out that debug_assert(is_allocated()) is called from
> free_blocks_int as well. I commented out blindly all
> debug_assert(is_allocated()) and performance became similar to
> stupid/filestore.
> I didn't bother to look into is_allocated() anymore as my guess is we can
> safely ignore this during mkfs() time ?
> But, it will be good if we can optimize this as it may induce latency in the IO
> path (?).
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, August 11, 2016 2:20 PM
> To: Allen Samuels
> Cc: Ramesh Chander; Somnath Roy; ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
> 
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 1:24 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > clarify as needed.
> > > >
> > > > I thought that the authoritative indication of space used by
> > > > BlueFS was contained in the snapshot/journal of BlueFS itself, NOT
> > > > in the KV store itself. This requires that upon startup, we replay
> > > > the BlueFS snapshot/journal into the FreeListManager so that it
> > > > properly records the consumption of BlueFS space (since that
> > > > allocation MAY NOT be accurate within the FreeListmanager itself).
> > > > But that this playback need not generate an KVStore operations
> > > > (since those are duplicates of the BlueFS).
> > > >
> > > > So in the code you cite:
> > > >
> > > > fm->allocate(0, reserved, t);
> > > >
> > > > There's no need to commit 't', and in fact, in the general case,
> > > > you don't want to commit 't'.
> > > >
> > > > That suggests to me that a version of allocate that doesn't have a
> > > > transaction could be easily created would have the speed we're
> > > > looking for (and independence from the BitMapAllocator to KVStore
> chunking).
> > >
> > > Oh, I see.  Yeah, you're right--this step isn't really necessary, as
> > > long as we ensure that the auxilliary representation of what bluefs
> > > owns (bluefs_extents in the superblock) is still passed into the
> > > Allocator during initialization.  Having the freelist reflect the
> > > allocator that this space was "in use" (by bluefs) and thus off
> > > limits to bluestore is simple but not strictly necessary.
> > >
> > > I'll work on a PR that avoids this...
> 
> https://github.com/ceph/ceph/pull/10698
> 
> Ramesh, can you give it a try?
> 
> > > > I suspect that we also have long startup times because we're doing
> > > > the same underlying bitmap operations except they come from the
> > > > BlueFS replay code instead of the BlueFS initialization code, but
> > > > same problem with likely the same fix.
> > >
> > > BlueFS doesn't touch the FreelistManager (or explicitly persist the
> > > freelist at all)... we initialize the in-memory Allocator state from
> > > the metadata in the bluefs log.  I think we should be fine on this end.
> >
> > Likely that code suffers from the same problem -- a false need to
> > update the KV Store (During the playback, BlueFS extents are converted
> > to bitmap runs, it's essentially the same lower level code as the case
> > we're seeing now, but it instead of being driven by an artificial "big
> > run", it'sll be driven from the BlueFS Journal replay code). But
> > that's just a guess, I don't have time to track down the actual code right
> now.
> 
> BlueFS can't touch the freelist (or kv store, ever) since it ultimately backs the
> kv store and that would be problematic.  We do initialize the bluefs
> Allocator's in-memory state, but that's it.
> 
> The PR above changes the BlueStore::_init_alloc() so that BlueStore's
> Allocator state is initialize with both the freelist state (from kv store)
> *and* the bluefs_extents list (from the bluestore superblock).  (From this
> Allocator's perspective, all of bluefs's space is allocated and can't be used.
> BlueFS has it's own separate instance to do it's internal
> allocations.)
> 
> sage

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-12  3:44                                       ` Allen Samuels
@ 2016-08-12  5:27                                         ` Ramesh Chander
  2016-08-12  5:52                                         ` Ramesh Chander
  2016-08-12  5:59                                         ` Somnath Roy
  2 siblings, 0 replies; 34+ messages in thread
From: Ramesh Chander @ 2016-08-12  5:27 UTC (permalink / raw)
  To: Allen Samuels, Somnath Roy, Sage Weil; +Cc: ceph-devel

Yes, that is good point, I will try skip the is_allocated and see if it improves.

I confirm Somnath's number , 2G of bitmap takes around 40sec to init and in mkfs it is done two times, once for mkfs then for mount.

That makes total of ~80secs ( 1 min 20 secs) out of 120 secs Somanth is seeing.

-Ramesh

> -----Original Message-----
> From: Allen Samuels
> Sent: Friday, August 12, 2016 9:15 AM
> To: Somnath Roy; Sage Weil
> Cc: Ramesh Chander; ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> Is there a simple way to detect whether you're in initialization/not? If so, you
> could augment the debug_asserts to skip the is_allocated during initialization
> but re-enable them during normal operation.
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Thursday, August 11, 2016 8:10 PM
> > To: Sage Weil <sage@newdream.net>; Allen Samuels
> > <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > Sage,
> > I tried your PR but it is not helping much. See this each
> > insert_free() call is taking ~40sec to complete and we have 2 calls that is
> taking time..
> >
> > 2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free
> > instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
> > 2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free
> > instance
> > 140128595341440 off 0x2000 len 0x6ab7d14f000
> > 2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear
> > bits in
> > 0x6ab7d100000
> >
> > 2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free
> > instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
> > 2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free
> > instance
> > 140127837929472 off 0x2000 len 0x6ab7d14f000
> > 2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear
> > bits in
> > 0x6ab7d100000
> >
> > I have also tried with the following and it is not helping either..
> >
> >        bluestore_bluefs_min_ratio = .01
> >         bluestore_freelist_blocks_per_key = 512
> >
> >
> > I did some debugging on this to find out which call inside this
> > function is taking time and I found this within
> > BitAllocator::free_blocks
> >
> >   debug_assert(is_allocated(start_block, num_blocks));
> >
> >   free_blocks_int(start_block, num_blocks);
> >
> > I did skip this debug_assert and total time reduced from ~80sec ~49sec
> > , so, that's a significant improvement.
> >
> > Next, I found out that debug_assert(is_allocated()) is called from
> > free_blocks_int as well. I commented out blindly all
> > debug_assert(is_allocated()) and performance became similar to
> > stupid/filestore.
> > I didn't bother to look into is_allocated() anymore as my guess is we
> > can safely ignore this during mkfs() time ?
> > But, it will be good if we can optimize this as it may induce latency
> > in the IO path (?).
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 2:20 PM
> > To: Allen Samuels
> > Cc: Ramesh Chander; Somnath Roy; ceph-devel
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > Sent: Thursday, August 11, 2016 1:24 PM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > > > Subject: RE: Bluestore different allocator performance Vs
> > > > FileStore
> > > >
> > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > > clarify as needed.
> > > > >
> > > > > I thought that the authoritative indication of space used by
> > > > > BlueFS was contained in the snapshot/journal of BlueFS itself,
> > > > > NOT in the KV store itself. This requires that upon startup, we
> > > > > replay the BlueFS snapshot/journal into the FreeListManager so
> > > > > that it properly records the consumption of BlueFS space (since
> > > > > that allocation MAY NOT be accurate within the FreeListmanager
> itself).
> > > > > But that this playback need not generate an KVStore operations
> > > > > (since those are duplicates of the BlueFS).
> > > > >
> > > > > So in the code you cite:
> > > > >
> > > > > fm->allocate(0, reserved, t);
> > > > >
> > > > > There's no need to commit 't', and in fact, in the general case,
> > > > > you don't want to commit 't'.
> > > > >
> > > > > That suggests to me that a version of allocate that doesn't have
> > > > > a transaction could be easily created would have the speed we're
> > > > > looking for (and independence from the BitMapAllocator to
> > > > > KVStore
> > chunking).
> > > >
> > > > Oh, I see.  Yeah, you're right--this step isn't really necessary,
> > > > as long as we ensure that the auxilliary representation of what
> > > > bluefs owns (bluefs_extents in the superblock) is still passed
> > > > into the Allocator during initialization.  Having the freelist
> > > > reflect the allocator that this space was "in use" (by bluefs) and
> > > > thus off limits to bluestore is simple but not strictly necessary.
> > > >
> > > > I'll work on a PR that avoids this...
> >
> > https://github.com/ceph/ceph/pull/10698
> >
> > Ramesh, can you give it a try?
> >
> > > > > I suspect that we also have long startup times because we're
> > > > > doing the same underlying bitmap operations except they come
> > > > > from the BlueFS replay code instead of the BlueFS initialization
> > > > > code, but same problem with likely the same fix.
> > > >
> > > > BlueFS doesn't touch the FreelistManager (or explicitly persist
> > > > the freelist at all)... we initialize the in-memory Allocator
> > > > state from the metadata in the bluefs log.  I think we should be fine on
> this end.
> > >
> > > Likely that code suffers from the same problem -- a false need to
> > > update the KV Store (During the playback, BlueFS extents are
> > > converted to bitmap runs, it's essentially the same lower level code
> > > as the case we're seeing now, but it instead of being driven by an
> > > artificial "big run", it'sll be driven from the BlueFS Journal
> > > replay code). But that's just a guess, I don't have time to track
> > > down the actual code right
> > now.
> >
> > BlueFS can't touch the freelist (or kv store, ever) since it
> > ultimately backs the kv store and that would be problematic.  We do
> > initialize the bluefs Allocator's in-memory state, but that's it.
> >
> > The PR above changes the BlueStore::_init_alloc() so that BlueStore's
> > Allocator state is initialize with both the freelist state (from kv
> > store)
> > *and* the bluefs_extents list (from the bluestore superblock).  (From
> > this Allocator's perspective, all of bluefs's space is allocated and can't be
> used.
> > BlueFS has it's own separate instance to do it's internal
> > allocations.)
> >
> > sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-12  3:44                                       ` Allen Samuels
  2016-08-12  5:27                                         ` Ramesh Chander
@ 2016-08-12  5:52                                         ` Ramesh Chander
  2016-08-12  5:59                                         ` Somnath Roy
  2 siblings, 0 replies; 34+ messages in thread
From: Ramesh Chander @ 2016-08-12  5:52 UTC (permalink / raw)
  To: Allen Samuels, Somnath Roy, Sage Weil; +Cc: ceph-devel

Good catch Allen :),

Removing is_allocated call reduces time to < 10 secs   from around 40secs.

Though we may not be able to live with just removing it, but can definitely think of avoiding it or optimizing it.

One obvious one is to do check bits in batch as we do set and clear. I was not done since we always thought it as debug code.

I am already doing the code change.

-Ramesh

> -----Original Message-----
> From: Ramesh Chander
> Sent: Friday, August 12, 2016 10:57 AM
> To: Allen Samuels; Somnath Roy; Sage Weil
> Cc: ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> Yes, that is good point, I will try skip the is_allocated and see if it improves.
>
> I confirm Somnath's number , 2G of bitmap takes around 40sec to init and in
> mkfs it is done two times, once for mkfs then for mount.
>
> That makes total of ~80secs ( 1 min 20 secs) out of 120 secs Somanth is
> seeing.
>
> -Ramesh
>
> > -----Original Message-----
> > From: Allen Samuels
> > Sent: Friday, August 12, 2016 9:15 AM
> > To: Somnath Roy; Sage Weil
> > Cc: Ramesh Chander; ceph-devel
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > Is there a simple way to detect whether you're in initialization/not?
> > If so, you could augment the debug_asserts to skip the is_allocated
> > during initialization but re-enable them during normal operation.
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Thursday, August 11, 2016 8:10 PM
> > > To: Sage Weil <sage@newdream.net>; Allen Samuels
> > > <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; ceph-devel
> <ceph-
> > > devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > Sage,
> > > I tried your PR but it is not helping much. See this each
> > > insert_free() call is taking ~40sec to complete and we have 2 calls
> > > that is
> > taking time..
> > >
> > > 2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free
> > > instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
> > > 2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free
> > > instance
> > > 140128595341440 off 0x2000 len 0x6ab7d14f000
> > > 2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear
> > > bits in
> > > 0x6ab7d100000
> > >
> > > 2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free
> > > instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
> > > 2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free
> > > instance
> > > 140127837929472 off 0x2000 len 0x6ab7d14f000
> > > 2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear
> > > bits in
> > > 0x6ab7d100000
> > >
> > > I have also tried with the following and it is not helping either..
> > >
> > >        bluestore_bluefs_min_ratio = .01
> > >         bluestore_freelist_blocks_per_key = 512
> > >
> > >
> > > I did some debugging on this to find out which call inside this
> > > function is taking time and I found this within
> > > BitAllocator::free_blocks
> > >
> > >   debug_assert(is_allocated(start_block, num_blocks));
> > >
> > >   free_blocks_int(start_block, num_blocks);
> > >
> > > I did skip this debug_assert and total time reduced from ~80sec
> > > ~49sec , so, that's a significant improvement.
> > >
> > > Next, I found out that debug_assert(is_allocated()) is called from
> > > free_blocks_int as well. I commented out blindly all
> > > debug_assert(is_allocated()) and performance became similar to
> > > stupid/filestore.
> > > I didn't bother to look into is_allocated() anymore as my guess is
> > > we can safely ignore this during mkfs() time ?
> > > But, it will be good if we can optimize this as it may induce
> > > latency in the IO path (?).
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 2:20 PM
> > > To: Allen Samuels
> > > Cc: Ramesh Chander; Somnath Roy; ceph-devel
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sage@newdream.net]
> > > > > Sent: Thursday, August 11, 2016 1:24 PM
> > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> > > devel@vger.kernel.org>
> > > > > Subject: RE: Bluestore different allocator performance Vs
> > > > > FileStore
> > > > >
> > > > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > > > Perhaps my understanding of the blueFS is incorrect -- so
> > > > > > please clarify as needed.
> > > > > >
> > > > > > I thought that the authoritative indication of space used by
> > > > > > BlueFS was contained in the snapshot/journal of BlueFS itself,
> > > > > > NOT in the KV store itself. This requires that upon startup,
> > > > > > we replay the BlueFS snapshot/journal into the FreeListManager
> > > > > > so that it properly records the consumption of BlueFS space
> > > > > > (since that allocation MAY NOT be accurate within the
> > > > > > FreeListmanager
> > itself).
> > > > > > But that this playback need not generate an KVStore operations
> > > > > > (since those are duplicates of the BlueFS).
> > > > > >
> > > > > > So in the code you cite:
> > > > > >
> > > > > > fm->allocate(0, reserved, t);
> > > > > >
> > > > > > There's no need to commit 't', and in fact, in the general
> > > > > > case, you don't want to commit 't'.
> > > > > >
> > > > > > That suggests to me that a version of allocate that doesn't
> > > > > > have a transaction could be easily created would have the
> > > > > > speed we're looking for (and independence from the
> > > > > > BitMapAllocator to KVStore
> > > chunking).
> > > > >
> > > > > Oh, I see.  Yeah, you're right--this step isn't really
> > > > > necessary, as long as we ensure that the auxilliary
> > > > > representation of what bluefs owns (bluefs_extents in the
> > > > > superblock) is still passed into the Allocator during
> > > > > initialization.  Having the freelist reflect the allocator that
> > > > > this space was "in use" (by bluefs) and thus off limits to bluestore is
> simple but not strictly necessary.
> > > > >
> > > > > I'll work on a PR that avoids this...
> > >
> > > https://github.com/ceph/ceph/pull/10698
> > >
> > > Ramesh, can you give it a try?
> > >
> > > > > > I suspect that we also have long startup times because we're
> > > > > > doing the same underlying bitmap operations except they come
> > > > > > from the BlueFS replay code instead of the BlueFS
> > > > > > initialization code, but same problem with likely the same fix.
> > > > >
> > > > > BlueFS doesn't touch the FreelistManager (or explicitly persist
> > > > > the freelist at all)... we initialize the in-memory Allocator
> > > > > state from the metadata in the bluefs log.  I think we should be
> > > > > fine on
> > this end.
> > > >
> > > > Likely that code suffers from the same problem -- a false need to
> > > > update the KV Store (During the playback, BlueFS extents are
> > > > converted to bitmap runs, it's essentially the same lower level
> > > > code as the case we're seeing now, but it instead of being driven
> > > > by an artificial "big run", it'sll be driven from the BlueFS
> > > > Journal replay code). But that's just a guess, I don't have time
> > > > to track down the actual code right
> > > now.
> > >
> > > BlueFS can't touch the freelist (or kv store, ever) since it
> > > ultimately backs the kv store and that would be problematic.  We do
> > > initialize the bluefs Allocator's in-memory state, but that's it.
> > >
> > > The PR above changes the BlueStore::_init_alloc() so that
> > > BlueStore's Allocator state is initialize with both the freelist
> > > state (from kv
> > > store)
> > > *and* the bluefs_extents list (from the bluestore superblock).
> > > (From this Allocator's perspective, all of bluefs's space is
> > > allocated and can't be
> > used.
> > > BlueFS has it's own separate instance to do it's internal
> > > allocations.)
> > >
> > > sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-12  3:44                                       ` Allen Samuels
  2016-08-12  5:27                                         ` Ramesh Chander
  2016-08-12  5:52                                         ` Ramesh Chander
@ 2016-08-12  5:59                                         ` Somnath Roy
  2 siblings, 0 replies; 34+ messages in thread
From: Somnath Roy @ 2016-08-12  5:59 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: Ramesh Chander, ceph-devel

Yes, simple way probably to send a flag as it will be called from mkfs()..

-----Original Message-----
From: Allen Samuels
Sent: Thursday, August 11, 2016 8:45 PM
To: Somnath Roy; Sage Weil
Cc: Ramesh Chander; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

Is there a simple way to detect whether you're in initialization/not? If so, you could augment the debug_asserts to skip the is_allocated during initialization but re-enable them during normal operation.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 11, 2016 8:10 PM
> To: Sage Weil <sage@newdream.net>; Allen Samuels
> <Allen.Samuels@sandisk.com>
> Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> Sage,
> I tried your PR but it is not helping much. See this each
> insert_free() call is taking ~40sec to complete and we have 2 calls that is taking time..
>
> 2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free
> instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
> 2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free
> instance
> 140128595341440 off 0x2000 len 0x6ab7d14f000
> 2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear
> bits in
> 0x6ab7d100000
>
> 2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free
> instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
> 2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free
> instance
> 140127837929472 off 0x2000 len 0x6ab7d14f000
> 2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear
> bits in
> 0x6ab7d100000
>
> I have also tried with the following and it is not helping either..
>
>        bluestore_bluefs_min_ratio = .01
>         bluestore_freelist_blocks_per_key = 512
>
>
> I did some debugging on this to find out which call inside this
> function is taking time and I found this within
> BitAllocator::free_blocks
>
>   debug_assert(is_allocated(start_block, num_blocks));
>
>   free_blocks_int(start_block, num_blocks);
>
> I did skip this debug_assert and total time reduced from ~80sec ~49sec
> , so, that's a significant improvement.
>
> Next, I found out that debug_assert(is_allocated()) is called from
> free_blocks_int as well. I commented out blindly all
> debug_assert(is_allocated()) and performance became similar to
> stupid/filestore.
> I didn't bother to look into is_allocated() anymore as my guess is we
> can safely ignore this during mkfs() time ?
> But, it will be good if we can optimize this as it may induce latency
> in the IO path (?).
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, August 11, 2016 2:20 PM
> To: Allen Samuels
> Cc: Ramesh Chander; Somnath Roy; ceph-devel
> Subject: RE: Bluestore different allocator performance Vs FileStore
>
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 1:24 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs
> > > FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > clarify as needed.
> > > >
> > > > I thought that the authoritative indication of space used by
> > > > BlueFS was contained in the snapshot/journal of BlueFS itself,
> > > > NOT in the KV store itself. This requires that upon startup, we
> > > > replay the BlueFS snapshot/journal into the FreeListManager so
> > > > that it properly records the consumption of BlueFS space (since
> > > > that allocation MAY NOT be accurate within the FreeListmanager itself).
> > > > But that this playback need not generate an KVStore operations
> > > > (since those are duplicates of the BlueFS).
> > > >
> > > > So in the code you cite:
> > > >
> > > > fm->allocate(0, reserved, t);
> > > >
> > > > There's no need to commit 't', and in fact, in the general case,
> > > > you don't want to commit 't'.
> > > >
> > > > That suggests to me that a version of allocate that doesn't have
> > > > a transaction could be easily created would have the speed we're
> > > > looking for (and independence from the BitMapAllocator to
> > > > KVStore
> chunking).
> > >
> > > Oh, I see.  Yeah, you're right--this step isn't really necessary,
> > > as long as we ensure that the auxilliary representation of what
> > > bluefs owns (bluefs_extents in the superblock) is still passed
> > > into the Allocator during initialization.  Having the freelist
> > > reflect the allocator that this space was "in use" (by bluefs) and
> > > thus off limits to bluestore is simple but not strictly necessary.
> > >
> > > I'll work on a PR that avoids this...
>
> https://github.com/ceph/ceph/pull/10698
>
> Ramesh, can you give it a try?
>
> > > > I suspect that we also have long startup times because we're
> > > > doing the same underlying bitmap operations except they come
> > > > from the BlueFS replay code instead of the BlueFS initialization
> > > > code, but same problem with likely the same fix.
> > >
> > > BlueFS doesn't touch the FreelistManager (or explicitly persist
> > > the freelist at all)... we initialize the in-memory Allocator
> > > state from the metadata in the bluefs log.  I think we should be fine on this end.
> >
> > Likely that code suffers from the same problem -- a false need to
> > update the KV Store (During the playback, BlueFS extents are
> > converted to bitmap runs, it's essentially the same lower level code
> > as the case we're seeing now, but it instead of being driven by an
> > artificial "big run", it'sll be driven from the BlueFS Journal
> > replay code). But that's just a guess, I don't have time to track
> > down the actual code right
> now.
>
> BlueFS can't touch the freelist (or kv store, ever) since it
> ultimately backs the kv store and that would be problematic.  We do
> initialize the bluefs Allocator's in-memory state, but that's it.
>
> The PR above changes the BlueStore::_init_alloc() so that BlueStore's
> Allocator state is initialize with both the freelist state (from kv
> store)
> *and* the bluefs_extents list (from the bluestore superblock).  (From
> this Allocator's perspective, all of bluefs's space is allocated and can't be used.
> BlueFS has it's own separate instance to do it's internal
> allocations.)
>
> sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 21:19                                   ` Sage Weil
  2016-08-12  3:10                                     ` Somnath Roy
@ 2016-08-12  6:19                                     ` Somnath Roy
  2016-08-12 15:26                                     ` Sage Weil
  2 siblings, 0 replies; 34+ messages in thread
From: Somnath Roy @ 2016-08-12  6:19 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: Ramesh Chander, ceph-devel

One more finding Ramesh while debugging this..
I found in the BitAllocator.cc you have used /usr/include/assert.h. This will collide with dout() (that I was trying to introduce) and give compilation error. Eventually, I had to comment out  <assert.h> and use ceph assert.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Thursday, August 11, 2016 8:10 PM
To: 'Sage Weil'; Allen Samuels
Cc: Ramesh Chander; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

Sage,
I tried your PR but it is not helping much. See this each insert_free() call is taking ~40sec to complete and we have 2 calls that is taking time..

2016-08-11 17:32:48.086109 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140128595341440 offset 0x2000 length 0x6ab7d14f000
2016-08-11 17:32:48.086111 7f7243fad8c0 20 bitmapalloc:insert_free instance 140128595341440 off 0x2000 len 0x6ab7d14f000
2016-08-11 17:33:27.843948 7f7243fad8c0 30 freelist  no more clear bits in 0x6ab7d100000

2016-08-11 17:33:30.839093 7f7243fad8c0 10 bitmapalloc:init_add_free instance 140127837929472 offset 0x2000 length 0x6ab7d14f000
2016-08-11 17:33:30.839095 7f7243fad8c0 20 bitmapalloc:insert_free instance 140127837929472 off 0x2000 len 0x6ab7d14f000
2016-08-11 17:34:10.517809 7f7243fad8c0 30 freelist  no more clear bits in 0x6ab7d100000

I have also tried with the following and it is not helping either..

       bluestore_bluefs_min_ratio = .01
        bluestore_freelist_blocks_per_key = 512


I did some debugging on this to find out which call inside this function is taking time and I found this within BitAllocator::free_blocks

  debug_assert(is_allocated(start_block, num_blocks));

  free_blocks_int(start_block, num_blocks);

I did skip this debug_assert and total time reduced from ~80sec ~49sec , so, that's a significant improvement.

Next, I found out that debug_assert(is_allocated()) is called from free_blocks_int as well. I commented out blindly all debug_assert(is_allocated()) and performance became similar to stupid/filestore.
I didn't bother to look into is_allocated() anymore as my guess is we can safely ignore this during mkfs() time ?
But, it will be good if we can optimize this as it may induce latency in the IO path (?).

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net]
Sent: Thursday, August 11, 2016 2:20 PM
To: Allen Samuels
Cc: Ramesh Chander; Somnath Roy; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

On Thu, 11 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, August 11, 2016 1:24 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Bluestore different allocator performance Vs FileStore
> >
> > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > clarify as needed.
> > >
> > > I thought that the authoritative indication of space used by
> > > BlueFS was contained in the snapshot/journal of BlueFS itself, NOT
> > > in the KV store itself. This requires that upon startup, we replay
> > > the BlueFS snapshot/journal into the FreeListManager so that it
> > > properly records the consumption of BlueFS space (since that
> > > allocation MAY NOT be accurate within the FreeListmanager itself).
> > > But that this playback need not generate an KVStore operations
> > > (since those are duplicates of the BlueFS).
> > >
> > > So in the code you cite:
> > >
> > > fm->allocate(0, reserved, t);
> > >
> > > There's no need to commit 't', and in fact, in the general case,
> > > you don't want to commit 't'.
> > >
> > > That suggests to me that a version of allocate that doesn't have a
> > > transaction could be easily created would have the speed we're
> > > looking for (and independence from the BitMapAllocator to KVStore chunking).
> >
> > Oh, I see.  Yeah, you're right--this step isn't really necessary, as
> > long as we ensure that the auxilliary representation of what bluefs
> > owns (bluefs_extents in the superblock) is still passed into the
> > Allocator during initialization.  Having the freelist reflect the
> > allocator that this space was "in use" (by bluefs) and thus off
> > limits to bluestore is simple but not strictly necessary.
> >
> > I'll work on a PR that avoids this...

https://github.com/ceph/ceph/pull/10698

Ramesh, can you give it a try?

> > > I suspect that we also have long startup times because we're doing
> > > the same underlying bitmap operations except they come from the
> > > BlueFS replay code instead of the BlueFS initialization code, but
> > > same problem with likely the same fix.
> >
> > BlueFS doesn't touch the FreelistManager (or explicitly persist the
> > freelist at all)... we initialize the in-memory Allocator state from
> > the metadata in the bluefs log.  I think we should be fine on this end.
>
> Likely that code suffers from the same problem -- a false need to
> update the KV Store (During the playback, BlueFS extents are converted
> to bitmap runs, it's essentially the same lower level code as the case
> we're seeing now, but it instead of being driven by an artificial "big
> run", it'sll be driven from the BlueFS Journal replay code). But
> that's just a guess, I don't have time to track down the actual code right now.

BlueFS can't touch the freelist (or kv store, ever) since it ultimately backs the kv store and that would be problematic.  We do initialize the bluefs Allocator's in-memory state, but that's it.

The PR above changes the BlueStore::_init_alloc() so that BlueStore's Allocator state is initialize with both the freelist state (from kv store)
*and* the bluefs_extents list (from the bluestore superblock).  (From this Allocator's perspective, all of bluefs's space is allocated and can't be used.  BlueFS has it's own separate instance to do it's internal
allocations.)

sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-11 21:19                                   ` Sage Weil
  2016-08-12  3:10                                     ` Somnath Roy
  2016-08-12  6:19                                     ` Somnath Roy
@ 2016-08-12 15:26                                     ` Sage Weil
  2016-08-12 15:43                                       ` Somnath Roy
  2016-08-12 20:02                                       ` Somnath Roy
  2 siblings, 2 replies; 34+ messages in thread
From: Sage Weil @ 2016-08-12 15:26 UTC (permalink / raw)
  To: Ramesh Chander; +Cc: Allen Samuels, Somnath Roy, ceph-devel

On Thu, 11 Aug 2016, Sage Weil wrote:
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 1:24 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs FileStore
> > > 
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > clarify as needed.
> > > >
> > > > I thought that the authoritative indication of space used by BlueFS
> > > > was contained in the snapshot/journal of BlueFS itself, NOT in the KV
> > > > store itself. This requires that upon startup, we replay the BlueFS
> > > > snapshot/journal into the FreeListManager so that it properly records
> > > > the consumption of BlueFS space (since that allocation MAY NOT be
> > > > accurate within the FreeListmanager itself). But that this playback
> > > > need not generate an KVStore operations (since those are duplicates of
> > > > the BlueFS).
> > > >
> > > > So in the code you cite:
> > > >
> > > > fm->allocate(0, reserved, t);
> > > >
> > > > There's no need to commit 't', and in fact, in the general case, you
> > > > don't want to commit 't'.
> > > >
> > > > That suggests to me that a version of allocate that doesn't have a
> > > > transaction could be easily created would have the speed we're looking
> > > > for (and independence from the BitMapAllocator to KVStore chunking).
> > > 
> > > Oh, I see.  Yeah, you're right--this step isn't really necessary, as long as we
> > > ensure that the auxilliary representation of what bluefs owns
> > > (bluefs_extents in the superblock) is still passed into the Allocator during
> > > initialization.  Having the freelist reflect the allocator that this space was "in
> > > use" (by bluefs) and thus off limits to bluestore is simple but not strictly
> > > necessary.
> > > 
> > > I'll work on a PR that avoids this...
> 
> https://github.com/ceph/ceph/pull/10698
> 
> Ramesh, can you give it a try?
> 
> > > > I suspect that we also have long startup times because we're doing the
> > > > same underlying bitmap operations except they come from the BlueFS
> > > > replay code instead of the BlueFS initialization code, but same
> > > > problem with likely the same fix.
> > > 
> > > BlueFS doesn't touch the FreelistManager (or explicitly persist the freelist at
> > > all)... we initialize the in-memory Allocator state from the metadata in the
> > > bluefs log.  I think we should be fine on this end.
> > 
> > Likely that code suffers from the same problem -- a false need to update 
> > the KV Store (During the playback, BlueFS extents are converted to 
> > bitmap runs, it's essentially the same lower level code as the case 
> > we're seeing now, but it instead of being driven by an artificial "big 
> > run", it'sll be driven from the BlueFS Journal replay code). But that's 
> > just a guess, I don't have time to track down the actual code right now.
> 
> BlueFS can't touch the freelist (or kv store, ever) since it ultimately 
> backs the kv store and that would be problematic.  We do initialize the 
> bluefs Allocator's in-memory state, but that's it.
> 
> The PR above changes the BlueStore::_init_alloc() so that BlueStore's 
> Allocator state is initialize with both the freelist state (from kv store) 
> *and* the bluefs_extents list (from the bluestore superblock).  (From this 
> Allocator's perspective, all of bluefs's space is allocated and can't be 
> used.  BlueFS has it's own separate instance to do it's internal 
> allocations.)

Ah, okay, so after our conversation in standup I went and looked at the 
code some more and realized I've been thinking about the 
BitmapFreelistManager and not the BitMapAllocator.  The ~40s is all CPU 
time spent updating in-memory bits, and has nothing to do with pushing 
updates through rocksdb.  Sorry for the confusing conversation.

So... I think there is one thing we can do: change the initialization of 
the allocator state from the freelist so that the assumption is that space 
is freed and we tell it was is allocation (currently we assume everything 
is allocated and tell it what is free).  I'm not sure it's worth it, 
though: we'll just make things slower to start up on a full OSD instead of 
slower on an empty OSD.  And it seems like the CPU time really won't be 
significant anyway once the debugging stuff is taken out.

I think this PR

	https://github.com/ceph/ceph/pull/10698

is still a good idea, though, since it avoids useless freelist kv work 
during mkfs.

Does that sound right?  Or am I still missing something?

Thanks for you patience!
sage



^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-12 15:26                                     ` Sage Weil
@ 2016-08-12 15:43                                       ` Somnath Roy
  2016-08-12 20:02                                       ` Somnath Roy
  1 sibling, 0 replies; 34+ messages in thread
From: Somnath Roy @ 2016-08-12 15:43 UTC (permalink / raw)
  To: Sage Weil, Ramesh Chander; +Cc: Allen Samuels, ceph-devel

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net]
Sent: Friday, August 12, 2016 8:26 AM
To: Ramesh Chander
Cc: Allen Samuels; Somnath Roy; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

On Thu, 11 Aug 2016, Sage Weil wrote:
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 1:24 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs
> > > FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > clarify as needed.
> > > >
> > > > I thought that the authoritative indication of space used by
> > > > BlueFS was contained in the snapshot/journal of BlueFS itself,
> > > > NOT in the KV store itself. This requires that upon startup, we
> > > > replay the BlueFS snapshot/journal into the FreeListManager so
> > > > that it properly records the consumption of BlueFS space (since
> > > > that allocation MAY NOT be accurate within the FreeListmanager
> > > > itself). But that this playback need not generate an KVStore
> > > > operations (since those are duplicates of the BlueFS).
> > > >
> > > > So in the code you cite:
> > > >
> > > > fm->allocate(0, reserved, t);
> > > >
> > > > There's no need to commit 't', and in fact, in the general case,
> > > > you don't want to commit 't'.
> > > >
> > > > That suggests to me that a version of allocate that doesn't have
> > > > a transaction could be easily created would have the speed we're
> > > > looking for (and independence from the BitMapAllocator to KVStore chunking).
> > >
> > > Oh, I see.  Yeah, you're right--this step isn't really necessary,
> > > as long as we ensure that the auxilliary representation of what
> > > bluefs owns (bluefs_extents in the superblock) is still passed
> > > into the Allocator during initialization.  Having the freelist
> > > reflect the allocator that this space was "in use" (by bluefs) and
> > > thus off limits to bluestore is simple but not strictly necessary.
> > >
> > > I'll work on a PR that avoids this...
>
> https://github.com/ceph/ceph/pull/10698
>
> Ramesh, can you give it a try?
>
> > > > I suspect that we also have long startup times because we're
> > > > doing the same underlying bitmap operations except they come
> > > > from the BlueFS replay code instead of the BlueFS initialization
> > > > code, but same problem with likely the same fix.
> > >
> > > BlueFS doesn't touch the FreelistManager (or explicitly persist
> > > the freelist at all)... we initialize the in-memory Allocator
> > > state from the metadata in the bluefs log.  I think we should be fine on this end.
> >
> > Likely that code suffers from the same problem -- a false need to
> > update the KV Store (During the playback, BlueFS extents are
> > converted to bitmap runs, it's essentially the same lower level code
> > as the case we're seeing now, but it instead of being driven by an
> > artificial "big run", it'sll be driven from the BlueFS Journal
> > replay code). But that's just a guess, I don't have time to track down the actual code right now.
>
> BlueFS can't touch the freelist (or kv store, ever) since it
> ultimately backs the kv store and that would be problematic.  We do
> initialize the bluefs Allocator's in-memory state, but that's it.
>
> The PR above changes the BlueStore::_init_alloc() so that BlueStore's
> Allocator state is initialize with both the freelist state (from kv
> store)
> *and* the bluefs_extents list (from the bluestore superblock).  (From
> this Allocator's perspective, all of bluefs's space is allocated and
> can't be used.  BlueFS has it's own separate instance to do it's
> internal
> allocations.)

Ah, okay, so after our conversation in standup I went and looked at the code some more and realized I've been thinking about the BitmapFreelistManager and not the BitMapAllocator.  The ~40s is all CPU time spent updating in-memory bits, and has nothing to do with pushing updates through rocksdb.  Sorry for the confusing conversation.

So... I think there is one thing we can do: change the initialization of the allocator state from the freelist so that the assumption is that space is freed and we tell it was is allocation (currently we assume everything is allocated and tell it what is free).  I'm not sure it's worth it,
though: we'll just make things slower to start up on a full OSD instead of slower on an empty OSD.  And it seems like the CPU time really won't be significant anyway once the debugging stuff is taken out.

I think this PR

https://github.com/ceph/ceph/pull/10698

is still a good idea, though, since it avoids useless freelist kv work during mkfs.

Does that sound right?  Or am I still missing something?

Thanks for you patience!

[Somnath] Yes, I think it is a good idea , it seems it will reduce some kv operation from IO path as well  because _balance_bluefs_freespace is in IO path (?)
sage


PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Bluestore different allocator performance Vs FileStore
  2016-08-12 15:26                                     ` Sage Weil
  2016-08-12 15:43                                       ` Somnath Roy
@ 2016-08-12 20:02                                       ` Somnath Roy
  1 sibling, 0 replies; 34+ messages in thread
From: Somnath Roy @ 2016-08-12 20:02 UTC (permalink / raw)
  To: Sage Weil, Ramesh Chander; +Cc: Allen Samuels, ceph-devel

FYI, with latest master (optimized is_allocated()) the osd uptime and mkfs() time is *almost* (~2.5X slower now compare to 16X) similar to stupid.
As we discussed today, removing debug_assert(is_allocated()) all together from mkfs() part should be resolving this gap as well...

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Friday, August 12, 2016 8:44 AM
To: 'Sage Weil'; Ramesh Chander
Cc: Allen Samuels; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net]
Sent: Friday, August 12, 2016 8:26 AM
To: Ramesh Chander
Cc: Allen Samuels; Somnath Roy; ceph-devel
Subject: RE: Bluestore different allocator performance Vs FileStore

On Thu, 11 Aug 2016, Sage Weil wrote:
> On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, August 11, 2016 1:24 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Ramesh Chander <Ramesh.Chander@sandisk.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: RE: Bluestore different allocator performance Vs
> > > FileStore
> > >
> > > On Thu, 11 Aug 2016, Allen Samuels wrote:
> > > > Perhaps my understanding of the blueFS is incorrect -- so please
> > > > clarify as needed.
> > > >
> > > > I thought that the authoritative indication of space used by
> > > > BlueFS was contained in the snapshot/journal of BlueFS itself,
> > > > NOT in the KV store itself. This requires that upon startup, we
> > > > replay the BlueFS snapshot/journal into the FreeListManager so
> > > > that it properly records the consumption of BlueFS space (since
> > > > that allocation MAY NOT be accurate within the FreeListmanager
> > > > itself). But that this playback need not generate an KVStore
> > > > operations (since those are duplicates of the BlueFS).
> > > >
> > > > So in the code you cite:
> > > >
> > > > fm->allocate(0, reserved, t);
> > > >
> > > > There's no need to commit 't', and in fact, in the general case,
> > > > you don't want to commit 't'.
> > > >
> > > > That suggests to me that a version of allocate that doesn't have
> > > > a transaction could be easily created would have the speed we're
> > > > looking for (and independence from the BitMapAllocator to KVStore chunking).
> > >
> > > Oh, I see.  Yeah, you're right--this step isn't really necessary,
> > > as long as we ensure that the auxilliary representation of what
> > > bluefs owns (bluefs_extents in the superblock) is still passed
> > > into the Allocator during initialization.  Having the freelist
> > > reflect the allocator that this space was "in use" (by bluefs) and
> > > thus off limits to bluestore is simple but not strictly necessary.
> > >
> > > I'll work on a PR that avoids this...
>
> https://github.com/ceph/ceph/pull/10698
>
> Ramesh, can you give it a try?
>
> > > > I suspect that we also have long startup times because we're
> > > > doing the same underlying bitmap operations except they come
> > > > from the BlueFS replay code instead of the BlueFS initialization
> > > > code, but same problem with likely the same fix.
> > >
> > > BlueFS doesn't touch the FreelistManager (or explicitly persist
> > > the freelist at all)... we initialize the in-memory Allocator
> > > state from the metadata in the bluefs log.  I think we should be fine on this end.
> >
> > Likely that code suffers from the same problem -- a false need to
> > update the KV Store (During the playback, BlueFS extents are
> > converted to bitmap runs, it's essentially the same lower level code
> > as the case we're seeing now, but it instead of being driven by an
> > artificial "big run", it'sll be driven from the BlueFS Journal
> > replay code). But that's just a guess, I don't have time to track down the actual code right now.
>
> BlueFS can't touch the freelist (or kv store, ever) since it
> ultimately backs the kv store and that would be problematic.  We do
> initialize the bluefs Allocator's in-memory state, but that's it.
>
> The PR above changes the BlueStore::_init_alloc() so that BlueStore's
> Allocator state is initialize with both the freelist state (from kv
> store)
> *and* the bluefs_extents list (from the bluestore superblock).  (From
> this Allocator's perspective, all of bluefs's space is allocated and
> can't be used.  BlueFS has it's own separate instance to do it's
> internal
> allocations.)

Ah, okay, so after our conversation in standup I went and looked at the code some more and realized I've been thinking about the BitmapFreelistManager and not the BitMapAllocator.  The ~40s is all CPU time spent updating in-memory bits, and has nothing to do with pushing updates through rocksdb.  Sorry for the confusing conversation.

So... I think there is one thing we can do: change the initialization of the allocator state from the freelist so that the assumption is that space is freed and we tell it was is allocation (currently we assume everything is allocated and tell it what is free).  I'm not sure it's worth it,
though: we'll just make things slower to start up on a full OSD instead of slower on an empty OSD.  And it seems like the CPU time really won't be significant anyway once the debugging stuff is taken out.

I think this PR

https://github.com/ceph/ceph/pull/10698

is still a good idea, though, since it avoids useless freelist kv work during mkfs.

Does that sound right?  Or am I still missing something?

Thanks for you patience!

[Somnath] Yes, I think it is a good idea , it seems it will reduce some kv operation from IO path as well  because _balance_bluefs_freespace is in IO path (?) sage


PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-08-12 20:02 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-10 16:55 Bluestore different allocator performance Vs FileStore Somnath Roy
2016-08-10 21:31 ` Sage Weil
2016-08-10 22:27   ` Somnath Roy
2016-08-10 22:44     ` Sage Weil
2016-08-10 22:58       ` Allen Samuels
2016-08-11  4:34         ` Ramesh Chander
2016-08-11  6:07         ` Ramesh Chander
2016-08-11  7:11           ` Somnath Roy
2016-08-11 11:24             ` Mark Nelson
2016-08-11 14:06               ` Ben England
2016-08-11 17:07                 ` Allen Samuels
2016-08-11 16:04           ` Allen Samuels
2016-08-11 16:35             ` Ramesh Chander
2016-08-11 16:38               ` Sage Weil
2016-08-11 17:05                 ` Allen Samuels
2016-08-11 17:15                   ` Sage Weil
2016-08-11 17:26                     ` Allen Samuels
2016-08-11 19:34                       ` Sage Weil
2016-08-11 19:45                         ` Allen Samuels
2016-08-11 20:03                           ` Sage Weil
2016-08-11 20:16                             ` Allen Samuels
2016-08-11 20:24                               ` Sage Weil
2016-08-11 20:28                                 ` Allen Samuels
2016-08-11 21:19                                   ` Sage Weil
2016-08-12  3:10                                     ` Somnath Roy
2016-08-12  3:44                                       ` Allen Samuels
2016-08-12  5:27                                         ` Ramesh Chander
2016-08-12  5:52                                         ` Ramesh Chander
2016-08-12  5:59                                         ` Somnath Roy
2016-08-12  6:19                                     ` Somnath Roy
2016-08-12 15:26                                     ` Sage Weil
2016-08-12 15:43                                       ` Somnath Roy
2016-08-12 20:02                                       ` Somnath Roy
2016-08-11 12:28       ` Milosz Tanski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.