All of lore.kernel.org
 help / color / mirror / Atom feed
* Odd WAL traffic for BlueStore
@ 2016-08-22 13:46 Igor Fedotov
  2016-08-22 15:05 ` Somnath Roy
  2016-08-22 15:08 ` Sage Weil
  0 siblings, 2 replies; 16+ messages in thread
From: Igor Fedotov @ 2016-08-22 13:46 UTC (permalink / raw)
  To: ceph-devel

Hi All,

While testing BlueStore as a standalone storage via FIO plugin I'm 
observing huge traffic to a WAL device.

Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L

The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and 
Block WAL.

The second is split similarly and first 200Gb partition allocated for 
Raw Block data.

RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . 
No much difference comparing to default settings though...

As a result when doing 4k sequential write (8Gb total) to a fresh 
storage I'm observing (using nmon and other disk mon tools) significant 
write traffic to WAL device. And it grows eventually from ~10Mbs to 
~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.

Additionally I inserted an output for BlueFS perf counters on 
umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).

The resulting values are very frustrating: ~28Gb and 4Gb for 
l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.


Doing 64K changes the picture dramatically:

WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
BlueFS counters are ~140Mb and 1K respectively.

Surely write completes much faster in the second case.

No WAL is reported in logs at BlueStore level for both cases.


High BlueFS WAL traffic is observed when running subsequent random 4K RW 
over the store propagated this way too.

I'm wondering why WAL device is involved in the process at all ( writes 
happen in min_alloc_size blocks) operate and why the traffic and written 
data volume is so high?

Don't we have some fault affecting 4K performance here?


Here are my settings and FIO job specification:

###########################

[global]
         debug bluestore = 0/0
         debug bluefs = 1/0
         debug bdev = 0/0
         debug rocksdb = 0/0

         # spread objects over 8 collections
         osd pool default pg num = 32
         log to stderr = false

[osd]
         osd objectstore = bluestore
         bluestore_block_create = true
         bluestore_block_db_create = true
         bluestore_block_wal_create = true
         bluestore_min_alloc_size = 4096
         #bluestore_max_alloc_size = #or 4096
         bluestore_fsck_on_mount = false

         bluestore_block_path=/dev/sdi1
         bluestore_block_db_path=/dev/sde1
         bluestore_block_wal_path=/dev/sde2

         enable experimental unrecoverable data corrupting features = 
bluestore rocksdb memdb

         bluestore_rocksdb_options = 
"max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"

         rocksdb_cache_size = 4294967296
         bluestore_csum = false
         bluestore_csum_type = none
         bluestore_bluefs_buffered_io = false
         bluestore_max_ops = 30000
         bluestore_max_bytes = 629145600
         bluestore_buffer_cache_size = 104857600
         bluestore_block_wal_size = 0

         # use directory= option from fio job file
         osd data = ${fio_dir}

         # log inside fio_dir
         log file = ${fio_dir}/log
####################################

#FIO jobs
#################
# Runs a 4k random write test against the ceph BlueStore.
[global]
ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in 
your LD_LIBRARY_PATH

conf=ceph-bluestore-somnath.conf # must point to a valid ceph 
configuration file
directory=./fio-bluestore # directory for osd_data

rw=write
iodepth=16
size=256m

[bluestore]
nr_files=63
bs=4k        # or 64k
numjobs=32
#############



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Odd WAL traffic for BlueStore
  2016-08-22 13:46 Odd WAL traffic for BlueStore Igor Fedotov
@ 2016-08-22 15:05 ` Somnath Roy
  2016-08-22 15:08   ` Somnath Roy
  2016-08-22 15:08   ` Igor Fedotov
  2016-08-22 15:08 ` Sage Weil
  1 sibling, 2 replies; 16+ messages in thread
From: Somnath Roy @ 2016-08-22 15:05 UTC (permalink / raw)
  To: Igor Fedotov, ceph-devel

Igor,
I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
Sent: Monday, August 22, 2016 6:47 AM
To: ceph-devel
Subject: Odd WAL traffic for BlueStore

Hi All,

While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.

Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L

The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.

The second is split similarly and first 200Gb partition allocated for Raw Block data.

RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
No much difference comparing to default settings though...

As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.

Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).

The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.


Doing 64K changes the picture dramatically:

WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
BlueFS counters are ~140Mb and 1K respectively.

Surely write completes much faster in the second case.

No WAL is reported in logs at BlueStore level for both cases.


High BlueFS WAL traffic is observed when running subsequent random 4K RW
over the store propagated this way too.

I'm wondering why WAL device is involved in the process at all ( writes
happen in min_alloc_size blocks) operate and why the traffic and written
data volume is so high?

Don't we have some fault affecting 4K performance here?


Here are my settings and FIO job specification:

###########################

[global]
         debug bluestore = 0/0
         debug bluefs = 1/0
         debug bdev = 0/0
         debug rocksdb = 0/0

         # spread objects over 8 collections
         osd pool default pg num = 32
         log to stderr = false

[osd]
         osd objectstore = bluestore
         bluestore_block_create = true
         bluestore_block_db_create = true
         bluestore_block_wal_create = true
         bluestore_min_alloc_size = 4096
         #bluestore_max_alloc_size = #or 4096
         bluestore_fsck_on_mount = false

         bluestore_block_path=/dev/sdi1
         bluestore_block_db_path=/dev/sde1
         bluestore_block_wal_path=/dev/sde2

         enable experimental unrecoverable data corrupting features =
bluestore rocksdb memdb

         bluestore_rocksdb_options =
"max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"

         rocksdb_cache_size = 4294967296
         bluestore_csum = false
         bluestore_csum_type = none
         bluestore_bluefs_buffered_io = false
         bluestore_max_ops = 30000
         bluestore_max_bytes = 629145600
         bluestore_buffer_cache_size = 104857600
         bluestore_block_wal_size = 0

         # use directory= option from fio job file
         osd data = ${fio_dir}

         # log inside fio_dir
         log file = ${fio_dir}/log
####################################

#FIO jobs
#################
# Runs a 4k random write test against the ceph BlueStore.
[global]
ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in
your LD_LIBRARY_PATH

conf=ceph-bluestore-somnath.conf # must point to a valid ceph
configuration file
directory=./fio-bluestore # directory for osd_data

rw=write
iodepth=16
size=256m

[bluestore]
nr_files=63
bs=4k        # or 64k
numjobs=32
#############


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Odd WAL traffic for BlueStore
  2016-08-22 15:05 ` Somnath Roy
@ 2016-08-22 15:08   ` Somnath Roy
  2016-08-22 15:12     ` Igor Fedotov
  2016-08-22 15:08   ` Igor Fedotov
  1 sibling, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-08-22 15:08 UTC (permalink / raw)
  To: Somnath Roy, Igor Fedotov, ceph-devel

Another point, I never tried but there is an option to disable WAL write during rocksdb write, we can try this option and see if that is reducing WAL partition writes or not.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Monday, August 22, 2016 8:06 AM
To: Igor Fedotov; ceph-devel
Subject: RE: Odd WAL traffic for BlueStore

Igor,
I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
Sent: Monday, August 22, 2016 6:47 AM
To: ceph-devel
Subject: Odd WAL traffic for BlueStore

Hi All,

While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.

Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L

The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.

The second is split similarly and first 200Gb partition allocated for Raw Block data.

RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
No much difference comparing to default settings though...

As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.

Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).

The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.


Doing 64K changes the picture dramatically:

WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs BlueFS counters are ~140Mb and 1K respectively.

Surely write completes much faster in the second case.

No WAL is reported in logs at BlueStore level for both cases.


High BlueFS WAL traffic is observed when running subsequent random 4K RW over the store propagated this way too.

I'm wondering why WAL device is involved in the process at all ( writes happen in min_alloc_size blocks) operate and why the traffic and written data volume is so high?

Don't we have some fault affecting 4K performance here?


Here are my settings and FIO job specification:

###########################

[global]
         debug bluestore = 0/0
         debug bluefs = 1/0
         debug bdev = 0/0
         debug rocksdb = 0/0

         # spread objects over 8 collections
         osd pool default pg num = 32
         log to stderr = false

[osd]
         osd objectstore = bluestore
         bluestore_block_create = true
         bluestore_block_db_create = true
         bluestore_block_wal_create = true
         bluestore_min_alloc_size = 4096
         #bluestore_max_alloc_size = #or 4096
         bluestore_fsck_on_mount = false

         bluestore_block_path=/dev/sdi1
         bluestore_block_db_path=/dev/sde1
         bluestore_block_wal_path=/dev/sde2

         enable experimental unrecoverable data corrupting features = bluestore rocksdb memdb

         bluestore_rocksdb_options =
"max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"

         rocksdb_cache_size = 4294967296
         bluestore_csum = false
         bluestore_csum_type = none
         bluestore_bluefs_buffered_io = false
         bluestore_max_ops = 30000
         bluestore_max_bytes = 629145600
         bluestore_buffer_cache_size = 104857600
         bluestore_block_wal_size = 0

         # use directory= option from fio job file
         osd data = ${fio_dir}

         # log inside fio_dir
         log file = ${fio_dir}/log
####################################

#FIO jobs
#################
# Runs a 4k random write test against the ceph BlueStore.
[global]
ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your LD_LIBRARY_PATH

conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration file directory=./fio-bluestore # directory for osd_data

rw=write
iodepth=16
size=256m

[bluestore]
nr_files=63
bs=4k        # or 64k
numjobs=32
#############


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 15:05 ` Somnath Roy
  2016-08-22 15:08   ` Somnath Roy
@ 2016-08-22 15:08   ` Igor Fedotov
  1 sibling, 0 replies; 16+ messages in thread
From: Igor Fedotov @ 2016-08-22 15:08 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Somnath,

Yeah, looks like that's indeed non-BlueStore traffic. I made some log 
analysis and haven't seen any WAL activity at BlueStore level.

And thanks for your comment on rocksdb tuning.

Kind regards,

Igor


On 22.08.2016 18:05, Somnath Roy wrote:
> Igor,
> I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
> Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
> Sent: Monday, August 22, 2016 6:47 AM
> To: ceph-devel
> Subject: Odd WAL traffic for BlueStore
>
> Hi All,
>
> While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.
>
> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>
> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.
>
> The second is split similarly and first 200Gb partition allocated for Raw Block data.
>
> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
> No much difference comparing to default settings though...
>
> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.
>
> Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>
> The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>
>
> Doing 64K changes the picture dramatically:
>
> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
> BlueFS counters are ~140Mb and 1K respectively.
>
> Surely write completes much faster in the second case.
>
> No WAL is reported in logs at BlueStore level for both cases.
>
>
> High BlueFS WAL traffic is observed when running subsequent random 4K RW
> over the store propagated this way too.
>
> I'm wondering why WAL device is involved in the process at all ( writes
> happen in min_alloc_size blocks) operate and why the traffic and written
> data volume is so high?
>
> Don't we have some fault affecting 4K performance here?
>
>
> Here are my settings and FIO job specification:
>
> ###########################
>
> [global]
>           debug bluestore = 0/0
>           debug bluefs = 1/0
>           debug bdev = 0/0
>           debug rocksdb = 0/0
>
>           # spread objects over 8 collections
>           osd pool default pg num = 32
>           log to stderr = false
>
> [osd]
>           osd objectstore = bluestore
>           bluestore_block_create = true
>           bluestore_block_db_create = true
>           bluestore_block_wal_create = true
>           bluestore_min_alloc_size = 4096
>           #bluestore_max_alloc_size = #or 4096
>           bluestore_fsck_on_mount = false
>
>           bluestore_block_path=/dev/sdi1
>           bluestore_block_db_path=/dev/sde1
>           bluestore_block_wal_path=/dev/sde2
>
>           enable experimental unrecoverable data corrupting features =
> bluestore rocksdb memdb
>
>           bluestore_rocksdb_options =
> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>
>           rocksdb_cache_size = 4294967296
>           bluestore_csum = false
>           bluestore_csum_type = none
>           bluestore_bluefs_buffered_io = false
>           bluestore_max_ops = 30000
>           bluestore_max_bytes = 629145600
>           bluestore_buffer_cache_size = 104857600
>           bluestore_block_wal_size = 0
>
>           # use directory= option from fio job file
>           osd data = ${fio_dir}
>
>           # log inside fio_dir
>           log file = ${fio_dir}/log
> ####################################
>
> #FIO jobs
> #################
> # Runs a 4k random write test against the ceph BlueStore.
> [global]
> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in
> your LD_LIBRARY_PATH
>
> conf=ceph-bluestore-somnath.conf # must point to a valid ceph
> configuration file
> directory=./fio-bluestore # directory for osd_data
>
> rw=write
> iodepth=16
> size=256m
>
> [bluestore]
> nr_files=63
> bs=4k        # or 64k
> numjobs=32
> #############
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 13:46 Odd WAL traffic for BlueStore Igor Fedotov
  2016-08-22 15:05 ` Somnath Roy
@ 2016-08-22 15:08 ` Sage Weil
  2016-08-22 15:10   ` Igor Fedotov
  1 sibling, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-08-22 15:08 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Mon, 22 Aug 2016, Igor Fedotov wrote:
> Hi All,
> 
> While testing BlueStore as a standalone storage via FIO plugin I'm observing
> huge traffic to a WAL device.
> 
> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
> 
> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block
> WAL.
> 
> The second is split similarly and first 200Gb partition allocated for Raw
> Block data.
> 
> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . No
> much difference comparing to default settings though...
> 
> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm
> observing (using nmon and other disk mon tools) significant write traffic to
> WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device
> traffic is pretty stable at ~30 Mbs.
> 
> Additionally I inserted an output for BlueFS perf counters on
> umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
> 
> The resulting values are very frustrating: ~28Gb and 4Gb for
> l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.

Yeah, this doesn't seem right.  Have you generated a log to see what is 
actually happening on each write?  I don't have any bright ideas about 
what is going wrong here.

sage

> Doing 64K changes the picture dramatically:
> 
> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
> BlueFS counters are ~140Mb and 1K respectively.
> 
> Surely write completes much faster in the second case.
> 
> No WAL is reported in logs at BlueStore level for both cases.
> 
> 
> High BlueFS WAL traffic is observed when running subsequent random 4K RW over
> the store propagated this way too.
> 
> I'm wondering why WAL device is involved in the process at all ( writes happen
> in min_alloc_size blocks) operate and why the traffic and written data volume
> is so high?
> 
> Don't we have some fault affecting 4K performance here?
> 
> 
> Here are my settings and FIO job specification:
> 
> ###########################
> 
> [global]
>         debug bluestore = 0/0
>         debug bluefs = 1/0
>         debug bdev = 0/0
>         debug rocksdb = 0/0
> 
>         # spread objects over 8 collections
>         osd pool default pg num = 32
>         log to stderr = false
> 
> [osd]
>         osd objectstore = bluestore
>         bluestore_block_create = true
>         bluestore_block_db_create = true
>         bluestore_block_wal_create = true
>         bluestore_min_alloc_size = 4096
>         #bluestore_max_alloc_size = #or 4096
>         bluestore_fsck_on_mount = false
> 
>         bluestore_block_path=/dev/sdi1
>         bluestore_block_db_path=/dev/sde1
>         bluestore_block_wal_path=/dev/sde2
> 
>         enable experimental unrecoverable data corrupting features = bluestore
> rocksdb memdb
> 
>         bluestore_rocksdb_options =
> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
> 
>         rocksdb_cache_size = 4294967296
>         bluestore_csum = false
>         bluestore_csum_type = none
>         bluestore_bluefs_buffered_io = false
>         bluestore_max_ops = 30000
>         bluestore_max_bytes = 629145600
>         bluestore_buffer_cache_size = 104857600
>         bluestore_block_wal_size = 0
> 
>         # use directory= option from fio job file
>         osd data = ${fio_dir}
> 
>         # log inside fio_dir
>         log file = ${fio_dir}/log
> ####################################
> 
> #FIO jobs
> #################
> # Runs a 4k random write test against the ceph BlueStore.
> [global]
> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your
> LD_LIBRARY_PATH
> 
> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration
> file
> directory=./fio-bluestore # directory for osd_data
> 
> rw=write
> iodepth=16
> size=256m
> 
> [bluestore]
> nr_files=63
> bs=4k        # or 64k
> numjobs=32
> #############
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 15:08 ` Sage Weil
@ 2016-08-22 15:10   ` Igor Fedotov
  2016-08-22 15:12     ` Sage Weil
  2016-08-22 15:17     ` Haomai Wang
  0 siblings, 2 replies; 16+ messages in thread
From: Igor Fedotov @ 2016-08-22 15:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Will prepare shortly. Any suggestions on desired levels and components?


On 22.08.2016 18:08, Sage Weil wrote:
> On Mon, 22 Aug 2016, Igor Fedotov wrote:
>> Hi All,
>>
>> While testing BlueStore as a standalone storage via FIO plugin I'm observing
>> huge traffic to a WAL device.
>>
>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>>
>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block
>> WAL.
>>
>> The second is split similarly and first 200Gb partition allocated for Raw
>> Block data.
>>
>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . No
>> much difference comparing to default settings though...
>>
>> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm
>> observing (using nmon and other disk mon tools) significant write traffic to
>> WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device
>> traffic is pretty stable at ~30 Mbs.
>>
>> Additionally I inserted an output for BlueFS perf counters on
>> umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>>
>> The resulting values are very frustrating: ~28Gb and 4Gb for
>> l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
> Yeah, this doesn't seem right.  Have you generated a log to see what is
> actually happening on each write?  I don't have any bright ideas about
> what is going wrong here.
>
> sage
>
>> Doing 64K changes the picture dramatically:
>>
>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
>> BlueFS counters are ~140Mb and 1K respectively.
>>
>> Surely write completes much faster in the second case.
>>
>> No WAL is reported in logs at BlueStore level for both cases.
>>
>>
>> High BlueFS WAL traffic is observed when running subsequent random 4K RW over
>> the store propagated this way too.
>>
>> I'm wondering why WAL device is involved in the process at all ( writes happen
>> in min_alloc_size blocks) operate and why the traffic and written data volume
>> is so high?
>>
>> Don't we have some fault affecting 4K performance here?
>>
>>
>> Here are my settings and FIO job specification:
>>
>> ###########################
>>
>> [global]
>>          debug bluestore = 0/0
>>          debug bluefs = 1/0
>>          debug bdev = 0/0
>>          debug rocksdb = 0/0
>>
>>          # spread objects over 8 collections
>>          osd pool default pg num = 32
>>          log to stderr = false
>>
>> [osd]
>>          osd objectstore = bluestore
>>          bluestore_block_create = true
>>          bluestore_block_db_create = true
>>          bluestore_block_wal_create = true
>>          bluestore_min_alloc_size = 4096
>>          #bluestore_max_alloc_size = #or 4096
>>          bluestore_fsck_on_mount = false
>>
>>          bluestore_block_path=/dev/sdi1
>>          bluestore_block_db_path=/dev/sde1
>>          bluestore_block_wal_path=/dev/sde2
>>
>>          enable experimental unrecoverable data corrupting features = bluestore
>> rocksdb memdb
>>
>>          bluestore_rocksdb_options =
>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>          rocksdb_cache_size = 4294967296
>>          bluestore_csum = false
>>          bluestore_csum_type = none
>>          bluestore_bluefs_buffered_io = false
>>          bluestore_max_ops = 30000
>>          bluestore_max_bytes = 629145600
>>          bluestore_buffer_cache_size = 104857600
>>          bluestore_block_wal_size = 0
>>
>>          # use directory= option from fio job file
>>          osd data = ${fio_dir}
>>
>>          # log inside fio_dir
>>          log file = ${fio_dir}/log
>> ####################################
>>
>> #FIO jobs
>> #################
>> # Runs a 4k random write test against the ceph BlueStore.
>> [global]
>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your
>> LD_LIBRARY_PATH
>>
>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration
>> file
>> directory=./fio-bluestore # directory for osd_data
>>
>> rw=write
>> iodepth=16
>> size=256m
>>
>> [bluestore]
>> nr_files=63
>> bs=4k        # or 64k
>> numjobs=32
>> #############
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 15:08   ` Somnath Roy
@ 2016-08-22 15:12     ` Igor Fedotov
  2016-08-22 15:53       ` Somnath Roy
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Fedotov @ 2016-08-22 15:12 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Can you point it out? Don't see any...


On 22.08.2016 18:08, Somnath Roy wrote:
> Another point, I never tried but there is an option to disable WAL write during rocksdb write, we can try this option and see if that is reducing WAL partition writes or not.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Monday, August 22, 2016 8:06 AM
> To: Igor Fedotov; ceph-devel
> Subject: RE: Odd WAL traffic for BlueStore
>
> Igor,
> I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
> Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
> Sent: Monday, August 22, 2016 6:47 AM
> To: ceph-devel
> Subject: Odd WAL traffic for BlueStore
>
> Hi All,
>
> While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.
>
> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>
> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.
>
> The second is split similarly and first 200Gb partition allocated for Raw Block data.
>
> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
> No much difference comparing to default settings though...
>
> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.
>
> Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>
> The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>
>
> Doing 64K changes the picture dramatically:
>
> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs BlueFS counters are ~140Mb and 1K respectively.
>
> Surely write completes much faster in the second case.
>
> No WAL is reported in logs at BlueStore level for both cases.
>
>
> High BlueFS WAL traffic is observed when running subsequent random 4K RW over the store propagated this way too.
>
> I'm wondering why WAL device is involved in the process at all ( writes happen in min_alloc_size blocks) operate and why the traffic and written data volume is so high?
>
> Don't we have some fault affecting 4K performance here?
>
>
> Here are my settings and FIO job specification:
>
> ###########################
>
> [global]
>           debug bluestore = 0/0
>           debug bluefs = 1/0
>           debug bdev = 0/0
>           debug rocksdb = 0/0
>
>           # spread objects over 8 collections
>           osd pool default pg num = 32
>           log to stderr = false
>
> [osd]
>           osd objectstore = bluestore
>           bluestore_block_create = true
>           bluestore_block_db_create = true
>           bluestore_block_wal_create = true
>           bluestore_min_alloc_size = 4096
>           #bluestore_max_alloc_size = #or 4096
>           bluestore_fsck_on_mount = false
>
>           bluestore_block_path=/dev/sdi1
>           bluestore_block_db_path=/dev/sde1
>           bluestore_block_wal_path=/dev/sde2
>
>           enable experimental unrecoverable data corrupting features = bluestore rocksdb memdb
>
>           bluestore_rocksdb_options =
> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>
>           rocksdb_cache_size = 4294967296
>           bluestore_csum = false
>           bluestore_csum_type = none
>           bluestore_bluefs_buffered_io = false
>           bluestore_max_ops = 30000
>           bluestore_max_bytes = 629145600
>           bluestore_buffer_cache_size = 104857600
>           bluestore_block_wal_size = 0
>
>           # use directory= option from fio job file
>           osd data = ${fio_dir}
>
>           # log inside fio_dir
>           log file = ${fio_dir}/log
> ####################################
>
> #FIO jobs
> #################
> # Runs a 4k random write test against the ceph BlueStore.
> [global]
> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your LD_LIBRARY_PATH
>
> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration file directory=./fio-bluestore # directory for osd_data
>
> rw=write
> iodepth=16
> size=256m
>
> [bluestore]
> nr_files=63
> bs=4k        # or 64k
> numjobs=32
> #############
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 15:10   ` Igor Fedotov
@ 2016-08-22 15:12     ` Sage Weil
  2016-08-22 16:14       ` Igor Fedotov
  2016-08-22 15:17     ` Haomai Wang
  1 sibling, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-08-22 15:12 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

debug bluestore = 20
debug bluefs = 20
debug rocksdb = 5

Thanks!
sage


On Mon, 22 Aug 2016, Igor Fedotov wrote:

> Will prepare shortly. Any suggestions on desired levels and components?
> 
> 
> On 22.08.2016 18:08, Sage Weil wrote:
> > On Mon, 22 Aug 2016, Igor Fedotov wrote:
> > > Hi All,
> > > 
> > > While testing BlueStore as a standalone storage via FIO plugin I'm
> > > observing
> > > huge traffic to a WAL device.
> > > 
> > > Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
> > > 
> > > The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and
> > > Block
> > > WAL.
> > > 
> > > The second is split similarly and first 200Gb partition allocated for Raw
> > > Block data.
> > > 
> > > RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . No
> > > much difference comparing to default settings though...
> > > 
> > > As a result when doing 4k sequential write (8Gb total) to a fresh storage
> > > I'm
> > > observing (using nmon and other disk mon tools) significant write traffic
> > > to
> > > WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block
> > > device
> > > traffic is pretty stable at ~30 Mbs.
> > > 
> > > Additionally I inserted an output for BlueFS perf counters on
> > > umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
> > > 
> > > The resulting values are very frustrating: ~28Gb and 4Gb for
> > > l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
> > Yeah, this doesn't seem right.  Have you generated a log to see what is
> > actually happening on each write?  I don't have any bright ideas about
> > what is going wrong here.
> > 
> > sage
> > 
> > > Doing 64K changes the picture dramatically:
> > > 
> > > WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
> > > BlueFS counters are ~140Mb and 1K respectively.
> > > 
> > > Surely write completes much faster in the second case.
> > > 
> > > No WAL is reported in logs at BlueStore level for both cases.
> > > 
> > > 
> > > High BlueFS WAL traffic is observed when running subsequent random 4K RW
> > > over
> > > the store propagated this way too.
> > > 
> > > I'm wondering why WAL device is involved in the process at all ( writes
> > > happen
> > > in min_alloc_size blocks) operate and why the traffic and written data
> > > volume
> > > is so high?
> > > 
> > > Don't we have some fault affecting 4K performance here?
> > > 
> > > 
> > > Here are my settings and FIO job specification:
> > > 
> > > ###########################
> > > 
> > > [global]
> > >          debug bluestore = 0/0
> > >          debug bluefs = 1/0
> > >          debug bdev = 0/0
> > >          debug rocksdb = 0/0
> > > 
> > >          # spread objects over 8 collections
> > >          osd pool default pg num = 32
> > >          log to stderr = false
> > > 
> > > [osd]
> > >          osd objectstore = bluestore
> > >          bluestore_block_create = true
> > >          bluestore_block_db_create = true
> > >          bluestore_block_wal_create = true
> > >          bluestore_min_alloc_size = 4096
> > >          #bluestore_max_alloc_size = #or 4096
> > >          bluestore_fsck_on_mount = false
> > > 
> > >          bluestore_block_path=/dev/sdi1
> > >          bluestore_block_db_path=/dev/sde1
> > >          bluestore_block_wal_path=/dev/sde2
> > > 
> > >          enable experimental unrecoverable data corrupting features =
> > > bluestore
> > > rocksdb memdb
> > > 
> > >          bluestore_rocksdb_options =
> > > "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
> > > 
> > >          rocksdb_cache_size = 4294967296
> > >          bluestore_csum = false
> > >          bluestore_csum_type = none
> > >          bluestore_bluefs_buffered_io = false
> > >          bluestore_max_ops = 30000
> > >          bluestore_max_bytes = 629145600
> > >          bluestore_buffer_cache_size = 104857600
> > >          bluestore_block_wal_size = 0
> > > 
> > >          # use directory= option from fio job file
> > >          osd data = ${fio_dir}
> > > 
> > >          # log inside fio_dir
> > >          log file = ${fio_dir}/log
> > > ####################################
> > > 
> > > #FIO jobs
> > > #################
> > > # Runs a 4k random write test against the ceph BlueStore.
> > > [global]
> > > ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your
> > > LD_LIBRARY_PATH
> > > 
> > > conf=ceph-bluestore-somnath.conf # must point to a valid ceph
> > > configuration
> > > file
> > > directory=./fio-bluestore # directory for osd_data
> > > 
> > > rw=write
> > > iodepth=16
> > > size=256m
> > > 
> > > [bluestore]
> > > nr_files=63
> > > bs=4k        # or 64k
> > > numjobs=32
> > > #############
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 15:10   ` Igor Fedotov
  2016-08-22 15:12     ` Sage Weil
@ 2016-08-22 15:17     ` Haomai Wang
  1 sibling, 0 replies; 16+ messages in thread
From: Haomai Wang @ 2016-08-22 15:17 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Sage Weil, ceph-devel

On Mon, Aug 22, 2016 at 11:10 PM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>
> Will prepare shortly. Any suggestions on desired levels and components?


hmm, I also met this problem for last WAL perf between filejournal and
rocksdb. For normal case, we will reuse recycle log while writing
enough data as sage mentioned. But I found it's strange while stress
writing doesn't discard bluefs inode metadata update. The rocksdb wal
log file continues to grow while writing hundreds of logs, I can see
bluefs wal log file's inode size exceed 500MB.....

I stopped at looking at DBImpl::SwitchMemtable. maybe you can see
whether your case is recycle log isn't switch?

>
>
>
>
> On 22.08.2016 18:08, Sage Weil wrote:
>>
>> On Mon, 22 Aug 2016, Igor Fedotov wrote:
>>>
>>> Hi All,
>>>
>>> While testing BlueStore as a standalone storage via FIO plugin I'm observing
>>> huge traffic to a WAL device.
>>>
>>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>>>
>>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block
>>> WAL.
>>>
>>> The second is split similarly and first 200Gb partition allocated for Raw
>>> Block data.
>>>
>>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . No
>>> much difference comparing to default settings though...
>>>
>>> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm
>>> observing (using nmon and other disk mon tools) significant write traffic to
>>> WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device
>>> traffic is pretty stable at ~30 Mbs.
>>>
>>> Additionally I inserted an output for BlueFS perf counters on
>>> umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>>>
>>> The resulting values are very frustrating: ~28Gb and 4Gb for
>>> l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>>
>> Yeah, this doesn't seem right.  Have you generated a log to see what is
>> actually happening on each write?  I don't have any bright ideas about
>> what is going wrong here.
>>
>> sage
>>
>>> Doing 64K changes the picture dramatically:
>>>
>>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
>>> BlueFS counters are ~140Mb and 1K respectively.
>>>
>>> Surely write completes much faster in the second case.
>>>
>>> No WAL is reported in logs at BlueStore level for both cases.
>>>
>>>
>>> High BlueFS WAL traffic is observed when running subsequent random 4K RW over
>>> the store propagated this way too.
>>>
>>> I'm wondering why WAL device is involved in the process at all ( writes happen
>>> in min_alloc_size blocks) operate and why the traffic and written data volume
>>> is so high?
>>>
>>> Don't we have some fault affecting 4K performance here?
>>>
>>>
>>> Here are my settings and FIO job specification:
>>>
>>> ###########################
>>>
>>> [global]
>>>          debug bluestore = 0/0
>>>          debug bluefs = 1/0
>>>          debug bdev = 0/0
>>>          debug rocksdb = 0/0
>>>
>>>          # spread objects over 8 collections
>>>          osd pool default pg num = 32
>>>          log to stderr = false
>>>
>>> [osd]
>>>          osd objectstore = bluestore
>>>          bluestore_block_create = true
>>>          bluestore_block_db_create = true
>>>          bluestore_block_wal_create = true
>>>          bluestore_min_alloc_size = 4096
>>>          #bluestore_max_alloc_size = #or 4096
>>>          bluestore_fsck_on_mount = false
>>>
>>>          bluestore_block_path=/dev/sdi1
>>>          bluestore_block_db_path=/dev/sde1
>>>          bluestore_block_wal_path=/dev/sde2
>>>
>>>          enable experimental unrecoverable data corrupting features = bluestore
>>> rocksdb memdb
>>>
>>>          bluestore_rocksdb_options =
>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>>
>>>          rocksdb_cache_size = 4294967296
>>>          bluestore_csum = false
>>>          bluestore_csum_type = none
>>>          bluestore_bluefs_buffered_io = false
>>>          bluestore_max_ops = 30000
>>>          bluestore_max_bytes = 629145600
>>>          bluestore_buffer_cache_size = 104857600
>>>          bluestore_block_wal_size = 0
>>>
>>>          # use directory= option from fio job file
>>>          osd data = ${fio_dir}
>>>
>>>          # log inside fio_dir
>>>          log file = ${fio_dir}/log
>>> ####################################
>>>
>>> #FIO jobs
>>> #################
>>> # Runs a 4k random write test against the ceph BlueStore.
>>> [global]
>>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your
>>> LD_LIBRARY_PATH
>>>
>>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration
>>> file
>>> directory=./fio-bluestore # directory for osd_data
>>>
>>> rw=write
>>> iodepth=16
>>> size=256m
>>>
>>> [bluestore]
>>> nr_files=63
>>> bs=4k        # or 64k
>>> numjobs=32
>>> #############
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Odd WAL traffic for BlueStore
  2016-08-22 15:12     ` Igor Fedotov
@ 2016-08-22 15:53       ` Somnath Roy
  2016-08-22 16:13         ` Somnath Roy
  0 siblings, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-08-22 15:53 UTC (permalink / raw)
  To: Igor Fedotov, ceph-devel

disableWAL=true in the rocksdb option in ceph.conf or do this (if previous is buggy or not working)..

RocksDBStore::submit_transaction and RocksDBStore::submit_transaction_sync  has this option set explicitly , make it 'true'.

woptions.disableWAL = true

Thanks & Regards
Somnath


-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Monday, August 22, 2016 8:12 AM
To: Somnath Roy; ceph-devel
Subject: Re: Odd WAL traffic for BlueStore

Can you point it out? Don't see any...


On 22.08.2016 18:08, Somnath Roy wrote:
> Another point, I never tried but there is an option to disable WAL write during rocksdb write, we can try this option and see if that is reducing WAL partition writes or not.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Monday, August 22, 2016 8:06 AM
> To: Igor Fedotov; ceph-devel
> Subject: RE: Odd WAL traffic for BlueStore
>
> Igor,
> I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
> Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
> Sent: Monday, August 22, 2016 6:47 AM
> To: ceph-devel
> Subject: Odd WAL traffic for BlueStore
>
> Hi All,
>
> While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.
>
> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>
> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.
>
> The second is split similarly and first 200Gb partition allocated for Raw Block data.
>
> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
> No much difference comparing to default settings though...
>
> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.
>
> Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>
> The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>
>
> Doing 64K changes the picture dramatically:
>
> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs BlueFS counters are ~140Mb and 1K respectively.
>
> Surely write completes much faster in the second case.
>
> No WAL is reported in logs at BlueStore level for both cases.
>
>
> High BlueFS WAL traffic is observed when running subsequent random 4K RW over the store propagated this way too.
>
> I'm wondering why WAL device is involved in the process at all ( writes happen in min_alloc_size blocks) operate and why the traffic and written data volume is so high?
>
> Don't we have some fault affecting 4K performance here?
>
>
> Here are my settings and FIO job specification:
>
> ###########################
>
> [global]
>           debug bluestore = 0/0
>           debug bluefs = 1/0
>           debug bdev = 0/0
>           debug rocksdb = 0/0
>
>           # spread objects over 8 collections
>           osd pool default pg num = 32
>           log to stderr = false
>
> [osd]
>           osd objectstore = bluestore
>           bluestore_block_create = true
>           bluestore_block_db_create = true
>           bluestore_block_wal_create = true
>           bluestore_min_alloc_size = 4096
>           #bluestore_max_alloc_size = #or 4096
>           bluestore_fsck_on_mount = false
>
>           bluestore_block_path=/dev/sdi1
>           bluestore_block_db_path=/dev/sde1
>           bluestore_block_wal_path=/dev/sde2
>
>           enable experimental unrecoverable data corrupting features = bluestore rocksdb memdb
>
>           bluestore_rocksdb_options =
> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>
>           rocksdb_cache_size = 4294967296
>           bluestore_csum = false
>           bluestore_csum_type = none
>           bluestore_bluefs_buffered_io = false
>           bluestore_max_ops = 30000
>           bluestore_max_bytes = 629145600
>           bluestore_buffer_cache_size = 104857600
>           bluestore_block_wal_size = 0
>
>           # use directory= option from fio job file
>           osd data = ${fio_dir}
>
>           # log inside fio_dir
>           log file = ${fio_dir}/log
> ####################################
>
> #FIO jobs
> #################
> # Runs a 4k random write test against the ceph BlueStore.
> [global]
> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your LD_LIBRARY_PATH
>
> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration file directory=./fio-bluestore # directory for osd_data
>
> rw=write
> iodepth=16
> size=256m
>
> [bluestore]
> nr_files=63
> bs=4k        # or 64k
> numjobs=32
> #############
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Odd WAL traffic for BlueStore
  2016-08-22 15:53       ` Somnath Roy
@ 2016-08-22 16:13         ` Somnath Roy
  2016-08-22 16:21           ` Igor Fedotov
  0 siblings, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-08-22 16:13 UTC (permalink / raw)
  To: Somnath Roy, Igor Fedotov, ceph-devel

Igor,
I just verified setting 'disableWAL= true' in the ceph.conf rocksdb option and it is working as expected. I am not seeing any WAL traffic now.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Monday, August 22, 2016 8:53 AM
To: Igor Fedotov; ceph-devel
Subject: RE: Odd WAL traffic for BlueStore

disableWAL=true in the rocksdb option in ceph.conf or do this (if previous is buggy or not working)..

RocksDBStore::submit_transaction and RocksDBStore::submit_transaction_sync  has this option set explicitly , make it 'true'.

woptions.disableWAL = true

Thanks & Regards
Somnath


-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Monday, August 22, 2016 8:12 AM
To: Somnath Roy; ceph-devel
Subject: Re: Odd WAL traffic for BlueStore

Can you point it out? Don't see any...


On 22.08.2016 18:08, Somnath Roy wrote:
> Another point, I never tried but there is an option to disable WAL write during rocksdb write, we can try this option and see if that is reducing WAL partition writes or not.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Monday, August 22, 2016 8:06 AM
> To: Igor Fedotov; ceph-devel
> Subject: RE: Odd WAL traffic for BlueStore
>
> Igor,
> I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
> Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
> Sent: Monday, August 22, 2016 6:47 AM
> To: ceph-devel
> Subject: Odd WAL traffic for BlueStore
>
> Hi All,
>
> While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.
>
> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>
> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.
>
> The second is split similarly and first 200Gb partition allocated for Raw Block data.
>
> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
> No much difference comparing to default settings though...
>
> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.
>
> Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>
> The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>
>
> Doing 64K changes the picture dramatically:
>
> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs BlueFS counters are ~140Mb and 1K respectively.
>
> Surely write completes much faster in the second case.
>
> No WAL is reported in logs at BlueStore level for both cases.
>
>
> High BlueFS WAL traffic is observed when running subsequent random 4K RW over the store propagated this way too.
>
> I'm wondering why WAL device is involved in the process at all ( writes happen in min_alloc_size blocks) operate and why the traffic and written data volume is so high?
>
> Don't we have some fault affecting 4K performance here?
>
>
> Here are my settings and FIO job specification:
>
> ###########################
>
> [global]
>           debug bluestore = 0/0
>           debug bluefs = 1/0
>           debug bdev = 0/0
>           debug rocksdb = 0/0
>
>           # spread objects over 8 collections
>           osd pool default pg num = 32
>           log to stderr = false
>
> [osd]
>           osd objectstore = bluestore
>           bluestore_block_create = true
>           bluestore_block_db_create = true
>           bluestore_block_wal_create = true
>           bluestore_min_alloc_size = 4096
>           #bluestore_max_alloc_size = #or 4096
>           bluestore_fsck_on_mount = false
>
>           bluestore_block_path=/dev/sdi1
>           bluestore_block_db_path=/dev/sde1
>           bluestore_block_wal_path=/dev/sde2
>
>           enable experimental unrecoverable data corrupting features = bluestore rocksdb memdb
>
>           bluestore_rocksdb_options =
> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>
>           rocksdb_cache_size = 4294967296
>           bluestore_csum = false
>           bluestore_csum_type = none
>           bluestore_bluefs_buffered_io = false
>           bluestore_max_ops = 30000
>           bluestore_max_bytes = 629145600
>           bluestore_buffer_cache_size = 104857600
>           bluestore_block_wal_size = 0
>
>           # use directory= option from fio job file
>           osd data = ${fio_dir}
>
>           # log inside fio_dir
>           log file = ${fio_dir}/log
> ####################################
>
> #FIO jobs
> #################
> # Runs a 4k random write test against the ceph BlueStore.
> [global]
> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your LD_LIBRARY_PATH
>
> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration file directory=./fio-bluestore # directory for osd_data
>
> rw=write
> iodepth=16
> size=256m
>
> [bluestore]
> nr_files=63
> bs=4k        # or 64k
> numjobs=32
> #############
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

\x13��칻\x1c�&�~�&�\x18��+-��ݶ\x17��w��˛���m�\x1e�\x17^��b��^n�r���z�\x1a��h����&��\x1e�G���h�\x03(�階�ݢj"��\x1a�^[m�����z�ޖ���f���h���~�m�

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 15:12     ` Sage Weil
@ 2016-08-22 16:14       ` Igor Fedotov
  2016-08-22 16:55         ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Fedotov @ 2016-08-22 16:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage,

here it is

https://drive.google.com/open?id=0B-4q9QFReegLZmxrd19VYTc2aVU


debug bluestore was set to 10 to reduce the log's size.

nr_files=8

numjobs=1


Total 89Mb was written from fio.

Please note following lines at the end

2016-08-22 15:53:01.717433 7fbf42ffd700  0 bluefs umount
2016-08-22 15:53:01.717440 7fbf42ffd700  0 bluefs 859013499 1069409

These are mentioned bluefs perf counters.

859Mb for 'wal_bytes_written'!

Please let me know if you need anything else.

Thanks,
Igor

On 22.08.2016 18:12, Sage Weil wrote:
> debug bluestore = 20
> debug bluefs = 20
> debug rocksdb = 5
>
> Thanks!
> sage
>
>
> On Mon, 22 Aug 2016, Igor Fedotov wrote:
>
>> Will prepare shortly. Any suggestions on desired levels and components?
>>
>>
>> On 22.08.2016 18:08, Sage Weil wrote:
>>> On Mon, 22 Aug 2016, Igor Fedotov wrote:
>>>> Hi All,
>>>>
>>>> While testing BlueStore as a standalone storage via FIO plugin I'm
>>>> observing
>>>> huge traffic to a WAL device.
>>>>
>>>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>>>>
>>>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and
>>>> Block
>>>> WAL.
>>>>
>>>> The second is split similarly and first 200Gb partition allocated for Raw
>>>> Block data.
>>>>
>>>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' . No
>>>> much difference comparing to default settings though...
>>>>
>>>> As a result when doing 4k sequential write (8Gb total) to a fresh storage
>>>> I'm
>>>> observing (using nmon and other disk mon tools) significant write traffic
>>>> to
>>>> WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block
>>>> device
>>>> traffic is pretty stable at ~30 Mbs.
>>>>
>>>> Additionally I inserted an output for BlueFS perf counters on
>>>> umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>>>>
>>>> The resulting values are very frustrating: ~28Gb and 4Gb for
>>>> l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>>> Yeah, this doesn't seem right.  Have you generated a log to see what is
>>> actually happening on each write?  I don't have any bright ideas about
>>> what is going wrong here.
>>>
>>> sage
>>>
>>>> Doing 64K changes the picture dramatically:
>>>>
>>>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
>>>> BlueFS counters are ~140Mb and 1K respectively.
>>>>
>>>> Surely write completes much faster in the second case.
>>>>
>>>> No WAL is reported in logs at BlueStore level for both cases.
>>>>
>>>>
>>>> High BlueFS WAL traffic is observed when running subsequent random 4K RW
>>>> over
>>>> the store propagated this way too.
>>>>
>>>> I'm wondering why WAL device is involved in the process at all ( writes
>>>> happen
>>>> in min_alloc_size blocks) operate and why the traffic and written data
>>>> volume
>>>> is so high?
>>>>
>>>> Don't we have some fault affecting 4K performance here?
>>>>
>>>>
>>>> Here are my settings and FIO job specification:
>>>>
>>>> ###########################
>>>>
>>>> [global]
>>>>           debug bluestore = 0/0
>>>>           debug bluefs = 1/0
>>>>           debug bdev = 0/0
>>>>           debug rocksdb = 0/0
>>>>
>>>>           # spread objects over 8 collections
>>>>           osd pool default pg num = 32
>>>>           log to stderr = false
>>>>
>>>> [osd]
>>>>           osd objectstore = bluestore
>>>>           bluestore_block_create = true
>>>>           bluestore_block_db_create = true
>>>>           bluestore_block_wal_create = true
>>>>           bluestore_min_alloc_size = 4096
>>>>           #bluestore_max_alloc_size = #or 4096
>>>>           bluestore_fsck_on_mount = false
>>>>
>>>>           bluestore_block_path=/dev/sdi1
>>>>           bluestore_block_db_path=/dev/sde1
>>>>           bluestore_block_wal_path=/dev/sde2
>>>>
>>>>           enable experimental unrecoverable data corrupting features =
>>>> bluestore
>>>> rocksdb memdb
>>>>
>>>>           bluestore_rocksdb_options =
>>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>>>
>>>>           rocksdb_cache_size = 4294967296
>>>>           bluestore_csum = false
>>>>           bluestore_csum_type = none
>>>>           bluestore_bluefs_buffered_io = false
>>>>           bluestore_max_ops = 30000
>>>>           bluestore_max_bytes = 629145600
>>>>           bluestore_buffer_cache_size = 104857600
>>>>           bluestore_block_wal_size = 0
>>>>
>>>>           # use directory= option from fio job file
>>>>           osd data = ${fio_dir}
>>>>
>>>>           # log inside fio_dir
>>>>           log file = ${fio_dir}/log
>>>> ####################################
>>>>
>>>> #FIO jobs
>>>> #################
>>>> # Runs a 4k random write test against the ceph BlueStore.
>>>> [global]
>>>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your
>>>> LD_LIBRARY_PATH
>>>>
>>>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph
>>>> configuration
>>>> file
>>>> directory=./fio-bluestore # directory for osd_data
>>>>
>>>> rw=write
>>>> iodepth=16
>>>> size=256m
>>>>
>>>> [bluestore]
>>>> nr_files=63
>>>> bs=4k        # or 64k
>>>> numjobs=32
>>>> #############
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 16:13         ` Somnath Roy
@ 2016-08-22 16:21           ` Igor Fedotov
  2016-08-22 16:31             ` Somnath Roy
  0 siblings, 1 reply; 16+ messages in thread
From: Igor Fedotov @ 2016-08-22 16:21 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

yeah, it's much better.

Then the questions are:

- Is it safe to disable WAL?

- Should we do that by default?


Thanks,

Igor.


On 22.08.2016 19:13, Somnath Roy wrote:
> Igor,
> I just verified setting 'disableWAL= true' in the ceph.conf rocksdb option and it is working as expected. I am not seeing any WAL traffic now.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Monday, August 22, 2016 8:53 AM
> To: Igor Fedotov; ceph-devel
> Subject: RE: Odd WAL traffic for BlueStore
>
> disableWAL=true in the rocksdb option in ceph.conf or do this (if previous is buggy or not working)..
>
> RocksDBStore::submit_transaction and RocksDBStore::submit_transaction_sync  has this option set explicitly , make it 'true'.
>
> woptions.disableWAL = true
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Monday, August 22, 2016 8:12 AM
> To: Somnath Roy; ceph-devel
> Subject: Re: Odd WAL traffic for BlueStore
>
> Can you point it out? Don't see any...
>
>
> On 22.08.2016 18:08, Somnath Roy wrote:
>> Another point, I never tried but there is an option to disable WAL write during rocksdb write, we can try this option and see if that is reducing WAL partition writes or not.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Monday, August 22, 2016 8:06 AM
>> To: Igor Fedotov; ceph-devel
>> Subject: RE: Odd WAL traffic for BlueStore
>>
>> Igor,
>> I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
>> Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
>> Sent: Monday, August 22, 2016 6:47 AM
>> To: ceph-devel
>> Subject: Odd WAL traffic for BlueStore
>>
>> Hi All,
>>
>> While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.
>>
>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL SSDSC2BX480G4L
>>
>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.
>>
>> The second is split similarly and first 200Gb partition allocated for Raw Block data.
>>
>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
>> No much difference comparing to default settings though...
>>
>> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.
>>
>> Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>>
>> The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>>
>>
>> Doing 64K changes the picture dramatically:
>>
>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs BlueFS counters are ~140Mb and 1K respectively.
>>
>> Surely write completes much faster in the second case.
>>
>> No WAL is reported in logs at BlueStore level for both cases.
>>
>>
>> High BlueFS WAL traffic is observed when running subsequent random 4K RW over the store propagated this way too.
>>
>> I'm wondering why WAL device is involved in the process at all ( writes happen in min_alloc_size blocks) operate and why the traffic and written data volume is so high?
>>
>> Don't we have some fault affecting 4K performance here?
>>
>>
>> Here are my settings and FIO job specification:
>>
>> ###########################
>>
>> [global]
>>            debug bluestore = 0/0
>>            debug bluefs = 1/0
>>            debug bdev = 0/0
>>            debug rocksdb = 0/0
>>
>>            # spread objects over 8 collections
>>            osd pool default pg num = 32
>>            log to stderr = false
>>
>> [osd]
>>            osd objectstore = bluestore
>>            bluestore_block_create = true
>>            bluestore_block_db_create = true
>>            bluestore_block_wal_create = true
>>            bluestore_min_alloc_size = 4096
>>            #bluestore_max_alloc_size = #or 4096
>>            bluestore_fsck_on_mount = false
>>
>>            bluestore_block_path=/dev/sdi1
>>            bluestore_block_db_path=/dev/sde1
>>            bluestore_block_wal_path=/dev/sde2
>>
>>            enable experimental unrecoverable data corrupting features = bluestore rocksdb memdb
>>
>>            bluestore_rocksdb_options =
>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>            rocksdb_cache_size = 4294967296
>>            bluestore_csum = false
>>            bluestore_csum_type = none
>>            bluestore_bluefs_buffered_io = false
>>            bluestore_max_ops = 30000
>>            bluestore_max_bytes = 629145600
>>            bluestore_buffer_cache_size = 104857600
>>            bluestore_block_wal_size = 0
>>
>>            # use directory= option from fio job file
>>            osd data = ${fio_dir}
>>
>>            # log inside fio_dir
>>            log file = ${fio_dir}/log
>> ####################################
>>
>> #FIO jobs
>> #################
>> # Runs a 4k random write test against the ceph BlueStore.
>> [global]
>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in your LD_LIBRARY_PATH
>>
>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph configuration file directory=./fio-bluestore # directory for osd_data
>>
>> rw=write
>> iodepth=16
>> size=256m
>>
>> [bluestore]
>> nr_files=63
>> bs=4k        # or 64k
>> numjobs=32
>> #############
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i
> \x13��칻\x1c�&�~�&�\x18��+-��ݶ\x17��w��˛���m�\x1e�\x17^��b��^n�r���z�\x1a��h����&��\x1e�G���h�\x03(�階�ݢj"��\x1a�^[m�����z�ޖ���f���h���~�m�


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Odd WAL traffic for BlueStore
  2016-08-22 16:21           ` Igor Fedotov
@ 2016-08-22 16:31             ` Somnath Roy
  0 siblings, 0 replies; 16+ messages in thread
From: Somnath Roy @ 2016-08-22 16:31 UTC (permalink / raw)
  To: Igor Fedotov, ceph-devel

Yes, it is not safe.

-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Monday, August 22, 2016 9:21 AM
To: Somnath Roy; ceph-devel
Subject: Re: Odd WAL traffic for BlueStore

yeah, it's much better.

Then the questions are:

- Is it safe to disable WAL?

- Should we do that by default?


Thanks,

Igor.


On 22.08.2016 19:13, Somnath Roy wrote:
> Igor,
> I just verified setting 'disableWAL= true' in the ceph.conf rocksdb option and it is working as expected. I am not seeing any WAL traffic now.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Monday, August 22, 2016 8:53 AM
> To: Igor Fedotov; ceph-devel
> Subject: RE: Odd WAL traffic for BlueStore
>
> disableWAL=true in the rocksdb option in ceph.conf or do this (if previous is buggy or not working)..
>
> RocksDBStore::submit_transaction and RocksDBStore::submit_transaction_sync  has this option set explicitly , make it 'true'.
>
> woptions.disableWAL = true
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Monday, August 22, 2016 8:12 AM
> To: Somnath Roy; ceph-devel
> Subject: Re: Odd WAL traffic for BlueStore
>
> Can you point it out? Don't see any...
>
>
> On 22.08.2016 18:08, Somnath Roy wrote:
>> Another point, I never tried but there is an option to disable WAL write during rocksdb write, we can try this option and see if that is reducing WAL partition writes or not.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Monday, August 22, 2016 8:06 AM
>> To: Igor Fedotov; ceph-devel
>> Subject: RE: Odd WAL traffic for BlueStore
>>
>> Igor,
>> I am always seeing this WAL traffic in my 4k tests. Initially, I thought there are some faulty logic on Bluestore side and not honoring min_alloc_size , but, further debugging it seems the traffic is generated from BlueFS/Rocksdb.
>> Regarding rocksdb tuning, if you are not running the tests long enough (may be >20 min) you wouldn't be seeing any difference with default.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
>> Sent: Monday, August 22, 2016 6:47 AM
>> To: ceph-devel
>> Subject: Odd WAL traffic for BlueStore
>>
>> Hi All,
>>
>> While testing BlueStore as a standalone storage via FIO plugin I'm observing huge traffic to a WAL device.
>>
>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL 
>> SSDSC2BX480G4L
>>
>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB and Block WAL.
>>
>> The second is split similarly and first 200Gb partition allocated for Raw Block data.
>>
>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning' .
>> No much difference comparing to default settings though...
>>
>> As a result when doing 4k sequential write (8Gb total) to a fresh storage I'm observing (using nmon and other disk mon tools) significant write traffic to WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block device traffic is pretty stable at ~30 Mbs.
>>
>> Additionally I inserted an output for BlueFS perf counters on umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>>
>> The resulting values are very frustrating: ~28Gb and 4Gb for l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>>
>>
>> Doing 64K changes the picture dramatically:
>>
>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs BlueFS counters are ~140Mb and 1K respectively.
>>
>> Surely write completes much faster in the second case.
>>
>> No WAL is reported in logs at BlueStore level for both cases.
>>
>>
>> High BlueFS WAL traffic is observed when running subsequent random 4K RW over the store propagated this way too.
>>
>> I'm wondering why WAL device is involved in the process at all ( writes happen in min_alloc_size blocks) operate and why the traffic and written data volume is so high?
>>
>> Don't we have some fault affecting 4K performance here?
>>
>>
>> Here are my settings and FIO job specification:
>>
>> ###########################
>>
>> [global]
>>            debug bluestore = 0/0
>>            debug bluefs = 1/0
>>            debug bdev = 0/0
>>            debug rocksdb = 0/0
>>
>>            # spread objects over 8 collections
>>            osd pool default pg num = 32
>>            log to stderr = false
>>
>> [osd]
>>            osd objectstore = bluestore
>>            bluestore_block_create = true
>>            bluestore_block_db_create = true
>>            bluestore_block_wal_create = true
>>            bluestore_min_alloc_size = 4096
>>            #bluestore_max_alloc_size = #or 4096
>>            bluestore_fsck_on_mount = false
>>
>>            bluestore_block_path=/dev/sdi1
>>            bluestore_block_db_path=/dev/sde1
>>            bluestore_block_wal_path=/dev/sde2
>>
>>            enable experimental unrecoverable data corrupting features 
>> = bluestore rocksdb memdb
>>
>>            bluestore_rocksdb_options = 
>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>            rocksdb_cache_size = 4294967296
>>            bluestore_csum = false
>>            bluestore_csum_type = none
>>            bluestore_bluefs_buffered_io = false
>>            bluestore_max_ops = 30000
>>            bluestore_max_bytes = 629145600
>>            bluestore_buffer_cache_size = 104857600
>>            bluestore_block_wal_size = 0
>>
>>            # use directory= option from fio job file
>>            osd data = ${fio_dir}
>>
>>            # log inside fio_dir
>>            log file = ${fio_dir}/log
>> ####################################
>>
>> #FIO jobs
>> #################
>> # Runs a 4k random write test against the ceph BlueStore.
>> [global]
>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in 
>> your LD_LIBRARY_PATH
>>
>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph 
>> configuration file directory=./fio-bluestore # directory for osd_data
>>
>> rw=write
>> iodepth=16
>> size=256m
>>
>> [bluestore]
>> nr_files=63
>> bs=4k        # or 64k
>> numjobs=32
>> #############
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i
> \x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03
> ( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 16:14       ` Igor Fedotov
@ 2016-08-22 16:55         ` Sage Weil
  2016-08-24 18:28           ` Igor Fedotov
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-08-22 16:55 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Mon, 22 Aug 2016, Igor Fedotov wrote:
> Sage,
> 
> here it is
> 
> https://drive.google.com/open?id=0B-4q9QFReegLZmxrd19VYTc2aVU
> 
> 
> debug bluestore was set to 10 to reduce the log's size.
> 
> nr_files=8
> 
> numjobs=1
> 
> 
> Total 89Mb was written from fio.
> 
> Please note following lines at the end
> 
> 2016-08-22 15:53:01.717433 7fbf42ffd700  0 bluefs umount
> 2016-08-22 15:53:01.717440 7fbf42ffd700  0 bluefs 859013499 1069409
> 
> These are mentioned bluefs perf counters.
> 
> 859Mb for 'wal_bytes_written'!
> 
> Please let me know if you need anything else.

1) We get about 60% of the way through teh workload before rocksdb logs 
start getting recycled.  I just pushed a PR that preconditions rocksdb on 
mkfs to get rid of this weirdness:

	https://github.com/ceph/ceph/pull/10814

2) Each 4K is generating a ~10K rocksdb write.  I think this is just the 
size of the onode.  So, same thing we've been working on optimizing.

I don't think there is anything else odd going on...

sage



> 
> Thanks,
> Igor
> 
> On 22.08.2016 18:12, Sage Weil wrote:
> > debug bluestore = 20
> > debug bluefs = 20
> > debug rocksdb = 5
> > 
> > Thanks!
> > sage
> > 
> > 
> > On Mon, 22 Aug 2016, Igor Fedotov wrote:
> > 
> > > Will prepare shortly. Any suggestions on desired levels and components?
> > > 
> > > 
> > > On 22.08.2016 18:08, Sage Weil wrote:
> > > > On Mon, 22 Aug 2016, Igor Fedotov wrote:
> > > > > Hi All,
> > > > > 
> > > > > While testing BlueStore as a standalone storage via FIO plugin I'm
> > > > > observing
> > > > > huge traffic to a WAL device.
> > > > > 
> > > > > Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL
> > > > > SSDSC2BX480G4L
> > > > > 
> > > > > The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB
> > > > > and
> > > > > Block
> > > > > WAL.
> > > > > 
> > > > > The second is split similarly and first 200Gb partition allocated for
> > > > > Raw
> > > > > Block data.
> > > > > 
> > > > > RocksDB settings are set as Somnath suggested in his 'RocksDB tuning'
> > > > > . No
> > > > > much difference comparing to default settings though...
> > > > > 
> > > > > As a result when doing 4k sequential write (8Gb total) to a fresh
> > > > > storage
> > > > > I'm
> > > > > observing (using nmon and other disk mon tools) significant write
> > > > > traffic
> > > > > to
> > > > > WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block
> > > > > device
> > > > > traffic is pretty stable at ~30 Mbs.
> > > > > 
> > > > > Additionally I inserted an output for BlueFS perf counters on
> > > > > umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
> > > > > 
> > > > > The resulting values are very frustrating: ~28Gb and 4Gb for
> > > > > l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
> > > > Yeah, this doesn't seem right.  Have you generated a log to see what is
> > > > actually happening on each write?  I don't have any bright ideas about
> > > > what is going wrong here.
> > > > 
> > > > sage
> > > > 
> > > > > Doing 64K changes the picture dramatically:
> > > > > 
> > > > > WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
> > > > > BlueFS counters are ~140Mb and 1K respectively.
> > > > > 
> > > > > Surely write completes much faster in the second case.
> > > > > 
> > > > > No WAL is reported in logs at BlueStore level for both cases.
> > > > > 
> > > > > 
> > > > > High BlueFS WAL traffic is observed when running subsequent random 4K
> > > > > RW
> > > > > over
> > > > > the store propagated this way too.
> > > > > 
> > > > > I'm wondering why WAL device is involved in the process at all (
> > > > > writes
> > > > > happen
> > > > > in min_alloc_size blocks) operate and why the traffic and written data
> > > > > volume
> > > > > is so high?
> > > > > 
> > > > > Don't we have some fault affecting 4K performance here?
> > > > > 
> > > > > 
> > > > > Here are my settings and FIO job specification:
> > > > > 
> > > > > ###########################
> > > > > 
> > > > > [global]
> > > > >           debug bluestore = 0/0
> > > > >           debug bluefs = 1/0
> > > > >           debug bdev = 0/0
> > > > >           debug rocksdb = 0/0
> > > > > 
> > > > >           # spread objects over 8 collections
> > > > >           osd pool default pg num = 32
> > > > >           log to stderr = false
> > > > > 
> > > > > [osd]
> > > > >           osd objectstore = bluestore
> > > > >           bluestore_block_create = true
> > > > >           bluestore_block_db_create = true
> > > > >           bluestore_block_wal_create = true
> > > > >           bluestore_min_alloc_size = 4096
> > > > >           #bluestore_max_alloc_size = #or 4096
> > > > >           bluestore_fsck_on_mount = false
> > > > > 
> > > > >           bluestore_block_path=/dev/sdi1
> > > > >           bluestore_block_db_path=/dev/sde1
> > > > >           bluestore_block_wal_path=/dev/sde2
> > > > > 
> > > > >           enable experimental unrecoverable data corrupting features =
> > > > > bluestore
> > > > > rocksdb memdb
> > > > > 
> > > > >           bluestore_rocksdb_options =
> > > > > "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
> > > > > 
> > > > >           rocksdb_cache_size = 4294967296
> > > > >           bluestore_csum = false
> > > > >           bluestore_csum_type = none
> > > > >           bluestore_bluefs_buffered_io = false
> > > > >           bluestore_max_ops = 30000
> > > > >           bluestore_max_bytes = 629145600
> > > > >           bluestore_buffer_cache_size = 104857600
> > > > >           bluestore_block_wal_size = 0
> > > > > 
> > > > >           # use directory= option from fio job file
> > > > >           osd data = ${fio_dir}
> > > > > 
> > > > >           # log inside fio_dir
> > > > >           log file = ${fio_dir}/log
> > > > > ####################################
> > > > > 
> > > > > #FIO jobs
> > > > > #################
> > > > > # Runs a 4k random write test against the ceph BlueStore.
> > > > > [global]
> > > > > ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in
> > > > > your
> > > > > LD_LIBRARY_PATH
> > > > > 
> > > > > conf=ceph-bluestore-somnath.conf # must point to a valid ceph
> > > > > configuration
> > > > > file
> > > > > directory=./fio-bluestore # directory for osd_data
> > > > > 
> > > > > rw=write
> > > > > iodepth=16
> > > > > size=256m
> > > > > 
> > > > > [bluestore]
> > > > > nr_files=63
> > > > > bs=4k        # or 64k
> > > > > numjobs=32
> > > > > #############
> > > > > 
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Odd WAL traffic for BlueStore
  2016-08-22 16:55         ` Sage Weil
@ 2016-08-24 18:28           ` Igor Fedotov
  0 siblings, 0 replies; 16+ messages in thread
From: Igor Fedotov @ 2016-08-24 18:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



On 22.08.2016 19:55, Sage Weil wrote:
>
> 2) Each 4K is generating a ~10K rocksdb write.  I think this is just the
> size of the onode.  So, same thing we've been working on optimizing.
>
> I don't think there is anything else odd going on...
>
Sage, thanks for diagnosis. Indeed rocksdb traffic is huge.

Now let me share some thoughts w.r.t. onode/blob size reduction.
Currently I'm getting 26K per onode for 4Mb object filled with 4K 
sequential writing and 4K min_alloc_size. Csum is off.
Hence subsequent random 4K overwrite test case on such an object 
triggers 4K disk write and 26K RocksDB overwrite.

Changing min_alloc_size for both test cases to 64K gives 1.6K for onode!!!!
And 4K overwrite test case triggers 4K disk write and 4K (WAL) + 1.6K 
RocksDB overwrite.
And I indeed can see performance gain for the second set for both 
sequential and random writes.

Hence I'm worrying if onode/blob diet makes much sense at all. One has 
to сut Onode/Blob size in 4.6 ( = 26 / 5.6 ) times to win against simple 
min_alloc_size increase.

And another point - speaking of onode size cut for large objects - 
wouldn't it be simpler just to split such an object into 4Mb ( or 
whatever value) shards and handle separately as standalone objects. IMHO 
we just need some simple name+offset -> new name mapper on top of 
current code for that.


> sage
>
>
>
>> Thanks,
>> Igor
>>
>> On 22.08.2016 18:12, Sage Weil wrote:
>>> debug bluestore = 20
>>> debug bluefs = 20
>>> debug rocksdb = 5
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> On Mon, 22 Aug 2016, Igor Fedotov wrote:
>>>
>>>> Will prepare shortly. Any suggestions on desired levels and components?
>>>>
>>>>
>>>> On 22.08.2016 18:08, Sage Weil wrote:
>>>>> On Mon, 22 Aug 2016, Igor Fedotov wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> While testing BlueStore as a standalone storage via FIO plugin I'm
>>>>>> observing
>>>>>> huge traffic to a WAL device.
>>>>>>
>>>>>> Bluestore is configured to use 2 450 Gb Intel's SSD: INTEL
>>>>>> SSDSC2BX480G4L
>>>>>>
>>>>>> The first SSD is split into 2 partitions (200 & 250 Gb) for Block DB
>>>>>> and
>>>>>> Block
>>>>>> WAL.
>>>>>>
>>>>>> The second is split similarly and first 200Gb partition allocated for
>>>>>> Raw
>>>>>> Block data.
>>>>>>
>>>>>> RocksDB settings are set as Somnath suggested in his 'RocksDB tuning'
>>>>>> . No
>>>>>> much difference comparing to default settings though...
>>>>>>
>>>>>> As a result when doing 4k sequential write (8Gb total) to a fresh
>>>>>> storage
>>>>>> I'm
>>>>>> observing (using nmon and other disk mon tools) significant write
>>>>>> traffic
>>>>>> to
>>>>>> WAL device. And it grows eventually from ~10Mbs to ~170Mbs. Raw Block
>>>>>> device
>>>>>> traffic is pretty stable at ~30 Mbs.
>>>>>>
>>>>>> Additionally I inserted an output for BlueFS perf counters on
>>>>>> umount(l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst).
>>>>>>
>>>>>> The resulting values are very frustrating: ~28Gb and 4Gb for
>>>>>> l_bluefs_bytes_written_wal & l_bluefs_bytes_written_sst respectively.
>>>>> Yeah, this doesn't seem right.  Have you generated a log to see what is
>>>>> actually happening on each write?  I don't have any bright ideas about
>>>>> what is going wrong here.
>>>>>
>>>>> sage
>>>>>
>>>>>> Doing 64K changes the picture dramatically:
>>>>>>
>>>>>> WAL traffic is stable at 10-12 Mbs and RAW Block one is at ~400Mbs
>>>>>> BlueFS counters are ~140Mb and 1K respectively.
>>>>>>
>>>>>> Surely write completes much faster in the second case.
>>>>>>
>>>>>> No WAL is reported in logs at BlueStore level for both cases.
>>>>>>
>>>>>>
>>>>>> High BlueFS WAL traffic is observed when running subsequent random 4K
>>>>>> RW
>>>>>> over
>>>>>> the store propagated this way too.
>>>>>>
>>>>>> I'm wondering why WAL device is involved in the process at all (
>>>>>> writes
>>>>>> happen
>>>>>> in min_alloc_size blocks) operate and why the traffic and written data
>>>>>> volume
>>>>>> is so high?
>>>>>>
>>>>>> Don't we have some fault affecting 4K performance here?
>>>>>>
>>>>>>
>>>>>> Here are my settings and FIO job specification:
>>>>>>
>>>>>> ###########################
>>>>>>
>>>>>> [global]
>>>>>>            debug bluestore = 0/0
>>>>>>            debug bluefs = 1/0
>>>>>>            debug bdev = 0/0
>>>>>>            debug rocksdb = 0/0
>>>>>>
>>>>>>            # spread objects over 8 collections
>>>>>>            osd pool default pg num = 32
>>>>>>            log to stderr = false
>>>>>>
>>>>>> [osd]
>>>>>>            osd objectstore = bluestore
>>>>>>            bluestore_block_create = true
>>>>>>            bluestore_block_db_create = true
>>>>>>            bluestore_block_wal_create = true
>>>>>>            bluestore_min_alloc_size = 4096
>>>>>>            #bluestore_max_alloc_size = #or 4096
>>>>>>            bluestore_fsck_on_mount = false
>>>>>>
>>>>>>            bluestore_block_path=/dev/sdi1
>>>>>>            bluestore_block_db_path=/dev/sde1
>>>>>>            bluestore_block_wal_path=/dev/sde2
>>>>>>
>>>>>>            enable experimental unrecoverable data corrupting features =
>>>>>> bluestore
>>>>>> rocksdb memdb
>>>>>>
>>>>>>            bluestore_rocksdb_options =
>>>>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>>>>>
>>>>>>            rocksdb_cache_size = 4294967296
>>>>>>            bluestore_csum = false
>>>>>>            bluestore_csum_type = none
>>>>>>            bluestore_bluefs_buffered_io = false
>>>>>>            bluestore_max_ops = 30000
>>>>>>            bluestore_max_bytes = 629145600
>>>>>>            bluestore_buffer_cache_size = 104857600
>>>>>>            bluestore_block_wal_size = 0
>>>>>>
>>>>>>            # use directory= option from fio job file
>>>>>>            osd data = ${fio_dir}
>>>>>>
>>>>>>            # log inside fio_dir
>>>>>>            log file = ${fio_dir}/log
>>>>>> ####################################
>>>>>>
>>>>>> #FIO jobs
>>>>>> #################
>>>>>> # Runs a 4k random write test against the ceph BlueStore.
>>>>>> [global]
>>>>>> ioengine=/usr/local/lib/libfio_ceph_objectstore.so # must be found in
>>>>>> your
>>>>>> LD_LIBRARY_PATH
>>>>>>
>>>>>> conf=ceph-bluestore-somnath.conf # must point to a valid ceph
>>>>>> configuration
>>>>>> file
>>>>>> directory=./fio-bluestore # directory for osd_data
>>>>>>
>>>>>> rw=write
>>>>>> iodepth=16
>>>>>> size=256m
>>>>>>
>>>>>> [bluestore]
>>>>>> nr_files=63
>>>>>> bs=4k        # or 64k
>>>>>> numjobs=32
>>>>>> #############
>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-24 18:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-22 13:46 Odd WAL traffic for BlueStore Igor Fedotov
2016-08-22 15:05 ` Somnath Roy
2016-08-22 15:08   ` Somnath Roy
2016-08-22 15:12     ` Igor Fedotov
2016-08-22 15:53       ` Somnath Roy
2016-08-22 16:13         ` Somnath Roy
2016-08-22 16:21           ` Igor Fedotov
2016-08-22 16:31             ` Somnath Roy
2016-08-22 15:08   ` Igor Fedotov
2016-08-22 15:08 ` Sage Weil
2016-08-22 15:10   ` Igor Fedotov
2016-08-22 15:12     ` Sage Weil
2016-08-22 16:14       ` Igor Fedotov
2016-08-22 16:55         ` Sage Weil
2016-08-24 18:28           ` Igor Fedotov
2016-08-22 15:17     ` Haomai Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.