* Ceph space amplification ratio with write is more than 2
@ 2016-02-26 17:46 James (Fei) Liu-SSI
2016-02-26 21:44 ` Somnath Roy
2016-02-27 1:27 ` Haomai Wang
0 siblings, 2 replies; 8+ messages in thread
From: James (Fei) Liu-SSI @ 2016-02-26 17:46 UTC (permalink / raw)
To: ceph-devel
Hi Cepher,
We recently have tested the Ceph space amplification with filestore through writing data to ramdisk with rados bench. However, we found only
3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor)
1. Cluster configuration:
. One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
. ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
2. Rados command we have been used :
rados bench -p rbd -b 4096 --max-objects 1048576 300 write --no-cleanup
3. ceph cluster configuration:
Please see appendix 1.
4. Results investigation.
Please see appendix 0.
Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
Regards,
James
Appendix 0:
ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
9206M 547M 8658M 94.05
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 3584M 38.93 547M 917505
ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
health HEALTH_WARN
1 near full osd(s)
monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
election epoch 3, quorum 0 localhost
osdmap e7: 1 osds: 1 up, 1 in
flags sortbitwise
pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
8658 MB used, 547 MB / 9206 MB avail
64 active+clean
ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 2.7T 1.4T 1.2T 55% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 5.9G 4.0K 5.9G 1% /dev
tmpfs 1.2G 720K 1.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 5.9G 81M 5.8G 2% /run/shm
none 100M 20K 100M 1% /run/user
/dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
Journal size:
-rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid
drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
-rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
-rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
-rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
-rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
-rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
-rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
-rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
-rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
du -sh
It shows current use 8G.
Appendix 1:
[global]
fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
filestore_max_sync_interval=10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
filestore_op_threads = 6
filestore_queue_max_ops=5000
filestore_queue_committing_max_ops=5000
journal_max_write_entries=1000
journal_queue_max_ops=3000
filestore_wbthrottle_enable=false
filestore_queue_max_bytes=1048576000
filestore_queue_committing_max_bytes=1048576000
journal_max_write_bytes=1048576000
journal_queue_max_bytes=1048576000
osd_journal_size = 1024
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
debug_newstore = 0/0
debug_keyvaluestore = 0/0
osd_tracing = true
osd_objectstore_tracing = true
rados_tracing = true
rbd_tracing = true
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_op_threads = 32
objecter_inflight_ops=102400
ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
osd_mkfs_type = xfs
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
mon_initial_members = localhost
mon_host = 127.0.0.1
mon data = '$CEPH_DEPLOY'/mon/mymondata
mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log
keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
run dir = '$CEPH_DEPLOY'/run
[osd.0]
osd data = '$CEPH_DEPLOY'/osd/myosddata
osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
#osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
log file = '$CEPH_DEPLOY'/osd/osd.log'
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Ceph space amplification ratio with write is more than 2
2016-02-26 17:46 Ceph space amplification ratio with write is more than 2 James (Fei) Liu-SSI
@ 2016-02-26 21:44 ` Somnath Roy
2016-03-01 1:10 ` James (Fei) Liu-SSI
2016-02-27 1:27 ` Haomai Wang
1 sibling, 1 reply; 8+ messages in thread
From: Somnath Roy @ 2016-02-26 21:44 UTC (permalink / raw)
To: James (Fei) Liu-SSI, ceph-devel
Hmm...Not *true* in my setup and in fact this can't be true with Ceph..Ceph's WA should be ~2.3 -2.4 considering journal write, excluding journal write it should be well below 1.
Single OSD , journal (10G) and data on the same SSD but in different partitions..1 rbd image with 200G..
root@emsnode10:~/sjust# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 304G 168G 121G 59% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 32G 12K 32G 1% /dev
tmpfs 6.3G 1.3M 6.3G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 32G 8.0K 32G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/sdj2 7.0T 201G 6.8T 3% /var/lib/ceph/osd/ceph-0
root@emsnode10:~/sjust# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
7141G 6941G 200G 2.80
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 1 200G 2.80 6941G 51203
Must be something we are missing in your case.
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Friday, February 26, 2016 9:47 AM
To: ceph-devel@vger.kernel.org
Subject: Ceph space amplification ratio with write is more than 2
Hi Cepher,
We recently have tested the Ceph space amplification with filestore through writing data to ramdisk with rados bench. However, we found only
3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor) 1. Cluster configuration:
. One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
. ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
2. Rados command we have been used :
rados bench -p rbd -b 4096 --max-objects 1048576 300 write --no-cleanup 3. ceph cluster configuration:
Please see appendix 1.
4. Results investigation.
Please see appendix 0.
Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
Regards,
James
Appendix 0:
ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
9206M 547M 8658M 94.05
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 3584M 38.93 547M 917505 ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
health HEALTH_WARN
1 near full osd(s)
monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
election epoch 3, quorum 0 localhost
osdmap e7: 1 osds: 1 up, 1 in
flags sortbitwise
pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
8658 MB used, 547 MB / 9206 MB avail
64 active+clean
ssd@OptiPlex-9020-1:~/src/bluestore$ df -h Filesystem Size Used Avail Use% Mounted on
/dev/sda2 2.7T 1.4T 1.2T 55% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 5.9G 4.0K 5.9G 1% /dev tmpfs 1.2G 720K 1.2G 1% /run none 5.0M 0 5.0M 0% /run/lock none 5.9G 81M 5.8G 2% /run/shm none 100M 20K 100M 1% /run/user
/dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
Journal size:
-rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
-rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
-rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
-rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
-rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
-rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
-rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
-rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
-rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
du -sh
It shows current use 8G.
Appendix 1:
[global]
fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
filestore_max_sync_interval=10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
filestore_op_threads = 6
filestore_queue_max_ops=5000
filestore_queue_committing_max_ops=5000
journal_max_write_entries=1000
journal_queue_max_ops=3000
filestore_wbthrottle_enable=false
filestore_queue_max_bytes=1048576000
filestore_queue_committing_max_bytes=1048576000
journal_max_write_bytes=1048576000
journal_queue_max_bytes=1048576000
osd_journal_size = 1024
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
debug_newstore = 0/0
debug_keyvaluestore = 0/0
osd_tracing = true
osd_objectstore_tracing = true
rados_tracing = true
rbd_tracing = true
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_op_threads = 32
objecter_inflight_ops=102400
ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
osd_mkfs_type = xfs
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
mon_initial_members = localhost
mon_host = 127.0.0.1
mon data = '$CEPH_DEPLOY'/mon/mymondata
mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
run dir = '$CEPH_DEPLOY'/run
[osd.0]
osd data = '$CEPH_DEPLOY'/osd/myosddata
osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
#osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
log file = '$CEPH_DEPLOY'/osd/osd.log'
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Ceph space amplification ratio with write is more than 2
2016-02-26 17:46 Ceph space amplification ratio with write is more than 2 James (Fei) Liu-SSI
2016-02-26 21:44 ` Somnath Roy
@ 2016-02-27 1:27 ` Haomai Wang
2016-02-27 1:58 ` Somnath Roy
1 sibling, 1 reply; 8+ messages in thread
From: Haomai Wang @ 2016-02-27 1:27 UTC (permalink / raw)
To: James (Fei) Liu-SSI; +Cc: ceph-devel
It should be as expected. The reason is FileJournal will prepend
header and append footer to the transaction data, and if using
aio+dio, we need to align these. So normally for 4k io, it will become
5-6k filejournal write IIRC. Plus filestore writeback to filesystem,
it should be 2.5x. But local file system and disk should has some
merge tricky, it should be less than 2.5x.
On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI
<james.liu@ssi.samsung.com> wrote:
> Hi Cepher,
> We recently have tested the Ceph space amplification with filestore through writing data to ramdisk with rados bench. However, we found only
> 3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>
> The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor)
> 1. Cluster configuration:
> . One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
> . ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
> 2. Rados command we have been used :
> rados bench -p rbd -b 4096 --max-objects 1048576 300 write --no-cleanup
> 3. ceph cluster configuration:
> Please see appendix 1.
> 4. Results investigation.
> Please see appendix 0.
>
> Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
>
> Regards,
> James
>
> Appendix 0:
>
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 9206M 547M 8658M 94.05
> POOLS:
> NAME ID USED %USED MAX AVAIL OBJECTS
> rbd 0 3584M 38.93 547M 917505
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
> health HEALTH_WARN
> 1 near full osd(s)
> monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
> election epoch 3, quorum 0 localhost
> osdmap e7: 1 osds: 1 up, 1 in
> flags sortbitwise
> pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
> 8658 MB used, 547 MB / 9206 MB avail
> 64 active+clean
> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda2 2.7T 1.4T 1.2T 55% /
> none 4.0K 0 4.0K 0% /sys/fs/cgroup
> udev 5.9G 4.0K 5.9G 1% /dev
> tmpfs 1.2G 720K 1.2G 1% /run
> none 5.0M 0 5.0M 0% /run/lock
> none 5.9G 81M 5.8G 2% /run/shm
> none 100M 20K 100M 1% /run/user
> /dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>
> Journal size:
>
> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid
> drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
> -rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
> -rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
> -rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
> -rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
> -rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
> -rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
> -rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
>
> du -sh
> It shows current use 8G.
>
> Appendix 1:
> [global]
> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
> auth_cluster_required = none
> auth_service_required = none
> auth_client_required = none
> filestore_xattr_use_omap = true
> filestore_max_sync_interval=10
> filestore_fd_cache_size = 64
> filestore_fd_cache_shards = 32
> filestore_op_threads = 6
> filestore_queue_max_ops=5000
> filestore_queue_committing_max_ops=5000
> journal_max_write_entries=1000
> journal_queue_max_ops=3000
> filestore_wbthrottle_enable=false
> filestore_queue_max_bytes=1048576000
> filestore_queue_committing_max_bytes=1048576000
> journal_max_write_bytes=1048576000
> journal_queue_max_bytes=1048576000
>
> osd_journal_size = 1024
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
> debug_newstore = 0/0
> debug_keyvaluestore = 0/0
> osd_tracing = true
> osd_objectstore_tracing = true
> rados_tracing = true
> rbd_tracing = true
> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
> osd_mkfs_options_xfs = -f -i size=2048
> osd_op_threads = 32
> objecter_inflight_ops=102400
> ms_dispatch_throttle_bytes=1048576000
> objecter_infilght_op_bytes=1048576000
> osd_mkfs_type = xfs
> osd_client_message_size_cap = 0
> osd_client_message_cap = 0
> osd_enable_op_tracker = false
> mon_initial_members = localhost
> mon_host = 127.0.0.1
> mon data = '$CEPH_DEPLOY'/mon/mymondata
> mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log
> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
> run dir = '$CEPH_DEPLOY'/run
> [osd.0]
> osd data = '$CEPH_DEPLOY'/osd/myosddata
> osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
> log file = '$CEPH_DEPLOY'/osd/osd.log'
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Ceph space amplification ratio with write is more than 2
2016-02-27 1:27 ` Haomai Wang
@ 2016-02-27 1:58 ` Somnath Roy
2016-02-27 5:50 ` Haomai Wang
0 siblings, 1 reply; 8+ messages in thread
From: Somnath Roy @ 2016-02-27 1:58 UTC (permalink / raw)
To: Haomai Wang, James (Fei) Liu-SSI; +Cc: ceph-devel
Haomai,
I think what James is seeing excluding journal write it is > 2X write amp. Must be something fishy on his setup..
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Friday, February 26, 2016 5:27 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph space amplification ratio with write is more than 2
It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
aio+dio, we need to align these. So normally for 4k io, it will become
5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.
On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Cepher,
> We recently have tested the Ceph space amplification with filestore
> through writing data to ramdisk with rados bench. However, we found
> only
> 3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>
> The totally space amplification is 7658/3584 = 2.14. (It is surprising
> huge factor) 1. Cluster configuration:
> . One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
> . ceph version from ceph master
> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
> 2. Rados command we have been used :
> rados bench -p rbd -b 4096 --max-objects 1048576 300 write
> --no-cleanup 3. ceph cluster configuration:
> Please see appendix 1.
> 4. Results investigation.
> Please see appendix 0.
>
> Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
>
> Regards,
> James
>
> Appendix 0:
>
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 9206M 547M 8658M 94.05
> POOLS:
> NAME ID USED %USED MAX AVAIL OBJECTS
> rbd 0 3584M 38.93 547M 917505
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
> health HEALTH_WARN
> 1 near full osd(s)
> monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
> election epoch 3, quorum 0 localhost
> osdmap e7: 1 osds: 1 up, 1 in
> flags sortbitwise
> pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
> 8658 MB used, 547 MB / 9206 MB avail
> 64 active+clean
> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda2 2.7T 1.4T 1.2T 55% /
> none 4.0K 0 4.0K 0% /sys/fs/cgroup
> udev 5.9G 4.0K 5.9G 1% /dev
> tmpfs 1.2G 720K 1.2G 1% /run
> none 5.0M 0 5.0M 0% /run/lock
> none 5.9G 81M 5.8G 2% /run/shm
> none 100M 20K 100M 1% /run/user
> /dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>
> Journal size:
>
> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid
> drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
> -rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
> -rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
> -rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
> -rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
> -rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
> -rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
> -rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
>
> du -sh
> It shows current use 8G.
>
> Appendix 1:
> [global]
> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
> auth_cluster_required = none
> auth_service_required = none
> auth_client_required = none
> filestore_xattr_use_omap = true
> filestore_max_sync_interval=10
> filestore_fd_cache_size = 64
> filestore_fd_cache_shards = 32
> filestore_op_threads = 6
> filestore_queue_max_ops=5000
> filestore_queue_committing_max_ops=5000
> journal_max_write_entries=1000
> journal_queue_max_ops=3000
> filestore_wbthrottle_enable=false
> filestore_queue_max_bytes=1048576000
> filestore_queue_committing_max_bytes=1048576000
> journal_max_write_bytes=1048576000
> journal_queue_max_bytes=1048576000
>
> osd_journal_size = 1024
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
> debug_newstore = 0/0
> debug_keyvaluestore = 0/0
> osd_tracing = true
> osd_objectstore_tracing = true
> rados_tracing = true
> rbd_tracing = true
> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
> objecter_inflight_ops=102400
> ms_dispatch_throttle_bytes=1048576000
> objecter_infilght_op_bytes=1048576000
> osd_mkfs_type = xfs
> osd_client_message_size_cap = 0
> osd_client_message_cap = 0
> osd_enable_op_tracker = false
> mon_initial_members = localhost
> mon_host = 127.0.0.1
> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
> '$CEPH_DEPLOY'/mon/mon.log
> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
> run dir = '$CEPH_DEPLOY'/run
> [osd.0]
> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
> log file = '$CEPH_DEPLOY'/osd/osd.log'
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Ceph space amplification ratio with write is more than 2
2016-02-27 1:58 ` Somnath Roy
@ 2016-02-27 5:50 ` Haomai Wang
2016-02-29 6:47 ` Ning Yao
2016-03-10 0:29 ` Jianjian Huo
0 siblings, 2 replies; 8+ messages in thread
From: Haomai Wang @ 2016-02-27 5:50 UTC (permalink / raw)
To: Somnath Roy; +Cc: James (Fei) Liu-SSI, ceph-devel
On Sat, Feb 27, 2016 at 9:58 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Haomai,
> I think what James is seeing excluding journal write it is > 2X write amp. Must be something fishy on his setup..
Oh, thanks for reminder. Hmm, I double checked the current
FileJournal, it should be aligned with 4k... So it could be possible
too......
>
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Friday, February 26, 2016 5:27 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Ceph space amplification ratio with write is more than 2
>
> It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
> aio+dio, we need to align these. So normally for 4k io, it will become
> 5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.
>
> On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cepher,
>> We recently have tested the Ceph space amplification with filestore
>> through writing data to ramdisk with rados bench. However, we found
>> only
>> 3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>>
>> The totally space amplification is 7658/3584 = 2.14. (It is surprising
>> huge factor) 1. Cluster configuration:
>> . One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
>> . ceph version from ceph master
>> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
>> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
>> 2. Rados command we have been used :
>> rados bench -p rbd -b 4096 --max-objects 1048576 300 write
>> --no-cleanup 3. ceph cluster configuration:
>> Please see appendix 1.
>> 4. Results investigation.
>> Please see appendix 0.
>>
>> Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
>>
>> Regards,
>> James
>>
>> Appendix 0:
>>
>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
>> GLOBAL:
>> SIZE AVAIL RAW USED %RAW USED
>> 9206M 547M 8658M 94.05
>> POOLS:
>> NAME ID USED %USED MAX AVAIL OBJECTS
>> rbd 0 3584M 38.93 547M 917505
>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>> health HEALTH_WARN
>> 1 near full osd(s)
>> monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>> election epoch 3, quorum 0 localhost
>> osdmap e7: 1 osds: 1 up, 1 in
>> flags sortbitwise
>> pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>> 8658 MB used, 547 MB / 9206 MB avail
>> 64 active+clean
>> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/sda2 2.7T 1.4T 1.2T 55% /
>> none 4.0K 0 4.0K 0% /sys/fs/cgroup
>> udev 5.9G 4.0K 5.9G 1% /dev
>> tmpfs 1.2G 720K 1.2G 1% /run
>> none 5.0M 0 5.0M 0% /run/lock
>> none 5.9G 81M 5.8G 2% /run/shm
>> none 100M 20K 100M 1% /run/user
>> /dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>>
>> Journal size:
>>
>> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid
>> drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
>> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
>> -rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
>> -rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
>> -rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
>> -rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
>> -rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
>> -rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
>> -rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
>>
>> du -sh
>> It shows current use 8G.
>>
>> Appendix 1:
>> [global]
>> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
>> auth_cluster_required = none
>> auth_service_required = none
>> auth_client_required = none
>> filestore_xattr_use_omap = true
>> filestore_max_sync_interval=10
>> filestore_fd_cache_size = 64
>> filestore_fd_cache_shards = 32
>> filestore_op_threads = 6
>> filestore_queue_max_ops=5000
>> filestore_queue_committing_max_ops=5000
>> journal_max_write_entries=1000
>> journal_queue_max_ops=3000
>> filestore_wbthrottle_enable=false
>> filestore_queue_max_bytes=1048576000
>> filestore_queue_committing_max_bytes=1048576000
>> journal_max_write_bytes=1048576000
>> journal_queue_max_bytes=1048576000
>>
>> osd_journal_size = 1024
>> debug_lockdep = 0/0
>> debug_context = 0/0
>> debug_crush = 0/0
>> debug_buffer = 0/0
>> debug_timer = 0/0
>> debug_filer = 0/0
>> debug_objecter = 0/0
>> debug_rados = 0/0
>> debug_rbd = 0/0
>> debug_journaler = 0/0
>> debug_objectcatcher = 0/0
>> debug_client = 0/0
>> debug_osd = 0/0
>> debug_optracker = 0/0
>> debug_objclass = 0/0
>> debug_filestore = 0/0
>> debug_journal = 0/0
>> debug_ms = 0/0
>> debug_monc = 0/0
>> debug_tp = 0/0
>> debug_auth = 0/0
>> debug_finisher = 0/0
>> debug_heartbeatmap = 0/0
>> debug_perfcounter = 0/0
>> debug_asok = 0/0
>> debug_throttle = 0/0
>> debug_mon = 0/0
>> debug_paxos = 0/0
>> debug_rgw = 0/0
>> debug_newstore = 0/0
>> debug_keyvaluestore = 0/0
>> osd_tracing = true
>> osd_objectstore_tracing = true
>> rados_tracing = true
>> rbd_tracing = true
>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
>> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
>> objecter_inflight_ops=102400
>> ms_dispatch_throttle_bytes=1048576000
>> objecter_infilght_op_bytes=1048576000
>> osd_mkfs_type = xfs
>> osd_client_message_size_cap = 0
>> osd_client_message_cap = 0
>> osd_enable_op_tracker = false
>> mon_initial_members = localhost
>> mon_host = 127.0.0.1
>> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
>> '$CEPH_DEPLOY'/mon/mon.log
>> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
>> run dir = '$CEPH_DEPLOY'/run
>> [osd.0]
>> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
>> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
>> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
>> log file = '$CEPH_DEPLOY'/osd/osd.log'
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Ceph space amplification ratio with write is more than 2
2016-02-27 5:50 ` Haomai Wang
@ 2016-02-29 6:47 ` Ning Yao
2016-03-10 0:29 ` Jianjian Huo
1 sibling, 0 replies; 8+ messages in thread
From: Ning Yao @ 2016-02-29 6:47 UTC (permalink / raw)
To: Haomai Wang; +Cc: Somnath Roy, James (Fei) Liu-SSI, ceph-devel
Yeah, I think James refers to space amplification, but no write amp.
So this has nothing to do with journal.
I'm but it may relate to xfs filesystem layout, since you have 1M
objects, and currently, we use mkfs.xfs -i size=2048k to explicitly
tell the each inode size is 2k. And also, the directory /current/omap
will cost some spaces, which will not stat by pools.
You may use a large-size disk instead of ramdisk and have a try, I
think this will happen again
Regards
Ning Yao
2016-02-27 13:50 GMT+08:00 Haomai Wang <haomai@xsky.com>:
> On Sat, Feb 27, 2016 at 9:58 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Haomai,
>> I think what James is seeing excluding journal write it is > 2X write amp. Must be something fishy on his setup..
>
> Oh, thanks for reminder. Hmm, I double checked the current
> FileJournal, it should be aligned with 4k... So it could be possible
> too......
>
>>
>> Thanks & Regards
>> Somnath
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Friday, February 26, 2016 5:27 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Ceph space amplification ratio with write is more than 2
>>
>> It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
>> aio+dio, we need to align these. So normally for 4k io, it will become
>> 5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.
>>
>> On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Cepher,
>>> We recently have tested the Ceph space amplification with filestore
>>> through writing data to ramdisk with rados bench. However, we found
>>> only
>>> 3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>>>
>>> The totally space amplification is 7658/3584 = 2.14. (It is surprising
>>> huge factor) 1. Cluster configuration:
>>> . One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
>>> . ceph version from ceph master
>>> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
>>> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
>>> 2. Rados command we have been used :
>>> rados bench -p rbd -b 4096 --max-objects 1048576 300 write
>>> --no-cleanup 3. ceph cluster configuration:
>>> Please see appendix 1.
>>> 4. Results investigation.
>>> Please see appendix 0.
>>>
>>> Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
>>>
>>> Regards,
>>> James
>>>
>>> Appendix 0:
>>>
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
>>> GLOBAL:
>>> SIZE AVAIL RAW USED %RAW USED
>>> 9206M 547M 8658M 94.05
>>> POOLS:
>>> NAME ID USED %USED MAX AVAIL OBJECTS
>>> rbd 0 3584M 38.93 547M 917505
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>>> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>>> health HEALTH_WARN
>>> 1 near full osd(s)
>>> monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>>> election epoch 3, quorum 0 localhost
>>> osdmap e7: 1 osds: 1 up, 1 in
>>> flags sortbitwise
>>> pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>>> 8658 MB used, 547 MB / 9206 MB avail
>>> 64 active+clean
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/sda2 2.7T 1.4T 1.2T 55% /
>>> none 4.0K 0 4.0K 0% /sys/fs/cgroup
>>> udev 5.9G 4.0K 5.9G 1% /dev
>>> tmpfs 1.2G 720K 1.2G 1% /run
>>> none 5.0M 0 5.0M 0% /run/lock
>>> none 5.9G 81M 5.8G 2% /run/shm
>>> none 100M 20K 100M 1% /run/user
>>> /dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>>>
>>> Journal size:
>>>
>>> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid
>>> drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
>>> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
>>> -rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
>>> -rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
>>> -rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
>>> -rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
>>> -rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
>>> -rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
>>> -rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
>>>
>>> du -sh
>>> It shows current use 8G.
>>>
>>> Appendix 1:
>>> [global]
>>> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
>>> auth_cluster_required = none
>>> auth_service_required = none
>>> auth_client_required = none
>>> filestore_xattr_use_omap = true
>>> filestore_max_sync_interval=10
>>> filestore_fd_cache_size = 64
>>> filestore_fd_cache_shards = 32
>>> filestore_op_threads = 6
>>> filestore_queue_max_ops=5000
>>> filestore_queue_committing_max_ops=5000
>>> journal_max_write_entries=1000
>>> journal_queue_max_ops=3000
>>> filestore_wbthrottle_enable=false
>>> filestore_queue_max_bytes=1048576000
>>> filestore_queue_committing_max_bytes=1048576000
>>> journal_max_write_bytes=1048576000
>>> journal_queue_max_bytes=1048576000
>>>
>>> osd_journal_size = 1024
>>> debug_lockdep = 0/0
>>> debug_context = 0/0
>>> debug_crush = 0/0
>>> debug_buffer = 0/0
>>> debug_timer = 0/0
>>> debug_filer = 0/0
>>> debug_objecter = 0/0
>>> debug_rados = 0/0
>>> debug_rbd = 0/0
>>> debug_journaler = 0/0
>>> debug_objectcatcher = 0/0
>>> debug_client = 0/0
>>> debug_osd = 0/0
>>> debug_optracker = 0/0
>>> debug_objclass = 0/0
>>> debug_filestore = 0/0
>>> debug_journal = 0/0
>>> debug_ms = 0/0
>>> debug_monc = 0/0
>>> debug_tp = 0/0
>>> debug_auth = 0/0
>>> debug_finisher = 0/0
>>> debug_heartbeatmap = 0/0
>>> debug_perfcounter = 0/0
>>> debug_asok = 0/0
>>> debug_throttle = 0/0
>>> debug_mon = 0/0
>>> debug_paxos = 0/0
>>> debug_rgw = 0/0
>>> debug_newstore = 0/0
>>> debug_keyvaluestore = 0/0
>>> osd_tracing = true
>>> osd_objectstore_tracing = true
>>> rados_tracing = true
>>> rbd_tracing = true
>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
>>> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
>>> objecter_inflight_ops=102400
>>> ms_dispatch_throttle_bytes=1048576000
>>> objecter_infilght_op_bytes=1048576000
>>> osd_mkfs_type = xfs
>>> osd_client_message_size_cap = 0
>>> osd_client_message_cap = 0
>>> osd_enable_op_tracker = false
>>> mon_initial_members = localhost
>>> mon_host = 127.0.0.1
>>> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
>>> '$CEPH_DEPLOY'/mon/mon.log
>>> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
>>> run dir = '$CEPH_DEPLOY'/run
>>> [osd.0]
>>> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
>>> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
>>> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
>>> log file = '$CEPH_DEPLOY'/osd/osd.log'
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Ceph space amplification ratio with write is more than 2
2016-02-26 21:44 ` Somnath Roy
@ 2016-03-01 1:10 ` James (Fei) Liu-SSI
0 siblings, 0 replies; 8+ messages in thread
From: James (Fei) Liu-SSI @ 2016-03-01 1:10 UTC (permalink / raw)
To: Somnath Roy, ceph-devel
HI Somnath,
Thanks for your quick answer. Could you mind sharing with us about the amount of data you issued with below rados bench command?
rados bench -p rbd -b 4096 --max-objects 1048576 $2 300 write --no-cleanup
Regards,
James
-----Original Message-----
From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
Sent: Friday, February 26, 2016 1:44 PM
To: James (Fei) Liu-SSI; ceph-devel@vger.kernel.org
Subject: RE: Ceph space amplification ratio with write is more than 2
Hmm...Not *true* in my setup and in fact this can't be true with Ceph..Ceph's WA should be ~2.3 -2.4 considering journal write, excluding journal write it should be well below 1.
Single OSD , journal (10G) and data on the same SSD but in different partitions..1 rbd image with 200G..
root@emsnode10:~/sjust# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 304G 168G 121G 59% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 32G 12K 32G 1% /dev
tmpfs 6.3G 1.3M 6.3G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 32G 8.0K 32G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/sdj2 7.0T 201G 6.8T 3% /var/lib/ceph/osd/ceph-0
root@emsnode10:~/sjust# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
7141G 6941G 200G 2.80
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 1 200G 2.80 6941G 51203
Must be something we are missing in your case.
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Friday, February 26, 2016 9:47 AM
To: ceph-devel@vger.kernel.org
Subject: Ceph space amplification ratio with write is more than 2
Hi Cepher,
We recently have tested the Ceph space amplification with filestore through writing data to ramdisk with rados bench. However, we found only
3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor) 1. Cluster configuration:
. One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
. ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
2. Rados command we have been used :
rados bench -p rbd -b 4096 --max-objects 1048576 300 write --no-cleanup 3. ceph cluster configuration:
Please see appendix 1.
4. Results investigation.
Please see appendix 0.
Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
Regards,
James
Appendix 0:
ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
9206M 547M 8658M 94.05
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 3584M 38.93 547M 917505 ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
health HEALTH_WARN
1 near full osd(s)
monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
election epoch 3, quorum 0 localhost
osdmap e7: 1 osds: 1 up, 1 in
flags sortbitwise
pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
8658 MB used, 547 MB / 9206 MB avail
64 active+clean
ssd@OptiPlex-9020-1:~/src/bluestore$ df -h Filesystem Size Used Avail Use% Mounted on
/dev/sda2 2.7T 1.4T 1.2T 55% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 5.9G 4.0K 5.9G 1% /dev tmpfs 1.2G 720K 1.2G 1% /run none 5.0M 0 5.0M 0% /run/lock none 5.9G 81M 5.8G 2% /run/shm none 100M 20K 100M 1% /run/user
/dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
Journal size:
-rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
-rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
-rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
-rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
-rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
-rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
-rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
-rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
-rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
du -sh
It shows current use 8G.
Appendix 1:
[global]
fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
filestore_max_sync_interval=10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
filestore_op_threads = 6
filestore_queue_max_ops=5000
filestore_queue_committing_max_ops=5000
journal_max_write_entries=1000
journal_queue_max_ops=3000
filestore_wbthrottle_enable=false
filestore_queue_max_bytes=1048576000
filestore_queue_committing_max_bytes=1048576000
journal_max_write_bytes=1048576000
journal_queue_max_bytes=1048576000
osd_journal_size = 1024
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
debug_newstore = 0/0
debug_keyvaluestore = 0/0
osd_tracing = true
osd_objectstore_tracing = true
rados_tracing = true
rbd_tracing = true
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_op_threads = 32
objecter_inflight_ops=102400
ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
osd_mkfs_type = xfs
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
mon_initial_members = localhost
mon_host = 127.0.0.1
mon data = '$CEPH_DEPLOY'/mon/mymondata
mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
run dir = '$CEPH_DEPLOY'/run
[osd.0]
osd data = '$CEPH_DEPLOY'/osd/myosddata
osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
#osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
log file = '$CEPH_DEPLOY'/osd/osd.log'
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Ceph space amplification ratio with write is more than 2
2016-02-27 5:50 ` Haomai Wang
2016-02-29 6:47 ` Ning Yao
@ 2016-03-10 0:29 ` Jianjian Huo
1 sibling, 0 replies; 8+ messages in thread
From: Jianjian Huo @ 2016-03-10 0:29 UTC (permalink / raw)
To: Haomai Wang; +Cc: Somnath Roy, James (Fei) Liu-SSI, ceph-devel
Just saw this thread...
If we are talking about application write amp, I observed different
WAF values for different Ceph workloads. For object workloads(rados
bench writes an empty drive to full), 4MB object writes have about 2X
write amp, ceph journal writes just a little bit more than data; 4KB
object writes have more than 10X write amp; for block
workloads(fio_rbd), 4KB block writes have about 5~6X write
amp(includes ceph journal, data files, leveldb, xfs journal, xfs
inodes, ...), depends on ceph and xfs configs.
For space amp, block workload is close to 1.
This is based on Hammer, 480GB SSD, single OSD, data and 5G journal on
the same drive, 400G rbd image.
On Fri, Feb 26, 2016 at 9:50 PM, Haomai Wang <haomai@xsky.com> wrote:
> On Sat, Feb 27, 2016 at 9:58 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Haomai,
>> I think what James is seeing excluding journal write it is > 2X write amp. Must be something fishy on his setup..
>
> Oh, thanks for reminder. Hmm, I double checked the current
> FileJournal, it should be aligned with 4k... So it could be possible
> too......
>
>>
>> Thanks & Regards
>> Somnath
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Friday, February 26, 2016 5:27 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Ceph space amplification ratio with write is more than 2
>>
>> It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
>> aio+dio, we need to align these. So normally for 4k io, it will become
>> 5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.
>>
>> On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Cepher,
>>> We recently have tested the Ceph space amplification with filestore
>>> through writing data to ramdisk with rados bench. However, we found
>>> only
>>> 3584 MB was written from rados bench but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>>>
>>> The totally space amplification is 7658/3584 = 2.14. (It is surprising
>>> huge factor) 1. Cluster configuration:
>>> . One OSD and one MON are in the same machine with rados bench. Replication factor was set as 1.
>>> . ceph version from ceph master
>>> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
>>> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
>>> 2. Rados command we have been used :
>>> rados bench -p rbd -b 4096 --max-objects 1048576 300 write
>>> --no-cleanup 3. ceph cluster configuration:
>>> Please see appendix 1.
>>> 4. Results investigation.
>>> Please see appendix 0.
>>>
>>> Could anyone help to explain why the space amplification with filestore is huge? Thanks a lot.
>>>
>>> Regards,
>>> James
>>>
>>> Appendix 0:
>>>
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
>>> GLOBAL:
>>> SIZE AVAIL RAW USED %RAW USED
>>> 9206M 547M 8658M 94.05
>>> POOLS:
>>> NAME ID USED %USED MAX AVAIL OBJECTS
>>> rbd 0 3584M 38.93 547M 917505
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>>> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>>> health HEALTH_WARN
>>> 1 near full osd(s)
>>> monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>>> election epoch 3, quorum 0 localhost
>>> osdmap e7: 1 osds: 1 up, 1 in
>>> flags sortbitwise
>>> pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>>> 8658 MB used, 547 MB / 9206 MB avail
>>> 64 active+clean
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/sda2 2.7T 1.4T 1.2T 55% /
>>> none 4.0K 0 4.0K 0% /sys/fs/cgroup
>>> udev 5.9G 4.0K 5.9G 1% /dev
>>> tmpfs 1.2G 720K 1.2G 1% /run
>>> none 5.0M 0 5.0M 0% /run/lock
>>> none 5.9G 81M 5.8G 2% /run/shm
>>> none 100M 20K 100M 1% /run/user
>>> /dev/ram1 9.0G 8.4G 618M 94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>>>
>>> Journal size:
>>>
>>> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 ceph_fsid
>>> drwxr-xr-x 132 ssd ssd 4096 Feb 25 15:53 current
>>> -rw-r--r-- 1 ssd ssd 37 Feb 25 15:53 fsid
>>> -rw-r--r-- 1 ssd ssd 21 Feb 25 15:53 magic
>>> -rw-r--r-- 1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
>>> -rw-r--r-- 1 ssd ssd 6 Feb 25 15:53 ready
>>> -rw-r--r-- 1 ssd ssd 4 Feb 25 15:53 store_version
>>> -rw-r--r-- 1 ssd ssd 53 Feb 25 15:53 superblock
>>> -rw-r--r-- 1 ssd ssd 10 Feb 25 15:53 type
>>> -rw-r--r-- 1 ssd ssd 2 Feb 25 15:53 whoami
>>>
>>> du -sh
>>> It shows current use 8G.
>>>
>>> Appendix 1:
>>> [global]
>>> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
>>> auth_cluster_required = none
>>> auth_service_required = none
>>> auth_client_required = none
>>> filestore_xattr_use_omap = true
>>> filestore_max_sync_interval=10
>>> filestore_fd_cache_size = 64
>>> filestore_fd_cache_shards = 32
>>> filestore_op_threads = 6
>>> filestore_queue_max_ops=5000
>>> filestore_queue_committing_max_ops=5000
>>> journal_max_write_entries=1000
>>> journal_queue_max_ops=3000
>>> filestore_wbthrottle_enable=false
>>> filestore_queue_max_bytes=1048576000
>>> filestore_queue_committing_max_bytes=1048576000
>>> journal_max_write_bytes=1048576000
>>> journal_queue_max_bytes=1048576000
>>>
>>> osd_journal_size = 1024
>>> debug_lockdep = 0/0
>>> debug_context = 0/0
>>> debug_crush = 0/0
>>> debug_buffer = 0/0
>>> debug_timer = 0/0
>>> debug_filer = 0/0
>>> debug_objecter = 0/0
>>> debug_rados = 0/0
>>> debug_rbd = 0/0
>>> debug_journaler = 0/0
>>> debug_objectcatcher = 0/0
>>> debug_client = 0/0
>>> debug_osd = 0/0
>>> debug_optracker = 0/0
>>> debug_objclass = 0/0
>>> debug_filestore = 0/0
>>> debug_journal = 0/0
>>> debug_ms = 0/0
>>> debug_monc = 0/0
>>> debug_tp = 0/0
>>> debug_auth = 0/0
>>> debug_finisher = 0/0
>>> debug_heartbeatmap = 0/0
>>> debug_perfcounter = 0/0
>>> debug_asok = 0/0
>>> debug_throttle = 0/0
>>> debug_mon = 0/0
>>> debug_paxos = 0/0
>>> debug_rgw = 0/0
>>> debug_newstore = 0/0
>>> debug_keyvaluestore = 0/0
>>> osd_tracing = true
>>> osd_objectstore_tracing = true
>>> rados_tracing = true
>>> rbd_tracing = true
>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
>>> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
>>> objecter_inflight_ops=102400
>>> ms_dispatch_throttle_bytes=1048576000
>>> objecter_infilght_op_bytes=1048576000
>>> osd_mkfs_type = xfs
>>> osd_client_message_size_cap = 0
>>> osd_client_message_cap = 0
>>> osd_enable_op_tracker = false
>>> mon_initial_members = localhost
>>> mon_host = 127.0.0.1
>>> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
>>> '$CEPH_DEPLOY'/mon/mon.log
>>> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
>>> run dir = '$CEPH_DEPLOY'/run
>>> [osd.0]
>>> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
>>> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
>>> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
>>> log file = '$CEPH_DEPLOY'/osd/osd.log'
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-03-10 0:29 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-26 17:46 Ceph space amplification ratio with write is more than 2 James (Fei) Liu-SSI
2016-02-26 21:44 ` Somnath Roy
2016-03-01 1:10 ` James (Fei) Liu-SSI
2016-02-27 1:27 ` Haomai Wang
2016-02-27 1:58 ` Somnath Roy
2016-02-27 5:50 ` Haomai Wang
2016-02-29 6:47 ` Ning Yao
2016-03-10 0:29 ` Jianjian Huo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.