All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph space amplification ratio with write is more than 2
@ 2016-02-26 17:46 James (Fei) Liu-SSI
  2016-02-26 21:44 ` Somnath Roy
  2016-02-27  1:27 ` Haomai Wang
  0 siblings, 2 replies; 8+ messages in thread
From: James (Fei) Liu-SSI @ 2016-02-26 17:46 UTC (permalink / raw)
  To: ceph-devel

Hi Cepher,
   We recently have tested the Ceph space amplification with filestore through  writing data to ramdisk with rados bench. However, we found only  
3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk. 

The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor)
1. Cluster configuration:
. One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
. ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
2. Rados command we have been used :
rados bench -p rbd -b 4096 --max-objects 1048576  300 write --no-cleanup
3. ceph cluster configuration:
Please see appendix 1.
4. Results investigation.
Please see appendix 0.
    
Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.

Regards,
James

Appendix 0:

ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    9206M      547M        8658M         94.05
POOLS:
    NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
    rbd      0      3584M     38.93          547M      917505
ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
    cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
     health HEALTH_WARN
            1 near full osd(s)
     monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
            election epoch 3, quorum 0 localhost
     osdmap e7: 1 osds: 1 up, 1 in
            flags sortbitwise
      pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
            8658 MB used, 547 MB / 9206 MB avail
                  64 active+clean
ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       2.7T  1.4T  1.2T  55% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            5.9G  4.0K  5.9G   1% /dev
tmpfs           1.2G  720K  1.2G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            5.9G   81M  5.8G   2% /run/shm
none            100M   20K  100M   1% /run/user
/dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata

Journal size:

-rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid
drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
-rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
-rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
-rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
-rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
-rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
-rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
-rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
-rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami

du -sh
It shows current use 8G.

Appendix 1:
[global]
fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
filestore_max_sync_interval=10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
filestore_op_threads = 6
filestore_queue_max_ops=5000
filestore_queue_committing_max_ops=5000
journal_max_write_entries=1000
journal_queue_max_ops=3000
filestore_wbthrottle_enable=false
filestore_queue_max_bytes=1048576000
filestore_queue_committing_max_bytes=1048576000
journal_max_write_bytes=1048576000
journal_queue_max_bytes=1048576000

osd_journal_size = 1024
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
debug_newstore = 0/0
debug_keyvaluestore = 0/0
osd_tracing = true
osd_objectstore_tracing = true
rados_tracing = true
rbd_tracing = true
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_op_threads = 32
objecter_inflight_ops=102400
ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
osd_mkfs_type = xfs
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
mon_initial_members = localhost
mon_host = 127.0.0.1
mon data = '$CEPH_DEPLOY'/mon/mymondata
mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log
keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
run dir = '$CEPH_DEPLOY'/run
[osd.0]
osd data = '$CEPH_DEPLOY'/osd/myosddata
osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
#osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
log file = '$CEPH_DEPLOY'/osd/osd.log'
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Ceph space amplification ratio with write is more than 2
  2016-02-26 17:46 Ceph space amplification ratio with write is more than 2 James (Fei) Liu-SSI
@ 2016-02-26 21:44 ` Somnath Roy
  2016-03-01  1:10   ` James (Fei) Liu-SSI
  2016-02-27  1:27 ` Haomai Wang
  1 sibling, 1 reply; 8+ messages in thread
From: Somnath Roy @ 2016-02-26 21:44 UTC (permalink / raw)
  To: James (Fei) Liu-SSI, ceph-devel

Hmm...Not *true* in my setup and in fact this can't be true with Ceph..Ceph's WA should be ~2.3 -2.4 considering journal write, excluding journal write it should be well below 1.

Single OSD , journal (10G) and data on the same SSD but in different partitions..1 rbd image with 200G..

root@emsnode10:~/sjust# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       304G  168G  121G  59% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev             32G   12K   32G   1% /dev
tmpfs           6.3G  1.3M  6.3G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none             32G  8.0K   32G   1% /run/shm
none            100M     0  100M   0% /run/user

/dev/sdj2       7.0T  201G  6.8T   3% /var/lib/ceph/osd/ceph-0
root@emsnode10:~/sjust# ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    7141G     6941G         200G          2.80
POOLS:
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS
    rbd      1      200G      2.80         6941G       51203


Must be something we are missing in your case.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Friday, February 26, 2016 9:47 AM
To: ceph-devel@vger.kernel.org
Subject: Ceph space amplification ratio with write is more than 2

Hi Cepher,
   We recently have tested the Ceph space amplification with filestore through  writing data to ramdisk with rados bench. However, we found only
3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.

The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor) 1. Cluster configuration:
. One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
. ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
2. Rados command we have been used :
rados bench -p rbd -b 4096 --max-objects 1048576  300 write --no-cleanup 3. ceph cluster configuration:
Please see appendix 1.
4. Results investigation.
Please see appendix 0.

Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.

Regards,
James

Appendix 0:

ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    9206M      547M        8658M         94.05
POOLS:
    NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
    rbd      0      3584M     38.93          547M      917505 ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
    cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
     health HEALTH_WARN
            1 near full osd(s)
     monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
            election epoch 3, quorum 0 localhost
     osdmap e7: 1 osds: 1 up, 1 in
            flags sortbitwise
      pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
            8658 MB used, 547 MB / 9206 MB avail
                  64 active+clean
ssd@OptiPlex-9020-1:~/src/bluestore$ df -h Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       2.7T  1.4T  1.2T  55% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup udev            5.9G  4.0K  5.9G   1% /dev tmpfs           1.2G  720K  1.2G   1% /run none            5.0M     0  5.0M   0% /run/lock none            5.9G   81M  5.8G   2% /run/shm none            100M   20K  100M   1% /run/user
/dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata

Journal size:

-rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
-rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
-rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
-rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
-rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
-rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
-rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
-rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
-rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami

du -sh
It shows current use 8G.

Appendix 1:
[global]
fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
filestore_max_sync_interval=10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
filestore_op_threads = 6
filestore_queue_max_ops=5000
filestore_queue_committing_max_ops=5000
journal_max_write_entries=1000
journal_queue_max_ops=3000
filestore_wbthrottle_enable=false
filestore_queue_max_bytes=1048576000
filestore_queue_committing_max_bytes=1048576000
journal_max_write_bytes=1048576000
journal_queue_max_bytes=1048576000

osd_journal_size = 1024
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
debug_newstore = 0/0
debug_keyvaluestore = 0/0
osd_tracing = true
osd_objectstore_tracing = true
rados_tracing = true
rbd_tracing = true
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_op_threads = 32
objecter_inflight_ops=102400
ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
osd_mkfs_type = xfs
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
mon_initial_members = localhost
mon_host = 127.0.0.1
mon data = '$CEPH_DEPLOY'/mon/mymondata
mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
run dir = '$CEPH_DEPLOY'/run
[osd.0]
osd data = '$CEPH_DEPLOY'/osd/myosddata
osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
#osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
log file = '$CEPH_DEPLOY'/osd/osd.log'
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Ceph space amplification ratio with write is more than 2
  2016-02-26 17:46 Ceph space amplification ratio with write is more than 2 James (Fei) Liu-SSI
  2016-02-26 21:44 ` Somnath Roy
@ 2016-02-27  1:27 ` Haomai Wang
  2016-02-27  1:58   ` Somnath Roy
  1 sibling, 1 reply; 8+ messages in thread
From: Haomai Wang @ 2016-02-27  1:27 UTC (permalink / raw)
  To: James (Fei) Liu-SSI; +Cc: ceph-devel

It should be as expected. The reason is FileJournal will prepend
header and append footer to the transaction data, and if using
aio+dio, we need to align these. So normally for 4k io, it will become
5-6k filejournal write IIRC. Plus filestore writeback to filesystem,
it should be 2.5x. But local file system and disk should has some
merge tricky, it should be less than 2.5x.

On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI
<james.liu@ssi.samsung.com> wrote:
> Hi Cepher,
>    We recently have tested the Ceph space amplification with filestore through  writing data to ramdisk with rados bench. However, we found only
> 3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>
> The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor)
> 1. Cluster configuration:
> . One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
> . ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
> 2. Rados command we have been used :
> rados bench -p rbd -b 4096 --max-objects 1048576  300 write --no-cleanup
> 3. ceph cluster configuration:
> Please see appendix 1.
> 4. Results investigation.
> Please see appendix 0.
>
> Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.
>
> Regards,
> James
>
> Appendix 0:
>
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
> GLOBAL:
>     SIZE      AVAIL     RAW USED     %RAW USED
>     9206M      547M        8658M         94.05
> POOLS:
>     NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
>     rbd      0      3584M     38.93          547M      917505
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>     cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>      health HEALTH_WARN
>             1 near full osd(s)
>      monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>             election epoch 3, quorum 0 localhost
>      osdmap e7: 1 osds: 1 up, 1 in
>             flags sortbitwise
>       pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>             8658 MB used, 547 MB / 9206 MB avail
>                   64 active+clean
> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda2       2.7T  1.4T  1.2T  55% /
> none            4.0K     0  4.0K   0% /sys/fs/cgroup
> udev            5.9G  4.0K  5.9G   1% /dev
> tmpfs           1.2G  720K  1.2G   1% /run
> none            5.0M     0  5.0M   0% /run/lock
> none            5.9G   81M  5.8G   2% /run/shm
> none            100M   20K  100M   1% /run/user
> /dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>
> Journal size:
>
> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid
> drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
> -rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
> -rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
> -rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
> -rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
> -rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
> -rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
> -rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami
>
> du -sh
> It shows current use 8G.
>
> Appendix 1:
> [global]
> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
> auth_cluster_required = none
> auth_service_required = none
> auth_client_required = none
> filestore_xattr_use_omap = true
> filestore_max_sync_interval=10
> filestore_fd_cache_size = 64
> filestore_fd_cache_shards = 32
> filestore_op_threads = 6
> filestore_queue_max_ops=5000
> filestore_queue_committing_max_ops=5000
> journal_max_write_entries=1000
> journal_queue_max_ops=3000
> filestore_wbthrottle_enable=false
> filestore_queue_max_bytes=1048576000
> filestore_queue_committing_max_bytes=1048576000
> journal_max_write_bytes=1048576000
> journal_queue_max_bytes=1048576000
>
> osd_journal_size = 1024
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
> debug_newstore = 0/0
> debug_keyvaluestore = 0/0
> osd_tracing = true
> osd_objectstore_tracing = true
> rados_tracing = true
> rbd_tracing = true
> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
> osd_mkfs_options_xfs = -f -i size=2048
> osd_op_threads = 32
> objecter_inflight_ops=102400
> ms_dispatch_throttle_bytes=1048576000
> objecter_infilght_op_bytes=1048576000
> osd_mkfs_type = xfs
> osd_client_message_size_cap = 0
> osd_client_message_cap = 0
> osd_enable_op_tracker = false
> mon_initial_members = localhost
> mon_host = 127.0.0.1
> mon data = '$CEPH_DEPLOY'/mon/mymondata
> mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log
> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
> run dir = '$CEPH_DEPLOY'/run
> [osd.0]
> osd data = '$CEPH_DEPLOY'/osd/myosddata
> osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
> log file = '$CEPH_DEPLOY'/osd/osd.log'
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Ceph space amplification ratio with write is more than 2
  2016-02-27  1:27 ` Haomai Wang
@ 2016-02-27  1:58   ` Somnath Roy
  2016-02-27  5:50     ` Haomai Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Somnath Roy @ 2016-02-27  1:58 UTC (permalink / raw)
  To: Haomai Wang, James (Fei) Liu-SSI; +Cc: ceph-devel

Haomai,
I think what James is seeing excluding  journal write it is > 2X write amp. Must be something fishy on his setup..

Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Friday, February 26, 2016 5:27 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ceph space amplification ratio with write is more than 2

It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
aio+dio, we need to align these. So normally for 4k io, it will become
5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.

On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
> Hi Cepher,
>    We recently have tested the Ceph space amplification with filestore
> through  writing data to ramdisk with rados bench. However, we found
> only
> 3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>
> The totally space amplification is 7658/3584 = 2.14. (It is surprising
> huge factor) 1. Cluster configuration:
> . One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
> . ceph version from ceph master
> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
> 2. Rados command we have been used :
> rados bench -p rbd -b 4096 --max-objects 1048576  300 write
> --no-cleanup 3. ceph cluster configuration:
> Please see appendix 1.
> 4. Results investigation.
> Please see appendix 0.
>
> Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.
>
> Regards,
> James
>
> Appendix 0:
>
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
> GLOBAL:
>     SIZE      AVAIL     RAW USED     %RAW USED
>     9206M      547M        8658M         94.05
> POOLS:
>     NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
>     rbd      0      3584M     38.93          547M      917505
> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>     cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>      health HEALTH_WARN
>             1 near full osd(s)
>      monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>             election epoch 3, quorum 0 localhost
>      osdmap e7: 1 osds: 1 up, 1 in
>             flags sortbitwise
>       pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>             8658 MB used, 547 MB / 9206 MB avail
>                   64 active+clean
> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda2       2.7T  1.4T  1.2T  55% /
> none            4.0K     0  4.0K   0% /sys/fs/cgroup
> udev            5.9G  4.0K  5.9G   1% /dev
> tmpfs           1.2G  720K  1.2G   1% /run
> none            5.0M     0  5.0M   0% /run/lock
> none            5.9G   81M  5.8G   2% /run/shm
> none            100M   20K  100M   1% /run/user
> /dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>
> Journal size:
>
> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid
> drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
> -rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
> -rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
> -rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
> -rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
> -rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
> -rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
> -rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami
>
> du -sh
> It shows current use 8G.
>
> Appendix 1:
> [global]
> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
> auth_cluster_required = none
> auth_service_required = none
> auth_client_required = none
> filestore_xattr_use_omap = true
> filestore_max_sync_interval=10
> filestore_fd_cache_size = 64
> filestore_fd_cache_shards = 32
> filestore_op_threads = 6
> filestore_queue_max_ops=5000
> filestore_queue_committing_max_ops=5000
> journal_max_write_entries=1000
> journal_queue_max_ops=3000
> filestore_wbthrottle_enable=false
> filestore_queue_max_bytes=1048576000
> filestore_queue_committing_max_bytes=1048576000
> journal_max_write_bytes=1048576000
> journal_queue_max_bytes=1048576000
>
> osd_journal_size = 1024
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
> debug_newstore = 0/0
> debug_keyvaluestore = 0/0
> osd_tracing = true
> osd_objectstore_tracing = true
> rados_tracing = true
> rbd_tracing = true
> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
> objecter_inflight_ops=102400
> ms_dispatch_throttle_bytes=1048576000
> objecter_infilght_op_bytes=1048576000
> osd_mkfs_type = xfs
> osd_client_message_size_cap = 0
> osd_client_message_cap = 0
> osd_enable_op_tracker = false
> mon_initial_members = localhost
> mon_host = 127.0.0.1
> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
> '$CEPH_DEPLOY'/mon/mon.log
> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
> run dir = '$CEPH_DEPLOY'/run
> [osd.0]
> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
> log file = '$CEPH_DEPLOY'/osd/osd.log'
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Ceph space amplification ratio with write is more than 2
  2016-02-27  1:58   ` Somnath Roy
@ 2016-02-27  5:50     ` Haomai Wang
  2016-02-29  6:47       ` Ning Yao
  2016-03-10  0:29       ` Jianjian Huo
  0 siblings, 2 replies; 8+ messages in thread
From: Haomai Wang @ 2016-02-27  5:50 UTC (permalink / raw)
  To: Somnath Roy; +Cc: James (Fei) Liu-SSI, ceph-devel

On Sat, Feb 27, 2016 at 9:58 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Haomai,
> I think what James is seeing excluding  journal write it is > 2X write amp. Must be something fishy on his setup..

Oh, thanks for reminder. Hmm, I double checked the current
FileJournal, it should be aligned with 4k... So it could be possible
too......

>
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Friday, February 26, 2016 5:27 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Ceph space amplification ratio with write is more than 2
>
> It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
> aio+dio, we need to align these. So normally for 4k io, it will become
> 5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.
>
> On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>> Hi Cepher,
>>    We recently have tested the Ceph space amplification with filestore
>> through  writing data to ramdisk with rados bench. However, we found
>> only
>> 3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>>
>> The totally space amplification is 7658/3584 = 2.14. (It is surprising
>> huge factor) 1. Cluster configuration:
>> . One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
>> . ceph version from ceph master
>> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
>> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
>> 2. Rados command we have been used :
>> rados bench -p rbd -b 4096 --max-objects 1048576  300 write
>> --no-cleanup 3. ceph cluster configuration:
>> Please see appendix 1.
>> 4. Results investigation.
>> Please see appendix 0.
>>
>> Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.
>>
>> Regards,
>> James
>>
>> Appendix 0:
>>
>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
>> GLOBAL:
>>     SIZE      AVAIL     RAW USED     %RAW USED
>>     9206M      547M        8658M         94.05
>> POOLS:
>>     NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
>>     rbd      0      3584M     38.93          547M      917505
>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>>     cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>>      health HEALTH_WARN
>>             1 near full osd(s)
>>      monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>>             election epoch 3, quorum 0 localhost
>>      osdmap e7: 1 osds: 1 up, 1 in
>>             flags sortbitwise
>>       pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>>             8658 MB used, 547 MB / 9206 MB avail
>>                   64 active+clean
>> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/sda2       2.7T  1.4T  1.2T  55% /
>> none            4.0K     0  4.0K   0% /sys/fs/cgroup
>> udev            5.9G  4.0K  5.9G   1% /dev
>> tmpfs           1.2G  720K  1.2G   1% /run
>> none            5.0M     0  5.0M   0% /run/lock
>> none            5.9G   81M  5.8G   2% /run/shm
>> none            100M   20K  100M   1% /run/user
>> /dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>>
>> Journal size:
>>
>> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid
>> drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
>> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
>> -rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
>> -rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
>> -rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
>> -rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
>> -rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
>> -rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
>> -rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami
>>
>> du -sh
>> It shows current use 8G.
>>
>> Appendix 1:
>> [global]
>> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
>> auth_cluster_required = none
>> auth_service_required = none
>> auth_client_required = none
>> filestore_xattr_use_omap = true
>> filestore_max_sync_interval=10
>> filestore_fd_cache_size = 64
>> filestore_fd_cache_shards = 32
>> filestore_op_threads = 6
>> filestore_queue_max_ops=5000
>> filestore_queue_committing_max_ops=5000
>> journal_max_write_entries=1000
>> journal_queue_max_ops=3000
>> filestore_wbthrottle_enable=false
>> filestore_queue_max_bytes=1048576000
>> filestore_queue_committing_max_bytes=1048576000
>> journal_max_write_bytes=1048576000
>> journal_queue_max_bytes=1048576000
>>
>> osd_journal_size = 1024
>> debug_lockdep = 0/0
>> debug_context = 0/0
>> debug_crush = 0/0
>> debug_buffer = 0/0
>> debug_timer = 0/0
>> debug_filer = 0/0
>> debug_objecter = 0/0
>> debug_rados = 0/0
>> debug_rbd = 0/0
>> debug_journaler = 0/0
>> debug_objectcatcher = 0/0
>> debug_client = 0/0
>> debug_osd = 0/0
>> debug_optracker = 0/0
>> debug_objclass = 0/0
>> debug_filestore = 0/0
>> debug_journal = 0/0
>> debug_ms = 0/0
>> debug_monc = 0/0
>> debug_tp = 0/0
>> debug_auth = 0/0
>> debug_finisher = 0/0
>> debug_heartbeatmap = 0/0
>> debug_perfcounter = 0/0
>> debug_asok = 0/0
>> debug_throttle = 0/0
>> debug_mon = 0/0
>> debug_paxos = 0/0
>> debug_rgw = 0/0
>> debug_newstore = 0/0
>> debug_keyvaluestore = 0/0
>> osd_tracing = true
>> osd_objectstore_tracing = true
>> rados_tracing = true
>> rbd_tracing = true
>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
>> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
>> objecter_inflight_ops=102400
>> ms_dispatch_throttle_bytes=1048576000
>> objecter_infilght_op_bytes=1048576000
>> osd_mkfs_type = xfs
>> osd_client_message_size_cap = 0
>> osd_client_message_cap = 0
>> osd_enable_op_tracker = false
>> mon_initial_members = localhost
>> mon_host = 127.0.0.1
>> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
>> '$CEPH_DEPLOY'/mon/mon.log
>> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
>> run dir = '$CEPH_DEPLOY'/run
>> [osd.0]
>> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
>> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
>> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
>> log file = '$CEPH_DEPLOY'/osd/osd.log'
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Ceph space amplification ratio with write is more than 2
  2016-02-27  5:50     ` Haomai Wang
@ 2016-02-29  6:47       ` Ning Yao
  2016-03-10  0:29       ` Jianjian Huo
  1 sibling, 0 replies; 8+ messages in thread
From: Ning Yao @ 2016-02-29  6:47 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Somnath Roy, James (Fei) Liu-SSI, ceph-devel

Yeah, I think James refers to space amplification, but no write amp.
So this has nothing to do with journal.
 I'm but it may relate to xfs filesystem layout, since you have 1M
objects, and currently, we use mkfs.xfs -i size=2048k to explicitly
tell the each inode size is 2k.  And also, the directory /current/omap
will cost some spaces, which will not stat by pools.
You may use a large-size disk instead of ramdisk and have a try, I
think this will happen again
Regards
Ning Yao


2016-02-27 13:50 GMT+08:00 Haomai Wang <haomai@xsky.com>:
> On Sat, Feb 27, 2016 at 9:58 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Haomai,
>> I think what James is seeing excluding  journal write it is > 2X write amp. Must be something fishy on his setup..
>
> Oh, thanks for reminder. Hmm, I double checked the current
> FileJournal, it should be aligned with 4k... So it could be possible
> too......
>
>>
>> Thanks & Regards
>> Somnath
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Friday, February 26, 2016 5:27 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Ceph space amplification ratio with write is more than 2
>>
>> It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
>> aio+dio, we need to align these. So normally for 4k io, it will become
>> 5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.
>>
>> On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Cepher,
>>>    We recently have tested the Ceph space amplification with filestore
>>> through  writing data to ramdisk with rados bench. However, we found
>>> only
>>> 3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>>>
>>> The totally space amplification is 7658/3584 = 2.14. (It is surprising
>>> huge factor) 1. Cluster configuration:
>>> . One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
>>> . ceph version from ceph master
>>> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
>>> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
>>> 2. Rados command we have been used :
>>> rados bench -p rbd -b 4096 --max-objects 1048576  300 write
>>> --no-cleanup 3. ceph cluster configuration:
>>> Please see appendix 1.
>>> 4. Results investigation.
>>> Please see appendix 0.
>>>
>>> Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.
>>>
>>> Regards,
>>> James
>>>
>>> Appendix 0:
>>>
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
>>> GLOBAL:
>>>     SIZE      AVAIL     RAW USED     %RAW USED
>>>     9206M      547M        8658M         94.05
>>> POOLS:
>>>     NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
>>>     rbd      0      3584M     38.93          547M      917505
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>>>     cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>>>      health HEALTH_WARN
>>>             1 near full osd(s)
>>>      monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>>>             election epoch 3, quorum 0 localhost
>>>      osdmap e7: 1 osds: 1 up, 1 in
>>>             flags sortbitwise
>>>       pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>>>             8658 MB used, 547 MB / 9206 MB avail
>>>                   64 active+clean
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/sda2       2.7T  1.4T  1.2T  55% /
>>> none            4.0K     0  4.0K   0% /sys/fs/cgroup
>>> udev            5.9G  4.0K  5.9G   1% /dev
>>> tmpfs           1.2G  720K  1.2G   1% /run
>>> none            5.0M     0  5.0M   0% /run/lock
>>> none            5.9G   81M  5.8G   2% /run/shm
>>> none            100M   20K  100M   1% /run/user
>>> /dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>>>
>>> Journal size:
>>>
>>> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid
>>> drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
>>> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
>>> -rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
>>> -rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
>>> -rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
>>> -rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
>>> -rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
>>> -rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
>>> -rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami
>>>
>>> du -sh
>>> It shows current use 8G.
>>>
>>> Appendix 1:
>>> [global]
>>> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
>>> auth_cluster_required = none
>>> auth_service_required = none
>>> auth_client_required = none
>>> filestore_xattr_use_omap = true
>>> filestore_max_sync_interval=10
>>> filestore_fd_cache_size = 64
>>> filestore_fd_cache_shards = 32
>>> filestore_op_threads = 6
>>> filestore_queue_max_ops=5000
>>> filestore_queue_committing_max_ops=5000
>>> journal_max_write_entries=1000
>>> journal_queue_max_ops=3000
>>> filestore_wbthrottle_enable=false
>>> filestore_queue_max_bytes=1048576000
>>> filestore_queue_committing_max_bytes=1048576000
>>> journal_max_write_bytes=1048576000
>>> journal_queue_max_bytes=1048576000
>>>
>>> osd_journal_size = 1024
>>> debug_lockdep = 0/0
>>> debug_context = 0/0
>>> debug_crush = 0/0
>>> debug_buffer = 0/0
>>> debug_timer = 0/0
>>> debug_filer = 0/0
>>> debug_objecter = 0/0
>>> debug_rados = 0/0
>>> debug_rbd = 0/0
>>> debug_journaler = 0/0
>>> debug_objectcatcher = 0/0
>>> debug_client = 0/0
>>> debug_osd = 0/0
>>> debug_optracker = 0/0
>>> debug_objclass = 0/0
>>> debug_filestore = 0/0
>>> debug_journal = 0/0
>>> debug_ms = 0/0
>>> debug_monc = 0/0
>>> debug_tp = 0/0
>>> debug_auth = 0/0
>>> debug_finisher = 0/0
>>> debug_heartbeatmap = 0/0
>>> debug_perfcounter = 0/0
>>> debug_asok = 0/0
>>> debug_throttle = 0/0
>>> debug_mon = 0/0
>>> debug_paxos = 0/0
>>> debug_rgw = 0/0
>>> debug_newstore = 0/0
>>> debug_keyvaluestore = 0/0
>>> osd_tracing = true
>>> osd_objectstore_tracing = true
>>> rados_tracing = true
>>> rbd_tracing = true
>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
>>> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
>>> objecter_inflight_ops=102400
>>> ms_dispatch_throttle_bytes=1048576000
>>> objecter_infilght_op_bytes=1048576000
>>> osd_mkfs_type = xfs
>>> osd_client_message_size_cap = 0
>>> osd_client_message_cap = 0
>>> osd_enable_op_tracker = false
>>> mon_initial_members = localhost
>>> mon_host = 127.0.0.1
>>> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
>>> '$CEPH_DEPLOY'/mon/mon.log
>>> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
>>> run dir = '$CEPH_DEPLOY'/run
>>> [osd.0]
>>> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
>>> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
>>> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
>>> log file = '$CEPH_DEPLOY'/osd/osd.log'
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Ceph space amplification ratio with write is more than 2
  2016-02-26 21:44 ` Somnath Roy
@ 2016-03-01  1:10   ` James (Fei) Liu-SSI
  0 siblings, 0 replies; 8+ messages in thread
From: James (Fei) Liu-SSI @ 2016-03-01  1:10 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

HI Somnath,
   Thanks for your quick answer.  Could you mind sharing with us about the amount of data you issued with below  rados bench command?    

   rados bench -p rbd -b 4096 --max-objects 1048576 $2 300  write --no-cleanup

   Regards,
   James
-----Original Message-----
From: Somnath Roy [mailto:Somnath.Roy@sandisk.com] 
Sent: Friday, February 26, 2016 1:44 PM
To: James (Fei) Liu-SSI; ceph-devel@vger.kernel.org
Subject: RE: Ceph space amplification ratio with write is more than 2 

Hmm...Not *true* in my setup and in fact this can't be true with Ceph..Ceph's WA should be ~2.3 -2.4 considering journal write, excluding journal write it should be well below 1.

Single OSD , journal (10G) and data on the same SSD but in different partitions..1 rbd image with 200G..

root@emsnode10:~/sjust# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       304G  168G  121G  59% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev             32G   12K   32G   1% /dev
tmpfs           6.3G  1.3M  6.3G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none             32G  8.0K   32G   1% /run/shm
none            100M     0  100M   0% /run/user

/dev/sdj2       7.0T  201G  6.8T   3% /var/lib/ceph/osd/ceph-0
root@emsnode10:~/sjust# ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    7141G     6941G         200G          2.80
POOLS:
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS
    rbd      1      200G      2.80         6941G       51203


Must be something we are missing in your case.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Friday, February 26, 2016 9:47 AM
To: ceph-devel@vger.kernel.org
Subject: Ceph space amplification ratio with write is more than 2

Hi Cepher,
   We recently have tested the Ceph space amplification with filestore through  writing data to ramdisk with rados bench. However, we found only
3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.

The totally space amplification is 7658/3584 = 2.14. (It is surprising huge factor) 1. Cluster configuration:
. One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
. ceph version from ceph master 5979c8e34fa2f3d7efa28c29fb90758b3f9f818 (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
2. Rados command we have been used :
rados bench -p rbd -b 4096 --max-objects 1048576  300 write --no-cleanup 3. ceph cluster configuration:
Please see appendix 1.
4. Results investigation.
Please see appendix 0.

Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.

Regards,
James

Appendix 0:

ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    9206M      547M        8658M         94.05
POOLS:
    NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
    rbd      0      3584M     38.93          547M      917505 ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
    cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
     health HEALTH_WARN
            1 near full osd(s)
     monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
            election epoch 3, quorum 0 localhost
     osdmap e7: 1 osds: 1 up, 1 in
            flags sortbitwise
      pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
            8658 MB used, 547 MB / 9206 MB avail
                  64 active+clean
ssd@OptiPlex-9020-1:~/src/bluestore$ df -h Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       2.7T  1.4T  1.2T  55% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup udev            5.9G  4.0K  5.9G   1% /dev tmpfs           1.2G  720K  1.2G   1% /run none            5.0M     0  5.0M   0% /run/lock none            5.9G   81M  5.8G   2% /run/shm none            100M   20K  100M   1% /run/user
/dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata

Journal size:

-rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
-rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
-rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
-rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
-rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
-rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
-rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
-rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
-rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami

du -sh
It shows current use 8G.

Appendix 1:
[global]
fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
filestore_max_sync_interval=10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
filestore_op_threads = 6
filestore_queue_max_ops=5000
filestore_queue_committing_max_ops=5000
journal_max_write_entries=1000
journal_queue_max_ops=3000
filestore_wbthrottle_enable=false
filestore_queue_max_bytes=1048576000
filestore_queue_committing_max_bytes=1048576000
journal_max_write_bytes=1048576000
journal_queue_max_bytes=1048576000

osd_journal_size = 1024
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
debug_newstore = 0/0
debug_keyvaluestore = 0/0
osd_tracing = true
osd_objectstore_tracing = true
rados_tracing = true
rbd_tracing = true
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs = -f -i size=2048
osd_op_threads = 32
objecter_inflight_ops=102400
ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
osd_mkfs_type = xfs
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
mon_initial_members = localhost
mon_host = 127.0.0.1
mon data = '$CEPH_DEPLOY'/mon/mymondata
mon cluster log file = '$CEPH_DEPLOY'/mon/mon.log keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
run dir = '$CEPH_DEPLOY'/run
[osd.0]
osd data = '$CEPH_DEPLOY'/osd/myosddata
osd journal = '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
#osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
log file = '$CEPH_DEPLOY'/osd/osd.log'
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Ceph space amplification ratio with write is more than 2
  2016-02-27  5:50     ` Haomai Wang
  2016-02-29  6:47       ` Ning Yao
@ 2016-03-10  0:29       ` Jianjian Huo
  1 sibling, 0 replies; 8+ messages in thread
From: Jianjian Huo @ 2016-03-10  0:29 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Somnath Roy, James (Fei) Liu-SSI, ceph-devel

Just saw this thread...

If we are talking about application write amp, I observed different
WAF values for different Ceph workloads. For object workloads(rados
bench writes an empty drive to full), 4MB object writes have about 2X
write amp, ceph journal writes just a little bit more than data; 4KB
object writes have more than 10X write amp; for block
workloads(fio_rbd), 4KB block writes have about 5~6X write
amp(includes ceph journal, data files, leveldb, xfs journal, xfs
inodes, ...), depends on ceph and xfs configs.

For space amp, block workload is close to 1.

This is based on Hammer, 480GB SSD, single OSD, data and 5G journal on
the same drive, 400G rbd image.

On Fri, Feb 26, 2016 at 9:50 PM, Haomai Wang <haomai@xsky.com> wrote:
> On Sat, Feb 27, 2016 at 9:58 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Haomai,
>> I think what James is seeing excluding  journal write it is > 2X write amp. Must be something fishy on his setup..
>
> Oh, thanks for reminder. Hmm, I double checked the current
> FileJournal, it should be aligned with 4k... So it could be possible
> too......
>
>>
>> Thanks & Regards
>> Somnath
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Friday, February 26, 2016 5:27 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Ceph space amplification ratio with write is more than 2
>>
>> It should be as expected. The reason is FileJournal will prepend header and append footer to the transaction data, and if using
>> aio+dio, we need to align these. So normally for 4k io, it will become
>> 5-6k filejournal write IIRC. Plus filestore writeback to filesystem, it should be 2.5x. But local file system and disk should has some merge tricky, it should be less than 2.5x.
>>
>> On Sat, Feb 27, 2016 at 1:46 AM, James (Fei) Liu-SSI <james.liu@ssi.samsung.com> wrote:
>>> Hi Cepher,
>>>    We recently have tested the Ceph space amplification with filestore
>>> through  writing data to ramdisk with rados bench. However, we found
>>> only
>>> 3584 MB was written from rados bench  but totally 8658M(1G of total 9G of ramdisk was used for journal) was used in the ramdisk.
>>>
>>> The totally space amplification is 7658/3584 = 2.14. (It is surprising
>>> huge factor) 1. Cluster configuration:
>>> . One OSD and one MON  are in the same machine with rados bench. Replication factor  was set as 1.
>>> . ceph version from ceph master
>>> 5979c8e34fa2f3d7efa28c29fb90758b3f9f818
>>> (45979c8e34fa2f3d7efa28c29fb90758b3f9f818)
>>> 2. Rados command we have been used :
>>> rados bench -p rbd -b 4096 --max-objects 1048576  300 write
>>> --no-cleanup 3. ceph cluster configuration:
>>> Please see appendix 1.
>>> 4. Results investigation.
>>> Please see appendix 0.
>>>
>>> Could anyone help to explain why  the space amplification with filestore is huge?  Thanks a lot.
>>>
>>> Regards,
>>> James
>>>
>>> Appendix 0:
>>>
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph df
>>> GLOBAL:
>>>     SIZE      AVAIL     RAW USED     %RAW USED
>>>     9206M      547M        8658M         94.05
>>> POOLS:
>>>     NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
>>>     rbd      0      3584M     38.93          547M      917505
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ ceph -s
>>>     cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>>>      health HEALTH_WARN
>>>             1 near full osd(s)
>>>      monmap e1: 1 mons at {localhost=127.0.0.1:6789/0}
>>>             election epoch 3, quorum 0 localhost
>>>      osdmap e7: 1 osds: 1 up, 1 in
>>>             flags sortbitwise
>>>       pgmap v31: 64 pgs, 1 pools, 3584 MB data, 896 kobjects
>>>             8658 MB used, 547 MB / 9206 MB avail
>>>                   64 active+clean
>>> ssd@OptiPlex-9020-1:~/src/bluestore$ df -h
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/sda2       2.7T  1.4T  1.2T  55% /
>>> none            4.0K     0  4.0K   0% /sys/fs/cgroup
>>> udev            5.9G  4.0K  5.9G   1% /dev
>>> tmpfs           1.2G  720K  1.2G   1% /run
>>> none            5.0M     0  5.0M   0% /run/lock
>>> none            5.9G   81M  5.8G   2% /run/shm
>>> none            100M   20K  100M   1% /run/user
>>> /dev/ram1       9.0G  8.4G  618M  94% /home/ssd/src/bluestore/ceph-deploy/osd/myosddata
>>>
>>> Journal size:
>>>
>>> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 ceph_fsid
>>> drwxr-xr-x 132 ssd ssd       4096 Feb 25 15:53 current
>>> -rw-r--r--   1 ssd ssd         37 Feb 25 15:53 fsid
>>> -rw-r--r--   1 ssd ssd         21 Feb 25 15:53 magic
>>> -rw-r--r--   1 ssd ssd 1073741824 Feb 25 15:53 myosdjournal
>>> -rw-r--r--   1 ssd ssd          6 Feb 25 15:53 ready
>>> -rw-r--r--   1 ssd ssd          4 Feb 25 15:53 store_version
>>> -rw-r--r--   1 ssd ssd         53 Feb 25 15:53 superblock
>>> -rw-r--r--   1 ssd ssd         10 Feb 25 15:53 type
>>> -rw-r--r--   1 ssd ssd          2 Feb 25 15:53 whoami
>>>
>>> du -sh
>>> It shows current use 8G.
>>>
>>> Appendix 1:
>>> [global]
>>> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
>>> auth_cluster_required = none
>>> auth_service_required = none
>>> auth_client_required = none
>>> filestore_xattr_use_omap = true
>>> filestore_max_sync_interval=10
>>> filestore_fd_cache_size = 64
>>> filestore_fd_cache_shards = 32
>>> filestore_op_threads = 6
>>> filestore_queue_max_ops=5000
>>> filestore_queue_committing_max_ops=5000
>>> journal_max_write_entries=1000
>>> journal_queue_max_ops=3000
>>> filestore_wbthrottle_enable=false
>>> filestore_queue_max_bytes=1048576000
>>> filestore_queue_committing_max_bytes=1048576000
>>> journal_max_write_bytes=1048576000
>>> journal_queue_max_bytes=1048576000
>>>
>>> osd_journal_size = 1024
>>> debug_lockdep = 0/0
>>> debug_context = 0/0
>>> debug_crush = 0/0
>>> debug_buffer = 0/0
>>> debug_timer = 0/0
>>> debug_filer = 0/0
>>> debug_objecter = 0/0
>>> debug_rados = 0/0
>>> debug_rbd = 0/0
>>> debug_journaler = 0/0
>>> debug_objectcatcher = 0/0
>>> debug_client = 0/0
>>> debug_osd = 0/0
>>> debug_optracker = 0/0
>>> debug_objclass = 0/0
>>> debug_filestore = 0/0
>>> debug_journal = 0/0
>>> debug_ms = 0/0
>>> debug_monc = 0/0
>>> debug_tp = 0/0
>>> debug_auth = 0/0
>>> debug_finisher = 0/0
>>> debug_heartbeatmap = 0/0
>>> debug_perfcounter = 0/0
>>> debug_asok = 0/0
>>> debug_throttle = 0/0
>>> debug_mon = 0/0
>>> debug_paxos = 0/0
>>> debug_rgw = 0/0
>>> debug_newstore = 0/0
>>> debug_keyvaluestore = 0/0
>>> osd_tracing = true
>>> osd_objectstore_tracing = true
>>> rados_tracing = true
>>> rbd_tracing = true
>>> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog
>>> osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32
>>> objecter_inflight_ops=102400
>>> ms_dispatch_throttle_bytes=1048576000
>>> objecter_infilght_op_bytes=1048576000
>>> osd_mkfs_type = xfs
>>> osd_client_message_size_cap = 0
>>> osd_client_message_cap = 0
>>> osd_enable_op_tracker = false
>>> mon_initial_members = localhost
>>> mon_host = 127.0.0.1
>>> mon data = '$CEPH_DEPLOY'/mon/mymondata mon cluster log file =
>>> '$CEPH_DEPLOY'/mon/mon.log
>>> keyring='$CEPH_DEPLOY'/ceph.client.admin.keyring
>>> run dir = '$CEPH_DEPLOY'/run
>>> [osd.0]
>>> osd data = '$CEPH_DEPLOY'/osd/myosddata osd journal =
>>> '$CEPH_DEPLOY'/osd/myosddata/myosdjournal
>>> #osd journal = '$WORKSPACE'/myosdjournal/myosdjournal
>>> log file = '$CEPH_DEPLOY'/osd/osd.log'
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-03-10  0:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-26 17:46 Ceph space amplification ratio with write is more than 2 James (Fei) Liu-SSI
2016-02-26 21:44 ` Somnath Roy
2016-03-01  1:10   ` James (Fei) Liu-SSI
2016-02-27  1:27 ` Haomai Wang
2016-02-27  1:58   ` Somnath Roy
2016-02-27  5:50     ` Haomai Wang
2016-02-29  6:47       ` Ning Yao
2016-03-10  0:29       ` Jianjian Huo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.