All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Regarding newstore performance
@ 2015-04-15  6:01 Somnath Roy
  2015-04-15 12:23 ` Haomai Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-04-15  6:01 UTC (permalink / raw)
  To: ceph-devel

Hi Sage/Mark,
I did some WA experiment with newstore with the similar settings I mentioned yesterday.

Test:
-------

64K Random write with 64 QD and writing total of 1 TB of data.


Newstore:
------------

Fio output at the end of 1 TB write.
-------------------------------------------

rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64
fio-2.1.11-20-g9a44
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
  write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
    slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
    clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
     lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
    clat percentiles (msec):
     |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
     | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
     | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
     | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
     | 99.99th=[ 1270]
    bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
    lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
    lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
  cpu          : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, >=64=97.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, mint=21421419msec, maxt=21421419msec


So, iops getting is ~764.
99th percentile latency should be 100ms.

Write amplification at disk level:
--------------------------------------

SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk.

Please find the data in the following xls.

https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlxcX5TLMRzdXyJE/edit?usp=sharing

Total host writes in this period = 923896266

Total flash writes in this period = 1465339040


FileStore:
-------------

Fio output at the end of 1 TB write.
-------------------------------------------

rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64
fio-2.1.11-20-g9a44
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 00m:01s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
  write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
    slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
    clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
     lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
     | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
     | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
     | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
     | 99.99th=[ 1647]
    bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
    lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
    lat (msec) : 2000=0.06%, >=2000=0.01%
  cpu          : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, >=64=19.1%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, mint=10636117msec, maxt=10636117msec

Disk stats (read/write):
  sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%

So, iops here is ~1500.
99th percentile latency should be within 50ms


Write amplification at disk level:
--------------------------------------

Total host writes in this period = 643611346

Total flash writes in this period = 1157304512



https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61Fz49CLH8WPh7Q/edit?pli=1#gid=95373000





Summary:
------------

1.  The performance is doubled in case of filestore and latency is almost half.

2. Total number of flash writes is impacted by by both application write pattern + FTL logic etc. etc. So, I am not going into that.  Things to note the significant increase of host writes with newstore and that's definitely causing extra WA compare to  filestore.

3. Considering flash page size = 4K, the total writes in case of filestore = 2455 GB with a 1000 GB fio write vs 3524 GB with newstore. So, WA of filestore is ~2.4 vs ~3.5 in case of newstore. Considering inherent 2X WA for filestore, it is doing pretty good here.
     Now, in case of newstore , it is not supposed to write WAL in case of new writes. It will be interesting to see % of new writes coming..Will analyze that..

4. If you can open my xls and graphs above, you can see initially host writes and flash writes are very similar in case of newstore and then it jumps high. Not sure why though. I will rerun the tests to confirm similar phenomenon.

5. The cumulative flash write  and cumulative host write graph is the actual WA (host + FW) caused by the write.

What's next:
---------------

1. Need to understand why 3.5 WA for newstore.

2. Try with different  Rocksdb tuning and record the impact.


Any feedback/suggestion is much appreciated.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Monday, April 13, 2015 4:54 PM
To: ceph-devel
Subject: Regarding newstore performance

Sage,
I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.

Test:
-----

        64K random writes with QD= 64 using fio_rbd.

Results :
----------

        1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...

        2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).

        3. I tried to tweak all the settings one by one but not much benefit anywhere.

        4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
             This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !

        5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.

I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.

I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.

Thanks & Regards
Somnath





________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-15  6:01 Regarding newstore performance Somnath Roy
@ 2015-04-15 12:23 ` Haomai Wang
  2015-04-15 16:07   ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-04-15 12:23 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Sage/Mark,
> I did some WA experiment with newstore with the similar settings I mentioned yesterday.
>
> Test:
> -------
>
> 64K Random write with 64 QD and writing total of 1 TB of data.
>
>
> Newstore:
> ------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
>   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
>     slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
>     clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
>      lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
>     clat percentiles (msec):
>      |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
>      | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
>      | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
>      | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
>      | 99.99th=[ 1270]
>     bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
>     lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
>     lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
>   cpu          : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, >=64=97.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, mint=21421419msec, maxt=21421419msec
>
>
> So, iops getting is ~764.
> 99th percentile latency should be 100ms.
>
> Write amplification at disk level:
> --------------------------------------
>
> SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk.
>
> Please find the data in the following xls.
>
> https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlxcX5TLMRzdXyJE/edit?usp=sharing
>
> Total host writes in this period = 923896266
>
> Total flash writes in this period = 1465339040
>
>
> FileStore:
> -------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 00m:01s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
>   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
>     slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
>     clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
>      lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
>     clat percentiles (msec):
>      |  1.00th=[    7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
>      | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
>      | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
>      | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
>      | 99.99th=[ 1647]
>     bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00
>     lat (usec) : 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
>     lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
>     lat (msec) : 2000=0.06%, >=2000=0.01%
>   cpu          : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, >=64=19.1%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, mint=10636117msec, maxt=10636117msec
>
> Disk stats (read/write):
>   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%
>
> So, iops here is ~1500.
> 99th percentile latency should be within 50ms
>
>
> Write amplification at disk level:
> --------------------------------------
>
> Total host writes in this period = 643611346
>
> Total flash writes in this period = 1157304512
>
>
>
> https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61Fz49CLH8WPh7Q/edit?pli=1#gid=95373000
>
>
>
>
>
> Summary:
> ------------
>
> 1.  The performance is doubled in case of filestore and latency is almost half.
>
> 2. Total number of flash writes is impacted by by both application write pattern + FTL logic etc. etc. So, I am not going into that.  Things to note the significant increase of host writes with newstore and that's definitely causing extra WA compare to  filestore.
>

Yeah, it seemed that xfs plays well when writing back.

> 3. Considering flash page size = 4K, the total writes in case of filestore = 2455 GB with a 1000 GB fio write vs 3524 GB with newstore. So, WA of filestore is ~2.4 vs ~3.5 in case of newstore. Considering inherent 2X WA for filestore, it is doing pretty good here.
>      Now, in case of newstore , it is not supposed to write WAL in case of new writes. It will be interesting to see % of new writes coming..Will analyze that..
>

I think it should result from kvdb. Maybe we can separate newstore's
data dir and kvdb dir. So we can measure the difference with different
disk counter.

> 4. If you can open my xls and graphs above, you can see initially host writes and flash writes are very similar in case of newstore and then it jumps high. Not sure why though. I will rerun the tests to confirm similar phenomenon.
>
> 5. The cumulative flash write  and cumulative host write graph is the actual WA (host + FW) caused by the write.
>

I'm interested in the flash write and disk write counter, is it a
internal tool to inspect or it's a opensource tool?

> What's next:
> ---------------
>
> 1. Need to understand why 3.5 WA for newstore.
>
> 2. Try with different  Rocksdb tuning and record the impact.
>
>
> Any feedback/suggestion is much appreciated.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, April 13, 2015 4:54 PM
> To: ceph-devel
> Subject: Regarding newstore performance
>
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>         64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>         1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>         2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>         3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>         4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>              This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>         5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-15 12:23 ` Haomai Wang
@ 2015-04-15 16:07   ` Somnath Roy
  2015-04-16  1:47     ` Chen, Xiaoxi
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-04-15 16:07 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA.
This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com] 
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Sage/Mark,
> I did some WA experiment with newstore with the similar settings I mentioned yesterday.
>
> Test:
> -------
>
> 64K Random write with 64 QD and writing total of 1 TB of data.
>
>
> Newstore:
> ------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
> iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
>   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
>     slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
>     clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
>      lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
>     clat percentiles (msec):
>      |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
>      | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
>      | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
>      | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
>      | 99.99th=[ 1270]
>     bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
>     lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
>     lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
>   cpu          : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, >=64=97.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
> mint=21421419msec, maxt=21421419msec
>
>
> So, iops getting is ~764.
> 99th percentile latency should be 100ms.
>
> Write amplification at disk level:
> --------------------------------------
>
> SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk.
>
> Please find the data in the following xls.
>
> https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
> cX5TLMRzdXyJE/edit?usp=sharing
>
> Total host writes in this period = 923896266
>
> Total flash writes in this period = 1465339040
>
>
> FileStore:
> -------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
> iops] [eta 00m:01s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
>   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
>     slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
>     clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
>      lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
>     clat percentiles (msec):
>      |  1.00th=[    7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
>      | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
>      | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
>      | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
>      | 99.99th=[ 1647]
>     bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00
>     lat (usec) : 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
>     lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
>     lat (msec) : 2000=0.06%, >=2000=0.01%
>   cpu          : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, >=64=19.1%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
> mint=10636117msec, maxt=10636117msec
>
> Disk stats (read/write):
>   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%
>
> So, iops here is ~1500.
> 99th percentile latency should be within 50ms
>
>
> Write amplification at disk level:
> --------------------------------------
>
> Total host writes in this period = 643611346
>
> Total flash writes in this period = 1157304512
>
>
>
> https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61
> Fz49CLH8WPh7Q/edit?pli=1#gid=95373000
>
>
>
>
>
> Summary:
> ------------
>
> 1.  The performance is doubled in case of filestore and latency is almost half.
>
> 2. Total number of flash writes is impacted by by both application write pattern + FTL logic etc. etc. So, I am not going into that.  Things to note the significant increase of host writes with newstore and that's definitely causing extra WA compare to  filestore.
>

Yeah, it seemed that xfs plays well when writing back.

> 3. Considering flash page size = 4K, the total writes in case of filestore = 2455 GB with a 1000 GB fio write vs 3524 GB with newstore. So, WA of filestore is ~2.4 vs ~3.5 in case of newstore. Considering inherent 2X WA for filestore, it is doing pretty good here.
>      Now, in case of newstore , it is not supposed to write WAL in case of new writes. It will be interesting to see % of new writes coming..Will analyze that..
>

I think it should result from kvdb. Maybe we can separate newstore's data dir and kvdb dir. So we can measure the difference with different disk counter.

> 4. If you can open my xls and graphs above, you can see initially host writes and flash writes are very similar in case of newstore and then it jumps high. Not sure why though. I will rerun the tests to confirm similar phenomenon.
>
> 5. The cumulative flash write  and cumulative host write graph is the actual WA (host + FW) caused by the write.
>

I'm interested in the flash write and disk write counter, is it a internal tool to inspect or it's a opensource tool?

> What's next:
> ---------------
>
> 1. Need to understand why 3.5 WA for newstore.
>
> 2. Try with different  Rocksdb tuning and record the impact.
>
>
> Any feedback/suggestion is much appreciated.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, April 13, 2015 4:54 PM
> To: ceph-devel
> Subject: Regarding newstore performance
>
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>         64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>         1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>         2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>         3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>         4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>              This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>         5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-15 16:07   ` Somnath Roy
@ 2015-04-16  1:47     ` Chen, Xiaoxi
  2015-04-16  4:22       ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Chen, Xiaoxi @ 2015-04-16  1:47 UTC (permalink / raw)
  To: Somnath Roy, Haomai Wang; +Cc: ceph-devel

Hi Somnath,
     You could try apply this one:)
     https://github.com/ceph/ceph/pull/4356

      BTW the previous RocksDB configuration has a bug that set rocksdb_disableDataSync to true by default, which may cause data loss in failure. So pls update the newstore to latest or manually set it to false. I suspect the KVDB performance will be worse after doing this...but that's the way we need to go.
										Xiaoxi

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, April 16, 2015 12:07 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA.
This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Sage/Mark,
> I did some WA experiment with newstore with the similar settings I mentioned yesterday.
>
> Test:
> -------
>
> 64K Random write with 64 QD and writing total of 1 TB of data.
>
>
> Newstore:
> ------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
> iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
>   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
>     slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
>     clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
>      lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
>     clat percentiles (msec):
>      |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
>      | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
>      | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
>      | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
>      | 99.99th=[ 1270]
>     bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
>     lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
>     lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
>   cpu          : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, >=64=97.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
> mint=21421419msec, maxt=21421419msec
>
>
> So, iops getting is ~764.
> 99th percentile latency should be 100ms.
>
> Write amplification at disk level:
> --------------------------------------
>
> SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk.
>
> Please find the data in the following xls.
>
> https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
> cX5TLMRzdXyJE/edit?usp=sharing
>
> Total host writes in this period = 923896266
>
> Total flash writes in this period = 1465339040
>
>
> FileStore:
> -------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
> iops] [eta 00m:01s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
>   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
>     slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
>     clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
>      lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
>     clat percentiles (msec):
>      |  1.00th=[    7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
>      | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
>      | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
>      | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
>      | 99.99th=[ 1647]
>     bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00
>     lat (usec) : 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
>     lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
>     lat (msec) : 2000=0.06%, >=2000=0.01%
>   cpu          : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, >=64=19.1%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
> mint=10636117msec, maxt=10636117msec
>
> Disk stats (read/write):
>   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%
>
> So, iops here is ~1500.
> 99th percentile latency should be within 50ms
>
>
> Write amplification at disk level:
> --------------------------------------
>
> Total host writes in this period = 643611346
>
> Total flash writes in this period = 1157304512
>
>
>
> https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61
> Fz49CLH8WPh7Q/edit?pli=1#gid=95373000
>
>
>
>
>
> Summary:
> ------------
>
> 1.  The performance is doubled in case of filestore and latency is almost half.
>
> 2. Total number of flash writes is impacted by by both application write pattern + FTL logic etc. etc. So, I am not going into that.  Things to note the significant increase of host writes with newstore and that's definitely causing extra WA compare to  filestore.
>

Yeah, it seemed that xfs plays well when writing back.

> 3. Considering flash page size = 4K, the total writes in case of filestore = 2455 GB with a 1000 GB fio write vs 3524 GB with newstore. So, WA of filestore is ~2.4 vs ~3.5 in case of newstore. Considering inherent 2X WA for filestore, it is doing pretty good here.
>      Now, in case of newstore , it is not supposed to write WAL in case of new writes. It will be interesting to see % of new writes coming..Will analyze that..
>

I think it should result from kvdb. Maybe we can separate newstore's data dir and kvdb dir. So we can measure the difference with different disk counter.

> 4. If you can open my xls and graphs above, you can see initially host writes and flash writes are very similar in case of newstore and then it jumps high. Not sure why though. I will rerun the tests to confirm similar phenomenon.
>
> 5. The cumulative flash write  and cumulative host write graph is the actual WA (host + FW) caused by the write.
>

I'm interested in the flash write and disk write counter, is it a internal tool to inspect or it's a opensource tool?

> What's next:
> ---------------
>
> 1. Need to understand why 3.5 WA for newstore.
>
> 2. Try with different  Rocksdb tuning and record the impact.
>
>
> Any feedback/suggestion is much appreciated.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, April 13, 2015 4:54 PM
> To: ceph-devel
> Subject: Regarding newstore performance
>
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>         64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>         1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>         2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>         3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>         4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>              This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>         5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat
\x04 {.n +       +%  lzwm  b 맲  r  yǩ ׯzX  \x17  ܨ}   Ơz &j:+v        zZ+  +zf   h   ~    i   z \x1e w   ?    & )ߢ^[f

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-16  1:47     ` Chen, Xiaoxi
@ 2015-04-16  4:22       ` Somnath Roy
  2015-04-16  6:17         ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-04-16  4:22 UTC (permalink / raw)
  To: Chen, Xiaoxi, Haomai Wang; +Cc: ceph-devel

Thanks Xiaoxi..
But, I have already initiated test by making db/ a symbolic link to another SSD..Will share the result soon.

Regards
Somnath

-----Original Message-----
From: Chen, Xiaoxi [mailto:xiaoxi.chen@intel.com] 
Sent: Wednesday, April 15, 2015 6:48 PM
To: Somnath Roy; Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hi Somnath,
     You could try apply this one:)
     https://github.com/ceph/ceph/pull/4356

      BTW the previous RocksDB configuration has a bug that set rocksdb_disableDataSync to true by default, which may cause data loss in failure. So pls update the newstore to latest or manually set it to false. I suspect the KVDB performance will be worse after doing this...but that's the way we need to go.
										Xiaoxi

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, April 16, 2015 12:07 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA.
This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Sage/Mark,
> I did some WA experiment with newstore with the similar settings I mentioned yesterday.
>
> Test:
> -------
>
> 64K Random write with 64 QD and writing total of 1 TB of data.
>
>
> Newstore:
> ------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
> iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
>   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
>     slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
>     clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
>      lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
>     clat percentiles (msec):
>      |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
>      | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
>      | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
>      | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
>      | 99.99th=[ 1270]
>     bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
>     lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
>     lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
>   cpu          : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, >=64=97.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
> mint=21421419msec, maxt=21421419msec
>
>
> So, iops getting is ~764.
> 99th percentile latency should be 100ms.
>
> Write amplification at disk level:
> --------------------------------------
>
> SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk.
>
> Please find the data in the following xls.
>
> https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
> cX5TLMRzdXyJE/edit?usp=sharing
>
> Total host writes in this period = 923896266
>
> Total flash writes in this period = 1465339040
>
>
> FileStore:
> -------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
> iops] [eta 00m:01s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
>   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
>     slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
>     clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
>      lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
>     clat percentiles (msec):
>      |  1.00th=[    7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
>      | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
>      | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
>      | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
>      | 99.99th=[ 1647]
>     bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00
>     lat (usec) : 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
>     lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
>     lat (msec) : 2000=0.06%, >=2000=0.01%
>   cpu          : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, >=64=19.1%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
> mint=10636117msec, maxt=10636117msec
>
> Disk stats (read/write):
>   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%
>
> So, iops here is ~1500.
> 99th percentile latency should be within 50ms
>
>
> Write amplification at disk level:
> --------------------------------------
>
> Total host writes in this period = 643611346
>
> Total flash writes in this period = 1157304512
>
>
>
> https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61
> Fz49CLH8WPh7Q/edit?pli=1#gid=95373000
>
>
>
>
>
> Summary:
> ------------
>
> 1.  The performance is doubled in case of filestore and latency is almost half.
>
> 2. Total number of flash writes is impacted by by both application write pattern + FTL logic etc. etc. So, I am not going into that.  Things to note the significant increase of host writes with newstore and that's definitely causing extra WA compare to  filestore.
>

Yeah, it seemed that xfs plays well when writing back.

> 3. Considering flash page size = 4K, the total writes in case of filestore = 2455 GB with a 1000 GB fio write vs 3524 GB with newstore. So, WA of filestore is ~2.4 vs ~3.5 in case of newstore. Considering inherent 2X WA for filestore, it is doing pretty good here.
>      Now, in case of newstore , it is not supposed to write WAL in case of new writes. It will be interesting to see % of new writes coming..Will analyze that..
>

I think it should result from kvdb. Maybe we can separate newstore's data dir and kvdb dir. So we can measure the difference with different disk counter.

> 4. If you can open my xls and graphs above, you can see initially host writes and flash writes are very similar in case of newstore and then it jumps high. Not sure why though. I will rerun the tests to confirm similar phenomenon.
>
> 5. The cumulative flash write  and cumulative host write graph is the actual WA (host + FW) caused by the write.
>

I'm interested in the flash write and disk write counter, is it a internal tool to inspect or it's a opensource tool?

> What's next:
> ---------------
>
> 1. Need to understand why 3.5 WA for newstore.
>
> 2. Try with different  Rocksdb tuning and record the impact.
>
>
> Any feedback/suggestion is much appreciated.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, April 13, 2015 4:54 PM
> To: ceph-devel
> Subject: Regarding newstore performance
>
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>         64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>         1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>         2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>         3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>         4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>              This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>         5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat
\x04 {.n +       +%  lzwm  b 맲  r  yǩ ׯzX  \x17  ܨ}   Ơz &j:+v        zZ+  +zf   h   ~    i   z \x1e w   ?    & )ߢ^[f

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-16  4:22       ` Somnath Roy
@ 2015-04-16  6:17         ` Somnath Roy
  2015-04-16 18:17           ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-04-16  6:17 UTC (permalink / raw)
  To: Chen, Xiaoxi, Haomai Wang; +Cc: ceph-devel

Here is the data with omap separated to another SSD and after 1000GB of fio writes (same profile)..

omap writes:
-------------

Total host writes in this period = 551020111 ------ ~2101 GB

Total flash writes in this period = 1150679336

data writes:
-----------

Total host writes in this period = 302550388 --- ~1154 GB

Total flash writes in this period = 600238328

So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those getting ~3.2 WA overall.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, April 15, 2015 9:22 PM
To: Chen, Xiaoxi; Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Thanks Xiaoxi..
But, I have already initiated test by making db/ a symbolic link to another SSD..Will share the result soon.

Regards
Somnath

-----Original Message-----
From: Chen, Xiaoxi [mailto:xiaoxi.chen@intel.com]
Sent: Wednesday, April 15, 2015 6:48 PM
To: Somnath Roy; Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hi Somnath,
     You could try apply this one:)
     https://github.com/ceph/ceph/pull/4356

      BTW the previous RocksDB configuration has a bug that set rocksdb_disableDataSync to true by default, which may cause data loss in failure. So pls update the newstore to latest or manually set it to false. I suspect the KVDB performance will be worse after doing this...but that's the way we need to go.
										Xiaoxi

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, April 16, 2015 12:07 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA.
This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Sage/Mark,
> I did some WA experiment with newstore with the similar settings I mentioned yesterday.
>
> Test:
> -------
>
> 64K Random write with 64 QD and writing total of 1 TB of data.
>
>
> Newstore:
> ------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
> iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
>   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
>     slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
>     clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
>      lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
>     clat percentiles (msec):
>      |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
>      | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
>      | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
>      | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
>      | 99.99th=[ 1270]
>     bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
>     lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
>     lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
>   cpu          : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, >=64=97.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
> mint=21421419msec, maxt=21421419msec
>
>
> So, iops getting is ~764.
> 99th percentile latency should be 100ms.
>
> Write amplification at disk level:
> --------------------------------------
>
> SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk.
>
> Please find the data in the following xls.
>
> https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
> cX5TLMRzdXyJE/edit?usp=sharing
>
> Total host writes in this period = 923896266
>
> Total flash writes in this period = 1465339040
>
>
> FileStore:
> -------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
> iops] [eta 00m:01s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
>   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
>     slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
>     clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
>      lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
>     clat percentiles (msec):
>      |  1.00th=[    7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
>      | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
>      | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
>      | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
>      | 99.99th=[ 1647]
>     bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00
>     lat (usec) : 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
>     lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
>     lat (msec) : 2000=0.06%, >=2000=0.01%
>   cpu          : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, >=64=19.1%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
> mint=10636117msec, maxt=10636117msec
>
> Disk stats (read/write):
>   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%
>
> So, iops here is ~1500.
> 99th percentile latency should be within 50ms
>
>
> Write amplification at disk level:
> --------------------------------------
>
> Total host writes in this period = 643611346
>
> Total flash writes in this period = 1157304512
>
>
>
> https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61
> Fz49CLH8WPh7Q/edit?pli=1#gid=95373000
>
>
>
>
>
> Summary:
> ------------
>
> 1.  The performance is doubled in case of filestore and latency is almost half.
>
> 2. Total number of flash writes is impacted by by both application write pattern + FTL logic etc. etc. So, I am not going into that.  Things to note the significant increase of host writes with newstore and that's definitely causing extra WA compare to  filestore.
>

Yeah, it seemed that xfs plays well when writing back.

> 3. Considering flash page size = 4K, the total writes in case of filestore = 2455 GB with a 1000 GB fio write vs 3524 GB with newstore. So, WA of filestore is ~2.4 vs ~3.5 in case of newstore. Considering inherent 2X WA for filestore, it is doing pretty good here.
>      Now, in case of newstore , it is not supposed to write WAL in case of new writes. It will be interesting to see % of new writes coming..Will analyze that..
>

I think it should result from kvdb. Maybe we can separate newstore's data dir and kvdb dir. So we can measure the difference with different disk counter.

> 4. If you can open my xls and graphs above, you can see initially host writes and flash writes are very similar in case of newstore and then it jumps high. Not sure why though. I will rerun the tests to confirm similar phenomenon.
>
> 5. The cumulative flash write  and cumulative host write graph is the actual WA (host + FW) caused by the write.
>

I'm interested in the flash write and disk write counter, is it a internal tool to inspect or it's a opensource tool?

> What's next:
> ---------------
>
> 1. Need to understand why 3.5 WA for newstore.
>
> 2. Try with different  Rocksdb tuning and record the impact.
>
>
> Any feedback/suggestion is much appreciated.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, April 13, 2015 4:54 PM
> To: ceph-devel
> Subject: Regarding newstore performance
>
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>         64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>         1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>         2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>         3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>         4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>              This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>         5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat
\x04 {.n +       +%  lzwm  b 맲  r  yǩ ׯzX  \x17  ܨ}   Ơz &j:+v        zZ+  +zf   h   ~    i   z \x1e w   ?    & )ߢ^[f
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-16  6:17         ` Somnath Roy
@ 2015-04-16 18:17           ` Mark Nelson
  2015-04-17  0:38             ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-16 18:17 UTC (permalink / raw)
  To: Somnath Roy, Chen, Xiaoxi, Haomai Wang; +Cc: ceph-devel

On 04/16/2015 01:17 AM, Somnath Roy wrote:
> Here is the data with omap separated to another SSD and after 1000GB of fio writes (same profile)..
>
> omap writes:
> -------------
>
> Total host writes in this period = 551020111 ------ ~2101 GB
>
> Total flash writes in this period = 1150679336
>
> data writes:
> -----------
>
> Total host writes in this period = 302550388 --- ~1154 GB
>
> Total flash writes in this period = 600238328
>
> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those getting ~3.2 WA overall.

Looks like we can get quite a bit of data out of the rocksdb log as 
well.  Here's a stats dump after a full benchmark run from an SSD backed 
OSD with newstore, fdatasync, and xioxi's tuanbles to increase buffer sizes:

http://www.fpaste.org/212007/raw/

It appears that in this test at least, a lot of data gets moved to L3 
and L4 with associated WA.  Notice the crazy amount of reads as well!

Mark

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-16 18:17           ` Mark Nelson
@ 2015-04-17  0:38             ` Sage Weil
  2015-04-17  0:47               ` Gregory Farnum
                                 ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Sage Weil @ 2015-04-17  0:38 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Somnath Roy, Chen, Xiaoxi, Haomai Wang, ceph-devel

On Thu, 16 Apr 2015, Mark Nelson wrote:
> On 04/16/2015 01:17 AM, Somnath Roy wrote:
> > Here is the data with omap separated to another SSD and after 1000GB of fio
> > writes (same profile)..
> > 
> > omap writes:
> > -------------
> > 
> > Total host writes in this period = 551020111 ------ ~2101 GB
> > 
> > Total flash writes in this period = 1150679336
> > 
> > data writes:
> > -----------
> > 
> > Total host writes in this period = 302550388 --- ~1154 GB
> > 
> > Total flash writes in this period = 600238328
> > 
> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
> > getting ~3.2 WA overall.

This all suggests that getting rocksdb to not rewrite the wal 
entries at all will be the big win.  I think Xiaoxi had tunable 
suggestions for that?  I didn't grok the rocksdb terms immediately so 
they didn't make a lot of sense at the time.. this is probably a good 
place to focus, though.  The rocksdb compaction stats should help out 
there.

But... today I ignored this entirely and put rocksdb in tmpfs and focused 
just on the actual wal IOs done to the fragments files after the fact.  
For simplicity I focused just on 128k random writes into 4mb objects.

fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting 
iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
almost any value really) and thinktime_blocks=16, at which point it goes 
up with the iodepth.  I'm not quite sure what is going on there but it 
seems to be preventing the elevator and/or disk from reordering writes and 
make more efficient sweeps across the disk.  In any case, though, with 
that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.  
Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec, 
which is basically what I was getting from newstore.  Here's my fio 
config:

	http://fpaste.org/212110/42923089/

Conclusion: we need multiple threads (or libaio) to get lots of IOs in 
flight so that the block layer and/or disk can reorder and be efficient.  
I added a threadpool for doing wal work (newstore wal threads = 8 by 
default) and it makes a big difference.  Now I am getting more like 
19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up 
much from there as I scale threads or qd, strangely; not sure why yet.

But... that's a big improvement over a few days ago (~8mb/sec).  And on 
this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're 
winning, yay!

I tabled the libaio patch for now since it was getting spurious EINVAL and 
would consistently SIGBUG from io_getevents() when ceph-osd did dlopen() 
on the rados plugins (weird!).

Mark, at this point it is probably worth checking that you can reproduce 
these results?  If so, we can redo the io size sweep.  I picked 8 wal 
threads since that was enough to help and going higher didn't seem to make 
much difference, but at some point we'll want to be more careful about 
picking that number.  We could also use libaio here, but I'm not sure it's 
worth it.  And this approach is somewhat orthogonal to the idea of 
efficiently passing the kernel things to fdatasync.

Anyway, next up is probably wrangling rocksdb's log!

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17  0:38             ` Sage Weil
@ 2015-04-17  0:47               ` Gregory Farnum
  2015-04-17  0:53                 ` Sage Weil
  2015-04-17  0:55                 ` Chen, Xiaoxi
  2015-04-17  4:53               ` Haomai Wang
  2015-04-17 12:10               ` Mark Nelson
  2 siblings, 2 replies; 28+ messages in thread
From: Gregory Farnum @ 2015-04-17  0:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, Somnath Roy, Chen, Xiaoxi, Haomai Wang, ceph-devel

On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>> > Here is the data with omap separated to another SSD and after 1000GB of fio
>> > writes (same profile)..
>> >
>> > omap writes:
>> > -------------
>> >
>> > Total host writes in this period = 551020111 ------ ~2101 GB
>> >
>> > Total flash writes in this period = 1150679336
>> >
>> > data writes:
>> > -----------
>> >
>> > Total host writes in this period = 302550388 --- ~1154 GB
>> >
>> > Total flash writes in this period = 600238328
>> >
>> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
>> > getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal
> entries at all will be the big win.  I think Xiaoxi had tunable
> suggestions for that?  I didn't grok the rocksdb terms immediately so
> they didn't make a lot of sense at the time.. this is probably a good
> place to focus, though.  The rocksdb compaction stats should help out
> there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and focused
> just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> almost any value really) and thinktime_blocks=16, at which point it goes
> up with the iodepth.  I'm not quite sure what is going on there but it
> seems to be preventing the elevator and/or disk from reordering writes and
> make more efficient sweeps across the disk.  In any case, though, with
> that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> which is basically what I was getting from newstore.  Here's my fio
> config:
>
>         http://fpaste.org/212110/42923089/
>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And on
> this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL and
> would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can reproduce
> these results?  If so, we can redo the io size sweep.  I picked 8 wal
> threads since that was enough to help and going higher didn't seem to make
> much difference, but at some point we'll want to be more careful about
> picking that number.  We could also use libaio here, but I'm not sure it's
> worth it.  And this approach is somewhat orthogonal to the idea of
> efficiently passing the kernel things to fdatasync.

Adding another thread switch to the IO path is going to make us very
sad in the future, so I think this'd be a bad prototype version to
have escape into the wild. I keep hearing Sam's talk about needing to
get down to 1 thread switch if we're ever to hope for 100usec writes.

So consider this one vote for making libaio work, and sooner rather
than later. :)
-Greg

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17  0:47               ` Gregory Farnum
@ 2015-04-17  0:53                 ` Sage Weil
  2015-04-17  0:55                 ` Chen, Xiaoxi
  1 sibling, 0 replies; 28+ messages in thread
From: Sage Weil @ 2015-04-17  0:53 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Mark Nelson, Somnath Roy, Chen, Xiaoxi, Haomai Wang, ceph-devel

On Thu, 16 Apr 2015, Gregory Farnum wrote:
> On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil <sage@newdream.net> wrote:
> > On Thu, 16 Apr 2015, Mark Nelson wrote:
> >> On 04/16/2015 01:17 AM, Somnath Roy wrote:
> >> > Here is the data with omap separated to another SSD and after 1000GB of fio
> >> > writes (same profile)..
> >> >
> >> > omap writes:
> >> > -------------
> >> >
> >> > Total host writes in this period = 551020111 ------ ~2101 GB
> >> >
> >> > Total flash writes in this period = 1150679336
> >> >
> >> > data writes:
> >> > -----------
> >> >
> >> > Total host writes in this period = 302550388 --- ~1154 GB
> >> >
> >> > Total flash writes in this period = 600238328
> >> >
> >> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
> >> > getting ~3.2 WA overall.
> >
> > This all suggests that getting rocksdb to not rewrite the wal
> > entries at all will be the big win.  I think Xiaoxi had tunable
> > suggestions for that?  I didn't grok the rocksdb terms immediately so
> > they didn't make a lot of sense at the time.. this is probably a good
> > place to focus, though.  The rocksdb compaction stats should help out
> > there.
> >
> > But... today I ignored this entirely and put rocksdb in tmpfs and focused
> > just on the actual wal IOs done to the fragments files after the fact.
> > For simplicity I focused just on 128k random writes into 4mb objects.
> >
> > fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> > iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> > almost any value really) and thinktime_blocks=16, at which point it goes
> > up with the iodepth.  I'm not quite sure what is going on there but it
> > seems to be preventing the elevator and/or disk from reordering writes and
> > make more efficient sweeps across the disk.  In any case, though, with
> > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> > which is basically what I was getting from newstore.  Here's my fio
> > config:
> >
> >         http://fpaste.org/212110/42923089/
> >
> > Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> > flight so that the block layer and/or disk can reorder and be efficient.
> > I added a threadpool for doing wal work (newstore wal threads = 8 by
> > default) and it makes a big difference.  Now I am getting more like
> > 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> > much from there as I scale threads or qd, strangely; not sure why yet.
> >
> > But... that's a big improvement over a few days ago (~8mb/sec).  And on
> > this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> > winning, yay!
> >
> > I tabled the libaio patch for now since it was getting spurious EINVAL and
> > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> > on the rados plugins (weird!).
> >
> > Mark, at this point it is probably worth checking that you can reproduce
> > these results?  If so, we can redo the io size sweep.  I picked 8 wal
> > threads since that was enough to help and going higher didn't seem to make
> > much difference, but at some point we'll want to be more careful about
> > picking that number.  We could also use libaio here, but I'm not sure it's
> > worth it.  And this approach is somewhat orthogonal to the idea of
> > efficiently passing the kernel things to fdatasync.
> 
> Adding another thread switch to the IO path is going to make us very
> sad in the future, so I think this'd be a bad prototype version to
> have escape into the wild. I keep hearing Sam's talk about needing to
> get down to 1 thread switch if we're ever to hope for 100usec writes.

Yeah, for fast memory we'll want to take a totally different synchronous 
path through the code.  Right now I'm targetting general purpose (spinning 
disk and current-generation SSDs) usage (and this is the async post-commit 
cleanup work).

But yeah... I'll bite the bullet and do aio soon.  I suspect I just 
screwed up the buffer alignment and that's where EINVAL was coming from 
before.

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-17  0:47               ` Gregory Farnum
  2015-04-17  0:53                 ` Sage Weil
@ 2015-04-17  0:55                 ` Chen, Xiaoxi
  1 sibling, 0 replies; 28+ messages in thread
From: Chen, Xiaoxi @ 2015-04-17  0:55 UTC (permalink / raw)
  To: Gregory Farnum, Sage Weil
  Cc: Mark Nelson, Somnath Roy, Haomai Wang, ceph-devel

Agree....Threadpool/Queue/Locking is in generally bad for latency. Can we just make newstore backend as synchronize as possible and utilize the parallelism by more #OSD_OP_THREAD? Hopefully we could have better latency in low #QD case.
 

-----Original Message-----
From: Gregory Farnum [mailto:greg@gregs42.com] 
Sent: Friday, April 17, 2015 8:48 AM
To: Sage Weil
Cc: Mark Nelson; Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
Subject: Re: Regarding newstore performance

On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>> > Here is the data with omap separated to another SSD and after 
>> > 1000GB of fio writes (same profile)..
>> >
>> > omap writes:
>> > -------------
>> >
>> > Total host writes in this period = 551020111 ------ ~2101 GB
>> >
>> > Total flash writes in this period = 1150679336
>> >
>> > data writes:
>> > -----------
>> >
>> > Total host writes in this period = 302550388 --- ~1154 GB
>> >
>> > Total flash writes in this period = 600238328
>> >
>> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and 
>> > adding those getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal entries 
> at all will be the big win.  I think Xiaoxi had tunable suggestions 
> for that?  I didn't grok the rocksdb terms immediately so they didn't 
> make a lot of sense at the time.. this is probably a good place to 
> focus, though.  The rocksdb compaction stats should help out there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and 
> focused just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, 
> setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
> almost any value really) and thinktime_blocks=16, at which point it 
> goes up with the iodepth.  I'm not quite sure what is going on there 
> but it seems to be preventing the elevator and/or disk from reordering 
> writes and make more efficient sweeps across the disk.  In any case, 
> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 
> 15mb/sec, which is basically what I was getting from newstore.  Here's 
> my fio
> config:
>
>         http://fpaste.org/212110/42923089/
>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in 
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like 
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going 
> up much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And 
> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're 
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL 
> and would consistently SIGBUG from io_getevents() when ceph-osd did 
> dlopen() on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can 
> reproduce these results?  If so, we can redo the io size sweep.  I 
> picked 8 wal threads since that was enough to help and going higher 
> didn't seem to make much difference, but at some point we'll want to 
> be more careful about picking that number.  We could also use libaio 
> here, but I'm not sure it's worth it.  And this approach is somewhat 
> orthogonal to the idea of efficiently passing the kernel things to fdatasync.

Adding another thread switch to the IO path is going to make us very sad in the future, so I think this'd be a bad prototype version to have escape into the wild. I keep hearing Sam's talk about needing to get down to 1 thread switch if we're ever to hope for 100usec writes.

So consider this one vote for making libaio work, and sooner rather than later. :) -Greg

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17  0:38             ` Sage Weil
  2015-04-17  0:47               ` Gregory Farnum
@ 2015-04-17  4:53               ` Haomai Wang
  2015-04-17 15:28                 ` Sage Weil
  2015-04-17 12:10               ` Mark Nelson
  2 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-04-17  4:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, Somnath Roy, Chen, Xiaoxi, ceph-devel

On Fri, Apr 17, 2015 at 8:38 AM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>> > Here is the data with omap separated to another SSD and after 1000GB of fio
>> > writes (same profile)..
>> >
>> > omap writes:
>> > -------------
>> >
>> > Total host writes in this period = 551020111 ------ ~2101 GB
>> >
>> > Total flash writes in this period = 1150679336
>> >
>> > data writes:
>> > -----------
>> >
>> > Total host writes in this period = 302550388 --- ~1154 GB
>> >
>> > Total flash writes in this period = 600238328
>> >
>> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
>> > getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal
> entries at all will be the big win.  I think Xiaoxi had tunable
> suggestions for that?  I didn't grok the rocksdb terms immediately so
> they didn't make a lot of sense at the time.. this is probably a good
> place to focus, though.  The rocksdb compaction stats should help out
> there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and focused
> just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> almost any value really) and thinktime_blocks=16, at which point it goes
> up with the iodepth.  I'm not quite sure what is going on there but it
> seems to be preventing the elevator and/or disk from reordering writes and
> make more efficient sweeps across the disk.  In any case, though, with
> that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> which is basically what I was getting from newstore.  Here's my fio
> config:
>
>         http://fpaste.org/212110/42923089/
>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> much from there as I scale threads or qd, strangely; not sure why yet.

Do you mean this PR(https://github.com/ceph/ceph/pull/4318)? I have a
simple benchmark at the comment of PR.

>
> But... that's a big improvement over a few days ago (~8mb/sec).  And on
> this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL and
> would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can reproduce
> these results?  If so, we can redo the io size sweep.  I picked 8 wal
> threads since that was enough to help and going higher didn't seem to make
> much difference, but at some point we'll want to be more careful about
> picking that number.  We could also use libaio here, but I'm not sure it's
> worth it.  And this approach is somewhat orthogonal to the idea of
> efficiently passing the kernel things to fdatasync.

Agreed, this time I think we need to focus data store only. Maybe I'm
missing, what's your overlay config value in this test?

>
> Anyway, next up is probably wrangling rocksdb's log!
>
> sage



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17  0:38             ` Sage Weil
  2015-04-17  0:47               ` Gregory Farnum
  2015-04-17  4:53               ` Haomai Wang
@ 2015-04-17 12:10               ` Mark Nelson
  2015-04-17 14:08                 ` Chen, Xiaoxi
                                   ` (2 more replies)
  2 siblings, 3 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-17 12:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Chen, Xiaoxi, Haomai Wang, ceph-devel



On 04/16/2015 07:38 PM, Sage Weil wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>> Here is the data with omap separated to another SSD and after 1000GB of fio
>>> writes (same profile)..
>>>
>>> omap writes:
>>> -------------
>>>
>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>
>>> Total flash writes in this period = 1150679336
>>>
>>> data writes:
>>> -----------
>>>
>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>
>>> Total flash writes in this period = 600238328
>>>
>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
>>> getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal
> entries at all will be the big win.  I think Xiaoxi had tunable
> suggestions for that?  I didn't grok the rocksdb terms immediately so
> they didn't make a lot of sense at the time.. this is probably a good
> place to focus, though.  The rocksdb compaction stats should help out
> there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and focused
> just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> almost any value really) and thinktime_blocks=16, at which point it goes
> up with the iodepth.  I'm not quite sure what is going on there but it
> seems to be preventing the elevator and/or disk from reordering writes and
> make more efficient sweeps across the disk.  In any case, though, with
> that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> which is basically what I was getting from newstore.  Here's my fio
> config:
>
> 	http://fpaste.org/212110/42923089/


Yikes!  That is a great observation Sage!

>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And on
> this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL and
> would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can reproduce
> these results?  If so, we can redo the io size sweep.  I picked 8 wal
> threads since that was enough to help and going higher didn't seem to make
> much difference, but at some point we'll want to be more careful about
> picking that number.  We could also use libaio here, but I'm not sure it's
> worth it.  And this approach is somewhat orthogonal to the idea of
> efficiently passing the kernel things to fdatasync.

Absolutely!  I'll get some tests running now.  Looks like everyone is 
jumping on the libaio bandwagon which naively seems like the right way 
to me too.  Can you talk a little bit more about how you'd see fdatasync 
work in this case though vs the threaded implementation?

>
> Anyway, next up is probably wrangling rocksdb's log!

I jumped on #rocksdb on freenode yesterday to ask about it, but I think 
we'll probably just need to hit the mailing list.

>
> sage
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-17 12:10               ` Mark Nelson
@ 2015-04-17 14:08                 ` Chen, Xiaoxi
  2015-04-17 14:20                   ` Haomai Wang
  2015-04-17 14:40                 ` Chen, Xiaoxi
  2015-04-17 15:46                 ` Sage Weil
  2 siblings, 1 reply; 28+ messages in thread
From: Chen, Xiaoxi @ 2015-04-17 14:08 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: Somnath Roy, Haomai Wang, ceph-devel

I tried to spilit the DB/data/WAL into 3 different SSD, the IOSTAT looks like below:

SDB is the data while SDC is db and SDD is the WAL of RocksDB.
The IO pattern is 4KB random write(QD=8) ontop of a pre-filled RBD, using fio-librbd.

The result looks strange,
1. in SDB(data part), we are expecting 4KB IO but actually we only get 2KB(4Sector).
2. There are not that much data written to Level 0+, only 0.53MB/s
3. Note that the avgqu-sz is very low compared to QD=8 in FIO, seems the problem is that we cannot commit the WAL fast enough.


My code base is 6e9b2fce30cf297e60454689c6fb406b6e786889,

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.77    0.00    8.87    2.06    0.00   73.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    10.60    0.00   49.60     0.00    21.56   890.39     6.68  134.76    0.00  134.76   1.16   5.76
sdb               0.00     0.00    0.00 1627.30     0.00     3.22     4.05     0.11    0.07    0.00    0.07   0.06  10.52
sdc               0.00     0.00    0.20    4.30     0.00     0.53   239.33     0.00    1.07    2.00    1.02   0.71   0.32
sdd               0.00   612.00    0.00 1829.50     0.00     9.41    10.53     0.85    0.46    0.00    0.46   0.46  84.68


/dev/sdc1      156172796  2740620 153432176   2% /root/ceph-0-db
/dev/sdd1      195264572    41940 195222632   1% /root/ceph-0-db-wal
/dev/sdb1      156172796 10519532 145653264   7% /var/lib/ceph/osd/ceph-0

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Friday, April 17, 2015 8:11 PM
To: Sage Weil
Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
Subject: Re: Regarding newstore performance



On 04/16/2015 07:38 PM, Sage Weil wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>> Here is the data with omap separated to another SSD and after 1000GB 
>>> of fio writes (same profile)..
>>>
>>> omap writes:
>>> -------------
>>>
>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>
>>> Total flash writes in this period = 1150679336
>>>
>>> data writes:
>>> -----------
>>>
>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>
>>> Total flash writes in this period = 600238328
>>>
>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and 
>>> adding those getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal entries 
> at all will be the big win.  I think Xiaoxi had tunable suggestions 
> for that?  I didn't grok the rocksdb terms immediately so they didn't 
> make a lot of sense at the time.. this is probably a good place to 
> focus, though.  The rocksdb compaction stats should help out there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and 
> focused just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, 
> setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
> almost any value really) and thinktime_blocks=16, at which point it 
> goes up with the iodepth.  I'm not quite sure what is going on there 
> but it seems to be preventing the elevator and/or disk from reordering 
> writes and make more efficient sweeps across the disk.  In any case, 
> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 
> 15mb/sec, which is basically what I was getting from newstore.  Here's 
> my fio
> config:
>
> 	http://fpaste.org/212110/42923089/


Yikes!  That is a great observation Sage!

>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in 
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like 
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going 
> up much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And 
> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're 
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL 
> and would consistently SIGBUG from io_getevents() when ceph-osd did 
> dlopen() on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can 
> reproduce these results?  If so, we can redo the io size sweep.  I 
> picked 8 wal threads since that was enough to help and going higher 
> didn't seem to make much difference, but at some point we'll want to 
> be more careful about picking that number.  We could also use libaio 
> here, but I'm not sure it's worth it.  And this approach is somewhat 
> orthogonal to the idea of efficiently passing the kernel things to fdatasync.

Absolutely!  I'll get some tests running now.  Looks like everyone is jumping on the libaio bandwagon which naively seems like the right way to me too.  Can you talk a little bit more about how you'd see fdatasync work in this case though vs the threaded implementation?

>
> Anyway, next up is probably wrangling rocksdb's log!

I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll probably just need to hit the mailing list.

>
> sage
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17 14:08                 ` Chen, Xiaoxi
@ 2015-04-17 14:20                   ` Haomai Wang
  2015-04-17 14:29                     ` Chen, Xiaoxi
  0 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-04-17 14:20 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Mark Nelson, Sage Weil, Somnath Roy, ceph-devel

On Fri, Apr 17, 2015 at 10:08 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> I tried to spilit the DB/data/WAL into 3 different SSD, the IOSTAT looks like below:
>
> SDB is the data while SDC is db and SDD is the WAL of RocksDB.
> The IO pattern is 4KB random write(QD=8) ontop of a pre-filled RBD, using fio-librbd.
>
> The result looks strange,
> 1. in SDB(data part), we are expecting 4KB IO but actually we only get 2KB(4Sector).
> 2. There are not that much data written to Level 0+, only 0.53MB/s
> 3. Note that the avgqu-sz is very low compared to QD=8 in FIO, seems the problem is that we cannot commit the WAL fast enough.

Are you using default io scheduler for these ssd? I'm not sure that
linux cfq scheduler will make fsync/fdatasync behind all inprogress
write op. So if we always issue fsync in rocksdb layer, it will try to
merge more fsync requests? Maybe you could move to deadline or noop?

>
>
> My code base is 6e9b2fce30cf297e60454689c6fb406b6e786889,
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           15.77    0.00    8.87    2.06    0.00   73.30
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00    10.60    0.00   49.60     0.00    21.56   890.39     6.68  134.76    0.00  134.76   1.16   5.76
> sdb               0.00     0.00    0.00 1627.30     0.00     3.22     4.05     0.11    0.07    0.00    0.07   0.06  10.52
> sdc               0.00     0.00    0.20    4.30     0.00     0.53   239.33     0.00    1.07    2.00    1.02   0.71   0.32
> sdd               0.00   612.00    0.00 1829.50     0.00     9.41    10.53     0.85    0.46    0.00    0.46   0.46  84.68
>
>
> /dev/sdc1      156172796  2740620 153432176   2% /root/ceph-0-db
> /dev/sdd1      195264572    41940 195222632   1% /root/ceph-0-db-wal
> /dev/sdb1      156172796 10519532 145653264   7% /var/lib/ceph/osd/ceph-0
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Friday, April 17, 2015 8:11 PM
> To: Sage Weil
> Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
> Subject: Re: Regarding newstore performance
>
>
>
> On 04/16/2015 07:38 PM, Sage Weil wrote:
>> On Thu, 16 Apr 2015, Mark Nelson wrote:
>>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>>> Here is the data with omap separated to another SSD and after 1000GB
>>>> of fio writes (same profile)..
>>>>
>>>> omap writes:
>>>> -------------
>>>>
>>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>>
>>>> Total flash writes in this period = 1150679336
>>>>
>>>> data writes:
>>>> -----------
>>>>
>>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>>
>>>> Total flash writes in this period = 600238328
>>>>
>>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and
>>>> adding those getting ~3.2 WA overall.
>>
>> This all suggests that getting rocksdb to not rewrite the wal entries
>> at all will be the big win.  I think Xiaoxi had tunable suggestions
>> for that?  I didn't grok the rocksdb terms immediately so they didn't
>> make a lot of sense at the time.. this is probably a good place to
>> focus, though.  The rocksdb compaction stats should help out there.
>>
>> But... today I ignored this entirely and put rocksdb in tmpfs and
>> focused just on the actual wal IOs done to the fragments files after the fact.
>> For simplicity I focused just on 128k random writes into 4mb objects.
>>
>> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly,
>> setting
>> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
>> almost any value really) and thinktime_blocks=16, at which point it
>> goes up with the iodepth.  I'm not quite sure what is going on there
>> but it seems to be preventing the elevator and/or disk from reordering
>> writes and make more efficient sweeps across the disk.  In any case,
>> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
>> Similarly, with qa 1 and thinktime of 250us, it drops to like
>> 15mb/sec, which is basically what I was getting from newstore.  Here's
>> my fio
>> config:
>>
>>       http://fpaste.org/212110/42923089/
>
>
> Yikes!  That is a great observation Sage!
>
>>
>> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
>> flight so that the block layer and/or disk can reorder and be efficient.
>> I added a threadpool for doing wal work (newstore wal threads = 8 by
>> default) and it makes a big difference.  Now I am getting more like
>> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going
>> up much from there as I scale threads or qd, strangely; not sure why yet.
>>
>> But... that's a big improvement over a few days ago (~8mb/sec).  And
>> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
>> winning, yay!
>>
>> I tabled the libaio patch for now since it was getting spurious EINVAL
>> and would consistently SIGBUG from io_getevents() when ceph-osd did
>> dlopen() on the rados plugins (weird!).
>>
>> Mark, at this point it is probably worth checking that you can
>> reproduce these results?  If so, we can redo the io size sweep.  I
>> picked 8 wal threads since that was enough to help and going higher
>> didn't seem to make much difference, but at some point we'll want to
>> be more careful about picking that number.  We could also use libaio
>> here, but I'm not sure it's worth it.  And this approach is somewhat
>> orthogonal to the idea of efficiently passing the kernel things to fdatasync.
>
> Absolutely!  I'll get some tests running now.  Looks like everyone is jumping on the libaio bandwagon which naively seems like the right way to me too.  Can you talk a little bit more about how you'd see fdatasync work in this case though vs the threaded implementation?
>
>>
>> Anyway, next up is probably wrangling rocksdb's log!
>
> I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll probably just need to hit the mailing list.
>
>>
>> sage
>>



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-17 14:20                   ` Haomai Wang
@ 2015-04-17 14:29                     ` Chen, Xiaoxi
  2015-04-17 14:34                       ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Chen, Xiaoxi @ 2015-04-17 14:29 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Mark Nelson, Sage Weil, Somnath Roy, ceph-devel

I use deadline.

Yes in RocksDB every commit will follow by a fsync/fdatasync for WAL log data safely. Not sure if they could write the WAL log by O_DIRECT to avoid tons of fsync?

Here is the DB stats that I printed every 5s,showing 1 write/sync.

** DB Stats **
Uptime(secs): 1127.6 total, 5.9 interval
Cumulative writes: 1723086 writes, 8251002 keys, 1723002 batches, 1.0 writes per batch, 14.46 GB user ingest, stall time: 0 us
Cumulative WAL: 1723087 writes, 1723001 syncs, 1.00 writes per sync, 14.46 GB written
Interval writes: 15179 writes, 77017 keys, 15179 batches, 1.0 writes per batch, 29.4 MB user ingest, stall time: 0 us
Interval WAL: 15180 writes, 15179 syncs, 1.00 writes per sync, 0.03 MB written


-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com] 
Sent: Friday, April 17, 2015 10:20 PM
To: Chen, Xiaoxi
Cc: Mark Nelson; Sage Weil; Somnath Roy; ceph-devel
Subject: Re: Regarding newstore performance

On Fri, Apr 17, 2015 at 10:08 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> I tried to spilit the DB/data/WAL into 3 different SSD, the IOSTAT looks like below:
>
> SDB is the data while SDC is db and SDD is the WAL of RocksDB.
> The IO pattern is 4KB random write(QD=8) ontop of a pre-filled RBD, using fio-librbd.
>
> The result looks strange,
> 1. in SDB(data part), we are expecting 4KB IO but actually we only get 2KB(4Sector).
> 2. There are not that much data written to Level 0+, only 0.53MB/s 3. 
> Note that the avgqu-sz is very low compared to QD=8 in FIO, seems the problem is that we cannot commit the WAL fast enough.

Are you using default io scheduler for these ssd? I'm not sure that linux cfq scheduler will make fsync/fdatasync behind all inprogress write op. So if we always issue fsync in rocksdb layer, it will try to merge more fsync requests? Maybe you could move to deadline or noop?

>
>
> My code base is 6e9b2fce30cf297e60454689c6fb406b6e786889,
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           15.77    0.00    8.87    2.06    0.00   73.30
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00    10.60    0.00   49.60     0.00    21.56   890.39     6.68  134.76    0.00  134.76   1.16   5.76
> sdb               0.00     0.00    0.00 1627.30     0.00     3.22     4.05     0.11    0.07    0.00    0.07   0.06  10.52
> sdc               0.00     0.00    0.20    4.30     0.00     0.53   239.33     0.00    1.07    2.00    1.02   0.71   0.32
> sdd               0.00   612.00    0.00 1829.50     0.00     9.41    10.53     0.85    0.46    0.00    0.46   0.46  84.68
>
>
> /dev/sdc1      156172796  2740620 153432176   2% /root/ceph-0-db
> /dev/sdd1      195264572    41940 195222632   1% /root/ceph-0-db-wal
> /dev/sdb1      156172796 10519532 145653264   7% /var/lib/ceph/osd/ceph-0
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Friday, April 17, 2015 8:11 PM
> To: Sage Weil
> Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
> Subject: Re: Regarding newstore performance
>
>
>
> On 04/16/2015 07:38 PM, Sage Weil wrote:
>> On Thu, 16 Apr 2015, Mark Nelson wrote:
>>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>>> Here is the data with omap separated to another SSD and after 
>>>> 1000GB of fio writes (same profile)..
>>>>
>>>> omap writes:
>>>> -------------
>>>>
>>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>>
>>>> Total flash writes in this period = 1150679336
>>>>
>>>> data writes:
>>>> -----------
>>>>
>>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>>
>>>> Total flash writes in this period = 600238328
>>>>
>>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and 
>>>> adding those getting ~3.2 WA overall.
>>
>> This all suggests that getting rocksdb to not rewrite the wal entries 
>> at all will be the big win.  I think Xiaoxi had tunable suggestions 
>> for that?  I didn't grok the rocksdb terms immediately so they didn't 
>> make a lot of sense at the time.. this is probably a good place to 
>> focus, though.  The rocksdb compaction stats should help out there.
>>
>> But... today I ignored this entirely and put rocksdb in tmpfs and 
>> focused just on the actual wal IOs done to the fragments files after the fact.
>> For simplicity I focused just on 128k random writes into 4mb objects.
>>
>> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, 
>> setting
>> iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
>> almost any value really) and thinktime_blocks=16, at which point it 
>> goes up with the iodepth.  I'm not quite sure what is going on there 
>> but it seems to be preventing the elevator and/or disk from 
>> reordering writes and make more efficient sweeps across the disk.  In 
>> any case, though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
>> Similarly, with qa 1 and thinktime of 250us, it drops to like 
>> 15mb/sec, which is basically what I was getting from newstore.  
>> Here's my fio
>> config:
>>
>>       http://fpaste.org/212110/42923089/
>
>
> Yikes!  That is a great observation Sage!
>
>>
>> Conclusion: we need multiple threads (or libaio) to get lots of IOs 
>> in flight so that the block layer and/or disk can reorder and be efficient.
>> I added a threadpool for doing wal work (newstore wal threads = 8 by
>> default) and it makes a big difference.  Now I am getting more like 
>> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not 
>> going up much from there as I scale threads or qd, strangely; not sure why yet.
>>
>> But... that's a big improvement over a few days ago (~8mb/sec).  And 
>> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So 
>> we're winning, yay!
>>
>> I tabled the libaio patch for now since it was getting spurious 
>> EINVAL and would consistently SIGBUG from io_getevents() when 
>> ceph-osd did
>> dlopen() on the rados plugins (weird!).
>>
>> Mark, at this point it is probably worth checking that you can 
>> reproduce these results?  If so, we can redo the io size sweep.  I 
>> picked 8 wal threads since that was enough to help and going higher 
>> didn't seem to make much difference, but at some point we'll want to 
>> be more careful about picking that number.  We could also use libaio 
>> here, but I'm not sure it's worth it.  And this approach is somewhat 
>> orthogonal to the idea of efficiently passing the kernel things to fdatasync.
>
> Absolutely!  I'll get some tests running now.  Looks like everyone is jumping on the libaio bandwagon which naively seems like the right way to me too.  Can you talk a little bit more about how you'd see fdatasync work in this case though vs the threaded implementation?
>
>>
>> Anyway, next up is probably wrangling rocksdb's log!
>
> I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll probably just need to hit the mailing list.
>
>>
>> sage
>>



--
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17 14:29                     ` Chen, Xiaoxi
@ 2015-04-17 14:34                       ` Mark Nelson
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-17 14:34 UTC (permalink / raw)
  To: Chen, Xiaoxi, Haomai Wang; +Cc: Sage Weil, Somnath Roy, ceph-devel



On 04/17/2015 09:29 AM, Chen, Xiaoxi wrote:
> I use deadline.
>
> Yes in RocksDB every commit will follow by a fsync/fdatasync for WAL log data safely. Not sure if they could write the WAL log by O_DIRECT to avoid tons of fsync?
>
> Here is the DB stats that I printed every 5s,showing 1 write/sync.
>
> ** DB Stats **
> Uptime(secs): 1127.6 total, 5.9 interval
> Cumulative writes: 1723086 writes, 8251002 keys, 1723002 batches, 1.0 writes per batch, 14.46 GB user ingest, stall time: 0 us
> Cumulative WAL: 1723087 writes, 1723001 syncs, 1.00 writes per sync, 14.46 GB written
> Interval writes: 15179 writes, 77017 keys, 15179 batches, 1.0 writes per batch, 29.4 MB user ingest, stall time: 0 us
> Interval WAL: 15180 writes, 15179 syncs, 1.00 writes per sync, 0.03 MB written

Yes, the dbstats for the test I did yesterday also show 1 write/sync:

http://www.fpaste.org/212007/raw/

>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, April 17, 2015 10:20 PM
> To: Chen, Xiaoxi
> Cc: Mark Nelson; Sage Weil; Somnath Roy; ceph-devel
> Subject: Re: Regarding newstore performance
>
> On Fri, Apr 17, 2015 at 10:08 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
>> I tried to spilit the DB/data/WAL into 3 different SSD, the IOSTAT looks like below:
>>
>> SDB is the data while SDC is db and SDD is the WAL of RocksDB.
>> The IO pattern is 4KB random write(QD=8) ontop of a pre-filled RBD, using fio-librbd.
>>
>> The result looks strange,
>> 1. in SDB(data part), we are expecting 4KB IO but actually we only get 2KB(4Sector).
>> 2. There are not that much data written to Level 0+, only 0.53MB/s 3.
>> Note that the avgqu-sz is very low compared to QD=8 in FIO, seems the problem is that we cannot commit the WAL fast enough.
>
> Are you using default io scheduler for these ssd? I'm not sure that linux cfq scheduler will make fsync/fdatasync behind all inprogress write op. So if we always issue fsync in rocksdb layer, it will try to merge more fsync requests? Maybe you could move to deadline or noop?
>
>>
>>
>> My code base is 6e9b2fce30cf297e60454689c6fb406b6e786889,
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            15.77    0.00    8.87    2.06    0.00   73.30
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda               0.00    10.60    0.00   49.60     0.00    21.56   890.39     6.68  134.76    0.00  134.76   1.16   5.76
>> sdb               0.00     0.00    0.00 1627.30     0.00     3.22     4.05     0.11    0.07    0.00    0.07   0.06  10.52
>> sdc               0.00     0.00    0.20    4.30     0.00     0.53   239.33     0.00    1.07    2.00    1.02   0.71   0.32
>> sdd               0.00   612.00    0.00 1829.50     0.00     9.41    10.53     0.85    0.46    0.00    0.46   0.46  84.68
>>
>>
>> /dev/sdc1      156172796  2740620 153432176   2% /root/ceph-0-db
>> /dev/sdd1      195264572    41940 195222632   1% /root/ceph-0-db-wal
>> /dev/sdb1      156172796 10519532 145653264   7% /var/lib/ceph/osd/ceph-0
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Friday, April 17, 2015 8:11 PM
>> To: Sage Weil
>> Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
>> Subject: Re: Regarding newstore performance
>>
>>
>>
>> On 04/16/2015 07:38 PM, Sage Weil wrote:
>>> On Thu, 16 Apr 2015, Mark Nelson wrote:
>>>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>>>> Here is the data with omap separated to another SSD and after
>>>>> 1000GB of fio writes (same profile)..
>>>>>
>>>>> omap writes:
>>>>> -------------
>>>>>
>>>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>>>
>>>>> Total flash writes in this period = 1150679336
>>>>>
>>>>> data writes:
>>>>> -----------
>>>>>
>>>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>>>
>>>>> Total flash writes in this period = 600238328
>>>>>
>>>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and
>>>>> adding those getting ~3.2 WA overall.
>>>
>>> This all suggests that getting rocksdb to not rewrite the wal entries
>>> at all will be the big win.  I think Xiaoxi had tunable suggestions
>>> for that?  I didn't grok the rocksdb terms immediately so they didn't
>>> make a lot of sense at the time.. this is probably a good place to
>>> focus, though.  The rocksdb compaction stats should help out there.
>>>
>>> But... today I ignored this entirely and put rocksdb in tmpfs and
>>> focused just on the actual wal IOs done to the fragments files after the fact.
>>> For simplicity I focused just on 128k random writes into 4mb objects.
>>>
>>> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly,
>>> setting
>>> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
>>> almost any value really) and thinktime_blocks=16, at which point it
>>> goes up with the iodepth.  I'm not quite sure what is going on there
>>> but it seems to be preventing the elevator and/or disk from
>>> reordering writes and make more efficient sweeps across the disk.  In
>>> any case, though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
>>> Similarly, with qa 1 and thinktime of 250us, it drops to like
>>> 15mb/sec, which is basically what I was getting from newstore.
>>> Here's my fio
>>> config:
>>>
>>>        http://fpaste.org/212110/42923089/
>>
>>
>> Yikes!  That is a great observation Sage!
>>
>>>
>>> Conclusion: we need multiple threads (or libaio) to get lots of IOs
>>> in flight so that the block layer and/or disk can reorder and be efficient.
>>> I added a threadpool for doing wal work (newstore wal threads = 8 by
>>> default) and it makes a big difference.  Now I am getting more like
>>> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not
>>> going up much from there as I scale threads or qd, strangely; not sure why yet.
>>>
>>> But... that's a big improvement over a few days ago (~8mb/sec).  And
>>> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So
>>> we're winning, yay!
>>>
>>> I tabled the libaio patch for now since it was getting spurious
>>> EINVAL and would consistently SIGBUG from io_getevents() when
>>> ceph-osd did
>>> dlopen() on the rados plugins (weird!).
>>>
>>> Mark, at this point it is probably worth checking that you can
>>> reproduce these results?  If so, we can redo the io size sweep.  I
>>> picked 8 wal threads since that was enough to help and going higher
>>> didn't seem to make much difference, but at some point we'll want to
>>> be more careful about picking that number.  We could also use libaio
>>> here, but I'm not sure it's worth it.  And this approach is somewhat
>>> orthogonal to the idea of efficiently passing the kernel things to fdatasync.
>>
>> Absolutely!  I'll get some tests running now.  Looks like everyone is jumping on the libaio bandwagon which naively seems like the right way to me too.  Can you talk a little bit more about how you'd see fdatasync work in this case though vs the threaded implementation?
>>
>>>
>>> Anyway, next up is probably wrangling rocksdb's log!
>>
>> I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll probably just need to hit the mailing list.
>>
>>>
>>> sage
>>>
>
>
>
> --
> Best Regards,
>
> Wheat
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-17 12:10               ` Mark Nelson
  2015-04-17 14:08                 ` Chen, Xiaoxi
@ 2015-04-17 14:40                 ` Chen, Xiaoxi
  2015-04-17 15:25                   ` Mark Nelson
  2015-04-17 15:46                 ` Sage Weil
  2 siblings, 1 reply; 28+ messages in thread
From: Chen, Xiaoxi @ 2015-04-17 14:40 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: Somnath Roy, Haomai Wang, ceph-devel

Hi Mark,

     These two tunings should help on keeping the WAL log live long enough. By default the value is 0/0, that means the WAL log file will be deleted ASAP, this is definitely not the way we want. Sadly these two is not exposed by RocksDB store, need hand writing to  os/RocksDBStore.cc:: do_open.

     Seems all the problem now is focusing on KV-DB, is that make sense for us to have a small benchmark tool that simulate newstore workload to RocksDB? The pattern seems like 1WAP item(4KB or something) per commit , in the 4KB random write case. then we can play with the tuning out of Ceph.

       // The following two fields affect how archived logs will be deleted.
  // 1. If both set to 0, logs will be deleted asap and will not get into
  //    the archive.
  // 2. If WAL_ttl_seconds is 0 and WAL_size_limit_MB is not 0,
  //    WAL files will be checked every 10 min and if total size is greater
  //    then WAL_size_limit_MB, they will be deleted starting with the
  //    earliest until size_limit is met. All empty files will be deleted.
  // 3. If WAL_ttl_seconds is not 0 and WAL_size_limit_MB is 0, then
  //    WAL files will be checked every WAL_ttl_secondsi / 2 and those that
  //    are older than WAL_ttl_seconds will be deleted.
  // 4. If both are not 0, WAL files will be checked every 10 min and both
  //    checks will be performed with ttl being first.
  uint64_t WAL_ttl_seconds;
  uint64_t WAL_size_limit_MB;

							Xiaoxi

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Friday, April 17, 2015 8:11 PM
To: Sage Weil
Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
Subject: Re: Regarding newstore performance



On 04/16/2015 07:38 PM, Sage Weil wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>> Here is the data with omap separated to another SSD and after 1000GB 
>>> of fio writes (same profile)..
>>>
>>> omap writes:
>>> -------------
>>>
>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>
>>> Total flash writes in this period = 1150679336
>>>
>>> data writes:
>>> -----------
>>>
>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>
>>> Total flash writes in this period = 600238328
>>>
>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and 
>>> adding those getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal entries 
> at all will be the big win.  I think Xiaoxi had tunable suggestions 
> for that?  I didn't grok the rocksdb terms immediately so they didn't 
> make a lot of sense at the time.. this is probably a good place to 
> focus, though.  The rocksdb compaction stats should help out there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and 
> focused just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, 
> setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
> almost any value really) and thinktime_blocks=16, at which point it 
> goes up with the iodepth.  I'm not quite sure what is going on there 
> but it seems to be preventing the elevator and/or disk from reordering 
> writes and make more efficient sweeps across the disk.  In any case, 
> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 
> 15mb/sec, which is basically what I was getting from newstore.  Here's 
> my fio
> config:
>
> 	http://fpaste.org/212110/42923089/


Yikes!  That is a great observation Sage!

>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in 
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like 
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going 
> up much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And 
> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're 
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL 
> and would consistently SIGBUG from io_getevents() when ceph-osd did 
> dlopen() on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can 
> reproduce these results?  If so, we can redo the io size sweep.  I 
> picked 8 wal threads since that was enough to help and going higher 
> didn't seem to make much difference, but at some point we'll want to 
> be more careful about picking that number.  We could also use libaio 
> here, but I'm not sure it's worth it.  And this approach is somewhat 
> orthogonal to the idea of efficiently passing the kernel things to fdatasync.

Absolutely!  I'll get some tests running now.  Looks like everyone is jumping on the libaio bandwagon which naively seems like the right way to me too.  Can you talk a little bit more about how you'd see fdatasync work in this case though vs the threaded implementation?

>
> Anyway, next up is probably wrangling rocksdb's log!

I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll probably just need to hit the mailing list.

>
> sage
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17 14:40                 ` Chen, Xiaoxi
@ 2015-04-17 15:25                   ` Mark Nelson
  2015-04-17 16:05                     ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-17 15:25 UTC (permalink / raw)
  To: Chen, Xiaoxi, Sage Weil; +Cc: Somnath Roy, Haomai Wang, ceph-devel

Hi Xioxi,

I may not be understanding correctly, but doesn't this just control how 
long the archive of old logs are kept around for rather than how long 
writes live in the log?

Mark

On 04/17/2015 09:40 AM, Chen, Xiaoxi wrote:
> Hi Mark,
>
>       These two tunings should help on keeping the WAL log live long enough. By default the value is 0/0, that means the WAL log file will be deleted ASAP, this is definitely not the way we want. Sadly these two is not exposed by RocksDB store, need hand writing to  os/RocksDBStore.cc:: do_open.
>
>       Seems all the problem now is focusing on KV-DB, is that make sense for us to have a small benchmark tool that simulate newstore workload to RocksDB? The pattern seems like 1WAP item(4KB or something) per commit , in the 4KB random write case. then we can play with the tuning out of Ceph.
>
>         // The following two fields affect how archived logs will be deleted.
>    // 1. If both set to 0, logs will be deleted asap and will not get into
>    //    the archive.
>    // 2. If WAL_ttl_seconds is 0 and WAL_size_limit_MB is not 0,
>    //    WAL files will be checked every 10 min and if total size is greater
>    //    then WAL_size_limit_MB, they will be deleted starting with the
>    //    earliest until size_limit is met. All empty files will be deleted.
>    // 3. If WAL_ttl_seconds is not 0 and WAL_size_limit_MB is 0, then
>    //    WAL files will be checked every WAL_ttl_secondsi / 2 and those that
>    //    are older than WAL_ttl_seconds will be deleted.
>    // 4. If both are not 0, WAL files will be checked every 10 min and both
>    //    checks will be performed with ttl being first.
>    uint64_t WAL_ttl_seconds;
>    uint64_t WAL_size_limit_MB;
>
> 							Xiaoxi
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Friday, April 17, 2015 8:11 PM
> To: Sage Weil
> Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
> Subject: Re: Regarding newstore performance
>
>
>
> On 04/16/2015 07:38 PM, Sage Weil wrote:
>> On Thu, 16 Apr 2015, Mark Nelson wrote:
>>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>>> Here is the data with omap separated to another SSD and after 1000GB
>>>> of fio writes (same profile)..
>>>>
>>>> omap writes:
>>>> -------------
>>>>
>>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>>
>>>> Total flash writes in this period = 1150679336
>>>>
>>>> data writes:
>>>> -----------
>>>>
>>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>>
>>>> Total flash writes in this period = 600238328
>>>>
>>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and
>>>> adding those getting ~3.2 WA overall.
>>
>> This all suggests that getting rocksdb to not rewrite the wal entries
>> at all will be the big win.  I think Xiaoxi had tunable suggestions
>> for that?  I didn't grok the rocksdb terms immediately so they didn't
>> make a lot of sense at the time.. this is probably a good place to
>> focus, though.  The rocksdb compaction stats should help out there.
>>
>> But... today I ignored this entirely and put rocksdb in tmpfs and
>> focused just on the actual wal IOs done to the fragments files after the fact.
>> For simplicity I focused just on 128k random writes into 4mb objects.
>>
>> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly,
>> setting
>> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
>> almost any value really) and thinktime_blocks=16, at which point it
>> goes up with the iodepth.  I'm not quite sure what is going on there
>> but it seems to be preventing the elevator and/or disk from reordering
>> writes and make more efficient sweeps across the disk.  In any case,
>> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
>> Similarly, with qa 1 and thinktime of 250us, it drops to like
>> 15mb/sec, which is basically what I was getting from newstore.  Here's
>> my fio
>> config:
>>
>> 	http://fpaste.org/212110/42923089/
>
>
> Yikes!  That is a great observation Sage!
>
>>
>> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
>> flight so that the block layer and/or disk can reorder and be efficient.
>> I added a threadpool for doing wal work (newstore wal threads = 8 by
>> default) and it makes a big difference.  Now I am getting more like
>> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going
>> up much from there as I scale threads or qd, strangely; not sure why yet.
>>
>> But... that's a big improvement over a few days ago (~8mb/sec).  And
>> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
>> winning, yay!
>>
>> I tabled the libaio patch for now since it was getting spurious EINVAL
>> and would consistently SIGBUG from io_getevents() when ceph-osd did
>> dlopen() on the rados plugins (weird!).
>>
>> Mark, at this point it is probably worth checking that you can
>> reproduce these results?  If so, we can redo the io size sweep.  I
>> picked 8 wal threads since that was enough to help and going higher
>> didn't seem to make much difference, but at some point we'll want to
>> be more careful about picking that number.  We could also use libaio
>> here, but I'm not sure it's worth it.  And this approach is somewhat
>> orthogonal to the idea of efficiently passing the kernel things to fdatasync.
>
> Absolutely!  I'll get some tests running now.  Looks like everyone is jumping on the libaio bandwagon which naively seems like the right way to me too.  Can you talk a little bit more about how you'd see fdatasync work in this case though vs the threaded implementation?
>
>>
>> Anyway, next up is probably wrangling rocksdb's log!
>
> I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll probably just need to hit the mailing list.
>
>>
>> sage
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17  4:53               ` Haomai Wang
@ 2015-04-17 15:28                 ` Sage Weil
  0 siblings, 0 replies; 28+ messages in thread
From: Sage Weil @ 2015-04-17 15:28 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Mark Nelson, Somnath Roy, Chen, Xiaoxi, ceph-devel

On Fri, 17 Apr 2015, Haomai Wang wrote:
> On Fri, Apr 17, 2015 at 8:38 AM, Sage Weil <sage@newdream.net> wrote:
> > On Thu, 16 Apr 2015, Mark Nelson wrote:
> >> On 04/16/2015 01:17 AM, Somnath Roy wrote:
> >> > Here is the data with omap separated to another SSD and after 1000GB of fio
> >> > writes (same profile)..
> >> >
> >> > omap writes:
> >> > -------------
> >> >
> >> > Total host writes in this period = 551020111 ------ ~2101 GB
> >> >
> >> > Total flash writes in this period = 1150679336
> >> >
> >> > data writes:
> >> > -----------
> >> >
> >> > Total host writes in this period = 302550388 --- ~1154 GB
> >> >
> >> > Total flash writes in this period = 600238328
> >> >
> >> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
> >> > getting ~3.2 WA overall.
> >
> > This all suggests that getting rocksdb to not rewrite the wal
> > entries at all will be the big win.  I think Xiaoxi had tunable
> > suggestions for that?  I didn't grok the rocksdb terms immediately so
> > they didn't make a lot of sense at the time.. this is probably a good
> > place to focus, though.  The rocksdb compaction stats should help out
> > there.
> >
> > But... today I ignored this entirely and put rocksdb in tmpfs and focused
> > just on the actual wal IOs done to the fragments files after the fact.
> > For simplicity I focused just on 128k random writes into 4mb objects.
> >
> > fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> > iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> > almost any value really) and thinktime_blocks=16, at which point it goes
> > up with the iodepth.  I'm not quite sure what is going on there but it
> > seems to be preventing the elevator and/or disk from reordering writes and
> > make more efficient sweeps across the disk.  In any case, though, with
> > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> > which is basically what I was getting from newstore.  Here's my fio
> > config:
> >
> >         http://fpaste.org/212110/42923089/
> >
> > Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> > flight so that the block layer and/or disk can reorder and be efficient.
> > I added a threadpool for doing wal work (newstore wal threads = 8 by
> > default) and it makes a big difference.  Now I am getting more like
> > 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> > much from there as I scale threads or qd, strangely; not sure why yet.
> 
> Do you mean this PR(https://github.com/ceph/ceph/pull/4318)? I have a
> simple benchmark at the comment of PR.

Sorry no, this is talking about the aio kernel interface (and the libaio 
wrapper for it) that newstore is/will use instead of the usual 
synchronous write(2) etc calls.

> > But... that's a big improvement over a few days ago (~8mb/sec).  And on
> > this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> > winning, yay!
> >
> > I tabled the libaio patch for now since it was getting spurious EINVAL and
> > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> > on the rados plugins (weird!).
> >
> > Mark, at this point it is probably worth checking that you can reproduce
> > these results?  If so, we can redo the io size sweep.  I picked 8 wal
> > threads since that was enough to help and going higher didn't seem to make
> > much difference, but at some point we'll want to be more careful about
> > picking that number.  We could also use libaio here, but I'm not sure it's
> > worth it.  And this approach is somewhat orthogonal to the idea of
> > efficiently passing the kernel things to fdatasync.
> 
> Agreed, this time I think we need to focus data store only. Maybe I'm
> missing, what's your overlay config value in this test?

For these tests I had overlay disabled to focus on the WAL behavior 
(newstore overlay max = 0).

FWIW I think we'll need to be really careful with the overlay max extent 
size too as it tends to shovel lots of data into rocksdb that is 
inevitably write amplified.  The expected net result is that WA will be 
overall higher but latency will be lower because of fewer seeks when we go 
off to do the random io to the fragment file.

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17 12:10               ` Mark Nelson
  2015-04-17 14:08                 ` Chen, Xiaoxi
  2015-04-17 14:40                 ` Chen, Xiaoxi
@ 2015-04-17 15:46                 ` Sage Weil
  2015-04-18  3:34                   ` Mark Nelson
  2 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2015-04-17 15:46 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Somnath Roy, Chen, Xiaoxi, Haomai Wang, ceph-devel

On Fri, 17 Apr 2015, Mark Nelson wrote:
> On 04/16/2015 07:38 PM, Sage Weil wrote:
> > On Thu, 16 Apr 2015, Mark Nelson wrote:
> > > On 04/16/2015 01:17 AM, Somnath Roy wrote:
> > > > Here is the data with omap separated to another SSD and after 1000GB of
> > > > fio
> > > > writes (same profile)..
> > > > 
> > > > omap writes:
> > > > -------------
> > > > 
> > > > Total host writes in this period = 551020111 ------ ~2101 GB
> > > > 
> > > > Total flash writes in this period = 1150679336
> > > > 
> > > > data writes:
> > > > -----------
> > > > 
> > > > Total host writes in this period = 302550388 --- ~1154 GB
> > > > 
> > > > Total flash writes in this period = 600238328
> > > > 
> > > > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding
> > > > those
> > > > getting ~3.2 WA overall.
> > 
> > This all suggests that getting rocksdb to not rewrite the wal
> > entries at all will be the big win.  I think Xiaoxi had tunable
> > suggestions for that?  I didn't grok the rocksdb terms immediately so
> > they didn't make a lot of sense at the time.. this is probably a good
> > place to focus, though.  The rocksdb compaction stats should help out
> > there.
> > 
> > But... today I ignored this entirely and put rocksdb in tmpfs and focused
> > just on the actual wal IOs done to the fragments files after the fact.
> > For simplicity I focused just on 128k random writes into 4mb objects.
> > 
> > fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> > iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> > almost any value really) and thinktime_blocks=16, at which point it goes
> > up with the iodepth.  I'm not quite sure what is going on there but it
> > seems to be preventing the elevator and/or disk from reordering writes and
> > make more efficient sweeps across the disk.  In any case, though, with
> > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> > which is basically what I was getting from newstore.  Here's my fio
> > config:
> > 
> > 	http://fpaste.org/212110/42923089/
> 
> 
> Yikes!  That is a great observation Sage!
> 
> > 
> > Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> > flight so that the block layer and/or disk can reorder and be efficient.
> > I added a threadpool for doing wal work (newstore wal threads = 8 by
> > default) and it makes a big difference.  Now I am getting more like
> > 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> > much from there as I scale threads or qd, strangely; not sure why yet.
> > 
> > But... that's a big improvement over a few days ago (~8mb/sec).  And on
> > this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> > winning, yay!
> > 
> > I tabled the libaio patch for now since it was getting spurious EINVAL and
> > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> > on the rados plugins (weird!).
> > 
> > Mark, at this point it is probably worth checking that you can reproduce
> > these results?  If so, we can redo the io size sweep.  I picked 8 wal
> > threads since that was enough to help and going higher didn't seem to make
> > much difference, but at some point we'll want to be more careful about
> > picking that number.  We could also use libaio here, but I'm not sure it's
> > worth it.  And this approach is somewhat orthogonal to the idea of
> > efficiently passing the kernel things to fdatasync.
> 
> Absolutely!  I'll get some tests running now.  Looks like everyone is jumping
> on the libaio bandwagon which naively seems like the right way to me too.  Can
> you talk a little bit more about how you'd see fdatasync work in this case
> though vs the threaded implementation?

That I'm not certain about, not sure if I need O_DSYNC or if the libaio 
fsync hook actually works; the docs are ambiguous.

> > Anyway, next up is probably wrangling rocksdb's log!
> 
> I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll
> probably just need to hit the mailing list.

This appears to be the place to reach rocksdb folks:

	https://www.facebook.com/groups/rocksdb.dev/

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17 15:25                   ` Mark Nelson
@ 2015-04-17 16:05                     ` Sage Weil
  2015-04-17 16:59                       ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2015-04-17 16:05 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Chen, Xiaoxi, Somnath Roy, Haomai Wang, ceph-devel

On Fri, 17 Apr 2015, Mark Nelson wrote:
> Hi Xioxi,
> 
> I may not be understanding correctly, but doesn't this just control how long
> the archive of old logs are kept around for rather than how long writes live
> in the log?

FWIW here's a recommendation from rocksdb folks:

Igor Canadi: If you set your write_buffer_size to be big and 
purge_redundant_kvs_while_flush to true (this is defaul) then your deleted 
keys should never be flushed to disk.

Have you guys managed to adjust these tunables to avoid any rewrites of 
wal keys?  Once we see an improvement we should change the defaults 
accordingly.  Hopefully we can get the log to be really big without 
adverse effects (e.g. we still want the keys to be rewritten in smallish 
chunks so there isn't a big spike)...

sage


> 
> Mark
> 
> On 04/17/2015 09:40 AM, Chen, Xiaoxi wrote:
> > Hi Mark,
> > 
> >       These two tunings should help on keeping the WAL log live long enough.
> > By default the value is 0/0, that means the WAL log file will be deleted
> > ASAP, this is definitely not the way we want. Sadly these two is not exposed
> > by RocksDB store, need hand writing to  os/RocksDBStore.cc:: do_open.
> > 
> >       Seems all the problem now is focusing on KV-DB, is that make sense for
> > us to have a small benchmark tool that simulate newstore workload to
> > RocksDB? The pattern seems like 1WAP item(4KB or something) per commit , in
> > the 4KB random write case. then we can play with the tuning out of Ceph.
> > 
> >         // The following two fields affect how archived logs will be
> > deleted.
> >    // 1. If both set to 0, logs will be deleted asap and will not get into
> >    //    the archive.
> >    // 2. If WAL_ttl_seconds is 0 and WAL_size_limit_MB is not 0,
> >    //    WAL files will be checked every 10 min and if total size is greater
> >    //    then WAL_size_limit_MB, they will be deleted starting with the
> >    //    earliest until size_limit is met. All empty files will be deleted.
> >    // 3. If WAL_ttl_seconds is not 0 and WAL_size_limit_MB is 0, then
> >    //    WAL files will be checked every WAL_ttl_secondsi / 2 and those that
> >    //    are older than WAL_ttl_seconds will be deleted.
> >    // 4. If both are not 0, WAL files will be checked every 10 min and both
> >    //    checks will be performed with ttl being first.
> >    uint64_t WAL_ttl_seconds;
> >    uint64_t WAL_size_limit_MB;
> > 
> > 							Xiaoxi
> > 
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@redhat.com]
> > Sent: Friday, April 17, 2015 8:11 PM
> > To: Sage Weil
> > Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
> > Subject: Re: Regarding newstore performance
> > 
> > 
> > 
> > On 04/16/2015 07:38 PM, Sage Weil wrote:
> > > On Thu, 16 Apr 2015, Mark Nelson wrote:
> > > > On 04/16/2015 01:17 AM, Somnath Roy wrote:
> > > > > Here is the data with omap separated to another SSD and after 1000GB
> > > > > of fio writes (same profile)..
> > > > > 
> > > > > omap writes:
> > > > > -------------
> > > > > 
> > > > > Total host writes in this period = 551020111 ------ ~2101 GB
> > > > > 
> > > > > Total flash writes in this period = 1150679336
> > > > > 
> > > > > data writes:
> > > > > -----------
> > > > > 
> > > > > Total host writes in this period = 302550388 --- ~1154 GB
> > > > > 
> > > > > Total flash writes in this period = 600238328
> > > > > 
> > > > > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and
> > > > > adding those getting ~3.2 WA overall.
> > > 
> > > This all suggests that getting rocksdb to not rewrite the wal entries
> > > at all will be the big win.  I think Xiaoxi had tunable suggestions
> > > for that?  I didn't grok the rocksdb terms immediately so they didn't
> > > make a lot of sense at the time.. this is probably a good place to
> > > focus, though.  The rocksdb compaction stats should help out there.
> > > 
> > > But... today I ignored this entirely and put rocksdb in tmpfs and
> > > focused just on the actual wal IOs done to the fragments files after the
> > > fact.
> > > For simplicity I focused just on 128k random writes into 4mb objects.
> > > 
> > > fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly,
> > > setting
> > > iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> > > almost any value really) and thinktime_blocks=16, at which point it
> > > goes up with the iodepth.  I'm not quite sure what is going on there
> > > but it seems to be preventing the elevator and/or disk from reordering
> > > writes and make more efficient sweeps across the disk.  In any case,
> > > though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec
> > > with qd 64.
> > > Similarly, with qa 1 and thinktime of 250us, it drops to like
> > > 15mb/sec, which is basically what I was getting from newstore.  Here's
> > > my fio
> > > config:
> > > 
> > > 	http://fpaste.org/212110/42923089/
> > 
> > 
> > Yikes!  That is a great observation Sage!
> > 
> > > 
> > > Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> > > flight so that the block layer and/or disk can reorder and be efficient.
> > > I added a threadpool for doing wal work (newstore wal threads = 8 by
> > > default) and it makes a big difference.  Now I am getting more like
> > > 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going
> > > up much from there as I scale threads or qd, strangely; not sure why yet.
> > > 
> > > But... that's a big improvement over a few days ago (~8mb/sec).  And
> > > on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> > > winning, yay!
> > > 
> > > I tabled the libaio patch for now since it was getting spurious EINVAL
> > > and would consistently SIGBUG from io_getevents() when ceph-osd did
> > > dlopen() on the rados plugins (weird!).
> > > 
> > > Mark, at this point it is probably worth checking that you can
> > > reproduce these results?  If so, we can redo the io size sweep.  I
> > > picked 8 wal threads since that was enough to help and going higher
> > > didn't seem to make much difference, but at some point we'll want to
> > > be more careful about picking that number.  We could also use libaio
> > > here, but I'm not sure it's worth it.  And this approach is somewhat
> > > orthogonal to the idea of efficiently passing the kernel things to
> > > fdatasync.
> > 
> > Absolutely!  I'll get some tests running now.  Looks like everyone is
> > jumping on the libaio bandwagon which naively seems like the right way to me
> > too.  Can you talk a little bit more about how you'd see fdatasync work in
> > this case though vs the threaded implementation?
> > 
> > > 
> > > Anyway, next up is probably wrangling rocksdb's log!
> > 
> > I jumped on #rocksdb on freenode yesterday to ask about it, but I think
> > we'll probably just need to hit the mailing list.
> > 
> > > 
> > > sage
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17 16:05                     ` Sage Weil
@ 2015-04-17 16:59                       ` Mark Nelson
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-17 16:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chen, Xiaoxi, Somnath Roy, Haomai Wang, ceph-devel



On 04/17/2015 11:05 AM, Sage Weil wrote:
> On Fri, 17 Apr 2015, Mark Nelson wrote:
>> Hi Xioxi,
>>
>> I may not be understanding correctly, but doesn't this just control how long
>> the archive of old logs are kept around for rather than how long writes live
>> in the log?
>
> FWIW here's a recommendation from rocksdb folks:
>
> Igor Canadi: If you set your write_buffer_size to be big and
> purge_redundant_kvs_while_flush to true (this is defaul) then your deleted
> keys should never be flushed to disk.
>
> Have you guys managed to adjust these tunables to avoid any rewrites of
> wal keys?  Once we see an improvement we should change the defaults
> accordingly.  Hopefully we can get the log to be really big without
> adverse effects (e.g. we still want the keys to be rewritten in smallish
> chunks so there isn't a big spike)...

So I'm using Xiaoxi's tunables for all of the recent tests:

write_buffer_size =512M
max_write_buffer_number = 6
min_write_buffer_number_to_merge = 2

This is what we saw on SSD at least:

http://nhm.ceph.com/newstore_xiaoxi_fdatasync.pdf

Basically xioaxi's tunables help a decent amount, especially at 512k-2MB 
IO sizes.  fdatasync helps a little more, especially at smaller IO sizes 
that are hard to see in that graph.

So far, the new threaded WAL implementation gets us a little more yet, 
maybe another 0-10%.  So we keep making little steps.

Going to go back and see how spinning disks do now.

>
> sage
>
>
>>
>> Mark
>>
>> On 04/17/2015 09:40 AM, Chen, Xiaoxi wrote:
>>> Hi Mark,
>>>
>>>        These two tunings should help on keeping the WAL log live long enough.
>>> By default the value is 0/0, that means the WAL log file will be deleted
>>> ASAP, this is definitely not the way we want. Sadly these two is not exposed
>>> by RocksDB store, need hand writing to  os/RocksDBStore.cc:: do_open.
>>>
>>>        Seems all the problem now is focusing on KV-DB, is that make sense for
>>> us to have a small benchmark tool that simulate newstore workload to
>>> RocksDB? The pattern seems like 1WAP item(4KB or something) per commit , in
>>> the 4KB random write case. then we can play with the tuning out of Ceph.
>>>
>>>          // The following two fields affect how archived logs will be
>>> deleted.
>>>     // 1. If both set to 0, logs will be deleted asap and will not get into
>>>     //    the archive.
>>>     // 2. If WAL_ttl_seconds is 0 and WAL_size_limit_MB is not 0,
>>>     //    WAL files will be checked every 10 min and if total size is greater
>>>     //    then WAL_size_limit_MB, they will be deleted starting with the
>>>     //    earliest until size_limit is met. All empty files will be deleted.
>>>     // 3. If WAL_ttl_seconds is not 0 and WAL_size_limit_MB is 0, then
>>>     //    WAL files will be checked every WAL_ttl_secondsi / 2 and those that
>>>     //    are older than WAL_ttl_seconds will be deleted.
>>>     // 4. If both are not 0, WAL files will be checked every 10 min and both
>>>     //    checks will be performed with ttl being first.
>>>     uint64_t WAL_ttl_seconds;
>>>     uint64_t WAL_size_limit_MB;
>>>
>>> 							Xiaoxi
>>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>> Sent: Friday, April 17, 2015 8:11 PM
>>> To: Sage Weil
>>> Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
>>> Subject: Re: Regarding newstore performance
>>>
>>>
>>>
>>> On 04/16/2015 07:38 PM, Sage Weil wrote:
>>>> On Thu, 16 Apr 2015, Mark Nelson wrote:
>>>>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>>>>>> Here is the data with omap separated to another SSD and after 1000GB
>>>>>> of fio writes (same profile)..
>>>>>>
>>>>>> omap writes:
>>>>>> -------------
>>>>>>
>>>>>> Total host writes in this period = 551020111 ------ ~2101 GB
>>>>>>
>>>>>> Total flash writes in this period = 1150679336
>>>>>>
>>>>>> data writes:
>>>>>> -----------
>>>>>>
>>>>>> Total host writes in this period = 302550388 --- ~1154 GB
>>>>>>
>>>>>> Total flash writes in this period = 600238328
>>>>>>
>>>>>> So, actual data write WA is ~1.1 but omap overhead is ~2.1 and
>>>>>> adding those getting ~3.2 WA overall.
>>>>
>>>> This all suggests that getting rocksdb to not rewrite the wal entries
>>>> at all will be the big win.  I think Xiaoxi had tunable suggestions
>>>> for that?  I didn't grok the rocksdb terms immediately so they didn't
>>>> make a lot of sense at the time.. this is probably a good place to
>>>> focus, though.  The rocksdb compaction stats should help out there.
>>>>
>>>> But... today I ignored this entirely and put rocksdb in tmpfs and
>>>> focused just on the actual wal IOs done to the fragments files after the
>>>> fact.
>>>> For simplicity I focused just on 128k random writes into 4mb objects.
>>>>
>>>> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly,
>>>> setting
>>>> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
>>>> almost any value really) and thinktime_blocks=16, at which point it
>>>> goes up with the iodepth.  I'm not quite sure what is going on there
>>>> but it seems to be preventing the elevator and/or disk from reordering
>>>> writes and make more efficient sweeps across the disk.  In any case,
>>>> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec
>>>> with qd 64.
>>>> Similarly, with qa 1 and thinktime of 250us, it drops to like
>>>> 15mb/sec, which is basically what I was getting from newstore.  Here's
>>>> my fio
>>>> config:
>>>>
>>>> 	http://fpaste.org/212110/42923089/
>>>
>>>
>>> Yikes!  That is a great observation Sage!
>>>
>>>>
>>>> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
>>>> flight so that the block layer and/or disk can reorder and be efficient.
>>>> I added a threadpool for doing wal work (newstore wal threads = 8 by
>>>> default) and it makes a big difference.  Now I am getting more like
>>>> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going
>>>> up much from there as I scale threads or qd, strangely; not sure why yet.
>>>>
>>>> But... that's a big improvement over a few days ago (~8mb/sec).  And
>>>> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
>>>> winning, yay!
>>>>
>>>> I tabled the libaio patch for now since it was getting spurious EINVAL
>>>> and would consistently SIGBUG from io_getevents() when ceph-osd did
>>>> dlopen() on the rados plugins (weird!).
>>>>
>>>> Mark, at this point it is probably worth checking that you can
>>>> reproduce these results?  If so, we can redo the io size sweep.  I
>>>> picked 8 wal threads since that was enough to help and going higher
>>>> didn't seem to make much difference, but at some point we'll want to
>>>> be more careful about picking that number.  We could also use libaio
>>>> here, but I'm not sure it's worth it.  And this approach is somewhat
>>>> orthogonal to the idea of efficiently passing the kernel things to
>>>> fdatasync.
>>>
>>> Absolutely!  I'll get some tests running now.  Looks like everyone is
>>> jumping on the libaio bandwagon which naively seems like the right way to me
>>> too.  Can you talk a little bit more about how you'd see fdatasync work in
>>> this case though vs the threaded implementation?
>>>
>>>>
>>>> Anyway, next up is probably wrangling rocksdb's log!
>>>
>>> I jumped on #rocksdb on freenode yesterday to ask about it, but I think
>>> we'll probably just need to hit the mailing list.
>>>
>>>>
>>>> sage
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-17 15:46                 ` Sage Weil
@ 2015-04-18  3:34                   ` Mark Nelson
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-18  3:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Chen, Xiaoxi, Haomai Wang, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 945 bytes --]

On 04/17/2015 10:46 AM, Sage Weil wrote:
> On Fri, 17 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 07:38 PM, Sage Weil wrote:
>> Absolutely!  I'll get some tests running now.  Looks like everyone is jumping
>> on the libaio bandwagon which naively seems like the right way to me too.  Can
>> you talk a little bit more about how you'd see fdatasync work in this case
>> though vs the threaded implementation?
>
> That I'm not certain about, not sure if I need O_DSYNC or if the libaio
> fsync hook actually works; the docs are ambiguous.

Ran tests with 8 WAL threads and compared to filestore numbers.  This is 
using the Xiaoxi tunables.  Things are definitely looking better on 
spinning disks.  Random writes even without the SSD WAL are far more 
comparable to filestore than previously.  Sequential writes are better 
but still sometimes quite a bit worse.

PCIe SSD numbers have improved marginally, but still far below filestore.

Mark

[-- Attachment #2: newstore_wal_threads.pdf --]
[-- Type: application/pdf, Size: 72069 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-14  0:12   ` Somnath Roy
@ 2015-04-14  0:21     ` Mark Nelson
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-14  0:21 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

It's trying to store writes (and now appends) to objects in the K/V 
store for a while so that if there are multiple writes to the same 
object it can just write them out and fsync once to the FS.  The problem 
is that rocksdb has no idea these are short lived objects and seems to 
want to move them out of level 0 quickly with default tunables as 
there's tons of write amplification going on, even with WAL on the SSD.

Mark

On 04/13/2015 07:12 PM, Somnath Roy wrote:
> Thanks Mark..Let me know if you are doing any tuning in rocksdb layer.
> BTW, do you know what this overlay does ? why it is impacting performance so much ?
> By looking at the code I am seeing lot of extra K/v operation in case of overlay writes.
> Waiting for Sage's reply on that part..
>
> Regards
> Somnath
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Monday, April 13, 2015 5:07 PM
> To: Somnath Roy; ceph-devel
> Subject: Re: Regarding newstore performance
>
> Hi Somnath,  I'm running similar tests right now looking at newstore with 8m and no overlay on spinning disk, spinning disk + SSD WAL, and SSD.  Should have results in the next hour or two.
>
> Mark
>
> On 04/13/2015 06:53 PM, Somnath Roy wrote:
>> Sage,
>> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>>
>> Test:
>> -----
>>
>>           64K random writes with QD= 64 using fio_rbd.
>>
>> Results :
>> ----------
>>
>>           1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>>
>>           2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>>
>>           3. I tried to tweak all the settings one by one but not much benefit anywhere.
>>
>>           4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>>                This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>>
>>           5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>>
>> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>>
>> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>>
>> Thanks & Regards
>> Somnath
>>
>>
>>
>>
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Regarding newstore performance
  2015-04-14  0:06 ` Mark Nelson
@ 2015-04-14  0:12   ` Somnath Roy
  2015-04-14  0:21     ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-04-14  0:12 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

Thanks Mark..Let me know if you are doing any tuning in rocksdb layer.
BTW, do you know what this overlay does ? why it is impacting performance so much ? 
By looking at the code I am seeing lot of extra K/v operation in case of overlay writes.
Waiting for Sage's reply on that part..

Regards
Somnath

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Monday, April 13, 2015 5:07 PM
To: Somnath Roy; ceph-devel
Subject: Re: Regarding newstore performance

Hi Somnath,  I'm running similar tests right now looking at newstore with 8m and no overlay on spinning disk, spinning disk + SSD WAL, and SSD.  Should have results in the next hour or two.

Mark

On 04/13/2015 06:53 PM, Somnath Roy wrote:
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>          64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>          1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>          2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>          3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>          4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>               This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>          5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regarding newstore performance
  2015-04-13 23:53 Somnath Roy
@ 2015-04-14  0:06 ` Mark Nelson
  2015-04-14  0:12   ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-14  0:06 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Hi Somnath,  I'm running similar tests right now looking at newstore 
with 8m and no overlay on spinning disk, spinning disk + SSD WAL, and 
SSD.  Should have results in the next hour or two.

Mark

On 04/13/2015 06:53 PM, Somnath Roy wrote:
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>          64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>          1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>          2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>          3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>          4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>               This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>          5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Regarding newstore performance
@ 2015-04-13 23:53 Somnath Roy
  2015-04-14  0:06 ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-04-13 23:53 UTC (permalink / raw)
  To: ceph-devel

Sage,
I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.

Test:
-----

        64K random writes with QD= 64 using fio_rbd.

Results :
----------

        1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...

        2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).

        3. I tried to tweak all the settings one by one but not much benefit anywhere.

        4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
             This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !

        5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.

I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.

I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.

Thanks & Regards
Somnath





________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-04-18  3:34 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-15  6:01 Regarding newstore performance Somnath Roy
2015-04-15 12:23 ` Haomai Wang
2015-04-15 16:07   ` Somnath Roy
2015-04-16  1:47     ` Chen, Xiaoxi
2015-04-16  4:22       ` Somnath Roy
2015-04-16  6:17         ` Somnath Roy
2015-04-16 18:17           ` Mark Nelson
2015-04-17  0:38             ` Sage Weil
2015-04-17  0:47               ` Gregory Farnum
2015-04-17  0:53                 ` Sage Weil
2015-04-17  0:55                 ` Chen, Xiaoxi
2015-04-17  4:53               ` Haomai Wang
2015-04-17 15:28                 ` Sage Weil
2015-04-17 12:10               ` Mark Nelson
2015-04-17 14:08                 ` Chen, Xiaoxi
2015-04-17 14:20                   ` Haomai Wang
2015-04-17 14:29                     ` Chen, Xiaoxi
2015-04-17 14:34                       ` Mark Nelson
2015-04-17 14:40                 ` Chen, Xiaoxi
2015-04-17 15:25                   ` Mark Nelson
2015-04-17 16:05                     ` Sage Weil
2015-04-17 16:59                       ` Mark Nelson
2015-04-17 15:46                 ` Sage Weil
2015-04-18  3:34                   ` Mark Nelson
  -- strict thread matches above, loose matches on Subject: below --
2015-04-13 23:53 Somnath Roy
2015-04-14  0:06 ` Mark Nelson
2015-04-14  0:12   ` Somnath Roy
2015-04-14  0:21     ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.