RE: Regarding newstore performance

From: "Chen, Xiaoxi" <xiaoxi.chen@intel.com>
To: Somnath Roy <Somnath.Roy@sandisk.com>,
	Haomai Wang <haomaiwang@gmail.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: Regarding newstore performance
Date: Thu, 16 Apr 2015 01:47:38 +0000	[thread overview]
Message-ID: <6F3FA899187F0043BA1827A69DA2F7CC021CE207@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <755F6B91B3BE364F9BCA11EA3F9E0C6F2CD7A0A1@SACMBXIP01.sdcorp.global.sandisk.com>

Hi Somnath,
     You could try apply this one:)
     https://github.com/ceph/ceph/pull/4356

      BTW the previous RocksDB configuration has a bug that set rocksdb_disableDataSync to true by default, which may cause data loss in failure. So pls update the newstore to latest or manually set it to false. I suspect the KVDB performance will be worse after doing this...but that's the way we need to go.
										Xiaoxi

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, April 16, 2015 12:07 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA.
This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks & Regards
Somnath

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Sage/Mark,
> I did some WA experiment with newstore with the similar settings I mentioned yesterday.
>
> Test:
> -------
>
> 64K Random write with 64 QD and writing total of 1 TB of data.
>
>
> Newstore:
> ------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
> iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
>   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
>     slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
>     clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
>      lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
>     clat percentiles (msec):
>      |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
>      | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
>      | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
>      | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
>      | 99.99th=[ 1270]
>     bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
>     lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
>     lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
>   cpu          : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, >=64=97.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
> mint=21421419msec, maxt=21421419msec
>
>
> So, iops getting is ~764.
> 99th percentile latency should be 100ms.
>
> Write amplification at disk level:
> --------------------------------------
>
> SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk.
>
> Please find the data in the following xls.
>
> https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
> cX5TLMRzdXyJE/edit?usp=sharing
>
> Total host writes in this period = 923896266
>
> Total flash writes in this period = 1465339040
>
>
> FileStore:
> -------------
>
> Fio output at the end of 1 TB write.
> -------------------------------------------
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
> ioengine=rbd, iodepth=64
> fio-2.1.11-20-g9a44
> Starting 1 process
> rbd engine: RBD version: 0.1.9
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
> iops] [eta 00m:01s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
>   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
>     slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
>     clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
>      lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
>     clat percentiles (msec):
>      |  1.00th=[    7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
>      | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
>      | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
>      | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
>      | 99.99th=[ 1647]
>     bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00
>     lat (usec) : 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
>     lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
>     lat (msec) : 2000=0.06%, >=2000=0.01%
>   cpu          : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, >=64=19.1%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
>   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
> mint=10636117msec, maxt=10636117msec
>
> Disk stats (read/write):
>   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%
>
> So, iops here is ~1500.
> 99th percentile latency should be within 50ms
>
>
> Write amplification at disk level:
> --------------------------------------
>
> Total host writes in this period = 643611346
>
> Total flash writes in this period = 1157304512
>
>
>
> https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61
> Fz49CLH8WPh7Q/edit?pli=1#gid=95373000
>
>
>
>
>
> Summary:
> ------------
>
> 1.  The performance is doubled in case of filestore and latency is almost half.
>
> 2. Total number of flash writes is impacted by by both application write pattern + FTL logic etc. etc. So, I am not going into that.  Things to note the significant increase of host writes with newstore and that's definitely causing extra WA compare to  filestore.
>

Yeah, it seemed that xfs plays well when writing back.

> 3. Considering flash page size = 4K, the total writes in case of filestore = 2455 GB with a 1000 GB fio write vs 3524 GB with newstore. So, WA of filestore is ~2.4 vs ~3.5 in case of newstore. Considering inherent 2X WA for filestore, it is doing pretty good here.
>      Now, in case of newstore , it is not supposed to write WAL in case of new writes. It will be interesting to see % of new writes coming..Will analyze that..
>

I think it should result from kvdb. Maybe we can separate newstore's data dir and kvdb dir. So we can measure the difference with different disk counter.

> 4. If you can open my xls and graphs above, you can see initially host writes and flash writes are very similar in case of newstore and then it jumps high. Not sure why though. I will rerun the tests to confirm similar phenomenon.
>
> 5. The cumulative flash write  and cumulative host write graph is the actual WA (host + FW) caused by the write.
>

I'm interested in the flash write and disk write counter, is it a internal tool to inspect or it's a opensource tool?

> What's next:
> ---------------
>
> 1. Need to understand why 3.5 WA for newstore.
>
> 2. Try with different  Rocksdb tuning and record the impact.
>
>
> Any feedback/suggestion is much appreciated.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, April 13, 2015 4:54 PM
> To: ceph-devel
> Subject: Regarding newstore performance
>
> Sage,
> I was doing some preliminary performance testing of newstore on a single OSD (SSD) , single replication setup. Here is my findings so far.
>
> Test:
> -----
>
>         64K random writes with QD= 64 using fio_rbd.
>
> Results :
> ----------
>
>         1. With all default settings, I am seeing very spiky performance. FIO is reporting between 0-~1K random write IOPS with many times IO stops at 0s...Tried with bigger overlay max size value but results are similar...
>
>         2. Next I set the newstore_overlay_max = 0 and I got pretty stable performance ~800-900 IOPS (write duration is short though).
>
>         3. I tried to tweak all the settings one by one but not much benefit anywhere.
>
>         4. One interesting observation here, in my setup if I set newstore_sync_queue_transaction = true , I am getting iops ~600-700..Which is ~100 less.
>              This is quite contrary to my keyvaluestore experiment where I got ~3X improvement by doing sync  writes !
>
>         5. Filestore performance in the similar setup is ~1.6K after 1 TB of data write.
>
> I am trying to figure out from the code what exactly this overlay writes does. Any insight/explanation would be helpful here.
>
> I am planning to do some more experiment with newstore including WA comparison between filestore vs newstore. Will publish the result soon.
>
> Thanks & Regards
> Somnath
>
>
>
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

--
Best Regards,

Wheat
\x04 {.n +       +%  lzwm  b 맲  r  yǩ ׯzX  \x17  ܨ}   Ơz &j:+v        zZ+  +zf   h   ~    i   z \x1e w   ?    & )ߢ^[f