From mboxrd@z Thu Jan 1 00:00:00 1970 From: Somnath Roy Subject: RE: Bluestore tuning Date: Thu, 28 Jul 2016 19:05:17 +0000 Message-ID: References: <4D529B90-D7F5-4651-91B1-879F480D932B@Teradata.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-dm3nam03on0057.outbound.protection.outlook.com ([104.47.41.57]:34860 "EHLO NAM03-DM3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754969AbcG1TFY convert rfc822-to-8bit (ORCPT ); Thu, 28 Jul 2016 15:05:24 -0400 Content-Language: en-US Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Allen Samuels , Evgeniy Firsov , "Kamble, Nitin A" Cc: "Mark Nelson (mnelson@redhat.com)" , "ceph-devel@vger.kernel.org" My bad , Allen/EF correctly pointed out that we have onode cache as wel= l (along with buffer cache) within TwoQ cache which is based on number = of onodes. But, EF, the cache is not per collection it is per shard and it is has= hed into collection, still it is bad and since I am running with high n= umber of shards , this is what I am seeing may be..Will verify.. Thanks & Regards Somnath -----Original Message----- =46rom: Somnath Roy Sent: Thursday, July 28, 2016 8:17 AM To: Allen Samuels; Evgeniy Firsov; Kamble, Nitin A Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org Subject: RE: Bluestore tuning I don't think the cache is based on count, it is based on size and numb= er of shards you are running with. If you see in my ceph.conf , I have limited cache size to 100MB (bluest= ore_buffer_cache_size =3D 104857600) and I have 25 shards. So, each OSD= will be using 2.5GB of memory as cache at max and I have 8 OSDs runnin= g , so, total OSD RSS for cache could be ~20GB. Anything above that wil= l be trimmed. This is what I am seeing in read path and after limiting = this in write path also I am seeing less growth. But, this is a small l= eak that is still happening in the write path it seems (unless we are n= ot doing a cache->trim() somewhere in the path). Thanks & Regards Somnath -----Original Message----- =46rom: Allen Samuels Sent: Thursday, July 28, 2016 7:11 AM To: Evgeniy Firsov; Somnath Roy; Kamble, Nitin A Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org Subject: RE: Bluestore tuning If the oNode cache is based on oNode count, then we might want to rethi= nk the accounting as the oNode size is likely to be highly variable mea= ning that the memory consumption of the cache will be highly variable t= oo. This means that users will have to set the cache for the worst-case= oNode size, meaning that most of the time the actual oNode cache will = be much smaller than desired -- for the resources (DRAM) that's involve= d. Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- > owner@vger.kernel.org] On Behalf Of Evgeniy Firsov > Sent: Thursday, July 28, 2016 1:35 AM > To: Somnath Roy ; Kamble, Nitin A > > Cc: Mark Nelson (mnelson@redhat.com) ; ceph- > devel@vger.kernel.org > Subject: Re: Bluestore tuning > > Somnath, > > In my opinion, =B3memory leak=B2 may be just onode cache size grows. > By default its 16K entries per PG (8 by default), onode size is ~38K > for 4M RBD object, so its 5.1G by default. > Likely you use much more Pgs. > Disabling checksums, reducing RBD object size will reduce the cache s= ize. > > On 7/27/16, 10:26 PM, "ceph-devel-owner@vger.kernel.org on behalf of > Somnath Roy" Somnath.Roy@sandisk.com> wrote: > > >My ceph version is 11.0.0-811-g278ea12, it is ~3-4 days old master. > >Regarding stability , it is getting there , no more easy crashes see= n > >:-) I am getting a memory leak though in the write path and after 1 > >hour of continuous run (4K RW) memory is started swapping for me..I > >am trying to nail it down.. > > > >Thanks & Regards > >Somnath > > > > > >-----Original Message----- > >From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] > >Sent: Wednesday, July 27, 2016 10:19 PM > >To: Somnath Roy > >Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org > >Subject: Re: Bluestore tuning > > > >Hi Somnath, > > Thanks for sharing this information. And great to see bluestore > >with improved stability and performance. Which version of ceph were > >you running in this environment, latest master? > >Also it would be good to know the level of stability of the environm= ent. > >Did ceph cluster broke after collection of this data? > > > >Thanks, > >Nitin > > > >> On Jul 27, 2016, at 8:40 AM, Somnath Roy > >>wrote: > >> > >> As discussed in performance meeting, I am sharing the latest > >>Bluestore tuning findings that is giving me better and most > >>importantly stable result in my environment. > >> > >> Setup : > >> ------- > >> > >> 2 OSD nodes with 8 OSDs (on 8 TB SSD) each. > >> Single 4TB image (with exclusive lock disabled) from single client > >>running 10 fio jobs and each job is with 128 QD. > >> Replication =3D 2. > >> Fio rbd ran for 30 min. > >> > >> Ceph.conf > >> ------------ > >> osd_op_num_threads_per_shard =3D 2 > >> osd_op_num_shards =3D 25 > >> > >> bluestore_rocksdb_options =3D > >>"max_write_buffer_number=3D16,min_write_buffer_number_to_merge=3D2 > ,recycle > >>_lo > >>g_file_num=3D16,compaction_threads=3D32,flusher_threads=3D8,max_bac= kgro > und_c > >>omp > >>actions=3D32,max_background_flushes=3D8,max_bytes_for_level_base=3D= 5368 > 70912 > >>0,w > >>rite_buffer_size=3D83886080,level0_file_num_compaction_trigger=3D4,= level > >>0 > _ > >>slo wdown_writes_trigger=3D400,level0_stop_writes_trigger=3D800" > >> > >> rocksdb_cache_size =3D 4294967296 > >> bluestore_csum =3D false > >> bluestore_csum_type =3D none > >> bluestore_bluefs_buffered_io =3D false > >> bluestore_max_ops =3D 30000 > >> bluestore_max_bytes =3D 629145600 > >> bluestore_buffer_cache_size =3D 104857600 > >> bluestore_block_wal_size =3D 0 > >> > >> [osd.0] > >> host =3D emsnode12 > >> devs =3D /dev/sdb1 > >> #osd_journal =3D /dev/sdb1 > >> bluestore_block_db_path =3D /dev/sdb2 > >> #bluestore_block_wal_path =3D /dev/nvme0n1p1 > >> bluestore_block_wal_path =3D /dev/sdb3 > >> bluestore_block_path =3D /dev/sdb4 > >> > >> I have separate partition for block/db/wal.. > >> > >> Result: > >> -------- > >> No preconditioning of rbd images , started writing 4K RW from the > >>beginning. > >> > >> Jobs: 10 (f=3D10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s] > >>[0/38.5K/0 iops] [eta 00m:00s] > >> rbd_iodepth32: (groupid=3D0, jobs=3D10): err=3D 0: pid=3D883598: F= ri Jul 22 > >> 19:43:41 2016 > >> write: io=3D282082MB, bw=3D160473KB/s, iops=3D40118, runt=3D18000= 07msec > >> slat (usec): min=3D25, max=3D2578, avg=3D51.73, stdev=3D15.99 > >> clat (usec): min=3D585, max=3D2096.7K, avg=3D3913.59, stdev=3D9= 871.73 > >> lat (usec): min=3D806, max=3D2096.7K, avg=3D3965.32, stdev=3D9= 871.71 > >> clat percentiles (usec): > >> | 1.00th=3D[ 1208], 5.00th=3D[ 1480], 10.00th=3D[ 1672], 20.= 00th=3D[ > >>1992], > >> | 30.00th=3D[ 2288], 40.00th=3D[ 2608], 50.00th=3D[ 2992], 60.= 00th=3D[ > >>3440], > >> | 70.00th=3D[ 4048], 80.00th=3D[ 4960], 90.00th=3D[ 6624], 95.= 00th=3D[ > >>8384], > >> | 99.00th=3D[15680], 99.50th=3D[25984], 99.90th=3D[55552], > >>99.95th=3D[64256], > >> | 99.99th=3D[87552] > >> bw (KB /s): min=3D 7, max=3D33864, per=3D10.08%, avg=3D1618= 3.08, > >>stdev=3D1401.82 > >> lat (usec) : 750=3D0.01%, 1000=3D0.10% > >> lat (msec) : 2=3D20.39%, 4=3D48.81%, 10=3D27.70%, 20=3D2.30%, 5= 0=3D0.55% > >> lat (msec) : 100=3D0.14%, 250=3D0.01%, 2000=3D0.01%, >=3D2000=3D= 0.01% > >> cpu : usr=3D20.18%, sys=3D3.67%, ctx=3D96626924, majf=3D= 0, minf=3D166692 > >> IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D27.0%, 16=3D73.0= %, 32=3D0.0%, > >>>=3D64=3D0.0% > >> submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.= 0%, 64=3D0.0%, > >>>=3D64=3D0.0% > >> complete : 0=3D0.0%, 4=3D99.9%, 8=3D0.1%, 16=3D0.1%, 32=3D0.0= %, 64=3D0.0%, > >>>=3D64=3D0.0% > >> issued : total=3Dr=3D0/w=3D72213031/d=3D0, short=3Dr=3D0/w=3D= 0/d=3D0 > >> latency : target=3D0, window=3D0, percentile=3D100.00%, dept= h=3D16 > >> > >> *Significantly better latency/throughput than similar setup filest= ore*. > >> > >> > >> This is based on my experiment on all SSD , HDD case will be diffe= rent. > >> Tuning also depends on your cpu complex/memory, I am running with > >>48 core (HT enabled) dual socket Xeon on each node with 64GB of mem= ory.. > >> > >> Thanks & Regards > >> Somnath > >> > >> -----Original Message----- > >> From: Somnath Roy > >> Sent: Monday, July 11, 2016 8:04 AM > >> To: Mark Nelson (mnelson@redhat.com) > >> Cc: 'ceph-devel@vger.kernel.org' > >> Subject: Rocksdb tuning on Bluestore > >> > >> Mark, > >> With the following tuning it seems rocksdb is performing better in > >>my environment. Basically, doing aggressive compaction to reduce th= e > >>write stalls. > >> > >> bluestore_rocksdb_options =3D > >>"max_write_buffer_number=3D16,min_write_buffer_number_to_merge=3D1 > 6,recycl > >>e_l > >>og_file_num=3D16,compaction_threads=3D32,flusher_threads=3D4,max_ba= ckgr > ound_ > >>com > >>pactions=3D32,max_background_flushes=3D8,max_bytes_for_level_base=3D= 536 > 87091 > >>20, > >>write_buffer_size=3D83886080,level0_file_num_compaction_trigger=3D4= ,leve > >>l > 0 > >>_sl owdown_writes_trigger=3D400,level0_stop_writes_trigger=3D800" > >> > >> > >> BTW, I am not able to run BlueStore more than 2 hour at a stretch > >>due to memory issues. It is filling up my system memory (2 node of > >>64 G memory , running 8 OSDS on each) fast. > >> The following operation I did and it started swapping. > >> > >> 1. Created a 4TB image and did 1M sequential preconditioning (took > >> ~1 > >> hour) > >> > >> 2. Followed by two 30 min 4k RW with QD 128 (numjob =3D 10) and in > >>the 2nd run memory started swapping. > >> > >> Let me know how this rocksdb option works for you. > >> > >> Thanks & Regards > >> Somnath > >> > >> PLEASE NOTE: The information contained in this electronic mail > >>message is intended only for the use of the designated recipient(s) > >>named > above. > >>If the reader of this message is not the intended recipient, you ar= e > >>hereby notified that you have received this message in error and > >>that any review, dissemination, distribution, or copying of this > >>message is strictly prohibited. If you have received this > >>communication in error, please notify the sender by telephone or > >>e-mail (as shown above) immediately and destroy any and all copies > >>of this message in your possession (whether hard copies or electron= ically stored copies). > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-dev= el" > >> in the body of a message to majordomo@vger.kernel.org More > majordomo > >>info at http://vger.kernel.org/majordomo-info.html > > > >-- > >To unsubscribe from this list: send the line "unsubscribe ceph-devel= " > >in the body of a message to majordomo@vger.kernel.org More majordomo > >info at http://vger.kernel.org/majordomo-info.html > > PLEASE NOTE: The information contained in this electronic mail messag= e > is intended only for the use of the designated recipient(s) named > above. If the reader of this message is not the intended recipient, > you are hereby notified that you have received this message in error > and that any review, dissemination, distribution, or copying of this > message is strictly prohibited. If you have received this > communication in error, please notify the sender by telephone or > e-mail (as shown above) immediately and destroy any and all copies of > this message in your possession (whether hard copies or electronicall= y stored copies). > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message = is intended only for the use of the designated recipient(s) named above= =2E If the reader of this message is not the intended recipient, you ar= e hereby notified that you have received this message in error and that= any review, dissemination, distribution, or copying of this message is= strictly prohibited. If you have received this communication in error,= please notify the sender by telephone or e-mail (as shown above) immed= iately and destroy any and all copies of this message in your possessio= n (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html