Re: Dramatic performance drop at certain number of objects in pool - Warren Wang

From: Warren Wang - ISD <Warren.Wang-dFwxUrggiyBBDgjK7y7TUQ@public.gmane.org>
To: Wade Holler <wade.holler-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Somnath Roy <Somnath.Roy-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Blair Bethwaite
	<blair.bethwaite-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org"
	<ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>,
	Ceph Development
	<ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: Dramatic performance drop at certain number of objects in pool
Date: Fri, 24 Jun 2016 16:24:44 +0000	[thread overview]
Message-ID: <D392D6EB.146C6%warren.wang@walmart.com> (raw)
In-Reply-To: <CA+e22SdmGJVzJX9+63T41UGsfFcxs9R=xZqniQyTgu-yG=h0cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Oops, that reminds me, do you have min_free_kbytes set to something
reasonable like at least 2-4GB?

Warren Wang

On 6/24/16, 10:23 AM, "Wade Holler" <wade.holler@gmail.com> wrote:

>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
>think it is the best choice for most configs.  However with our large
>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>hitting an issue where our write response time would double; then a
>drop of caches would return response time to normal.  I don't claim to
>totally understand this and I only have speculation at the moment.
>Again thanks for this suggestion, I do think it is best for boxes that
>don't have very large memory.
>
>@ Christian - reformatting to btrfs or ext4 is an option in my test
>cluster.  I thought about that but needed to sort xfs first. (thats
>what production will run right now) You all have helped me do that and
>thank you again.  I will circle back and test btrfs under the same
>conditions.  I suspect that it will behave similarly but it's only a
>day and half's work or so to test.
>
>Best Regards,
>Wade
>
>
>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <Somnath.Roy@sandisk.com>
>wrote:
>> Oops , typo , 128 GB :-)...
>>
>> -----Original Message-----
>> From: Christian Balzer [mailto:chibi@gol.com]
>> Sent: Thursday, June 23, 2016 5:08 PM
>> To: ceph-users@lists.ceph.com
>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>Development
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>of objects in pool
>>
>>
>> Hello,
>>
>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>>
>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>> *pin* inode/dentries in memory. We are using that for long now (with
>>> 128 TB node memory) and it seems helping specially for the random
>>> write workload and saving xattrs read in between.
>>>
>> 128TB node memory, really?
>> Can I have some of those, too? ^o^
>> And here I was thinking that Wade's 660GB machines were on the
>>excessive side.
>>
>> There's something to be said (and optimized) when your storage nodes
>>have the same or more RAM as your compute nodes...
>>
>> As for Warren, well spotted.
>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>fireworks if your memory is really needed elsewhere, while keeping
>>things in memory normally.
>>
>> Christian
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf
>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>> To: Wade Holler; Blair Bethwaite
>>> Cc: Ceph Development; ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>> of objects in pool
>>>
>>> vm.vfs_cache_pressure = 100
>>>
>>> Go the other direction on that. You易ll want to keep it low to help
>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>
>>>
>>> Warren Wang
>>>
>>>
>>>
>>>
>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.holler@gmail.com> wrote:
>>>
>>> >Blairo,
>>> >
>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>3.
>>> >
>>> >23.3 Million Objects / OSD
>>> >pg_num 2048
>>> >16 OSDs / Server
>>> >3 Servers
>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>> >vm.vfs_cache_pressure = 100
>>> >
>>> >Workload is native librados with python.  ALL 4k objects.
>>> >
>>> >Best Regards,
>>> >Wade
>>> >
>>> >
>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>> ><blair.bethwaite@gmail.com> wrote:
>>> >> Wade, good to know.
>>> >>
>>> >> For the record, what does this work out to roughly per OSD? And how
>>> >> much RAM and how many PGs per OSD do you have?
>>> >>
>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>> >> RBD) it's better to increase default object size somewhat before
>>> >> pushing the split/merge up a lot...
>>> >>
>>> >> Cheers,
>>> >>
>>> >> On 23 June 2016 at 11:26, Wade Holler <wade.holler@gmail.com> wrote:
>>> >>> Based on everyones suggestions; The first modification to 50 / 16
>>> >>> enabled our config to get to ~645Mill objects before the behavior
>>> >>> in question was observed (~330 was the previous ceiling).
>>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>>> >>> Billion+
>>> >>>
>>> >>> Thank you all very much for your support and assistance.
>>> >>>
>>> >>> Best Regards,
>>> >>> Wade
>>> >>>
>>> >>>
>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <chibi@gol.com>
>>> >>>wrote:
>>> >>>>
>>> >>>> Hello,
>>> >>>>
>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>> >>>>
>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>>> >>>>>here.
>>> >>>>> One of those things you just have to find out as an operator
>>> >>>>>since it's  not well documented :(
>>> >>>>>
>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>> >>>>>
>>> >>>>> We have over 200 million objects in this cluster, and it's still
>>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
>>> >>>>>drives
>>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
>>> >>>>>vfs_cache_pressure  should also help.
>>> >>>>>
>>> >>>> Indeed.
>>> >>>>
>>> >>>> Since it was asked in that bug report and also my first
>>> >>>>suspicion, it  would probably be good time to clarify that it
>>> >>>>isn't the splits that cause  the performance degradation, but the
>>> >>>>resulting inflation of dir entries  and exhaustion of SLAB and
>>> >>>>thus having to go to disk for things that  normally would be in
>>>memory.
>>> >>>>
>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>>> >>>>clear, a  purely split caused degradation should have relented
>>> >>>>much quicker.
>>> >>>>
>>> >>>>
>>> >>>>> Keep in mind that if you change the values, it won't take effect
>>> >>>>> immediately. It only merges them back if the directory is under
>>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>>> >>>>> forget).
>>> >>>>>
>>> >>>> If it's a read a plain scrub might do the trick.
>>> >>>>
>>> >>>> Christian
>>> >>>>> Warren
>>> >>>>>
>>> >>>>>
>>> >>>>> From: ceph-users
>>> >>>>>
>>> 
>>>>>>>><ceph-users-bounces@lists.ceph.com<mailto:ceph-users-bounces@lists.
>>> >>>>>cep
>>> >>>>>h.com>>
>>> >>>>> on behalf of Wade Holler
>>> >>>>> <wade.holler@gmail.com<mailto:wade.holler@gmail.com>> Date:
>>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>,
>>> >>>>>Wido den  Hollander <wido@42on.com<mailto:wido@42on.com>> Cc:
>>> >>>>>Ceph Development
>>> >>>>><ceph-devel@vger.kernel.org<mailto:ceph-devel@vger.kernel.org>>,
>>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>>> >>>>>Subject:
>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>> >>>>>objects  in pool
>>> >>>>>
>>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
>>> >>>>> are testing with different pg_num and filestore_split_multiple
>>> >>>>> settings. Early indications are .... well not great. Regardless
>>> >>>>> it is nice to understand the symptoms better so we try to design
>>> >>>>> around it.
>>> >>>>>
>>> >>>>> Best Regards,
>>> >>>>> Wade
>>> >>>>>
>>> >>>>>
>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>wrote:
>>> >>>>>On
>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>> >>>>><blair.bethwaite@gmail.com<mailto:blair.bethwaite@gmail.com>>
>>>wrote:
>>> >>>>> > slow request issues). If you watch your xfs stats you'll
>>> >>>>> > likely get further confirmation. In my experience
>>> >>>>> > xs_dir_lookups balloons
>>> >>>>>(which
>>> >>>>> > means directory lookups are missing cache and going to disk).
>>> >>>>>
>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>>> >>>>> very problem we had only ephemerally set the new filestore
>>> >>>>> merge/split values - oops. Here's what started happening when we
>>> >>>>> upgraded and restarted a bunch of OSDs:
>>> >>>>>
>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>>> >>>>>_d
>>> >>>>>ir_
>>> >>>>>lookup.png
>>> >>>>>
>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>> >>>>> 12:30, then still took a while to settle.
>>> >>>>>
>>> >>>>> --
>>> >>>>> Cheers,
>>> >>>>> ~Blairo
>>> >>>>>
>>> >>>>> This email and any files transmitted with it are confidential
>>> >>>>>and intended solely for the individual or entity to whom they are
>>> >>>>>addressed.
>>> >>>>> If you have received this email in error destroy it immediately.
>>> >>>>>***  Walmart Confidential ***
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Christian Balzer        Network/Systems Engineer
>>> >>>> chibi@gol.com           Global OnLine Japan/Rakuten Communications
>>> >>>> http://www.gol.com/
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Cheers,
>>> >> ~Blairo
>>>
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the individual or entity to whom they are
>>>addressed.
>>> If you have received this email in error destroy it immediately. ***
>>> Walmart Confidential ***
>>> _______________________________________________
>>> ceph-users mailing list ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>>> The information contained in this electronic mail message is intended
>>> only for the use of the designated recipient(s) named above. If the
>>> reader of this message is not the intended recipient, you are hereby
>>> notified that you have received this message in error and that any
>>> review, dissemination, distribution, or copying of this message is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender by telephone or e-mail (as shown above)
>>> immediately and destroy any and all copies of this message in your
>>> possession (whether hard copies or electronically stored copies).
>>> _______________________________________________ ceph-users mailing
>>> list ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi@gol.com   Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> PLEASE NOTE: The information contained in this electronic mail message
>>is intended only for the use of the designated recipient(s) named above.
>>If the reader of this message is not the intended recipient, you are
>>hereby notified that you have received this message in error and that
>>any review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).

This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com