Re: Ceph cluster stability

From: "Darius Kasparavičius" <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: M Ranga Swami Reddy <swamireddy-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: ceph-users <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>,
	ceph-devel <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: Ceph cluster stability
Date: Fri, 22 Feb 2019 13:55:20 +0200	[thread overview]
Message-ID: <CANrNMwUuODV3Ju+TxosZE0hM9qwAU1Rk0efST6QmzmWXW+hXFA@mail.gmail.com> (raw)
In-Reply-To: <CANA9Uk6rHoAtvAUGtW1VqZnXDz6EjaFOUfxmYUocChRe5mZwDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

If your using hdd for monitor servers. Check their load. It might be
the issue there.

On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
<swamireddy@gmail.com> wrote:
>
> ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
> folder on FS on a disk
>
> On Fri, Feb 22, 2019 at 5:13 PM David Turner <drakonstein@gmail.com> wrote:
> >
> > Mon disks don't have journals, they're just a folder on a filesystem on a disk.
> >
> > On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy <swamireddy@gmail.com> wrote:
> >>
> >> ceph mons looks fine during the recovery.  Using  HDD with SSD
> >> journals. with recommeded CPU and RAM numbers.
> >>
> >> On Fri, Feb 22, 2019 at 4:40 PM David Turner <drakonstein@gmail.com> wrote:
> >> >
> >> > What about the system stats on your mons during recovery? If they are having a hard time keeping up with requests during a recovery, I could see that impacting client io. What disks are they running on? CPU? Etc.
> >> >
> >> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy <swamireddy@gmail.com> wrote:
> >> >>
> >> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> >> >> Shall I try with 0 for all debug settings?
> >> >>
> >> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius <daznis@gmail.com> wrote:
> >> >> >
> >> >> > Hello,
> >> >> >
> >> >> >
> >> >> > Check your CPU usage when you are doing those kind of operations. We
> >> >> > had a similar issue where our CPU monitoring was reporting fine < 40%
> >> >> > usage, but our load on the nodes was high mid 60-80. If it's possible
> >> >> > try disabling ht and see the actual cpu usage.
> >> >> > If you are hitting CPU limits you can try disabling crc on messages.
> >> >> > ms_nocrc
> >> >> > ms_crc_data
> >> >> > ms_crc_header
> >> >> >
> >> >> > And setting all your debug messages to 0.
> >> >> > If you haven't done you can also lower your recovery settings a little.
> >> >> > osd recovery max active
> >> >> > osd max backfills
> >> >> >
> >> >> > You can also lower your file store threads.
> >> >> > filestore op threads
> >> >> >
> >> >> >
> >> >> > If you can also switch to bluestore from filestore. This will also
> >> >> > lower your CPU usage. I'm not sure that this is bluestore that does
> >> >> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> >> >> > compared to filestore + leveldb .
> >> >> >
> >> >> >
> >> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >> >> > <swamireddy@gmail.com> wrote:
> >> >> > >
> >> >> > > Thats expected from Ceph by design. But in our case, we are using all
> >> >> > > recommendation like rack failure domain, replication n/w,etc, still
> >> >> > > face client IO performance issues during one OSD down..
> >> >> > >
> >> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner <drakonstein@gmail.com> wrote:
> >> >> > > >
> >> >> > > > With a RACK failure domain, you should be able to have an entire rack powered down without noticing any major impact on the clients.  I regularly take down OSDs and nodes for maintenance and upgrades without seeing any problems with client IO.
> >> >> > > >
> >> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <swamireddy@gmail.com> wrote:
> >> >> > > >>
> >> >> > > >> Hello - I have a couple of questions on ceph cluster stability, even
> >> >> > > >> we follow all recommendations as below:
> >> >> > > >> - Having separate replication n/w and data n/w
> >> >> > > >> - RACK is the failure domain
> >> >> > > >> - Using SSDs for journals (1:4ratio)
> >> >> > > >>
> >> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps impacted.
> >> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> >> >> > > >> workable condition, if one osd down or one node down,etc.
> >> >> > > >>
> >> >> > > >> Thanks
> >> >> > > >> Swami
> >> >> > > >> _______________________________________________
> >> >> > > >> ceph-users mailing list
> >> >> > > >> ceph-users@lists.ceph.com
> >> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> > > _______________________________________________
> >> >> > > ceph-users mailing list
> >> >> > > ceph-users@lists.ceph.com
> >> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com