From: Wyllys Ingersoll <wyllys.ingersoll@keepertech.com>
To: Ceph Development <ceph-devel@vger.kernel.org>
Subject: stuck recovery for many days, help needed
Date: Thu, 21 Sep 2017 10:08:40 -0400 [thread overview]
Message-ID: <CAGbvivL+rMz439OZV2R1UGAaN2A5uOMiMe1ykd88nOJ=MqnFcQ@mail.gmail.com> (raw)
I have a damaged cluster that has been recovering for over a week and
is still not getting healthy. It will get to a point and then the
"degraded" recovery objects count stops going down and eventually the
"mispaced" object count also stops going down and recovery basically
stops.
Problems noted:
- Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
disks (though only 40TB of disks are marked "up/in" the cluster
currently to avoid crashing issues and some suspected bad disks).
- OSD crashes. We have a number of OSDs that repeatedly crash on or
shortly after starting up and joining back into the cluster (crash
logs already sent in to this list early this week). Possibly due to
hard drive issues, but none of them are marked as failing by SMART
utilities.
- Too many cephfs snapshots. We have a cephfs with over 4800
snapshots. cephfs is currently unavailable during the recovery, but
when it *was* available, deleting a single snapshot threw the system
into a bad state - thousands of requests would become blocked, cephfs
would become blocked and the entire cluster basically went to hell. I
believe a bug has been filed for this, but I think the impact is more
severe and critical than originally suspected.
Fixes attempted:
- Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
- Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
- disabled scrub and deep scrub
- attempting to bring more OSDs online, but its tricky because we end
up either running into memory exhaustion problems or the OSDs crash
shortly after starting making them essentially useless.
Currently our status looks like this (MDSs are disabled intentionally
for now, having them online makes no difference for recovery or cephfs
availability):
health HEALTH_ERR
25 pgs are stuck inactive for more than 300 seconds
1398 pgs backfill_wait
72 pgs backfilling
38 pgs degraded
13 pgs down
1 pgs incomplete
2 pgs inconsistent
13 pgs peering
35 pgs recovering
37 pgs stuck degraded
25 pgs stuck inactive
1519 pgs stuck unclean
33 pgs stuck undersized
34 pgs undersized
81 requests are blocked > 32 sec
recovery 351883/51815427 objects degraded (0.679%)
recovery 4920116/51815427 objects misplaced (9.495%)
recovery 152/17271809 unfound (0.001%)
15 scrub errors
mds rank 0 has failed
mds cluster is degraded
noscrub,nodeep-scrub flag(s) set
monmap e1: 3 mons at
{mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
election epoch 192, quorum 0,1,2 mon01,mon02,mon03
fsmap e18157: 0/1/1 up, 1 failed
osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
86259 GB used, 139 TB / 223 TB avail
Any suggestions as to what to look for or how to try and get this
cluster healthy soon would be much appreciated, its literally been
more than 2 weeks of battling with various issues and we are no closer
to a healthy usable cluster.
next reply other threads:[~2017-09-21 14:08 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-21 14:08 Wyllys Ingersoll [this message]
2017-09-21 17:47 ` stuck recovery for many days, help needed Vincent Godin
2017-09-21 18:07 ` Wyllys Ingersoll
2017-09-21 20:20 ` Vincent Godin
2017-09-21 21:27 ` Mustafa Muhammad
2017-09-22 2:04 ` Xiaoxi Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGbvivL+rMz439OZV2R1UGAaN2A5uOMiMe1ykd88nOJ=MqnFcQ@mail.gmail.com' \
--to=wyllys.ingersoll@keepertech.com \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.