All of lore.kernel.org
 help / color / mirror / Atom feed
* stuck recovery for many days, help needed
@ 2017-09-21 14:08 Wyllys Ingersoll
  2017-09-21 17:47 ` Vincent Godin
  2017-09-21 21:27 ` Mustafa Muhammad
  0 siblings, 2 replies; 6+ messages in thread
From: Wyllys Ingersoll @ 2017-09-21 14:08 UTC (permalink / raw)
  To: Ceph Development

I have a damaged cluster that has been recovering for over a week and
is still not getting healthy.  It will get to a point and then the
"degraded" recovery objects count stops going down and eventually the
"mispaced" object count also stops going down and recovery basically
stops.

Problems noted:

 - Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
disks (though only 40TB of disks are marked "up/in" the cluster
currently to avoid crashing issues and some suspected bad disks).

- OSD crashes.  We have a number of OSDs that repeatedly crash on or
shortly after starting up and joining back into the cluster (crash
logs already sent in to this list early this week).  Possibly due to
hard drive issues, but none of them are marked as failing by SMART
utilities.

- Too many cephfs snapshots.  We have a cephfs with over 4800
snapshots.  cephfs is currently unavailable during the recovery, but
when it *was* available, deleting a single snapshot threw the system
into a bad state - thousands of requests would become blocked, cephfs
would become blocked and the entire cluster basically went to hell.  I
believe a bug has been filed for this, but I think the impact is more
severe and critical than originally suspected.


Fixes attempted:
- Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
- Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
- disabled scrub and deep scrub
- attempting to bring more OSDs online, but its tricky because we end
up either running into memory exhaustion problems or the OSDs crash
shortly after starting making them essentially useless.


Currently our status looks like this (MDSs are disabled intentionally
for now, having them online makes no difference for recovery or cephfs
availability):

     health HEALTH_ERR
            25 pgs are stuck inactive for more than 300 seconds
            1398 pgs backfill_wait
            72 pgs backfilling
            38 pgs degraded
            13 pgs down
            1 pgs incomplete
            2 pgs inconsistent
            13 pgs peering
            35 pgs recovering
            37 pgs stuck degraded
            25 pgs stuck inactive
            1519 pgs stuck unclean
            33 pgs stuck undersized
            34 pgs undersized
            81 requests are blocked > 32 sec
            recovery 351883/51815427 objects degraded (0.679%)
            recovery 4920116/51815427 objects misplaced (9.495%)
            recovery 152/17271809 unfound (0.001%)
            15 scrub errors
            mds rank 0 has failed
            mds cluster is degraded
            noscrub,nodeep-scrub flag(s) set
     monmap e1: 3 mons at
{mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
            election epoch 192, quorum 0,1,2 mon01,mon02,mon03
      fsmap e18157: 0/1/1 up, 1 failed
     osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
            flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
      pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
            86259 GB used, 139 TB / 223 TB avail


Any suggestions as to what to look for or how to try and get this
cluster healthy soon would be much appreciated, its literally been
more than 2 weeks of battling with various issues and we are no closer
to a healthy usable cluster.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck recovery for many days, help needed
  2017-09-21 14:08 stuck recovery for many days, help needed Wyllys Ingersoll
@ 2017-09-21 17:47 ` Vincent Godin
  2017-09-21 18:07   ` Wyllys Ingersoll
  2017-09-21 21:27 ` Mustafa Muhammad
  1 sibling, 1 reply; 6+ messages in thread
From: Vincent Godin @ 2017-09-21 17:47 UTC (permalink / raw)
  To: Wyllys Ingersoll; +Cc: Ceph Development

Hello,

You should first investigate on the 13 pgs which refuse to peer. They
probably refuse to peer because they're waiting for some OSDs with
more up-to-date datas. Try to focus on one pg and restart the OSD the
pg is waiting for

I don't understand very well your memory problem : my hosts have 64GB
of RAM and (20 x 6 TB SATA + 5 x 400GB SSD) and i have encountered no
memory problems (i'm on 10.2.7). An OSD consumes about 1GB of RAM. How
many OSD process are running on one of your host and how much RAM are
used by OSD process ? It may be your main problem

2017-09-21 16:08 GMT+02:00 Wyllys Ingersoll <wyllys.ingersoll@keepertech.com>:
> I have a damaged cluster that has been recovering for over a week and
> is still not getting healthy.  It will get to a point and then the
> "degraded" recovery objects count stops going down and eventually the
> "mispaced" object count also stops going down and recovery basically
> stops.
>
> Problems noted:
>
>  - Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
> disks (though only 40TB of disks are marked "up/in" the cluster
> currently to avoid crashing issues and some suspected bad disks).
>
> - OSD crashes.  We have a number of OSDs that repeatedly crash on or
> shortly after starting up and joining back into the cluster (crash
> logs already sent in to this list early this week).  Possibly due to
> hard drive issues, but none of them are marked as failing by SMART
> utilities.
>
> - Too many cephfs snapshots.  We have a cephfs with over 4800
> snapshots.  cephfs is currently unavailable during the recovery, but
> when it *was* available, deleting a single snapshot threw the system
> into a bad state - thousands of requests would become blocked, cephfs
> would become blocked and the entire cluster basically went to hell.  I
> believe a bug has been filed for this, but I think the impact is more
> severe and critical than originally suspected.
>
>
> Fixes attempted:
> - Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
> - Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
> - disabled scrub and deep scrub
> - attempting to bring more OSDs online, but its tricky because we end
> up either running into memory exhaustion problems or the OSDs crash
> shortly after starting making them essentially useless.
>
>
> Currently our status looks like this (MDSs are disabled intentionally
> for now, having them online makes no difference for recovery or cephfs
> availability):
>
>      health HEALTH_ERR
>             25 pgs are stuck inactive for more than 300 seconds
>             1398 pgs backfill_wait
>             72 pgs backfilling
>             38 pgs degraded
>             13 pgs down
>             1 pgs incomplete
>             2 pgs inconsistent
>             13 pgs peering
>             35 pgs recovering
>             37 pgs stuck degraded
>             25 pgs stuck inactive
>             1519 pgs stuck unclean
>             33 pgs stuck undersized
>             34 pgs undersized
>             81 requests are blocked > 32 sec
>             recovery 351883/51815427 objects degraded (0.679%)
>             recovery 4920116/51815427 objects misplaced (9.495%)
>             recovery 152/17271809 unfound (0.001%)
>             15 scrub errors
>             mds rank 0 has failed
>             mds cluster is degraded
>             noscrub,nodeep-scrub flag(s) set
>      monmap e1: 3 mons at
> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>             election epoch 192, quorum 0,1,2 mon01,mon02,mon03
>       fsmap e18157: 0/1/1 up, 1 failed
>      osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
>             flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>       pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
>             86259 GB used, 139 TB / 223 TB avail
>
>
> Any suggestions as to what to look for or how to try and get this
> cluster healthy soon would be much appreciated, its literally been
> more than 2 weeks of battling with various issues and we are no closer
> to a healthy usable cluster.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck recovery for many days, help needed
  2017-09-21 17:47 ` Vincent Godin
@ 2017-09-21 18:07   ` Wyllys Ingersoll
  2017-09-21 20:20     ` Vincent Godin
  0 siblings, 1 reply; 6+ messages in thread
From: Wyllys Ingersoll @ 2017-09-21 18:07 UTC (permalink / raw)
  To: Vincent Godin; +Cc: Ceph Development

I have investigated the peering issues (down to 3 now), mostly it's
because the OSDs they are waiting on refuse to come up and stay up
long enough to complete the operation requested due to issue #1 below,
ceph-osd assertion errors causing crashes.

During heavy recovery, and after running for long periods of time, the
OSDs consume far more than 1GB of RAM.  Here is an example (clipped
from 'top'), the server has 10 ceph-osd processes, not all shown here
but you get the idea.  They all consume 10-20+GB of memory.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
 699905 ceph      20   0 25.526g 0.021t 125472 S  70.4 17.3  37:31.26
ceph-osd
 662712 ceph      20   0 10.958g 6.229g 238392 S  39.9  5.0  98:34.80
ceph-osd
 692981 ceph      20   0 14.940g 5.845g  84408 S  39.9  4.6  89:36.22
ceph-osd
 553786 ceph      20   0 29.059g 0.011t 231992 S  35.5  9.1 612:15.30
ceph-osd
 656799 ceph      20   0 27.610g 0.014t 197704 S  25.9 11.5 399:02.59
ceph-osd
 662727 ceph      20   0 18.703g 0.013t 105012 S   4.7 10.9  90:20.22
ceph-osd

On Thu, Sep 21, 2017 at 1:47 PM, Vincent Godin <vince.mlist@gmail.com> wrote:
> Hello,
>
> You should first investigate on the 13 pgs which refuse to peer. They
> probably refuse to peer because they're waiting for some OSDs with
> more up-to-date datas. Try to focus on one pg and restart the OSD the
> pg is waiting for
>
> I don't understand very well your memory problem : my hosts have 64GB
> of RAM and (20 x 6 TB SATA + 5 x 400GB SSD) and i have encountered no
> memory problems (i'm on 10.2.7). An OSD consumes about 1GB of RAM. How
> many OSD process are running on one of your host and how much RAM are
> used by OSD process ? It may be your main problem
>
> 2017-09-21 16:08 GMT+02:00 Wyllys Ingersoll <wyllys.ingersoll@keepertech.com>:
>> I have a damaged cluster that has been recovering for over a week and
>> is still not getting healthy.  It will get to a point and then the
>> "degraded" recovery objects count stops going down and eventually the
>> "mispaced" object count also stops going down and recovery basically
>> stops.
>>
>> Problems noted:
>>
>>  - Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
>> disks (though only 40TB of disks are marked "up/in" the cluster
>> currently to avoid crashing issues and some suspected bad disks).
>>
>> - OSD crashes.  We have a number of OSDs that repeatedly crash on or
>> shortly after starting up and joining back into the cluster (crash
>> logs already sent in to this list early this week).  Possibly due to
>> hard drive issues, but none of them are marked as failing by SMART
>> utilities.
>>
>> - Too many cephfs snapshots.  We have a cephfs with over 4800
>> snapshots.  cephfs is currently unavailable during the recovery, but
>> when it *was* available, deleting a single snapshot threw the system
>> into a bad state - thousands of requests would become blocked, cephfs
>> would become blocked and the entire cluster basically went to hell.  I
>> believe a bug has been filed for this, but I think the impact is more
>> severe and critical than originally suspected.
>>
>>
>> Fixes attempted:
>> - Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
>> - Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
>> - disabled scrub and deep scrub
>> - attempting to bring more OSDs online, but its tricky because we end
>> up either running into memory exhaustion problems or the OSDs crash
>> shortly after starting making them essentially useless.
>>
>>
>> Currently our status looks like this (MDSs are disabled intentionally
>> for now, having them online makes no difference for recovery or cephfs
>> availability):
>>
>>      health HEALTH_ERR
>>             25 pgs are stuck inactive for more than 300 seconds
>>             1398 pgs backfill_wait
>>             72 pgs backfilling
>>             38 pgs degraded
>>             13 pgs down
>>             1 pgs incomplete
>>             2 pgs inconsistent
>>             13 pgs peering
>>             35 pgs recovering
>>             37 pgs stuck degraded
>>             25 pgs stuck inactive
>>             1519 pgs stuck unclean
>>             33 pgs stuck undersized
>>             34 pgs undersized
>>             81 requests are blocked > 32 sec
>>             recovery 351883/51815427 objects degraded (0.679%)
>>             recovery 4920116/51815427 objects misplaced (9.495%)
>>             recovery 152/17271809 unfound (0.001%)
>>             15 scrub errors
>>             mds rank 0 has failed
>>             mds cluster is degraded
>>             noscrub,nodeep-scrub flag(s) set
>>      monmap e1: 3 mons at
>> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>>             election epoch 192, quorum 0,1,2 mon01,mon02,mon03
>>       fsmap e18157: 0/1/1 up, 1 failed
>>      osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
>>             flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>>       pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
>>             86259 GB used, 139 TB / 223 TB avail
>>
>>
>> Any suggestions as to what to look for or how to try and get this
>> cluster healthy soon would be much appreciated, its literally been
>> more than 2 weeks of battling with various issues and we are no closer
>> to a healthy usable cluster.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck recovery for many days, help needed
  2017-09-21 18:07   ` Wyllys Ingersoll
@ 2017-09-21 20:20     ` Vincent Godin
  0 siblings, 0 replies; 6+ messages in thread
From: Vincent Godin @ 2017-09-21 20:20 UTC (permalink / raw)
  To: Wyllys Ingersoll; +Cc: Ceph Development

10 GB of RAM per OSD process is huge !!! (It looks like a very old bug
in hammer)
You should give more informations : ceph.conf, OS version, hardware
config, debug level


2017-09-21 20:07 GMT+02:00 Wyllys Ingersoll <wyllys.ingersoll@keepertech.com>:
> I have investigated the peering issues (down to 3 now), mostly it's
> because the OSDs they are waiting on refuse to come up and stay up
> long enough to complete the operation requested due to issue #1 below,
> ceph-osd assertion errors causing crashes.
>
> During heavy recovery, and after running for long periods of time, the
> OSDs consume far more than 1GB of RAM.  Here is an example (clipped
> from 'top'), the server has 10 ceph-osd processes, not all shown here
> but you get the idea.  They all consume 10-20+GB of memory.
>
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
>  699905 ceph      20   0 25.526g 0.021t 125472 S  70.4 17.3  37:31.26
> ceph-osd
>  662712 ceph      20   0 10.958g 6.229g 238392 S  39.9  5.0  98:34.80
> ceph-osd
>  692981 ceph      20   0 14.940g 5.845g  84408 S  39.9  4.6  89:36.22
> ceph-osd
>  553786 ceph      20   0 29.059g 0.011t 231992 S  35.5  9.1 612:15.30
> ceph-osd
>  656799 ceph      20   0 27.610g 0.014t 197704 S  25.9 11.5 399:02.59
> ceph-osd
>  662727 ceph      20   0 18.703g 0.013t 105012 S   4.7 10.9  90:20.22
> ceph-osd
>
> On Thu, Sep 21, 2017 at 1:47 PM, Vincent Godin <vince.mlist@gmail.com> wrote:
>> Hello,
>>
>> You should first investigate on the 13 pgs which refuse to peer. They
>> probably refuse to peer because they're waiting for some OSDs with
>> more up-to-date datas. Try to focus on one pg and restart the OSD the
>> pg is waiting for
>>
>> I don't understand very well your memory problem : my hosts have 64GB
>> of RAM and (20 x 6 TB SATA + 5 x 400GB SSD) and i have encountered no
>> memory problems (i'm on 10.2.7). An OSD consumes about 1GB of RAM. How
>> many OSD process are running on one of your host and how much RAM are
>> used by OSD process ? It may be your main problem
>>
>> 2017-09-21 16:08 GMT+02:00 Wyllys Ingersoll <wyllys.ingersoll@keepertech.com>:
>>> I have a damaged cluster that has been recovering for over a week and
>>> is still not getting healthy.  It will get to a point and then the
>>> "degraded" recovery objects count stops going down and eventually the
>>> "mispaced" object count also stops going down and recovery basically
>>> stops.
>>>
>>> Problems noted:
>>>
>>>  - Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
>>> disks (though only 40TB of disks are marked "up/in" the cluster
>>> currently to avoid crashing issues and some suspected bad disks).
>>>
>>> - OSD crashes.  We have a number of OSDs that repeatedly crash on or
>>> shortly after starting up and joining back into the cluster (crash
>>> logs already sent in to this list early this week).  Possibly due to
>>> hard drive issues, but none of them are marked as failing by SMART
>>> utilities.
>>>
>>> - Too many cephfs snapshots.  We have a cephfs with over 4800
>>> snapshots.  cephfs is currently unavailable during the recovery, but
>>> when it *was* available, deleting a single snapshot threw the system
>>> into a bad state - thousands of requests would become blocked, cephfs
>>> would become blocked and the entire cluster basically went to hell.  I
>>> believe a bug has been filed for this, but I think the impact is more
>>> severe and critical than originally suspected.
>>>
>>>
>>> Fixes attempted:
>>> - Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
>>> - Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
>>> - disabled scrub and deep scrub
>>> - attempting to bring more OSDs online, but its tricky because we end
>>> up either running into memory exhaustion problems or the OSDs crash
>>> shortly after starting making them essentially useless.
>>>
>>>
>>> Currently our status looks like this (MDSs are disabled intentionally
>>> for now, having them online makes no difference for recovery or cephfs
>>> availability):
>>>
>>>      health HEALTH_ERR
>>>             25 pgs are stuck inactive for more than 300 seconds
>>>             1398 pgs backfill_wait
>>>             72 pgs backfilling
>>>             38 pgs degraded
>>>             13 pgs down
>>>             1 pgs incomplete
>>>             2 pgs inconsistent
>>>             13 pgs peering
>>>             35 pgs recovering
>>>             37 pgs stuck degraded
>>>             25 pgs stuck inactive
>>>             1519 pgs stuck unclean
>>>             33 pgs stuck undersized
>>>             34 pgs undersized
>>>             81 requests are blocked > 32 sec
>>>             recovery 351883/51815427 objects degraded (0.679%)
>>>             recovery 4920116/51815427 objects misplaced (9.495%)
>>>             recovery 152/17271809 unfound (0.001%)
>>>             15 scrub errors
>>>             mds rank 0 has failed
>>>             mds cluster is degraded
>>>             noscrub,nodeep-scrub flag(s) set
>>>      monmap e1: 3 mons at
>>> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>>>             election epoch 192, quorum 0,1,2 mon01,mon02,mon03
>>>       fsmap e18157: 0/1/1 up, 1 failed
>>>      osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
>>>             flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>>>       pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
>>>             86259 GB used, 139 TB / 223 TB avail
>>>
>>>
>>> Any suggestions as to what to look for or how to try and get this
>>> cluster healthy soon would be much appreciated, its literally been
>>> more than 2 weeks of battling with various issues and we are no closer
>>> to a healthy usable cluster.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck recovery for many days, help needed
  2017-09-21 14:08 stuck recovery for many days, help needed Wyllys Ingersoll
  2017-09-21 17:47 ` Vincent Godin
@ 2017-09-21 21:27 ` Mustafa Muhammad
  2017-09-22  2:04   ` Xiaoxi Chen
  1 sibling, 1 reply; 6+ messages in thread
From: Mustafa Muhammad @ 2017-09-21 21:27 UTC (permalink / raw)
  To: Wyllys Ingersoll; +Cc: Ceph Development

Hello,

We had similar issue 6 weeks ago, you can find some details in this thread:
https://marc.info/?t=150297924500005&r=1&w=2

There were multiple problems all together, mainly osdmap updates are
very slow and peering takes huge amount of memory (in that version,
fixed in 12.2)
I think you should first set "pause" and "notieragent" flags.
Also set noup, nodown so your osdmap doesn't change rapidly with every
OSD down and up, and only unset them for maybe 10 seconds when you
want started OSDs to go up.

For us, the memory usage issue was fixed by upgrading to Luminous
(12.2.0 is available), after that we could start the whole cluster
with fraction of the memory (no more than 15G per node (12 OSD each)
).

This should let the peering and recovery proceed, and hopefully you
get your cluster healthy soon.

We faced another bug in recovery, hope you don't face it too, my
colleague made a patch for it and sent it to this ML, but I hope you
don't need it.

Feel free to ask for any more info

Regards
Mustafa Muhammad


On Thu, Sep 21, 2017 at 5:08 PM, Wyllys Ingersoll
<wyllys.ingersoll@keepertech.com> wrote:
> I have a damaged cluster that has been recovering for over a week and
> is still not getting healthy.  It will get to a point and then the
> "degraded" recovery objects count stops going down and eventually the
> "mispaced" object count also stops going down and recovery basically
> stops.
>
> Problems noted:
>
>  - Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
> disks (though only 40TB of disks are marked "up/in" the cluster
> currently to avoid crashing issues and some suspected bad disks).
>
> - OSD crashes.  We have a number of OSDs that repeatedly crash on or
> shortly after starting up and joining back into the cluster (crash
> logs already sent in to this list early this week).  Possibly due to
> hard drive issues, but none of them are marked as failing by SMART
> utilities.
>
> - Too many cephfs snapshots.  We have a cephfs with over 4800
> snapshots.  cephfs is currently unavailable during the recovery, but
> when it *was* available, deleting a single snapshot threw the system
> into a bad state - thousands of requests would become blocked, cephfs
> would become blocked and the entire cluster basically went to hell.  I
> believe a bug has been filed for this, but I think the impact is more
> severe and critical than originally suspected.
>
>
> Fixes attempted:
> - Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
> - Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
> - disabled scrub and deep scrub
> - attempting to bring more OSDs online, but its tricky because we end
> up either running into memory exhaustion problems or the OSDs crash
> shortly after starting making them essentially useless.
>
>
> Currently our status looks like this (MDSs are disabled intentionally
> for now, having them online makes no difference for recovery or cephfs
> availability):
>
>      health HEALTH_ERR
>             25 pgs are stuck inactive for more than 300 seconds
>             1398 pgs backfill_wait
>             72 pgs backfilling
>             38 pgs degraded
>             13 pgs down
>             1 pgs incomplete
>             2 pgs inconsistent
>             13 pgs peering
>             35 pgs recovering
>             37 pgs stuck degraded
>             25 pgs stuck inactive
>             1519 pgs stuck unclean
>             33 pgs stuck undersized
>             34 pgs undersized
>             81 requests are blocked > 32 sec
>             recovery 351883/51815427 objects degraded (0.679%)
>             recovery 4920116/51815427 objects misplaced (9.495%)
>             recovery 152/17271809 unfound (0.001%)
>             15 scrub errors
>             mds rank 0 has failed
>             mds cluster is degraded
>             noscrub,nodeep-scrub flag(s) set
>      monmap e1: 3 mons at
> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>             election epoch 192, quorum 0,1,2 mon01,mon02,mon03
>       fsmap e18157: 0/1/1 up, 1 failed
>      osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
>             flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>       pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
>             86259 GB used, 139 TB / 223 TB avail
>
>
> Any suggestions as to what to look for or how to try and get this
> cluster healthy soon would be much appreciated, its literally been
> more than 2 weeks of battling with various issues and we are no closer
> to a healthy usable cluster.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: stuck recovery for many days, help needed
  2017-09-21 21:27 ` Mustafa Muhammad
@ 2017-09-22  2:04   ` Xiaoxi Chen
  0 siblings, 0 replies; 6+ messages in thread
From: Xiaoxi Chen @ 2017-09-22  2:04 UTC (permalink / raw)
  To: Mustafa Muhammad; +Cc: Wyllys Ingersoll, Ceph Development

Luminous recovery also eat up lots of memory,  I consistently seeing
5GB+ RSS for my OSDs during recovery.

mempool stat showing pglog eat up most of the memory

>     "osd_pglog": {
>         "items": 7834058,
>         "bytes": 3025235100
>     },


>     "total": {
>         "items": 23999967,
>         "bytes": 3820337626
>     }


Also the huge gap on memory consumption, between mempool stat and heap
stat are unknown:

[19:02:15 pts/0]root@slx03c-6rqx:~# ceph daemon osd.428 --cluster
pre-prod  heap stats
osd.428 tcmalloc heap stats:------------------------------------------------
MALLOC:     6418067864 ( 6120.7 MiB) Bytes in use by application
MALLOC: +     20635648 (   19.7 MiB) Bytes in page heap freelist
MALLOC: +    806292256 (  768.9 MiB) Bytes in central cache freelist
MALLOC: +     26934096 (   25.7 MiB) Bytes in transfer cache freelist
MALLOC: +     86640632 (   82.6 MiB) Bytes in thread cache freelists
MALLOC: +     33353880 (   31.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   7391924376 ( 7049.5 MiB) Actual memory used (physical + swap)
MALLOC: +    907812864 (  865.8 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   8299737240 ( 7915.2 MiB) Virtual address space used
MALLOC:
MALLOC:         399235              Spans in use
MALLOC:             34              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
{
    "error": "(0) Success",
    "success": true
}

2017-09-22 5:27 GMT+08:00 Mustafa Muhammad <mustafa1024m@gmail.com>:
> Hello,
>
> We had similar issue 6 weeks ago, you can find some details in this thread:
> https://marc.info/?t=150297924500005&r=1&w=2
>
> There were multiple problems all together, mainly osdmap updates are
> very slow and peering takes huge amount of memory (in that version,
> fixed in 12.2)
> I think you should first set "pause" and "notieragent" flags.
> Also set noup, nodown so your osdmap doesn't change rapidly with every
> OSD down and up, and only unset them for maybe 10 seconds when you
> want started OSDs to go up.
>
> For us, the memory usage issue was fixed by upgrading to Luminous
> (12.2.0 is available), after that we could start the whole cluster
> with fraction of the memory (no more than 15G per node (12 OSD each)
> ).
>
> This should let the peering and recovery proceed, and hopefully you
> get your cluster healthy soon.
>
> We faced another bug in recovery, hope you don't face it too, my
> colleague made a patch for it and sent it to this ML, but I hope you
> don't need it.
>
> Feel free to ask for any more info
>
> Regards
> Mustafa Muhammad
>
>
> On Thu, Sep 21, 2017 at 5:08 PM, Wyllys Ingersoll
> <wyllys.ingersoll@keepertech.com> wrote:
>> I have a damaged cluster that has been recovering for over a week and
>> is still not getting healthy.  It will get to a point and then the
>> "degraded" recovery objects count stops going down and eventually the
>> "mispaced" object count also stops going down and recovery basically
>> stops.
>>
>> Problems noted:
>>
>>  - Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
>> disks (though only 40TB of disks are marked "up/in" the cluster
>> currently to avoid crashing issues and some suspected bad disks).
>>
>> - OSD crashes.  We have a number of OSDs that repeatedly crash on or
>> shortly after starting up and joining back into the cluster (crash
>> logs already sent in to this list early this week).  Possibly due to
>> hard drive issues, but none of them are marked as failing by SMART
>> utilities.
>>
>> - Too many cephfs snapshots.  We have a cephfs with over 4800
>> snapshots.  cephfs is currently unavailable during the recovery, but
>> when it *was* available, deleting a single snapshot threw the system
>> into a bad state - thousands of requests would become blocked, cephfs
>> would become blocked and the entire cluster basically went to hell.  I
>> believe a bug has been filed for this, but I think the impact is more
>> severe and critical than originally suspected.
>>
>>
>> Fixes attempted:
>> - Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
>> - Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
>> - disabled scrub and deep scrub
>> - attempting to bring more OSDs online, but its tricky because we end
>> up either running into memory exhaustion problems or the OSDs crash
>> shortly after starting making them essentially useless.
>>
>>
>> Currently our status looks like this (MDSs are disabled intentionally
>> for now, having them online makes no difference for recovery or cephfs
>> availability):
>>
>>      health HEALTH_ERR
>>             25 pgs are stuck inactive for more than 300 seconds
>>             1398 pgs backfill_wait
>>             72 pgs backfilling
>>             38 pgs degraded
>>             13 pgs down
>>             1 pgs incomplete
>>             2 pgs inconsistent
>>             13 pgs peering
>>             35 pgs recovering
>>             37 pgs stuck degraded
>>             25 pgs stuck inactive
>>             1519 pgs stuck unclean
>>             33 pgs stuck undersized
>>             34 pgs undersized
>>             81 requests are blocked > 32 sec
>>             recovery 351883/51815427 objects degraded (0.679%)
>>             recovery 4920116/51815427 objects misplaced (9.495%)
>>             recovery 152/17271809 unfound (0.001%)
>>             15 scrub errors
>>             mds rank 0 has failed
>>             mds cluster is degraded
>>             noscrub,nodeep-scrub flag(s) set
>>      monmap e1: 3 mons at
>> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>>             election epoch 192, quorum 0,1,2 mon01,mon02,mon03
>>       fsmap e18157: 0/1/1 up, 1 failed
>>      osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
>>             flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>>       pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
>>             86259 GB used, 139 TB / 223 TB avail
>>
>>
>> Any suggestions as to what to look for or how to try and get this
>> cluster healthy soon would be much appreciated, its literally been
>> more than 2 weeks of battling with various issues and we are no closer
>> to a healthy usable cluster.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-09-22  2:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-21 14:08 stuck recovery for many days, help needed Wyllys Ingersoll
2017-09-21 17:47 ` Vincent Godin
2017-09-21 18:07   ` Wyllys Ingersoll
2017-09-21 20:20     ` Vincent Godin
2017-09-21 21:27 ` Mustafa Muhammad
2017-09-22  2:04   ` Xiaoxi Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.