All of lore.kernel.org
 help / color / mirror / Atom feed
* Troubleshooting remapped PG's + OSD flaps
@ 2017-05-18 18:37 nokia ceph
       [not found] ` <CALWuvY-oC5h6FzTRYHQ-QmUXoPOo0TnWPR5QgU13X-c8JhgpBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: nokia ceph @ 2017-05-18 18:37 UTC (permalink / raw)
  To: Ceph Users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 5891 bytes --]

Hello,


Env;- Bluestore EC 4+1 v11.2.0 RHEL7.3 16383 PG


We  did our resiliency testing and found OSD's keeps on flapping and
cluster went to error state.

What we did:-

1. we have 5 node cluster
2. poweroff/stop ceph.target on last node and waited everything seems to
reach back to normal.
3. Then power up the last node and then we see this recovery stuck on
remapped PG.
~~~
 osdmap e4829: 340 osds: 101 up, 112 in; 15011 *remapped pgs*
*~~~*
4. Initially all osd's reach 340, at the same time this remapped value
reached 16384 with OSD epoch value e818
5. Then after 1 or 2 hour we suspect that this remapped PG value keeps on
incremnet/decrement results the osd's started failed one by one. While we
tested with below patch also  still no change.
patch -
https://github.com/ceph/ceph-ci/commit/wip-prune-past-intervals-kraken


#ceph -s
2017-05-18 18:07:45.876586 7fd6bb87e700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
2017-05-18 18:07:45.900045 7fd6bb87e700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
    cluster cb55baa8-d5a5-442e-9aae-3fd83553824e
     health HEALTH_ERR
            27056 pgs are stuck inactive for more than 300 seconds
            744 pgs degraded
            10944 pgs down
            3919 pgs peering
            11416 pgs stale
            744 pgs stuck degraded
            15640 pgs stuck inactive
            11416 pgs stuck stale
            16384 pgs stuck unclean
            744 pgs stuck undersized
            744 pgs undersized
            recovery 1279809/135206985 objects degraded (0.947%)
            too many PGs per OSD (731 > max 300)
            11/112 in osds are down
     monmap e3: 5 mons at {PL6-CN1=
10.50.62.151:6789/0,PL6-CN2=10.50.62.152:6789/0,PL6-CN3=10.50.62.153:6789/0,PL6-CN4=10.50.62.154:6789/0,PL6-CN5=1
0.50.62.155:6789/0}
            election epoch 22, quorum 0,1,2,3,4
PL6-CN1,PL6-CN2,PL6-CN3,PL6-CN4,PL6-CN5
        mgr no daemons active
     osdmap e4827: 340 osds: 101 up, 112 in; 15011 remapped pgs
            flags sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v83202: 16384 pgs, 1 pools, 52815 GB data, 26407 kobjects
            12438 GB used, 331 TB / 343 TB avail
            1279809/135206985 objects degraded (0.947%)
                4512 stale+down+remapped
                3060 down+remapped
                2204 stale+down
                2000 stale+remapped+peering
                1259 stale+peering
                1167 down
                 739 stale+active+undersized+degraded
                 702 stale+remapped
                 557 peering
                 102 remapped+peering



# ceph pg stat
2017-05-18 18:09:18.345865 7fe2f72ec700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
2017-05-18 18:09:18.368566 7fe2f72ec700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
v83204: 16384 pgs: 1 inactive, 1259 stale+peering, 75 remapped, 2000
stale+remapped+peering, 102 remapped+peering, 2204 stale+down, 739
stale+active+undersized+degraded, 1 down+remapped+peering, 702
stale+remapped, 557 peering, 4512 stale+down+remapped, 3060 down+remapped,
5 active+undersized+degraded, 1167 down; 52815 GB data, 12438 GB used, 331
TB / 343 TB avail; 1279809/135206985 objects degraded (0.947%)



Randomly capture some pg value.
~~~
3.3ffc     1646                  0     1715         0       0 3451912192
1646     1646 stale+active+undersized+degraded 2017-05-18 11:06:32.453158
846'1646  872:1634     [36,NONE,278,219,225]         36
[36,NONE,278,219,225]             36        0'0 2017-05-18 07:14:30.303859
            0'0 2017-05-18 07:14:30.303859
3.3ffb     1711                  0        0         0       0 3588227072
1711     1711                             down 2017-05-18 15:20:52.858840
846'1711 1602:1708    [150,161,NONE,NONE,83]        150
 [150,161,NONE,NONE,83]            150        0'0 2017-05-18
07:14:30.303838             0'0 2017-05-18 07:14:30.303838
3.3ffa     1617                  0        0         0       0 3391094784
1617     1617                    down+remapped 2017-05-18 17:12:54.943317
846'1617 2525:1637        [48,292,77,277,49]         48
[48,NONE,NONE,277,49]             48        0'0 2017-05-18 07:14:30.303807
            0'0 2017-05-18 07:14:30.303807
3.3ff9     1682                  0        0         0       0 3527409664
1682     1682                    down+remapped 2017-05-18 16:16:42.223632
846'1682 2195:1678     [266,79,NONE,309,258]        266
[NONE,NONE,NONE,NONE,258]            258        0'0 2017-05-18
07:14:30.303793             0'0 2017-05-18 07:14:30.303793
~~~

ceph.conf

[mon]
mon_osd_down_out_interval = 3600
mon_osd_reporter_subtree_level=host
mon_osd_down_out_subtree_limit=host
mon_osd_min_down_reporters = 4
mon_allow_pool_delete = true
[osd]
bluestore = true
bluestore_cache_size = 107374182
bluefs_buffered_io = true
osd_op_threads = 24
osd_op_num_shards = 5
osd_op_num_threads_per_shard = 2
osd_enable_op_tracker = false
osd_scrub_begin_hour = 1
osd_scrub_end_hour = 7
osd_deep_scrub_interval = 3.154e+9
osd_max_backfills = 3
osd_recovery_max_active = 3
osd_recovery_op_priority = 1



# ceph osd stat
2017-05-18 18:10:11.864303 7fedc5a98700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
2017-05-18 18:10:11.887182 7fedc5a98700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
     osdmap e4829: 340 osds: 101 up, 112 in; 15011 remapped pgs ==<<<
<<<<<<<<<<<<<< SEE this
            flags sortbitwise,require_jewel_osds,require_kraken_osds

Is there any config directive which helps to skip the remapped PG count
while recovery process. Does Luminous v12.0.3 fixed the OSD flap issue?

Awaiting for your suggestions.

Thanks
Jayaram

[-- Attachment #1.2: Type: text/html, Size: 8010 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Troubleshooting remapped PG's + OSD flaps
       [not found] ` <CALWuvY-oC5h6FzTRYHQ-QmUXoPOo0TnWPR5QgU13X-c8JhgpBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-05-18 19:28   ` David Turner
  0 siblings, 0 replies; 2+ messages in thread
From: David Turner @ 2017-05-18 19:28 UTC (permalink / raw)
  To: nokia ceph, Ceph Users, ceph-devel


[-- Attachment #1.1: Type: text/plain, Size: 7778 bytes --]

"340 osds: 101 up, 112 in" This is going to be your culprit.  Your CRUSH
map is in a really weird state.  How many OSDs do you have in this
cluster?  When OSDs go down, secondary OSDs take over for it, but when OSDs
get marked out, the cluster re-balances to distribute the data according to
how many replicas your settings say it should have (remapped PGs).  Your
cluster thinks it has 340 OSDs in total, it believes that 112 of them are
added to the cluster, but only 101 of them are currently up and running.
That means that it is trying to put all of your data onto those 101 OSDs.
Your settings to have 16k PGs is fine for 340 OSDs, but with 101 OSDs
you're getting the error of too many PGs per OSD.

So next steps:
1) How many OSDs do you expect to be in your Ceph cluster?
2) Did you bring your OSDs back up during your rolling restart testing
BEFORE
    a) They were marked down in the cluster?
    b) You moved onto the next node?  Additionally, did you wait for all
backfilling to finish before you proceeded to the next node?
3) Do you have enough memory in your nodes or are your OSDs being killed by
OOM killer?  I see that you have a lot of peering PGs in your status
output.  That is indicative that the OSDs are continually restarting or
being marked down for not responding.

On Thu, May 18, 2017 at 2:41 PM nokia ceph <nokiacephusers-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Hello,
>
>
> Env;- Bluestore EC 4+1 v11.2.0 RHEL7.3 16383 PG
>
>
> We  did our resiliency testing and found OSD's keeps on flapping and
> cluster went to error state.
>
> What we did:-
>
> 1. we have 5 node cluster
> 2. poweroff/stop ceph.target on last node and waited everything seems to
> reach back to normal.
> 3. Then power up the last node and then we see this recovery stuck on
> remapped PG.
> ~~~
>  osdmap e4829: 340 osds: 101 up, 112 in; 15011 *remapped pgs*
> *~~~*
> 4. Initially all osd's reach 340, at the same time this remapped value
> reached 16384 with OSD epoch value e818
> 5. Then after 1 or 2 hour we suspect that this remapped PG value keeps on
> incremnet/decrement results the osd's started failed one by one. While we
> tested with below patch also  still no change.
> patch -
> https://github.com/ceph/ceph-ci/commit/wip-prune-past-intervals-kraken
>
>
> #ceph -s
> 2017-05-18 18:07:45.876586 7fd6bb87e700 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore,rocksdb
> 2017-05-18 18:07:45.900045 7fd6bb87e700 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore,rocksdb
>     cluster cb55baa8-d5a5-442e-9aae-3fd83553824e
>      health HEALTH_ERR
>             27056 pgs are stuck inactive for more than 300 seconds
>             744 pgs degraded
>             10944 pgs down
>             3919 pgs peering
>             11416 pgs stale
>             744 pgs stuck degraded
>             15640 pgs stuck inactive
>             11416 pgs stuck stale
>             16384 pgs stuck unclean
>             744 pgs stuck undersized
>             744 pgs undersized
>             recovery 1279809/135206985 objects degraded (0.947%)
>             too many PGs per OSD (731 > max 300)
>             11/112 in osds are down
>      monmap e3: 5 mons at {PL6-CN1=
> 10.50.62.151:6789/0,PL6-CN2=10.50.62.152:6789/0,PL6-CN3=10.50.62.153:6789/0,PL6-CN4=10.50.62.154:6789/0,PL6-CN5=1
> 0.50.62.155:6789/0}
>             election epoch 22, quorum 0,1,2,3,4
> PL6-CN1,PL6-CN2,PL6-CN3,PL6-CN4,PL6-CN5
>         mgr no daemons active
>      osdmap e4827: 340 osds: 101 up, 112 in; 15011 remapped pgs
>             flags sortbitwise,require_jewel_osds,require_kraken_osds
>       pgmap v83202: 16384 pgs, 1 pools, 52815 GB data, 26407 kobjects
>             12438 GB used, 331 TB / 343 TB avail
>             1279809/135206985 objects degraded (0.947%)
>                 4512 stale+down+remapped
>                 3060 down+remapped
>                 2204 stale+down
>                 2000 stale+remapped+peering
>                 1259 stale+peering
>                 1167 down
>                  739 stale+active+undersized+degraded
>                  702 stale+remapped
>                  557 peering
>                  102 remapped+peering
>
>
>
> # ceph pg stat
> 2017-05-18 18:09:18.345865 7fe2f72ec700 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore,rocksdb
> 2017-05-18 18:09:18.368566 7fe2f72ec700 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore,rocksdb
> v83204: 16384 pgs: 1 inactive, 1259 stale+peering, 75 remapped, 2000
> stale+remapped+peering, 102 remapped+peering, 2204 stale+down, 739
> stale+active+undersized+degraded, 1 down+remapped+peering, 702
> stale+remapped, 557 peering, 4512 stale+down+remapped, 3060 down+remapped,
> 5 active+undersized+degraded, 1167 down; 52815 GB data, 12438 GB used, 331
> TB / 343 TB avail; 1279809/135206985 objects degraded (0.947%)
>
>
>
> Randomly capture some pg value.
> ~~~
> 3.3ffc     1646                  0     1715         0       0 3451912192
> 1646     1646 stale+active+undersized+degraded 2017-05-18 11:06:32.453158
> 846'1646  872:1634     [36,NONE,278,219,225]         36
> [36,NONE,278,219,225]             36        0'0 2017-05-18 07:14:30.303859
>             0'0 2017-05-18 07:14:30.303859
> 3.3ffb     1711                  0        0         0       0 3588227072
> 1711     1711                             down 2017-05-18 15:20:52.858840
> 846'1711 1602:1708    [150,161,NONE,NONE,83]        150
>  [150,161,NONE,NONE,83]            150        0'0 2017-05-18
> 07:14:30.303838             0'0 2017-05-18 07:14:30.303838
> 3.3ffa     1617                  0        0         0       0 3391094784
> 1617     1617                    down+remapped 2017-05-18 17:12:54.943317
> 846'1617 2525:1637        [48,292,77,277,49]         48
> [48,NONE,NONE,277,49]             48        0'0 2017-05-18 07:14:30.303807
>             0'0 2017-05-18 07:14:30.303807
> 3.3ff9     1682                  0        0         0       0 3527409664
> <(352)%20740-9664> 1682     1682                    down+remapped
> 2017-05-18 16:16:42.223632 846'1682 2195:1678     [266,79,NONE,309,258]
>    266 [NONE,NONE,NONE,NONE,258]            258        0'0 2017-05-18
> 07:14:30.303793             0'0 2017-05-18 07:14:30.303793
> ~~~
>
> ceph.conf
>
> [mon]
> mon_osd_down_out_interval = 3600
> mon_osd_reporter_subtree_level=host
> mon_osd_down_out_subtree_limit=host
> mon_osd_min_down_reporters = 4
> mon_allow_pool_delete = true
> [osd]
> bluestore = true
> bluestore_cache_size = 107374182
> bluefs_buffered_io = true
> osd_op_threads = 24
> osd_op_num_shards = 5
> osd_op_num_threads_per_shard = 2
> osd_enable_op_tracker = false
> osd_scrub_begin_hour = 1
> osd_scrub_end_hour = 7
> osd_deep_scrub_interval = 3.154e+9
> osd_max_backfills = 3
> osd_recovery_max_active = 3
> osd_recovery_op_priority = 1
>
>
>
> # ceph osd stat
> 2017-05-18 18:10:11.864303 7fedc5a98700 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore,rocksdb
> 2017-05-18 18:10:11.887182 7fedc5a98700 -1 WARNING: the following
> dangerous and experimental features are enabled: bluestore,rocksdb
>      osdmap e4829: 340 osds: 101 up, 112 in; 15011 remapped pgs ==<<<
> <<<<<<<<<<<<<< SEE this
>             flags sortbitwise,require_jewel_osds,require_kraken_osds
>
> Is there any config directive which helps to skip the remapped PG count
> while recovery process. Does Luminous v12.0.3 fixed the OSD flap issue?
>
> Awaiting for your suggestions.
>
> Thanks
> Jayaram
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

[-- Attachment #1.2: Type: text/html, Size: 10519 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-05-18 19:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-18 18:37 Troubleshooting remapped PG's + OSD flaps nokia ceph
     [not found] ` <CALWuvY-oC5h6FzTRYHQ-QmUXoPOo0TnWPR5QgU13X-c8JhgpBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-05-18 19:28   ` David Turner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.