All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                               ` <f9adb4b2dcada947f418b6f95ad7a8d1@mail.meizo.com>
@ 2015-04-28 20:19                                 ` Sage Weil
       [not found]                                   ` <alpine.DEB.2.00.1504281256440.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2015-04-28 20:19 UTC (permalink / raw)
  To: Tuomas Juntunen; +Cc: ceph-users, ceph-devel

[adding ceph-devel]

Okay, I see the problem.  This seems to be unrelated ot the giant -> 
hammer move... it's a result of the tiering changes you made:

> > > > > > The following:
> > > > > > 
> > > > > > ceph osd tier add img images --force-nonempty
> > > > > > ceph osd tier cache-mode images forward 
> > > > > > ceph osd tier set-overlay img images

Specifically, --force-nonempty bypassed important safety checks.

1. images had snapshots (and removed_snaps)

2. images was added as a tier *of* img, and img's removed_snaps was copied 
to images, clobbering the removed_snaps value (see 
OSDMap::Incremental::propagate_snaps_to_tiers)

3. tiering relation was undone, but removed_snaps was still gone

4. on OSD startup, when we load the PG, removed_snaps is initialized with 
the older map.  later, in PGPool::update(), we assume that removed_snaps 
alwasy grows (never shrinks) and we trigger an assert.

To fix this I think we need to do 2 things:

1. make the OSD forgiving out removed_snaps getting smaller.  This is 
probably a good thing anyway: once we know snaps are removed on all OSDs 
we can prune the interval_set in the OSDMap.  Maybe.

2. Fix the mon to prevent this from happening, *even* when 
--force-nonempty is specified.  (This is the root cause.)

I've opened http://tracker.ceph.com/issues/11493 to track this.

sage

    

> > > > > > 
> > > > > > Idea was to make images as a tier to img, move data to img 
> > > > > > then change
> > > > > clients to use the new img pool.
> > > > > > 
> > > > > > Br,
> > > > > > Tuomas
> > > > > > 
> > > > > > > Can you explain exactly what you mean by:
> > > > > > >
> > > > > > > "Also I created one pool for tier to be able to move data 
> > > > > > > without
> > > > > outage."
> > > > > > >
> > > > > > > -Sam
> > > > > > > ----- Original Message -----
> > > > > > > From: "tuomas juntunen" <tuomas.juntunen@databasement.fi>
> > > > > > > To: "Ian Colle" <icolle@redhat.com>
> > > > > > > Cc: ceph-users@lists.ceph.com
> > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
> > > > > > > after some basic operations most of the OSD's went down
> > > > > > >
> > > > > > > Hi
> > > > > > >
> > > > > > > Any solution for this yet?
> > > > > > >
> > > > > > > Br,
> > > > > > > Tuomas
> > > > > > >
> > > > > > >> It looks like you may have hit
> > > > > > >> http://tracker.ceph.com/issues/7915
> > > > > > >>
> > > > > > >> Ian R. Colle
> > > > > > >> Global Director
> > > > > > >> of Software Engineering
> > > > > > >> Red Hat (Inktank is now part of Red Hat!) 
> > > > > > >> http://www.linkedin.com/in/ircolle
> > > > > > >> http://www.twitter.com/ircolle
> > > > > > >> Cell: +1.303.601.7713
> > > > > > >> Email: icolle@redhat.com
> > > > > > >>
> > > > > > >> ----- Original Message -----
> > > > > > >> From: "tuomas juntunen" <tuomas.juntunen@databasement.fi>
> > > > > > >> To: ceph-users@lists.ceph.com
> > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and 
> > > > > > >> after some basic operations most of the OSD's went down
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
> > > > > > >>
> > > > > > >> Then created new pools and deleted some old ones. Also I 
> > > > > > >> created one pool for tier to be able to move data without
> outage.
> > > > > > >>
> > > > > > >> After these operations all but 10 OSD's are down and 
> > > > > > >> creating this kind of messages to logs, I get more than 
> > > > > > >> 100gb of these in a
> > > > night:
> > > > > > >>
> > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
> pg_epoch:
> > 
> > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > >> ec=1 les/c
> > > > > > >> 16609/16659
> > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > >> pi=15659-16589/42
> > > > > > >> crt=8480'7 lcod
> > > > > > >> 0'0 inactive NOTIFY] enter Started
> > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > >> ec=1 les/c
> > > > > > >> 16609/16659
> > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > >> pi=15659-16589/42
> > > > > > >> crt=8480'7 lcod
> > > > > > >> 0'0 inactive NOTIFY] enter Start
> > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > >> ec=1 les/c
> > > > > > >> 16609/16659
> > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > >> pi=15659-16589/42
> > > > > > >> crt=8480'7 lcod
> > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > >> ec=1 les/c
> > > > > > >> 16609/16659
> > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > >> pi=15659-16589/42
> > > > > > >> crt=8480'7 lcod
> > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
> > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > >> ec=1 les/c
> > > > > > >> 16609/16659
> > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > >> pi=15659-16589/42
> > > > > > >> crt=8480'7 lcod
> > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
> > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > >> 17879/17879
> > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > >> NOTIFY] exit Reset 0.119467 4 0.000037
> > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > >> 17879/17879
> > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > >> NOTIFY] enter Started
> > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > >> 17879/17879
> > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > >> NOTIFY] enter Start
> > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > >> 17879/17879
> > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > >> NOTIFY]
> > > > > > >> state<Start>: transitioning to Stray
> > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > >> 17879/17879
> > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > >> NOTIFY] exit Start 0.000020 0 0.000000
> > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > >> 17879/17879
> > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > >> NOTIFY] enter Started/Stray
> > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> inactive] exit Reset 7.511623 45 0.000165
> > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> inactive] enter Started
> > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> inactive] enter Start
> > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> inactive]
> > > > > > >> state<Start>: transitioning to Primary
> > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> inactive] exit Start 0.000023 0 0.000000
> > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> inactive] enter Started/Primary
> > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> inactive] enter Started/Primary/Peering
> > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > >> 16127/16344
> > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > >> peering] enter Started/Primary/Peering/GetInfo
> > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
> > > > > ./include/interval_set.h:
> > > > > > >> In
> > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
> snapid_t]' 
> > > > > > >> thread
> > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >= 0)
> > > > > > >>
> > > > > > >>  ceph version 0.94.1
> > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > >> int, char
> > > > > > >> const*)+0x8b)
> > > > > > >> [0xbc271b]
> > > > > > >>  2: 
> > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t>
> > > > > > >> const&)+0xb0) [0x82cd50]
> > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
> > > > > > >> const>)+0x52e) [0x80113e]
> > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
> > > > > > >> const>std::vector<int,
> > > > > > >> std::allocator<int> >&, int, std::vector<int, 
> > > > > > >> std::allocator<int>
> > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*, 
> > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*, 
> > > > > > >> std::set<boost::intrusive_ptr<PG>,
> > > > > > >> std::less<boost::intrusive_ptr<PG> >, 
> > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3) 
> > > > > > >> [0x6b0e43]
> > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
> > > > > > >> std::allocator<PG*>
> > > > > > >> > const&,
> > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
> > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
> > > > > > >> std::allocator<PG*>
> > > > > > >> > const&,
> > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
> > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> > > > > > >> [0xbb38ae]
> > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> > > > > > >>  10: (()+0x8182) [0x7fd906946182]
> > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> > > > > > >>
> > > > > > >> Also by monitoring (ceph -w) I get the following messages, 
> > > > > > >> also lots of
> > > > > them.
> > > > > > >>
> > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
> > > > > 10.20.0.13:0/1174409'
> > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush create-or-move",
> > "args":
> > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]: 
> > > > > > >> dispatch
> > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
> > > > > 10.20.0.13:0/1174483'
> > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush create-or-move",
> > "args":
> > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]: 
> > > > > > >> dispatch
> > > > > > >>
> > > > > > >>
> > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are also 
> > > > > > >> mons and mds's to save servers. All run Ubuntu 14.04.2.
> > > > > > >>
> > > > > > >> I have pretty much tried everything I could think of.
> > > > > > >>
> > > > > > >> Restarting daemons doesn't help.
> > > > > > >>
> > > > > > >> Any help would be appreciated. I can also provide more logs 
> > > > > > >> if necessary. They just seem to get pretty large in few
> moments.
> > > > > > >>
> > > > > > >> Thank you
> > > > > > >> Tuomas
> > > > > > >>
> > > > > > >>
> > > > > > >> _______________________________________________
> > > > > > >> ceph-users mailing list
> > > > > > >> ceph-users@lists.ceph.com
> > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users@lists.ceph.com
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                                   ` <alpine.DEB.2.00.1504281256440.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-04-28 20:57                                     ` Sage Weil
       [not found]                                       ` <alpine.DEB.2.00.1504281355130.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2015-04-28 20:57 UTC (permalink / raw)
  To: Tuomas Juntunen
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hi Tuomas,

I've pushed an updated wip-hammer-snaps branch.  Can you please try it?  
The build will appear here

	http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e2eb514067f72afda11bcde286

(or a similar url; adjust for your distro).

Thanks!
sage


On Tue, 28 Apr 2015, Sage Weil wrote:

> [adding ceph-devel]
> 
> Okay, I see the problem.  This seems to be unrelated ot the giant -> 
> hammer move... it's a result of the tiering changes you made:
> 
> > > > > > > The following:
> > > > > > > 
> > > > > > > ceph osd tier add img images --force-nonempty
> > > > > > > ceph osd tier cache-mode images forward 
> > > > > > > ceph osd tier set-overlay img images
> 
> Specifically, --force-nonempty bypassed important safety checks.
> 
> 1. images had snapshots (and removed_snaps)
> 
> 2. images was added as a tier *of* img, and img's removed_snaps was copied 
> to images, clobbering the removed_snaps value (see 
> OSDMap::Incremental::propagate_snaps_to_tiers)
> 
> 3. tiering relation was undone, but removed_snaps was still gone
> 
> 4. on OSD startup, when we load the PG, removed_snaps is initialized with 
> the older map.  later, in PGPool::update(), we assume that removed_snaps 
> alwasy grows (never shrinks) and we trigger an assert.
> 
> To fix this I think we need to do 2 things:
> 
> 1. make the OSD forgiving out removed_snaps getting smaller.  This is 
> probably a good thing anyway: once we know snaps are removed on all OSDs 
> we can prune the interval_set in the OSDMap.  Maybe.
> 
> 2. Fix the mon to prevent this from happening, *even* when 
> --force-nonempty is specified.  (This is the root cause.)
> 
> I've opened http://tracker.ceph.com/issues/11493 to track this.
> 
> sage
> 
>     
> 
> > > > > > > 
> > > > > > > Idea was to make images as a tier to img, move data to img 
> > > > > > > then change
> > > > > > clients to use the new img pool.
> > > > > > > 
> > > > > > > Br,
> > > > > > > Tuomas
> > > > > > > 
> > > > > > > > Can you explain exactly what you mean by:
> > > > > > > >
> > > > > > > > "Also I created one pool for tier to be able to move data 
> > > > > > > > without
> > > > > > outage."
> > > > > > > >
> > > > > > > > -Sam
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "tuomas juntunen" <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
> > > > > > > > after some basic operations most of the OSD's went down
> > > > > > > >
> > > > > > > > Hi
> > > > > > > >
> > > > > > > > Any solution for this yet?
> > > > > > > >
> > > > > > > > Br,
> > > > > > > > Tuomas
> > > > > > > >
> > > > > > > >> It looks like you may have hit
> > > > > > > >> http://tracker.ceph.com/issues/7915
> > > > > > > >>
> > > > > > > >> Ian R. Colle
> > > > > > > >> Global Director
> > > > > > > >> of Software Engineering
> > > > > > > >> Red Hat (Inktank is now part of Red Hat!) 
> > > > > > > >> http://www.linkedin.com/in/ircolle
> > > > > > > >> http://www.twitter.com/ircolle
> > > > > > > >> Cell: +1.303.601.7713
> > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > >>
> > > > > > > >> ----- Original Message -----
> > > > > > > >> From: "tuomas juntunen" <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> > > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and 
> > > > > > > >> after some basic operations most of the OSD's went down
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
> > > > > > > >>
> > > > > > > >> Then created new pools and deleted some old ones. Also I 
> > > > > > > >> created one pool for tier to be able to move data without
> > outage.
> > > > > > > >>
> > > > > > > >> After these operations all but 10 OSD's are down and 
> > > > > > > >> creating this kind of messages to logs, I get more than 
> > > > > > > >> 100gb of these in a
> > > > > night:
> > > > > > > >>
> > > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] enter Started
> > > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] enter Start
> > > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> > > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
> > > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
> > > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > > >> NOTIFY] exit Reset 0.119467 4 0.000037
> > > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > > >> NOTIFY] enter Started
> > > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > > >> NOTIFY] enter Start
> > > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > > >> NOTIFY]
> > > > > > > >> state<Start>: transitioning to Stray
> > > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > > >> NOTIFY] exit Start 0.000020 0 0.000000
> > > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive 
> > > > > > > >> NOTIFY] enter Started/Stray
> > > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> inactive] exit Reset 7.511623 45 0.000165
> > > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> inactive] enter Started
> > > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> inactive] enter Start
> > > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> inactive]
> > > > > > > >> state<Start>: transitioning to Primary
> > > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> inactive] exit Start 0.000023 0 0.000000
> > > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> inactive] enter Started/Primary
> > > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> inactive] enter Started/Primary/Peering
> > > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 
> > > > > > > >> peering] enter Started/Primary/Peering/GetInfo
> > > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
> > > > > > ./include/interval_set.h:
> > > > > > > >> In
> > > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
> > snapid_t]' 
> > > > > > > >> thread
> > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> > > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >= 0)
> > > > > > > >>
> > > > > > > >>  ceph version 0.94.1
> > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > >> int, char
> > > > > > > >> const*)+0x8b)
> > > > > > > >> [0xbc271b]
> > > > > > > >>  2: 
> > > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t>
> > > > > > > >> const&)+0xb0) [0x82cd50]
> > > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
> > > > > > > >> const>)+0x52e) [0x80113e]
> > > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
> > > > > > > >> const>std::vector<int,
> > > > > > > >> std::allocator<int> >&, int, std::vector<int, 
> > > > > > > >> std::allocator<int>
> > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*, 
> > > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*, 
> > > > > > > >> std::set<boost::intrusive_ptr<PG>,
> > > > > > > >> std::less<boost::intrusive_ptr<PG> >, 
> > > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3) 
> > > > > > > >> [0x6b0e43]
> > > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
> > > > > > > >> std::allocator<PG*>
> > > > > > > >> > const&,
> > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
> > > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
> > > > > > > >> std::allocator<PG*>
> > > > > > > >> > const&,
> > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
> > > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> > > > > > > >> [0xbb38ae]
> > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> > > > > > > >>  10: (()+0x8182) [0x7fd906946182]
> > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> > > > > > > >>
> > > > > > > >> Also by monitoring (ceph -w) I get the following messages, 
> > > > > > > >> also lots of
> > > > > > them.
> > > > > > > >>
> > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
> > > > > > 10.20.0.13:0/1174409'
> > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush create-or-move",
> > > "args":
> > > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]: 
> > > > > > > >> dispatch
> > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
> > > > > > 10.20.0.13:0/1174483'
> > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush create-or-move",
> > > "args":
> > > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]: 
> > > > > > > >> dispatch
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are also 
> > > > > > > >> mons and mds's to save servers. All run Ubuntu 14.04.2.
> > > > > > > >>
> > > > > > > >> I have pretty much tried everything I could think of.
> > > > > > > >>
> > > > > > > >> Restarting daemons doesn't help.
> > > > > > > >>
> > > > > > > >> Any help would be appreciated. I can also provide more logs 
> > > > > > > >> if necessary. They just seem to get pretty large in few
> > moments.
> > > > > > > >>
> > > > > > > >> Thank you
> > > > > > > >> Tuomas
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> _______________________________________________
> > > > > > > >> ceph-users mailing list
> > > > > > > >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                                       ` <alpine.DEB.2.00.1504281355130.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-04-29  4:16                                         ` Tuomas Juntunen
       [not found]                                           ` <81216125e573cf00539f61cc090b282b-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Tuomas Juntunen @ 2015-04-29  4:16 UTC (permalink / raw)
  To: 'Sage Weil'
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 17530 bytes --]

Hi

I updated that version and it seems that something did happen, the osd's
stayed up for a while and 'ceph status' got updated. But then in couple of
minutes, they all went down the same way.

I have attached new 'ceph osd dump -f json-pretty' and got a new log from
one of the osd's with osd debug = 20,
http://beta.xaasbox.com/ceph/ceph-osd.15.log

Thank you!

Br,
Tuomas



-----Original Message-----
From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org] 
Sent: 28. huhtikuuta 2015 23:57
To: Tuomas Juntunen
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
operations most of the OSD's went down

Hi Tuomas,

I've pushed an updated wip-hammer-snaps branch.  Can you please try it?  
The build will appear here

	
http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
2eb514067f72afda11bcde286

(or a similar url; adjust for your distro).

Thanks!
sage


On Tue, 28 Apr 2015, Sage Weil wrote:

> [adding ceph-devel]
> 
> Okay, I see the problem.  This seems to be unrelated ot the giant -> 
> hammer move... it's a result of the tiering changes you made:
> 
> > > > > > > The following:
> > > > > > > 
> > > > > > > ceph osd tier add img images --force-nonempty ceph osd 
> > > > > > > tier cache-mode images forward ceph osd tier set-overlay 
> > > > > > > img images
> 
> Specifically, --force-nonempty bypassed important safety checks.
> 
> 1. images had snapshots (and removed_snaps)
> 
> 2. images was added as a tier *of* img, and img's removed_snaps was 
> copied to images, clobbering the removed_snaps value (see
> OSDMap::Incremental::propagate_snaps_to_tiers)
> 
> 3. tiering relation was undone, but removed_snaps was still gone
> 
> 4. on OSD startup, when we load the PG, removed_snaps is initialized 
> with the older map.  later, in PGPool::update(), we assume that 
> removed_snaps alwasy grows (never shrinks) and we trigger an assert.
> 
> To fix this I think we need to do 2 things:
> 
> 1. make the OSD forgiving out removed_snaps getting smaller.  This is 
> probably a good thing anyway: once we know snaps are removed on all 
> OSDs we can prune the interval_set in the OSDMap.  Maybe.
> 
> 2. Fix the mon to prevent this from happening, *even* when 
> --force-nonempty is specified.  (This is the root cause.)
> 
> I've opened http://tracker.ceph.com/issues/11493 to track this.
> 
> sage
> 
>     
> 
> > > > > > > 
> > > > > > > Idea was to make images as a tier to img, move data to img 
> > > > > > > then change
> > > > > > clients to use the new img pool.
> > > > > > > 
> > > > > > > Br,
> > > > > > > Tuomas
> > > > > > > 
> > > > > > > > Can you explain exactly what you mean by:
> > > > > > > >
> > > > > > > > "Also I created one pool for tier to be able to move 
> > > > > > > > data without
> > > > > > outage."
> > > > > > > >
> > > > > > > > -Sam
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "tuomas juntunen" 
> > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer 
> > > > > > > > and after some basic operations most of the OSD's went 
> > > > > > > > down
> > > > > > > >
> > > > > > > > Hi
> > > > > > > >
> > > > > > > > Any solution for this yet?
> > > > > > > >
> > > > > > > > Br,
> > > > > > > > Tuomas
> > > > > > > >
> > > > > > > >> It looks like you may have hit
> > > > > > > >> http://tracker.ceph.com/issues/7915
> > > > > > > >>
> > > > > > > >> Ian R. Colle
> > > > > > > >> Global Director
> > > > > > > >> of Software Engineering Red Hat (Inktank is now part of 
> > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
> > > > > > > >> http://www.twitter.com/ircolle
> > > > > > > >> Cell: +1.303.601.7713
> > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > >>
> > > > > > > >> ----- Original Message -----
> > > > > > > >> From: "tuomas juntunen" 
> > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> > > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and 
> > > > > > > >> after some basic operations most of the OSD's went down
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
> > > > > > > >>
> > > > > > > >> Then created new pools and deleted some old ones. Also 
> > > > > > > >> I created one pool for tier to be able to move data 
> > > > > > > >> without
> > outage.
> > > > > > > >>
> > > > > > > >> After these operations all but 10 OSD's are down and 
> > > > > > > >> creating this kind of messages to logs, I get more than 
> > > > > > > >> 100gb of these in a
> > > > > night:
> > > > > > > >>
> > > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
> > pg_epoch:
> > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > >> n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] enter Started
> > > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > >> n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] enter Start
> > > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > >> n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> > > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > >> n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
> > > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > >> n=0
> > > > > > > >> ec=1 les/c
> > > > > > > >> 16609/16659
> > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > >> pi=15659-16589/42
> > > > > > > >> crt=8480'7 lcod
> > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
> > > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > >> les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > >> inactive NOTIFY] exit Reset 0.119467 4 0.000037
> > > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > >> les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > >> inactive NOTIFY] enter Started
> > > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > >> les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > >> inactive NOTIFY] enter Start
> > > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > >> les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > >> inactive NOTIFY]
> > > > > > > >> state<Start>: transitioning to Stray
> > > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > >> les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > >> inactive NOTIFY] exit Start 0.000020 0 0.000000
> > > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > >> les/c
> > > > > > > >> 17879/17879
> > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > >> inactive NOTIFY] enter Started/Stray
> > > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165
> > > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 inactive] enter Started
> > > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 inactive] enter Start
> > > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 inactive]
> > > > > > > >> state<Start>: transitioning to Primary
> > > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000
> > > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 inactive] enter Started/Primary
> > > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 inactive] enter Started/Primary/Peering
> > > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5 
> > > > > > > >> osd.23
> > > pg_epoch:
> > > > 
> > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > >> 16127/16344
> > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > >> 0'0 peering] enter Started/Primary/Peering/GetInfo
> > > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
> > > > > > ./include/interval_set.h:
> > > > > > > >> In
> > > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
> > snapid_t]' 
> > > > > > > >> thread
> > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> > > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >= 
> > > > > > > >> 0)
> > > > > > > >>
> > > > > > > >>  ceph version 0.94.1
> > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > >> int, char
> > > > > > > >> const*)+0x8b)
> > > > > > > >> [0xbc271b]
> > > > > > > >>  2: 
> > > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t
> > > > > > > >> >
> > > > > > > >> const&)+0xb0) [0x82cd50]
> > > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
> > > > > > > >> const>)+0x52e) [0x80113e]
> > > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
> > > > > > > >> const>std::vector<int,
> > > > > > > >> std::allocator<int> >&, int, std::vector<int, 
> > > > > > > >> std::allocator<int>
> > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*, 
> > > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*, 
> > > > > > > >> std::set<boost::intrusive_ptr<PG>,
> > > > > > > >> std::less<boost::intrusive_ptr<PG> >, 
> > > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3) 
> > > > > > > >> [0x6b0e43]
> > > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
> > > > > > > >> std::allocator<PG*>
> > > > > > > >> > const&,
> > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
> > > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
> > > > > > > >> std::allocator<PG*>
> > > > > > > >> > const&,
> > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
> > > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> > > > > > > >> [0xbb38ae]
> > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> > > > > > > >>  10: (()+0x8182) [0x7fd906946182]
> > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> > > > > > > >>
> > > > > > > >> Also by monitoring (ceph -w) I get the following 
> > > > > > > >> messages, also lots of
> > > > > > them.
> > > > > > > >>
> > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
> > > > > > 10.20.0.13:0/1174409'
> > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
> > > > > > > >> create-or-move",
> > > "args":
> > > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]:

> > > > > > > >> dispatch
> > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
> > > > > > 10.20.0.13:0/1174483'
> > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
> > > > > > > >> create-or-move",
> > > "args":
> > > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]:

> > > > > > > >> dispatch
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are 
> > > > > > > >> also mons and mds's to save servers. All run Ubuntu
14.04.2.
> > > > > > > >>
> > > > > > > >> I have pretty much tried everything I could think of.
> > > > > > > >>
> > > > > > > >> Restarting daemons doesn't help.
> > > > > > > >>
> > > > > > > >> Any help would be appreciated. I can also provide more 
> > > > > > > >> logs if necessary. They just seem to get pretty large 
> > > > > > > >> in few
> > moments.
> > > > > > > >>
> > > > > > > >> Thank you
> > > > > > > >> Tuomas
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> _______________________________________________
> > > > > > > >> ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

[-- Attachment #2: 18610json.pretty.txt --]
[-- Type: text/plain, Size: 93942 bytes --]


{
    "epoch": 18610,
    "fsid": "a2974742-3805-4cd3-bc79-765f2bddaefe",
    "created": "2014-10-15 20:43:45.186949",
    "modified": "2015-04-29 06:49:32.691995",
    "flags": "",
    "cluster_snapshot": "",
    "pool_max": 17,
    "max_osd": 71,
    "pools": [
        {
            "pool": 0,
            "pool_name": "data",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 1,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 4096,
            "pg_placement_num": 4096,
            "crash_replay_interval": 45,
            "last_change": "1112",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 0,
            "cache_target_full_ratio_micro": 0,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 1,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 1,
            "pool_name": "metadata",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 1,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 4096,
            "pg_placement_num": 4096,
            "crash_replay_interval": 0,
            "last_change": "1114",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 0,
            "cache_target_full_ratio_micro": 0,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 1,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 2,
            "pool_name": "rbd",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 2,
            "min_size": 1,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 4096,
            "pg_placement_num": 4096,
            "crash_replay_interval": 0,
            "last_change": "1116",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 0,
            "cache_target_full_ratio_micro": 0,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 1,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 3,
            "pool_name": "volumes",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 1,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 4096,
            "pg_placement_num": 4096,
            "crash_replay_interval": 0,
            "last_change": "9974",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 23,
            "snap_epoch": 9974,
            "pool_snaps": [],
            "removed_snaps": "[1~17]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "default",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 1,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 4,
            "pool_name": "images",
            "flags": 9,
            "flags_names": "hashpspool,incomplete_clones",
            "type": 1,
            "size": 3,
            "min_size": 1,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 4096,
            "pg_placement_num": 4096,
            "crash_replay_interval": 0,
            "last_change": "17905",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 17882,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 0,
            "cache_target_full_ratio_micro": 0,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "default",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 1,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 6,
            "pool_name": "vms",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 1,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 4096,
            "pg_placement_num": 4096,
            "crash_replay_interval": 0,
            "last_change": "1122",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "default",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 1,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 7,
            "pool_name": "san",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 1,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 4096,
            "pg_placement_num": 4096,
            "crash_replay_interval": 0,
            "last_change": "14096",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 0,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 8,
            "pool_name": "vol-ssd-accelerated",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 2,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 1024,
            "pg_placement_num": 1024,
            "crash_replay_interval": 0,
            "last_change": "17861",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 0,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 14,
            "pool_name": "backup",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 2,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 128,
            "pg_placement_num": 128,
            "crash_replay_interval": 0,
            "last_change": "18018",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 0,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 15,
            "pool_name": "img",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 2,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 256,
            "pg_placement_num": 256,
            "crash_replay_interval": 0,
            "last_change": "18019",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 0,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 16,
            "pool_name": "vm",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 2,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 1024,
            "pg_placement_num": 1024,
            "crash_replay_interval": 0,
            "last_change": "18020",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 0,
            "stripe_width": 0,
            "expected_num_objects": 0
        },
        {
            "pool": 17,
            "pool_name": "infradisks",
            "flags": 1,
            "flags_names": "hashpspool",
            "type": 1,
            "size": 3,
            "min_size": 2,
            "crush_ruleset": 0,
            "object_hash": 2,
            "pg_num": 256,
            "pg_placement_num": 256,
            "crash_replay_interval": 0,
            "last_change": "18021",
            "last_force_op_resend": "0",
            "auid": 0,
            "snap_mode": "selfmanaged",
            "snap_seq": 0,
            "snap_epoch": 0,
            "pool_snaps": [],
            "removed_snaps": "[]",
            "quota_max_bytes": 0,
            "quota_max_objects": 0,
            "tiers": [],
            "tier_of": -1,
            "read_tier": -1,
            "write_tier": -1,
            "cache_mode": "none",
            "target_max_bytes": 0,
            "target_max_objects": 0,
            "cache_target_dirty_ratio_micro": 400000,
            "cache_target_full_ratio_micro": 800000,
            "cache_min_flush_age": 0,
            "cache_min_evict_age": 0,
            "erasure_code_profile": "",
            "hit_set_params": {
                "type": "none"
            },
            "hit_set_period": 0,
            "hit_set_count": 0,
            "min_read_recency_for_promote": 0,
            "stripe_width": 0,
            "expected_num_objects": 0
        }
    ],
    "osds": [
        {
            "osd": 0,
            "uuid": "757c3bc5-4d00-4344-8de4-82f5379c96af",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15738,
            "last_clean_end": 17882,
            "up_from": 18352,
            "up_thru": 18353,
            "down_at": 18415,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6833\/2259607",
            "cluster_addr": "10.20.0.11:6836\/2259607",
            "heartbeat_back_addr": "10.20.0.11:6853\/2259607",
            "heartbeat_front_addr": "10.20.0.11:6856\/2259607",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 1,
            "uuid": "c7eaa4ac-99fc-46db-84aa-a67274896ec8",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15740,
            "last_clean_end": 17882,
            "up_from": 18350,
            "up_thru": 18352,
            "down_at": 18403,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6813\/2259893",
            "cluster_addr": "10.20.0.11:6814\/2259893",
            "heartbeat_back_addr": "10.20.0.11:6815\/2259893",
            "heartbeat_front_addr": "10.20.0.11:6825\/2259893",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 2,
            "uuid": "206b2949-4adf-4789-8e06-f68a8ee819c9",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15739,
            "last_clean_end": 17882,
            "up_from": 18348,
            "up_thru": 18348,
            "down_at": 18415,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6809\/2259657",
            "cluster_addr": "10.20.0.11:6810\/2259657",
            "heartbeat_back_addr": "10.20.0.11:6811\/2259657",
            "heartbeat_front_addr": "10.20.0.11:6812\/2259657",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 3,
            "uuid": "90b7c219-4dcd-48ea-a24d-f3b796a521e4",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15736,
            "last_clean_end": 17882,
            "up_from": 18346,
            "up_thru": 18346,
            "down_at": 18412,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6829\/2257497",
            "cluster_addr": "10.20.0.11:6830\/2257497",
            "heartbeat_back_addr": "10.20.0.11:6831\/2257497",
            "heartbeat_front_addr": "10.20.0.11:6832\/2257497",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 4,
            "uuid": "049ef94f-121a-4e71-8ba6-27eaebf0a569",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15737,
            "last_clean_end": 17883,
            "up_from": 18342,
            "up_thru": 18345,
            "down_at": 18415,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6861\/2257349",
            "cluster_addr": "10.20.0.11:6862\/2257349",
            "heartbeat_back_addr": "10.20.0.11:6863\/2257349",
            "heartbeat_front_addr": "10.20.0.11:6864\/2257349",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 5,
            "uuid": "2437a53b-339e-45af-b4de-0fc675d27405",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15734,
            "last_clean_end": 17882,
            "up_from": 18347,
            "up_thru": 18347,
            "down_at": 18403,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6821\/2256278",
            "cluster_addr": "10.20.0.11:6822\/2256278",
            "heartbeat_back_addr": "10.20.0.11:6823\/2256278",
            "heartbeat_front_addr": "10.20.0.11:6824\/2256278",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 6,
            "uuid": "f117ceed-b1fd-4069-99fe-b7aba9f3ef8d",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15738,
            "last_clean_end": 17882,
            "up_from": 18349,
            "up_thru": 18349,
            "down_at": 18415,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6854\/2257155",
            "cluster_addr": "10.20.0.11:6855\/2257155",
            "heartbeat_back_addr": "10.20.0.11:6857\/2257155",
            "heartbeat_front_addr": "10.20.0.11:6859\/2257155",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 7,
            "uuid": "e98e9b8a-9c62-4e3e-bdb4-c2c30103c0c1",
            "up": 0,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15730,
            "last_clean_end": 17883,
            "up_from": 18345,
            "up_thru": 18345,
            "down_at": 18419,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6873\/2258645",
            "cluster_addr": "10.20.0.11:6874\/2258645",
            "heartbeat_back_addr": "10.20.0.11:6875\/2258645",
            "heartbeat_front_addr": "10.20.0.11:6876\/2258645",
            "state": [
                "exists"
            ]
        },
        {
            "osd": 8,
            "uuid": "41e471cd-fafe-4422-8bf5-22018bbe1375",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 17795,
            "last_clean_end": 17882,
            "up_from": 18346,
            "up_thru": 18346,
            "down_at": 18412,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6877\/2258943",
            "cluster_addr": "10.20.0.11:6878\/2258943",
            "heartbeat_back_addr": "10.20.0.11:6879\/2258943",
            "heartbeat_front_addr": "10.20.0.11:6880\/2258943",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 9,
            "uuid": "d68eeebd-d058-4b1c-a30a-994bf8fc8030",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15733,
            "last_clean_end": 17882,
            "up_from": 18347,
            "up_thru": 18347,
            "down_at": 18410,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6849\/2258152",
            "cluster_addr": "10.20.0.11:6850\/2258152",
            "heartbeat_back_addr": "10.20.0.11:6851\/2258152",
            "heartbeat_front_addr": "10.20.0.11:6852\/2258152",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 10,
            "uuid": "660747d6-3f47-449a-bc69-5399b0d54ff6",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15736,
            "last_clean_end": 17883,
            "up_from": 18345,
            "up_thru": 18346,
            "down_at": 18403,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6841\/2256646",
            "cluster_addr": "10.20.0.11:6842\/2256646",
            "heartbeat_back_addr": "10.20.0.11:6843\/2256646",
            "heartbeat_front_addr": "10.20.0.11:6844\/2256646",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 11,
            "uuid": "805965b1-127f-44a6-9a05-8a643eb7a512",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 17801,
            "last_clean_end": 17882,
            "up_from": 18349,
            "up_thru": 18351,
            "down_at": 18439,
            "lost_at": 0,
            "public_addr": "10.20.0.11:6801\/2257816",
            "cluster_addr": "10.20.0.11:6802\/2257816",
            "heartbeat_back_addr": "10.20.0.11:6803\/2257816",
            "heartbeat_front_addr": "10.20.0.11:6804\/2257816",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 12,
            "uuid": "61fbfcbe-d642-478f-9620-f9d72ee96238",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18162,
            "up_thru": 18162,
            "down_at": 18208,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6833\/3261949",
            "cluster_addr": "10.20.0.12:6834\/3261949",
            "heartbeat_back_addr": "10.20.0.12:6835\/3261949",
            "heartbeat_front_addr": "10.20.0.12:6836\/3261949",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 13,
            "uuid": "6faad33b-00be-42a4-92ba-08be5ab7f995",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18164,
            "up_thru": 18166,
            "down_at": 18206,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6885\/3262416",
            "cluster_addr": "10.20.0.12:6886\/3262416",
            "heartbeat_back_addr": "10.20.0.12:6887\/3262416",
            "heartbeat_front_addr": "10.20.0.12:6888\/3262416",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 14,
            "uuid": "f301705b-e725-443d-96e1-d9ec9aafe657",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17883,
            "up_from": 18164,
            "up_thru": 18164,
            "down_at": 18352,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6861\/3261624",
            "cluster_addr": "10.20.0.12:6862\/3261624",
            "heartbeat_back_addr": "10.20.0.12:6863\/3261624",
            "heartbeat_front_addr": "10.20.0.12:6864\/3261624",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 15,
            "uuid": "536bf483-10de-44b0-8e1e-4f349fbe572a",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15785,
            "last_clean_end": 17882,
            "up_from": 18168,
            "up_thru": 18168,
            "down_at": 18352,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6805\/3262650",
            "cluster_addr": "10.20.0.12:6806\/3262650",
            "heartbeat_back_addr": "10.20.0.12:6807\/3262650",
            "heartbeat_front_addr": "10.20.0.12:6808\/3262650",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 16,
            "uuid": "4185bd20-8eb0-4616-b36e-bacb181ae40e",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18164,
            "up_thru": 18164,
            "down_at": 18352,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6849\/3261188",
            "cluster_addr": "10.20.0.12:6850\/3261188",
            "heartbeat_back_addr": "10.20.0.12:6851\/3261188",
            "heartbeat_front_addr": "10.20.0.12:6852\/3261188",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 17,
            "uuid": "a6f2f5b4-477f-48f9-9acf-d5b7a6c88b98",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18164,
            "up_thru": 18166,
            "down_at": 18352,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6857\/3261610",
            "cluster_addr": "10.20.0.12:6858\/3261610",
            "heartbeat_back_addr": "10.20.0.12:6859\/3261610",
            "heartbeat_front_addr": "10.20.0.12:6860\/3261610",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 18,
            "uuid": "b31b0bd8-938a-496d-91bc-19bf4f794f82",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17883,
            "up_from": 18164,
            "up_thru": 18166,
            "down_at": 18352,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6869\/3261788",
            "cluster_addr": "10.20.0.12:6870\/3261788",
            "heartbeat_back_addr": "10.20.0.12:6871\/3261788",
            "heartbeat_front_addr": "10.20.0.12:6872\/3261788",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 19,
            "uuid": "d76b6bd5-1ef3-436c-a75d-3587c515eb56",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18150,
            "up_thru": 18150,
            "down_at": 18203,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6865\/3261778",
            "cluster_addr": "10.20.0.12:6866\/3261778",
            "heartbeat_back_addr": "10.20.0.12:6867\/3261778",
            "heartbeat_front_addr": "10.20.0.12:6868\/3261778",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 20,
            "uuid": "8e4dd982-a4c5-4ca5-9fc5-243f55c4db57",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18151,
            "up_thru": 18151,
            "down_at": 18239,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6881\/3262190",
            "cluster_addr": "10.20.0.12:6882\/3262190",
            "heartbeat_back_addr": "10.20.0.12:6883\/3262190",
            "heartbeat_front_addr": "10.20.0.12:6884\/3262190",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 21,
            "uuid": "760aaf28-0a34-4bbc-af0c-2654b0a43fff",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18150,
            "up_thru": 18150,
            "down_at": 18203,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6845\/3261106",
            "cluster_addr": "10.20.0.12:6846\/3261106",
            "heartbeat_back_addr": "10.20.0.12:6847\/3261106",
            "heartbeat_front_addr": "10.20.0.12:6848\/3261106",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 22,
            "uuid": "40322a34-ab31-4760-b71e-a7672f812cb3",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18161,
            "up_thru": 18161,
            "down_at": 18352,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6853\/3261379",
            "cluster_addr": "10.20.0.12:6854\/3261379",
            "heartbeat_back_addr": "10.20.0.12:6855\/3261379",
            "heartbeat_front_addr": "10.20.0.12:6856\/3261379",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 23,
            "uuid": "e1d81949-f4b5-4cf2-b6af-dccaaeb30ed7",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15783,
            "last_clean_end": 17882,
            "up_from": 18165,
            "up_thru": 18166,
            "down_at": 18352,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6873\/3262047",
            "cluster_addr": "10.20.0.12:6874\/3262047",
            "heartbeat_back_addr": "10.20.0.12:6875\/3262047",
            "heartbeat_front_addr": "10.20.0.12:6876\/3262047",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 24,
            "uuid": "ede77283-a423-4c6b-9c6e-b0e807c63cb5",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 18520,
            "last_clean_end": 18582,
            "up_from": 18589,
            "up_thru": 18592,
            "down_at": 18588,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6801\/3842583",
            "cluster_addr": "10.20.0.13:6839\/3842583",
            "heartbeat_back_addr": "10.20.0.13:6840\/3842583",
            "heartbeat_front_addr": "10.20.0.13:6841\/3842583",
            "state": [
                "exists",
                "up"
            ]
        },
        {
            "osd": 25,
            "uuid": "7cfe85f8-3ae9-493d-9801-025ff6c6265d",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15686,
            "last_clean_end": 17883,
            "up_from": 18426,
            "up_thru": 18426,
            "down_at": 18518,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6829\/3788954",
            "cluster_addr": "10.20.0.13:6830\/3788954",
            "heartbeat_back_addr": "10.20.0.13:6831\/3788954",
            "heartbeat_front_addr": "10.20.0.13:6832\/3788954",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 26,
            "uuid": "266f6d70-519f-4c24-bca2-236495a600a7",
            "up": 0,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15692,
            "last_clean_end": 17883,
            "up_from": 18420,
            "up_thru": 18421,
            "down_at": 18542,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6873\/3788357",
            "cluster_addr": "10.20.0.13:6874\/3788357",
            "heartbeat_back_addr": "10.20.0.13:6875\/3788357",
            "heartbeat_front_addr": "10.20.0.13:6876\/3788357",
            "state": [
                "exists"
            ]
        },
        {
            "osd": 27,
            "uuid": "68644fa9-9459-4db0-a6c9-01661645038b",
            "up": 0,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15689,
            "last_clean_end": 17883,
            "up_from": 18420,
            "up_thru": 18421,
            "down_at": 18527,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6813\/3788083",
            "cluster_addr": "10.20.0.13:6814\/3788083",
            "heartbeat_back_addr": "10.20.0.13:6815\/3788083",
            "heartbeat_front_addr": "10.20.0.13:6816\/3788083",
            "state": [
                "exists"
            ]
        },
        {
            "osd": 28,
            "uuid": "fc3d5749-7673-4100-a0d4-f25e9cc0bc88",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15688,
            "last_clean_end": 17882,
            "up_from": 18424,
            "up_thru": 18424,
            "down_at": 18518,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6825\/3789248",
            "cluster_addr": "10.20.0.13:6826\/3789248",
            "heartbeat_back_addr": "10.20.0.13:6827\/3789248",
            "heartbeat_front_addr": "10.20.0.13:6828\/3789248",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 29,
            "uuid": "cb5feda9-de3f-4e42-bb73-7945b4928b22",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 18511,
            "last_clean_end": 18534,
            "up_from": 18544,
            "up_thru": 18571,
            "down_at": 18543,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6817\/3815548",
            "cluster_addr": "10.20.0.13:6868\/3815548",
            "heartbeat_back_addr": "10.20.0.13:6869\/3815548",
            "heartbeat_front_addr": "10.20.0.13:6870\/3815548",
            "state": [
                "exists",
                "up"
            ]
        },
        {
            "osd": 30,
            "uuid": "ef1e65bb-a634-4096-9466-1262af55db01",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15693,
            "last_clean_end": 17882,
            "up_from": 18437,
            "up_thru": 18585,
            "down_at": 17884,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6833\/3787367",
            "cluster_addr": "10.20.0.13:6834\/3787367",
            "heartbeat_back_addr": "10.20.0.13:6835\/3787367",
            "heartbeat_front_addr": "10.20.0.13:6836\/3787367",
            "state": [
                "exists",
                "up"
            ]
        },
        {
            "osd": 31,
            "uuid": "3dad6393-67a8-43d4-ba8d-ffd320827396",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 18534,
            "last_clean_end": 18551,
            "up_from": 18562,
            "up_thru": 18581,
            "down_at": 18561,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6842\/3819894",
            "cluster_addr": "10.20.0.13:6864\/3819894",
            "heartbeat_back_addr": "10.20.0.13:6865\/3819894",
            "heartbeat_front_addr": "10.20.0.13:6871\/3819894",
            "state": [
                "exists",
                "up"
            ]
        },
        {
            "osd": 32,
            "uuid": "db6f3afa-53ed-453a-97e3-861e88cb818f",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15684,
            "last_clean_end": 17882,
            "up_from": 18419,
            "up_thru": 18420,
            "down_at": 18523,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6809\/3786362",
            "cluster_addr": "10.20.0.13:6810\/3786362",
            "heartbeat_back_addr": "10.20.0.13:6811\/3786362",
            "heartbeat_front_addr": "10.20.0.13:6812\/3786362",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 33,
            "uuid": "d5e59852-06b4-4a30-8c5e-ff7e328b5455",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 18508,
            "last_clean_end": 18534,
            "up_from": 18551,
            "up_thru": 18577,
            "down_at": 18550,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6809\/3817103",
            "cluster_addr": "10.20.0.13:6810\/3817103",
            "heartbeat_back_addr": "10.20.0.13:6811\/3817103",
            "heartbeat_front_addr": "10.20.0.13:6812\/3817103",
            "state": [
                "exists",
                "up"
            ]
        },
        {
            "osd": 34,
            "uuid": "f35a10c5-217a-4cfb-88b9-7334bda441b8",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 18521,
            "last_clean_end": 18572,
            "up_from": 18592,
            "up_thru": 18592,
            "down_at": 18591,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6805\/3842840",
            "cluster_addr": "10.20.0.13:6819\/3842840",
            "heartbeat_back_addr": "10.20.0.13:6820\/3842840",
            "heartbeat_front_addr": "10.20.0.13:6821\/3842840",
            "state": [
                "exists",
                "up"
            ]
        },
        {
            "osd": 35,
            "uuid": "335e797f-a390-4f08-9da6-9ab76ffb12ae",
            "up": 0,
            "in": 0,
            "weight": 0.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 15687,
            "last_clean_end": 17882,
            "up_from": 18424,
            "up_thru": 18424,
            "down_at": 18498,
            "lost_at": 0,
            "public_addr": "10.20.0.13:6861\/3787537",
            "cluster_addr": "10.20.0.13:6862\/3787537",
            "heartbeat_back_addr": "10.20.0.13:6863\/3787537",
            "heartbeat_front_addr": "10.20.0.13:6864\/3787537",
            "state": [
                "autoout",
                "exists"
            ]
        },
        {
            "osd": 36,
            "uuid": "33c11fa1-1b03-42e4-8296-dc55ba052b35",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 18599,
            "last_clean_end": 18600,
            "up_from": 18606,
            "up_thru": 18609,
            "down_at": 18605,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6829\/3479135",
            "cluster_addr": "10.20.0.12:6830\/3479135",
            "heartbeat_back_addr": "10.20.0.12:6831\/3479135",
            "heartbeat_front_addr": "10.20.0.12:6832\/3479135",
            "state": [
                "exists",
                "up"
            ]
        },
        {
            "osd": 37,
            "uuid": "a97a791a-fe36-438b-80e2-db2a0d5e8e27",
            "up": 1,
            "in": 1,
            "weight": 1.000000,
            "primary_affinity": 1.000000,
            "last_clean_begin": 18596,
            "last_clean_end": 18600,
            "up_from": 18609,
            "up_thru": 18609,
            "down_at": 18608,
            "lost_at": 0,
            "public_addr": "10.20.0.12:6889\/3481637",
            "cluster_addr": "10.20.0.12:6890\/3481637",
            "heartbeat_back_addr": "10.20.0.12:6891\/3481637",
            "heartbeat_front_addr": "10.20.0.12:6894\/3481637",
            "state": [
                "exists",
                "up"
            ]
        }
    ],
    "osd_xinfo": [
        {
            "osd": 0,
            "down_stamp": "2015-04-29 06:36:11.510911",
            "laggy_probability": 0.648970,
            "laggy_interval": 32,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 1,
            "down_stamp": "2015-04-29 06:35:39.342646",
            "laggy_probability": 0.627290,
            "laggy_interval": 30,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 2,
            "down_stamp": "2015-04-29 06:36:11.510911",
            "laggy_probability": 0.617737,
            "laggy_interval": 47,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 3,
            "down_stamp": "2015-04-29 06:36:06.479824",
            "laggy_probability": 0.660475,
            "laggy_interval": 28,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 4,
            "down_stamp": "2015-04-29 06:36:11.510911",
            "laggy_probability": 0.642416,
            "laggy_interval": 39,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 5,
            "down_stamp": "2015-04-29 06:35:39.342646",
            "laggy_probability": 0.617737,
            "laggy_interval": 10,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 6,
            "down_stamp": "2015-04-29 06:36:11.510911",
            "laggy_probability": 0.642416,
            "laggy_interval": 41,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 7,
            "down_stamp": "2015-04-29 06:38:05.135599",
            "laggy_probability": 0.642416,
            "laggy_interval": 66,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 8,
            "down_stamp": "2015-04-29 06:36:06.479824",
            "laggy_probability": 0.449691,
            "laggy_interval": 40,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 9,
            "down_stamp": "2015-04-29 06:35:59.293041",
            "laggy_probability": 0.643535,
            "laggy_interval": 16,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 10,
            "down_stamp": "2015-04-29 06:35:39.342646",
            "laggy_probability": 0.616699,
            "laggy_interval": 48,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 11,
            "down_stamp": "2015-04-29 06:38:34.318677",
            "laggy_probability": 0.422864,
            "laggy_interval": 22,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 12,
            "down_stamp": "2015-04-29 06:30:10.761975",
            "laggy_probability": 0.594721,
            "laggy_interval": 41,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 13,
            "down_stamp": "2015-04-29 06:30:08.803695",
            "laggy_probability": 0.601756,
            "laggy_interval": 29,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 14,
            "down_stamp": "2015-04-29 06:34:32.372745",
            "laggy_probability": 0.663821,
            "laggy_interval": 21,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 15,
            "down_stamp": "2015-04-29 06:34:32.372745",
            "laggy_probability": 0.661855,
            "laggy_interval": 18,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 16,
            "down_stamp": "2015-04-29 06:34:32.372745",
            "laggy_probability": 0.663889,
            "laggy_interval": 13,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 17,
            "down_stamp": "2015-04-29 06:34:32.372745",
            "laggy_probability": 0.541368,
            "laggy_interval": 50,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 18,
            "down_stamp": "2015-04-29 06:34:32.372745",
            "laggy_probability": 0.622311,
            "laggy_interval": 35,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 19,
            "down_stamp": "2015-04-29 06:30:02.919322",
            "laggy_probability": 0.651860,
            "laggy_interval": 20,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 20,
            "down_stamp": "2015-04-29 06:30:45.855010",
            "laggy_probability": 0.626463,
            "laggy_interval": 30,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 21,
            "down_stamp": "2015-04-29 06:30:02.919322",
            "laggy_probability": 0.653627,
            "laggy_interval": 9,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 22,
            "down_stamp": "2015-04-29 06:34:32.372745",
            "laggy_probability": 0.666169,
            "laggy_interval": 12,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 23,
            "down_stamp": "2015-04-29 06:34:32.372745",
            "laggy_probability": 0.594888,
            "laggy_interval": 45,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 24,
            "down_stamp": "2015-04-29 06:45:16.246255",
            "laggy_probability": 0.193668,
            "laggy_interval": 10,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 25,
            "down_stamp": "2015-04-29 06:40:01.722875",
            "laggy_probability": 0.567685,
            "laggy_interval": 36,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 26,
            "down_stamp": "2015-04-29 06:40:42.614902",
            "laggy_probability": 0.601077,
            "laggy_interval": 92,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 27,
            "down_stamp": "2015-04-29 06:40:14.223004",
            "laggy_probability": 0.557502,
            "laggy_interval": 49,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 28,
            "down_stamp": "2015-04-29 06:40:01.722875",
            "laggy_probability": 0.635835,
            "laggy_interval": 27,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 29,
            "down_stamp": "2015-04-29 06:40:43.818245",
            "laggy_probability": 0.251127,
            "laggy_interval": 17,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 30,
            "down_stamp": "2015-04-26 14:21:27.940755",
            "laggy_probability": 0.606626,
            "laggy_interval": 30,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 31,
            "down_stamp": "2015-04-29 06:41:16.132199",
            "laggy_probability": 0.145557,
            "laggy_interval": 7,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 32,
            "down_stamp": "2015-04-29 06:40:06.732853",
            "laggy_probability": 0.568801,
            "laggy_interval": 37,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 33,
            "down_stamp": "2015-04-29 06:40:52.364979",
            "laggy_probability": 0.273623,
            "laggy_interval": 21,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 34,
            "down_stamp": "2015-04-29 06:45:19.569449",
            "laggy_probability": 0.233592,
            "laggy_interval": 36,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 35,
            "down_stamp": "2015-04-29 06:39:41.678784",
            "laggy_probability": 0.492127,
            "laggy_interval": 32,
            "features": 1125899906842623,
            "old_weight": 65536
        },
        {
            "osd": 36,
            "down_stamp": "2015-04-29 06:49:26.582575",
            "laggy_probability": 0.048084,
            "laggy_interval": 0,
            "features": 1125899906842623,
            "old_weight": 0
        },
        {
            "osd": 37,
            "down_stamp": "2015-04-29 06:49:30.662891",
            "laggy_probability": 0.140542,
            "laggy_interval": 7,
            "features": 1125899906842623,
            "old_weight": 65536
        }
    ],
    "pg_temp": [
        {
            "pgid": "0.31",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "0.ac",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "0.d4",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "0.169",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "0.1b8",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "0.1d2",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "0.1f0",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "0.600",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "0.855",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "0.87a",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "0.8d8",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "0.97a",
            "osds": [
                37,
                22,
                36
            ]
        },
        {
            "pgid": "0.a12",
            "osds": [
                36,
                28,
                37
            ]
        },
        {
            "pgid": "0.a1c",
            "osds": [
                37,
                14,
                36
            ]
        },
        {
            "pgid": "0.ad4",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "0.aef",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "0.b30",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "0.b7f",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "0.ba5",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "0.bc3",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "0.d9c",
            "osds": [
                37,
                9,
                36
            ]
        },
        {
            "pgid": "0.e16",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "0.e71",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "0.eab",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "0.ef9",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "0.f09",
            "osds": [
                36,
                35,
                37
            ]
        },
        {
            "pgid": "0.f32",
            "osds": [
                37,
                26,
                36
            ]
        },
        {
            "pgid": "0.f37",
            "osds": [
                37,
                18,
                36
            ]
        },
        {
            "pgid": "0.fc0",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "0.fd4",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "0.fdf",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "1.31",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "1.39",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "1.4c",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "1.68",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "1.6b",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "1.9f",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "1.dd",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "1.174",
            "osds": [
                37,
                23,
                36
            ]
        },
        {
            "pgid": "1.178",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "1.1c2",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "1.1f8",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "1.1fc",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "1.40a",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "1.4b3",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "1.53f",
            "osds": [
                37,
                16,
                36
            ]
        },
        {
            "pgid": "1.5cc",
            "osds": [
                37,
                16,
                36
            ]
        },
        {
            "pgid": "1.82b",
            "osds": [
                36,
                25,
                37
            ]
        },
        {
            "pgid": "1.90d",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "1.9ec",
            "osds": [
                36,
                32,
                37
            ]
        },
        {
            "pgid": "1.9ff",
            "osds": [
                34,
                37
            ]
        },
        {
            "pgid": "1.a6d",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "1.b76",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "1.b8a",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "1.c7a",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "1.cb9",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "1.ced",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "1.d05",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "1.d30",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "1.d7a",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "1.ddf",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "1.e0f",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "1.e4f",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "1.e97",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "1.efd",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "1.f2c",
            "osds": [
                37,
                22,
                36
            ]
        },
        {
            "pgid": "1.f3d",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "1.f4b",
            "osds": [
                34,
                31,
                36
            ]
        },
        {
            "pgid": "1.f9a",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "1.fca",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "2.76",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "2.c4",
            "osds": [
                34,
                37
            ]
        },
        {
            "pgid": "2.150",
            "osds": [
                33,
                24
            ]
        },
        {
            "pgid": "2.159",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "2.1b4",
            "osds": [
                34,
                37
            ]
        },
        {
            "pgid": "2.1cf",
            "osds": [
                29,
                24
            ]
        },
        {
            "pgid": "2.1fa",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "2.545",
            "osds": [
                33,
                24
            ]
        },
        {
            "pgid": "2.7e4",
            "osds": [
                31,
                24
            ]
        },
        {
            "pgid": "2.ab7",
            "osds": [
                29,
                24
            ]
        },
        {
            "pgid": "2.d25",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "2.dbd",
            "osds": [
                36,
                24
            ]
        },
        {
            "pgid": "2.e69",
            "osds": [
                34,
                24
            ]
        },
        {
            "pgid": "2.e8d",
            "osds": [
                31,
                24
            ]
        },
        {
            "pgid": "2.ef9",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "2.f50",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "2.f5f",
            "osds": [
                30,
                24
            ]
        },
        {
            "pgid": "2.f9b",
            "osds": [
                30,
                24
            ]
        },
        {
            "pgid": "2.fea",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "3.64",
            "osds": [
                37,
                18,
                36
            ]
        },
        {
            "pgid": "3.c6",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "3.f8",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "3.194",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "3.1a9",
            "osds": [
                36,
                27,
                37
            ]
        },
        {
            "pgid": "3.686",
            "osds": [
                37,
                16,
                36
            ]
        },
        {
            "pgid": "3.98f",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "3.a88",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "3.acb",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "3.ae0",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "3.b74",
            "osds": [
                37,
                18,
                36
            ]
        },
        {
            "pgid": "3.c0f",
            "osds": [
                37,
                14,
                36
            ]
        },
        {
            "pgid": "3.c50",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "3.c65",
            "osds": [
                37,
                9,
                36
            ]
        },
        {
            "pgid": "3.d05",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "3.d8f",
            "osds": [
                0,
                37,
                36
            ]
        },
        {
            "pgid": "3.de5",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "3.edd",
            "osds": [
                37,
                1,
                36
            ]
        },
        {
            "pgid": "3.ef5",
            "osds": [
                34,
                31,
                36
            ]
        },
        {
            "pgid": "3.ef6",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "3.ef7",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "3.f01",
            "osds": [
                37,
                26,
                36
            ]
        },
        {
            "pgid": "3.f34",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "3.f35",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "3.f47",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "3.f8f",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "3.fb6",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "3.fdb",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "4.5",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "4.34",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "4.3f",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "4.84",
            "osds": [
                34,
                37
            ]
        },
        {
            "pgid": "4.93",
            "osds": [
                37,
                32,
                36
            ]
        },
        {
            "pgid": "4.156",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "4.165",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "4.17b",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "4.17d",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "4.17e",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "4.182",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "4.194",
            "osds": [
                37,
                26,
                36
            ]
        },
        {
            "pgid": "4.1a3",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "4.1aa",
            "osds": [
                37,
                18,
                36
            ]
        },
        {
            "pgid": "4.1c1",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "4.1c2",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "4.1d6",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "4.649",
            "osds": [
                37,
                7,
                36
            ]
        },
        {
            "pgid": "4.703",
            "osds": [
                36,
                32,
                37
            ]
        },
        {
            "pgid": "4.73d",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "4.787",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "4.90e",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "4.a5a",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "4.ab2",
            "osds": [
                32,
                37,
                36
            ]
        },
        {
            "pgid": "4.ab3",
            "osds": [
                0,
                37,
                36
            ]
        },
        {
            "pgid": "4.ae8",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "4.bc7",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "4.c04",
            "osds": [
                33,
                24,
                37
            ]
        },
        {
            "pgid": "4.c10",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "4.c33",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "4.c46",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "4.d1b",
            "osds": [
                9,
                36,
                37
            ]
        },
        {
            "pgid": "4.d66",
            "osds": [
                37,
                23,
                36
            ]
        },
        {
            "pgid": "4.d73",
            "osds": [
                9,
                37,
                36
            ]
        },
        {
            "pgid": "4.dc4",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "4.e1a",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "4.e3c",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "4.e60",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "4.e80",
            "osds": [
                37,
                8,
                36
            ]
        },
        {
            "pgid": "4.e92",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "4.eb6",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "4.f08",
            "osds": [
                37,
                34
            ]
        },
        {
            "pgid": "4.f2e",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "4.f44",
            "osds": [
                37,
                2,
                36
            ]
        },
        {
            "pgid": "4.f46",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "4.f6f",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "4.fbc",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "4.ff1",
            "osds": [
                37,
                14,
                36
            ]
        },
        {
            "pgid": "4.ff6",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "4.ffc",
            "osds": [
                29,
                37
            ]
        },
        {
            "pgid": "6.62",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "6.90",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "6.191",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "6.2f5",
            "osds": [
                37,
                22,
                36
            ]
        },
        {
            "pgid": "6.6b8",
            "osds": [
                37,
                11,
                36
            ]
        },
        {
            "pgid": "6.6d1",
            "osds": [
                0,
                37,
                36
            ]
        },
        {
            "pgid": "6.809",
            "osds": [
                37,
                7,
                36
            ]
        },
        {
            "pgid": "6.968",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "6.996",
            "osds": [
                37,
                23,
                36
            ]
        },
        {
            "pgid": "6.99e",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "6.a2a",
            "osds": [
                37,
                14,
                36
            ]
        },
        {
            "pgid": "6.a35",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "6.aa5",
            "osds": [
                37,
                15,
                36
            ]
        },
        {
            "pgid": "6.aef",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "6.b3b",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "6.b41",
            "osds": [
                2,
                37,
                36
            ]
        },
        {
            "pgid": "6.bdc",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "6.c6b",
            "osds": [
                27,
                37,
                36
            ]
        },
        {
            "pgid": "6.cb1",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "6.cbb",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "6.dbd",
            "osds": [
                37,
                27,
                36
            ]
        },
        {
            "pgid": "6.e9f",
            "osds": [
                37,
                18,
                36
            ]
        },
        {
            "pgid": "6.ec5",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "6.f26",
            "osds": [
                33,
                37
            ]
        },
        {
            "pgid": "6.f7b",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "6.f8b",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "6.fda",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "6.fdc",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "6.fe0",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "7.3b",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "7.52",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "7.7e",
            "osds": [
                30
            ]
        },
        {
            "pgid": "7.87",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "7.a5",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "7.ea",
            "osds": [
                37,
                23,
                36
            ]
        },
        {
            "pgid": "7.161",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "7.163",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "7.1d4",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "7.1da",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "7.1dd",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "7.1f0",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "7.1fd",
            "osds": [
                34,
                37,
                24
            ]
        },
        {
            "pgid": "7.374",
            "osds": [
                37,
                22,
                36
            ]
        },
        {
            "pgid": "7.5ea",
            "osds": [
                37,
                23,
                36
            ]
        },
        {
            "pgid": "7.7f4",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "7.a31",
            "osds": [
                37,
                22,
                36
            ]
        },
        {
            "pgid": "7.a93",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "7.b2b",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "7.c34",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "7.c50",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "7.cd9",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "7.d1b",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "7.d66",
            "osds": [
                34,
                36
            ]
        },
        {
            "pgid": "7.e20",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "7.e8f",
            "osds": [
                29,
                36
            ]
        },
        {
            "pgid": "7.eaa",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "7.f0b",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "7.f48",
            "osds": [
                30,
                37
            ]
        },
        {
            "pgid": "7.fc2",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "7.fdd",
            "osds": [
                37,
                24
            ]
        },
        {
            "pgid": "8.11",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "8.12",
            "osds": [
                33,
                31
            ]
        },
        {
            "pgid": "8.18",
            "osds": [
                24,
                31
            ]
        },
        {
            "pgid": "8.1d",
            "osds": [
                30,
                31
            ]
        },
        {
            "pgid": "8.37",
            "osds": [
                31,
                34
            ]
        },
        {
            "pgid": "8.5a",
            "osds": [
                34,
                33
            ]
        },
        {
            "pgid": "8.7c",
            "osds": [
                31,
                36
            ]
        },
        {
            "pgid": "8.c0",
            "osds": [
                24,
                34
            ]
        },
        {
            "pgid": "8.c2",
            "osds": [
                34,
                24
            ]
        },
        {
            "pgid": "8.d3",
            "osds": [
                37,
                17,
                36
            ]
        },
        {
            "pgid": "8.e3",
            "osds": [
                29,
                24
            ]
        },
        {
            "pgid": "8.ed",
            "osds": [
                29,
                34
            ]
        },
        {
            "pgid": "8.103",
            "osds": [
                29,
                33
            ]
        },
        {
            "pgid": "8.146",
            "osds": [
                29,
                30
            ]
        },
        {
            "pgid": "8.160",
            "osds": [
                31,
                33
            ]
        },
        {
            "pgid": "8.16f",
            "osds": [
                29,
                24
            ]
        },
        {
            "pgid": "8.171",
            "osds": [
                24,
                37
            ]
        },
        {
            "pgid": "8.175",
            "osds": [
                34,
                30
            ]
        },
        {
            "pgid": "8.17e",
            "osds": [
                31,
                37
            ]
        },
        {
            "pgid": "8.182",
            "osds": [
                34,
                31
            ]
        },
        {
            "pgid": "8.18a",
            "osds": [
                30,
                36
            ]
        },
        {
            "pgid": "8.1a1",
            "osds": [
                33,
                34
            ]
        },
        {
            "pgid": "8.1a4",
            "osds": [
                24,
                30
            ]
        },
        {
            "pgid": "8.1ae",
            "osds": [
                30,
                31
            ]
        },
        {
            "pgid": "8.1b4",
            "osds": [
                33,
                31
            ]
        },
        {
            "pgid": "8.1c3",
            "osds": [
                30,
                24
            ]
        },
        {
            "pgid": "8.1c7",
            "osds": [
                33,
                31
            ]
        },
        {
            "pgid": "8.1ce",
            "osds": [
                24,
                36
            ]
        },
        {
            "pgid": "8.1d0",
            "osds": [
                29,
                30
            ]
        },
        {
            "pgid": "8.1f1",
            "osds": [
                31,
                29
            ]
        },
        {
            "pgid": "8.1f3",
            "osds": [
                24,
                29
            ]
        },
        {
            "pgid": "8.1f5",
            "osds": [
                34,
                29
            ]
        },
        {
            "pgid": "8.240",
            "osds": [
                30,
                24
            ]
        },
        {
            "pgid": "8.25d",
            "osds": [
                34,
                24
            ]
        },
        {
            "pgid": "8.26b",
            "osds": [
                31,
                33
            ]
        },
        {
            "pgid": "8.2ad",
            "osds": [
                34,
                24
            ]
        },
        {
            "pgid": "8.2c5",
            "osds": [
                31,
                24
            ]
        },
        {
            "pgid": "8.2e3",
            "osds": [
                33,
                36
            ]
        },
        {
            "pgid": "8.31b",
            "osds": [
                31,
                24
            ]
        },
        {
            "pgid": "8.36e",
            "osds": [
                37,
                11,
                36
            ]
        },
        {
            "pgid": "8.3c1",
            "osds": [
                34,
                24
            ]
        },
        {
            "pgid": "8.3c7",
            "osds": [
                33,
                24
            ]
        },
        {
            "pgid": "16.32a",
            "osds": [
                33,
                36
            ]
        }
    ],
    "primary_temp": [],
    "blacklist": [
        "2015-04-29 07:13:18.543543",
        "2015-04-29 07:11:05.620929",
        "2015-04-29 07:07:39.090155"
    ],
    "erasure_code_profiles": {
        "default": {
            "directory": "\/usr\/lib\/ceph\/erasure-code",
            "k": "2",
            "m": "1",
            "plugin": "jerasure",
            "technique": "reed_sol_van"
        }
    }
}

[-- Attachment #3: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                                           ` <81216125e573cf00539f61cc090b282b-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
@ 2015-04-29 15:38                                             ` Sage Weil
       [not found]                                               ` <alpine.DEB.2.00.1504290838060.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2015-04-29 15:38 UTC (permalink / raw)
  To: Tuomas Juntunen
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> Hi
> 
> I updated that version and it seems that something did happen, the osd's
> stayed up for a while and 'ceph status' got updated. But then in couple of
> minutes, they all went down the same way.
> 
> I have attached new 'ceph osd dump -f json-pretty' and got a new log from
> one of the osd's with osd debug = 20,
> http://beta.xaasbox.com/ceph/ceph-osd.15.log

Sam mentioned that you had said earlier that this was not critical data?  
If not, I think the simplest thing is to just drop those pools.  The 
important thing (from my perspective at least :) is that we understand the 
root cause and can prevent this in the future.

sage


> 
> Thank you!
> 
> Br,
> Tuomas
> 
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org] 
> Sent: 28. huhtikuuta 2015 23:57
> To: Tuomas Juntunen
> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
> operations most of the OSD's went down
> 
> Hi Tuomas,
> 
> I've pushed an updated wip-hammer-snaps branch.  Can you please try it?  
> The build will appear here
> 
> 	
> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
> 2eb514067f72afda11bcde286
> 
> (or a similar url; adjust for your distro).
> 
> Thanks!
> sage
> 
> 
> On Tue, 28 Apr 2015, Sage Weil wrote:
> 
> > [adding ceph-devel]
> > 
> > Okay, I see the problem.  This seems to be unrelated ot the giant -> 
> > hammer move... it's a result of the tiering changes you made:
> > 
> > > > > > > > The following:
> > > > > > > > 
> > > > > > > > ceph osd tier add img images --force-nonempty ceph osd 
> > > > > > > > tier cache-mode images forward ceph osd tier set-overlay 
> > > > > > > > img images
> > 
> > Specifically, --force-nonempty bypassed important safety checks.
> > 
> > 1. images had snapshots (and removed_snaps)
> > 
> > 2. images was added as a tier *of* img, and img's removed_snaps was 
> > copied to images, clobbering the removed_snaps value (see
> > OSDMap::Incremental::propagate_snaps_to_tiers)
> > 
> > 3. tiering relation was undone, but removed_snaps was still gone
> > 
> > 4. on OSD startup, when we load the PG, removed_snaps is initialized 
> > with the older map.  later, in PGPool::update(), we assume that 
> > removed_snaps alwasy grows (never shrinks) and we trigger an assert.
> > 
> > To fix this I think we need to do 2 things:
> > 
> > 1. make the OSD forgiving out removed_snaps getting smaller.  This is 
> > probably a good thing anyway: once we know snaps are removed on all 
> > OSDs we can prune the interval_set in the OSDMap.  Maybe.
> > 
> > 2. Fix the mon to prevent this from happening, *even* when 
> > --force-nonempty is specified.  (This is the root cause.)
> > 
> > I've opened http://tracker.ceph.com/issues/11493 to track this.
> > 
> > sage
> > 
> >     
> > 
> > > > > > > > 
> > > > > > > > Idea was to make images as a tier to img, move data to img 
> > > > > > > > then change
> > > > > > > clients to use the new img pool.
> > > > > > > > 
> > > > > > > > Br,
> > > > > > > > Tuomas
> > > > > > > > 
> > > > > > > > > Can you explain exactly what you mean by:
> > > > > > > > >
> > > > > > > > > "Also I created one pool for tier to be able to move 
> > > > > > > > > data without
> > > > > > > outage."
> > > > > > > > >
> > > > > > > > > -Sam
> > > > > > > > > ----- Original Message -----
> > > > > > > > > From: "tuomas juntunen" 
> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer 
> > > > > > > > > and after some basic operations most of the OSD's went 
> > > > > > > > > down
> > > > > > > > >
> > > > > > > > > Hi
> > > > > > > > >
> > > > > > > > > Any solution for this yet?
> > > > > > > > >
> > > > > > > > > Br,
> > > > > > > > > Tuomas
> > > > > > > > >
> > > > > > > > >> It looks like you may have hit
> > > > > > > > >> http://tracker.ceph.com/issues/7915
> > > > > > > > >>
> > > > > > > > >> Ian R. Colle
> > > > > > > > >> Global Director
> > > > > > > > >> of Software Engineering Red Hat (Inktank is now part of 
> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
> > > > > > > > >> http://www.twitter.com/ircolle
> > > > > > > > >> Cell: +1.303.601.7713
> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > > > > > > > >>
> > > > > > > > >> ----- Original Message -----
> > > > > > > > >> From: "tuomas juntunen" 
> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and 
> > > > > > > > >> after some basic operations most of the OSD's went down
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
> > > > > > > > >>
> > > > > > > > >> Then created new pools and deleted some old ones. Also 
> > > > > > > > >> I created one pool for tier to be able to move data 
> > > > > > > > >> without
> > > outage.
> > > > > > > > >>
> > > > > > > > >> After these operations all but 10 OSD's are down and 
> > > > > > > > >> creating this kind of messages to logs, I get more than 
> > > > > > > > >> 100gb of these in a
> > > > > > night:
> > > > > > > > >>
> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
> > > pg_epoch:
> > > > 
> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > > >> n=0
> > > > > > > > >> ec=1 les/c
> > > > > > > > >> 16609/16659
> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > > >> pi=15659-16589/42
> > > > > > > > >> crt=8480'7 lcod
> > > > > > > > >> 0'0 inactive NOTIFY] enter Started
> > > > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > > >> n=0
> > > > > > > > >> ec=1 les/c
> > > > > > > > >> 16609/16659
> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > > >> pi=15659-16589/42
> > > > > > > > >> crt=8480'7 lcod
> > > > > > > > >> 0'0 inactive NOTIFY] enter Start
> > > > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > > >> n=0
> > > > > > > > >> ec=1 les/c
> > > > > > > > >> 16609/16659
> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > > >> pi=15659-16589/42
> > > > > > > > >> crt=8480'7 lcod
> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > > >> n=0
> > > > > > > > >> ec=1 les/c
> > > > > > > > >> 16609/16659
> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > > >> pi=15659-16589/42
> > > > > > > > >> crt=8480'7 lcod
> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
> > > > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 
> > > > > > > > >> n=0
> > > > > > > > >> ec=1 les/c
> > > > > > > > >> 16609/16659
> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > > > > > > > >> pi=15659-16589/42
> > > > > > > > >> crt=8480'7 lcod
> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
> > > > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > > >> les/c
> > > > > > > > >> 17879/17879
> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > > >> inactive NOTIFY] exit Reset 0.119467 4 0.000037
> > > > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > > >> les/c
> > > > > > > > >> 17879/17879
> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > > >> inactive NOTIFY] enter Started
> > > > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > > >> les/c
> > > > > > > > >> 17879/17879
> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > > >> inactive NOTIFY] enter Start
> > > > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > > >> les/c
> > > > > > > > >> 17879/17879
> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > > >> inactive NOTIFY]
> > > > > > > > >> state<Start>: transitioning to Stray
> > > > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > > >> les/c
> > > > > > > > >> 17879/17879
> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > > >> inactive NOTIFY] exit Start 0.000020 0 0.000000
> > > > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 
> > > > > > > > >> les/c
> > > > > > > > >> 17879/17879
> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 
> > > > > > > > >> inactive NOTIFY] enter Started/Stray
> > > > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165
> > > > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 inactive] enter Started
> > > > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 inactive] enter Start
> > > > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 inactive]
> > > > > > > > >> state<Start>: transitioning to Primary
> > > > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000
> > > > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 inactive] enter Started/Primary
> > > > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering
> > > > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5 
> > > > > > > > >> osd.23
> > > > pg_epoch:
> > > > > 
> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> > > > > > > > >> 16127/16344
> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 
> > > > > > > > >> 0'0 peering] enter Started/Primary/Peering/GetInfo
> > > > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
> > > > > > > ./include/interval_set.h:
> > > > > > > > >> In
> > > > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
> > > snapid_t]' 
> > > > > > > > >> thread
> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> > > > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >= 
> > > > > > > > >> 0)
> > > > > > > > >>
> > > > > > > > >>  ceph version 0.94.1
> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > >> int, char
> > > > > > > > >> const*)+0x8b)
> > > > > > > > >> [0xbc271b]
> > > > > > > > >>  2: 
> > > > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t
> > > > > > > > >> >
> > > > > > > > >> const&)+0xb0) [0x82cd50]
> > > > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
> > > > > > > > >> const>)+0x52e) [0x80113e]
> > > > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
> > > > > > > > >> const>std::vector<int,
> > > > > > > > >> std::allocator<int> >&, int, std::vector<int, 
> > > > > > > > >> std::allocator<int>
> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*, 
> > > > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*, 
> > > > > > > > >> std::set<boost::intrusive_ptr<PG>,
> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >, 
> > > > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3) 
> > > > > > > > >> [0x6b0e43]
> > > > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
> > > > > > > > >> std::allocator<PG*>
> > > > > > > > >> > const&,
> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
> > > > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
> > > > > > > > >> std::allocator<PG*>
> > > > > > > > >> > const&,
> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
> > > > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> > > > > > > > >> [0xbb38ae]
> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> > > > > > > > >>  10: (()+0x8182) [0x7fd906946182]
> > > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> > > > > > > > >>
> > > > > > > > >> Also by monitoring (ceph -w) I get the following 
> > > > > > > > >> messages, also lots of
> > > > > > > them.
> > > > > > > > >>
> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
> > > > > > > 10.20.0.13:0/1174409'
> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
> > > > > > > > >> create-or-move",
> > > > "args":
> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]:
> 
> > > > > > > > >> dispatch
> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
> > > > > > > 10.20.0.13:0/1174483'
> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
> > > > > > > > >> create-or-move",
> > > > "args":
> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]:
> 
> > > > > > > > >> dispatch
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are 
> > > > > > > > >> also mons and mds's to save servers. All run Ubuntu
> 14.04.2.
> > > > > > > > >>
> > > > > > > > >> I have pretty much tried everything I could think of.
> > > > > > > > >>
> > > > > > > > >> Restarting daemons doesn't help.
> > > > > > > > >>
> > > > > > > > >> Any help would be appreciated. I can also provide more 
> > > > > > > > >> logs if necessary. They just seem to get pretty large 
> > > > > > > > >> in few
> > > moments.
> > > > > > > > >>
> > > > > > > > >> Thank you
> > > > > > > > >> Tuomas
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> _______________________________________________
> > > > > > > > >> ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > ceph-users mailing list
> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > 
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                                               ` <alpine.DEB.2.00.1504290838060.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-04-30  3:31                                                 ` tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g
       [not found]                                                   ` <928ebb7320e4eb07f14071e997ed7be2-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g @ 2015-04-30  3:31 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hey

Yes I can drop the images data, you think this will fix it?


Br,

Tuomas

> On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
>> Hi
>>
>> I updated that version and it seems that something did happen, the osd's
>> stayed up for a while and 'ceph status' got updated. But then in couple of
>> minutes, they all went down the same way.
>>
>> I have attached new 'ceph osd dump -f json-pretty' and got a new log from
>> one of the osd's with osd debug = 20,
>> http://beta.xaasbox.com/ceph/ceph-osd.15.log
>
> Sam mentioned that you had said earlier that this was not critical data?
> If not, I think the simplest thing is to just drop those pools.  The
> important thing (from my perspective at least :) is that we understand the
> root cause and can prevent this in the future.
>
> sage
>
>
>>
>> Thank you!
>>
>> Br,
>> Tuomas
>>
>>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
>> Sent: 28. huhtikuuta 2015 23:57
>> To: Tuomas Juntunen
>> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
>> operations most of the OSD's went down
>>
>> Hi Tuomas,
>>
>> I've pushed an updated wip-hammer-snaps branch.  Can you please try it?
>> The build will appear here
>>
>>
>> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
>> 2eb514067f72afda11bcde286
>>
>> (or a similar url; adjust for your distro).
>>
>> Thanks!
>> sage
>>
>>
>> On Tue, 28 Apr 2015, Sage Weil wrote:
>>
>> > [adding ceph-devel]
>> >
>> > Okay, I see the problem.  This seems to be unrelated ot the giant ->
>> > hammer move... it's a result of the tiering changes you made:
>> >
>> > > > > > > > The following:
>> > > > > > > >
>> > > > > > > > ceph osd tier add img images --force-nonempty ceph osd
>> > > > > > > > tier cache-mode images forward ceph osd tier set-overlay
>> > > > > > > > img images
>> >
>> > Specifically, --force-nonempty bypassed important safety checks.
>> >
>> > 1. images had snapshots (and removed_snaps)
>> >
>> > 2. images was added as a tier *of* img, and img's removed_snaps was
>> > copied to images, clobbering the removed_snaps value (see
>> > OSDMap::Incremental::propagate_snaps_to_tiers)
>> >
>> > 3. tiering relation was undone, but removed_snaps was still gone
>> >
>> > 4. on OSD startup, when we load the PG, removed_snaps is initialized
>> > with the older map.  later, in PGPool::update(), we assume that
>> > removed_snaps alwasy grows (never shrinks) and we trigger an assert.
>> >
>> > To fix this I think we need to do 2 things:
>> >
>> > 1. make the OSD forgiving out removed_snaps getting smaller.  This is
>> > probably a good thing anyway: once we know snaps are removed on all
>> > OSDs we can prune the interval_set in the OSDMap.  Maybe.
>> >
>> > 2. Fix the mon to prevent this from happening, *even* when
>> > --force-nonempty is specified.  (This is the root cause.)
>> >
>> > I've opened http://tracker.ceph.com/issues/11493 to track this.
>> >
>> > sage
>> >
>> >
>> >
>> > > > > > > >
>> > > > > > > > Idea was to make images as a tier to img, move data to img
>> > > > > > > > then change
>> > > > > > > clients to use the new img pool.
>> > > > > > > >
>> > > > > > > > Br,
>> > > > > > > > Tuomas
>> > > > > > > >
>> > > > > > > > > Can you explain exactly what you mean by:
>> > > > > > > > >
>> > > > > > > > > "Also I created one pool for tier to be able to move
>> > > > > > > > > data without
>> > > > > > > outage."
>> > > > > > > > >
>> > > > > > > > > -Sam
>> > > > > > > > > ----- Original Message -----
>> > > > > > > > > From: "tuomas juntunen"
>> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
>> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
>> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer
>> > > > > > > > > and after some basic operations most of the OSD's went
>> > > > > > > > > down
>> > > > > > > > >
>> > > > > > > > > Hi
>> > > > > > > > >
>> > > > > > > > > Any solution for this yet?
>> > > > > > > > >
>> > > > > > > > > Br,
>> > > > > > > > > Tuomas
>> > > > > > > > >
>> > > > > > > > >> It looks like you may have hit
>> > > > > > > > >> http://tracker.ceph.com/issues/7915
>> > > > > > > > >>
>> > > > > > > > >> Ian R. Colle
>> > > > > > > > >> Global Director
>> > > > > > > > >> of Software Engineering Red Hat (Inktank is now part of
>> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
>> > > > > > > > >> http://www.twitter.com/ircolle
>> > > > > > > > >> Cell: +1.303.601.7713
>> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>> > > > > > > > >>
>> > > > > > > > >> ----- Original Message -----
>> > > > > > > > >> From: "tuomas juntunen"
>> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
>> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
>> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and
>> > > > > > > > >> after some basic operations most of the OSD's went down
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
>> > > > > > > > >>
>> > > > > > > > >> Then created new pools and deleted some old ones. Also
>> > > > > > > > >> I created one pool for tier to be able to move data
>> > > > > > > > >> without
>> > > outage.
>> > > > > > > > >>
>> > > > > > > > >> After these operations all but 10 OSD's are down and
>> > > > > > > > >> creating this kind of messages to logs, I get more than
>> > > > > > > > >> 100gb of these in a
>> > > > > > night:
>> > > > > > > > >>
>> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
>> > > pg_epoch:
>> > > >
>> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
>> > > > > > > > >> n=0
>> > > > > > > > >> ec=1 les/c
>> > > > > > > > >> 16609/16659
>> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> > > > > > > > >> pi=15659-16589/42
>> > > > > > > > >> crt=8480'7 lcod
>> > > > > > > > >> 0'0 inactive NOTIFY] enter Started
>> > > > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
>> > > > > > > > >> n=0
>> > > > > > > > >> ec=1 les/c
>> > > > > > > > >> 16609/16659
>> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> > > > > > > > >> pi=15659-16589/42
>> > > > > > > > >> crt=8480'7 lcod
>> > > > > > > > >> 0'0 inactive NOTIFY] enter Start
>> > > > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
>> > > > > > > > >> n=0
>> > > > > > > > >> ec=1 les/c
>> > > > > > > > >> 16609/16659
>> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> > > > > > > > >> pi=15659-16589/42
>> > > > > > > > >> crt=8480'7 lcod
>> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
>> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
>> > > > > > > > >> n=0
>> > > > > > > > >> ec=1 les/c
>> > > > > > > > >> 16609/16659
>> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> > > > > > > > >> pi=15659-16589/42
>> > > > > > > > >> crt=8480'7 lcod
>> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
>> > > > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
>> > > > > > > > >> n=0
>> > > > > > > > >> ec=1 les/c
>> > > > > > > > >> 16609/16659
>> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> > > > > > > > >> pi=15659-16589/42
>> > > > > > > > >> crt=8480'7 lcod
>> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
>> > > > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
>> > > > > > > > >> les/c
>> > > > > > > > >> 17879/17879
>> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
>> > > > > > > > >> inactive NOTIFY] exit Reset 0.119467 4 0.000037
>> > > > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
>> > > > > > > > >> les/c
>> > > > > > > > >> 17879/17879
>> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
>> > > > > > > > >> inactive NOTIFY] enter Started
>> > > > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
>> > > > > > > > >> les/c
>> > > > > > > > >> 17879/17879
>> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
>> > > > > > > > >> inactive NOTIFY] enter Start
>> > > > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
>> > > > > > > > >> les/c
>> > > > > > > > >> 17879/17879
>> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
>> > > > > > > > >> inactive NOTIFY]
>> > > > > > > > >> state<Start>: transitioning to Stray
>> > > > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
>> > > > > > > > >> les/c
>> > > > > > > > >> 17879/17879
>> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
>> > > > > > > > >> inactive NOTIFY] exit Start 0.000020 0 0.000000
>> > > > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
>> > > > > > > > >> les/c
>> > > > > > > > >> 17879/17879
>> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
>> > > > > > > > >> inactive NOTIFY] enter Started/Stray
>> > > > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165
>> > > > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 inactive] enter Started
>> > > > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 inactive] enter Start
>> > > > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 inactive]
>> > > > > > > > >> state<Start>: transitioning to Primary
>> > > > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000
>> > > > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 inactive] enter Started/Primary
>> > > > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering
>> > > > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5
>> > > > > > > > >> osd.23
>> > > > pg_epoch:
>> > > > >
>> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
>> > > > > > > > >> 16127/16344
>> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
>> > > > > > > > >> 0'0 peering] enter Started/Primary/Peering/GetInfo
>> > > > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
>> > > > > > > ./include/interval_set.h:
>> > > > > > > > >> In
>> > > > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
>> > > snapid_t]'
>> > > > > > > > >> thread
>> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
>> > > > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >=
>> > > > > > > > >> 0)
>> > > > > > > > >>
>> > > > > > > > >>  ceph version 0.94.1
>> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*,
>> > > > > > > > >> int, char
>> > > > > > > > >> const*)+0x8b)
>> > > > > > > > >> [0xbc271b]
>> > > > > > > > >>  2:
>> > > > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t
>> > > > > > > > >> >
>> > > > > > > > >> const&)+0xb0) [0x82cd50]
>> > > > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
>> > > > > > > > >> const>)+0x52e) [0x80113e]
>> > > > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
>> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>,
>> > > > > > > > >> const>std::vector<int,
>> > > > > > > > >> std::allocator<int> >&, int, std::vector<int,
>> > > > > > > > >> std::allocator<int>
>> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
>> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*,
>> > > > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*,
>> > > > > > > > >> std::set<boost::intrusive_ptr<PG>,
>> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >,
>> > > > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3)
>> > > > > > > > >> [0x6b0e43]
>> > > > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
>> > > > > > > > >> std::allocator<PG*>
>> > > > > > > > >> > const&,
>> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
>> > > > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
>> > > > > > > > >> std::allocator<PG*>
>> > > > > > > > >> > const&,
>> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
>> > > > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
>> > > > > > > > >> [0xbb38ae]
>> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
>> > > > > > > > >>  10: (()+0x8182) [0x7fd906946182]
>> > > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
>> > > > > > > > >>
>> > > > > > > > >> Also by monitoring (ceph -w) I get the following
>> > > > > > > > >> messages, also lots of
>> > > > > > > them.
>> > > > > > > > >>
>> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
>> > > > > > > 10.20.0.13:0/1174409'
>> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush
>> > > > > > > > >> create-or-move",
>> > > > "args":
>> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]:
>>
>> > > > > > > > >> dispatch
>> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
>> > > > > > > 10.20.0.13:0/1174483'
>> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush
>> > > > > > > > >> create-or-move",
>> > > > "args":
>> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]:
>>
>> > > > > > > > >> dispatch
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are
>> > > > > > > > >> also mons and mds's to save servers. All run Ubuntu
>> 14.04.2.
>> > > > > > > > >>
>> > > > > > > > >> I have pretty much tried everything I could think of.
>> > > > > > > > >>
>> > > > > > > > >> Restarting daemons doesn't help.
>> > > > > > > > >>
>> > > > > > > > >> Any help would be appreciated. I can also provide more
>> > > > > > > > >> logs if necessary. They just seem to get pretty large
>> > > > > > > > >> in few
>> > > moments.
>> > > > > > > > >>
>> > > > > > > > >> Thank you
>> > > > > > > > >> Tuomas
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> _______________________________________________
>> > > > > > > > >> ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > _______________________________________________
>> > > > > > > > > ceph-users mailing list
>> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > _______________________________________________
>> > > > > > > > ceph-users mailing list
>> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > _______________________________________________
>> > > > > > > > ceph-users mailing list
>> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                                                   ` <928ebb7320e4eb07f14071e997ed7be2-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
@ 2015-04-30 15:23                                                     ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2015-04-30 15:23 UTC (permalink / raw)
  To: Tuomas Juntunen
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Thu, 30 Apr 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> Hey
> 
> Yes I can drop the images data, you think this will fix it?

It's a slightly different assert that (I believe) should not trigger once 
the pool is deleted.  Please give that a try and if you still hit it I'll 
whip up a workaround.

Thanks!
sage

 > 
> 
> Br,
> 
> Tuomas
> 
> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> >> Hi
> >>
> >> I updated that version and it seems that something did happen, the osd's
> >> stayed up for a while and 'ceph status' got updated. But then in couple of
> >> minutes, they all went down the same way.
> >>
> >> I have attached new 'ceph osd dump -f json-pretty' and got a new log from
> >> one of the osd's with osd debug = 20,
> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
> >
> > Sam mentioned that you had said earlier that this was not critical data?
> > If not, I think the simplest thing is to just drop those pools.  The
> > important thing (from my perspective at least :) is that we understand the
> > root cause and can prevent this in the future.
> >
> > sage
> >
> >
> >>
> >> Thank you!
> >>
> >> Br,
> >> Tuomas
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
> >> Sent: 28. huhtikuuta 2015 23:57
> >> To: Tuomas Juntunen
> >> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
> >> operations most of the OSD's went down
> >>
> >> Hi Tuomas,
> >>
> >> I've pushed an updated wip-hammer-snaps branch.  Can you please try it?
> >> The build will appear here
> >>
> >>
> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
> >> 2eb514067f72afda11bcde286
> >>
> >> (or a similar url; adjust for your distro).
> >>
> >> Thanks!
> >> sage
> >>
> >>
> >> On Tue, 28 Apr 2015, Sage Weil wrote:
> >>
> >> > [adding ceph-devel]
> >> >
> >> > Okay, I see the problem.  This seems to be unrelated ot the giant ->
> >> > hammer move... it's a result of the tiering changes you made:
> >> >
> >> > > > > > > > The following:
> >> > > > > > > >
> >> > > > > > > > ceph osd tier add img images --force-nonempty ceph osd
> >> > > > > > > > tier cache-mode images forward ceph osd tier set-overlay
> >> > > > > > > > img images
> >> >
> >> > Specifically, --force-nonempty bypassed important safety checks.
> >> >
> >> > 1. images had snapshots (and removed_snaps)
> >> >
> >> > 2. images was added as a tier *of* img, and img's removed_snaps was
> >> > copied to images, clobbering the removed_snaps value (see
> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
> >> >
> >> > 3. tiering relation was undone, but removed_snaps was still gone
> >> >
> >> > 4. on OSD startup, when we load the PG, removed_snaps is initialized
> >> > with the older map.  later, in PGPool::update(), we assume that
> >> > removed_snaps alwasy grows (never shrinks) and we trigger an assert.
> >> >
> >> > To fix this I think we need to do 2 things:
> >> >
> >> > 1. make the OSD forgiving out removed_snaps getting smaller.  This is
> >> > probably a good thing anyway: once we know snaps are removed on all
> >> > OSDs we can prune the interval_set in the OSDMap.  Maybe.
> >> >
> >> > 2. Fix the mon to prevent this from happening, *even* when
> >> > --force-nonempty is specified.  (This is the root cause.)
> >> >
> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
> >> >
> >> > sage
> >> >
> >> >
> >> >
> >> > > > > > > >
> >> > > > > > > > Idea was to make images as a tier to img, move data to img
> >> > > > > > > > then change
> >> > > > > > > clients to use the new img pool.
> >> > > > > > > >
> >> > > > > > > > Br,
> >> > > > > > > > Tuomas
> >> > > > > > > >
> >> > > > > > > > > Can you explain exactly what you mean by:
> >> > > > > > > > >
> >> > > > > > > > > "Also I created one pool for tier to be able to move
> >> > > > > > > > > data without
> >> > > > > > > outage."
> >> > > > > > > > >
> >> > > > > > > > > -Sam
> >> > > > > > > > > ----- Original Message -----
> >> > > > > > > > > From: "tuomas juntunen"
> >> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer
> >> > > > > > > > > and after some basic operations most of the OSD's went
> >> > > > > > > > > down
> >> > > > > > > > >
> >> > > > > > > > > Hi
> >> > > > > > > > >
> >> > > > > > > > > Any solution for this yet?
> >> > > > > > > > >
> >> > > > > > > > > Br,
> >> > > > > > > > > Tuomas
> >> > > > > > > > >
> >> > > > > > > > >> It looks like you may have hit
> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
> >> > > > > > > > >>
> >> > > > > > > > >> Ian R. Colle
> >> > > > > > > > >> Global Director
> >> > > > > > > > >> of Software Engineering Red Hat (Inktank is now part of
> >> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
> >> > > > > > > > >> http://www.twitter.com/ircolle
> >> > > > > > > > >> Cell: +1.303.601.7713
> >> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >> > > > > > > > >>
> >> > > > > > > > >> ----- Original Message -----
> >> > > > > > > > >> From: "tuomas juntunen"
> >> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and
> >> > > > > > > > >> after some basic operations most of the OSD's went down
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
> >> > > > > > > > >>
> >> > > > > > > > >> Then created new pools and deleted some old ones. Also
> >> > > > > > > > >> I created one pool for tier to be able to move data
> >> > > > > > > > >> without
> >> > > outage.
> >> > > > > > > > >>
> >> > > > > > > > >> After these operations all but 10 OSD's are down and
> >> > > > > > > > >> creating this kind of messages to logs, I get more than
> >> > > > > > > > >> 100gb of these in a
> >> > > > > > night:
> >> > > > > > > > >>
> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
> >> > > pg_epoch:
> >> > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started
> >> > > > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start
> >> > > > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
> >> > > > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
> >> > > > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] exit Reset 0.119467 4 0.000037
> >> > > > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] enter Started
> >> > > > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] enter Start
> >> > > > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY]
> >> > > > > > > > >> state<Start>: transitioning to Stray
> >> > > > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] exit Start 0.000020 0 0.000000
> >> > > > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] enter Started/Stray
> >> > > > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165
> >> > > > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Started
> >> > > > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Start
> >> > > > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive]
> >> > > > > > > > >> state<Start>: transitioning to Primary
> >> > > > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000
> >> > > > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Started/Primary
> >> > > > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering
> >> > > > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 peering] enter Started/Primary/Peering/GetInfo
> >> > > > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
> >> > > > > > > ./include/interval_set.h:
> >> > > > > > > > >> In
> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
> >> > > snapid_t]'
> >> > > > > > > > >> thread
> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >=
> >> > > > > > > > >> 0)
> >> > > > > > > > >>
> >> > > > > > > > >>  ceph version 0.94.1
> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*,
> >> > > > > > > > >> int, char
> >> > > > > > > > >> const*)+0x8b)
> >> > > > > > > > >> [0xbc271b]
> >> > > > > > > > >>  2:
> >> > > > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t
> >> > > > > > > > >> >
> >> > > > > > > > >> const&)+0xb0) [0x82cd50]
> >> > > > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
> >> > > > > > > > >> const>)+0x52e) [0x80113e]
> >> > > > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>,
> >> > > > > > > > >> const>std::vector<int,
> >> > > > > > > > >> std::allocator<int> >&, int, std::vector<int,
> >> > > > > > > > >> std::allocator<int>
> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*,
> >> > > > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*,
> >> > > > > > > > >> std::set<boost::intrusive_ptr<PG>,
> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >,
> >> > > > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3)
> >> > > > > > > > >> [0x6b0e43]
> >> > > > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
> >> > > > > > > > >> std::allocator<PG*>
> >> > > > > > > > >> > const&,
> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
> >> > > > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
> >> > > > > > > > >> std::allocator<PG*>
> >> > > > > > > > >> > const&,
> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
> >> > > > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> >> > > > > > > > >> [0xbb38ae]
> >> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> >> > > > > > > > >>  10: (()+0x8182) [0x7fd906946182]
> >> > > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> >> > > > > > > > >>
> >> > > > > > > > >> Also by monitoring (ceph -w) I get the following
> >> > > > > > > > >> messages, also lots of
> >> > > > > > > them.
> >> > > > > > > > >>
> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
> >> > > > > > > 10.20.0.13:0/1174409'
> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush
> >> > > > > > > > >> create-or-move",
> >> > > > "args":
> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]:
> >>
> >> > > > > > > > >> dispatch
> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
> >> > > > > > > 10.20.0.13:0/1174483'
> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush
> >> > > > > > > > >> create-or-move",
> >> > > > "args":
> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]:
> >>
> >> > > > > > > > >> dispatch
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are
> >> > > > > > > > >> also mons and mds's to save servers. All run Ubuntu
> >> 14.04.2.
> >> > > > > > > > >>
> >> > > > > > > > >> I have pretty much tried everything I could think of.
> >> > > > > > > > >>
> >> > > > > > > > >> Restarting daemons doesn't help.
> >> > > > > > > > >>
> >> > > > > > > > >> Any help would be appreciated. I can also provide more
> >> > > > > > > > >> logs if necessary. They just seem to get pretty large
> >> > > > > > > > >> in few
> >> > > moments.
> >> > > > > > > > >>
> >> > > > > > > > >> Thank you
> >> > > > > > > > >> Tuomas
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> _______________________________________________
> >> > > > > > > > >> ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > _______________________________________________
> >> > > > > > > > > ceph-users mailing list
> >> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > _______________________________________________
> >> > > > > > > > ceph-users mailing list
> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > _______________________________________________
> >> > > > > > > > ceph-users mailing list
> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >> >
> >>
> >
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                       ` <alpine.DEB.2.00.1505041019300.24939-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-05-04 17:28                         ` Tuomas Juntunen
  0 siblings, 0 replies; 13+ messages in thread
From: Tuomas Juntunen @ 2015-05-04 17:28 UTC (permalink / raw)
  To: 'Sage Weil'
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hi

Ok, restarting osd's did it. I thought I restarted the daemons after it was
almost clean, but it seems I didn't.


Now everything is running fine.

Thanks again!

Br,
Tuomas


-----Original Message-----
From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org] 
Sent: 4. toukokuuta 2015 20:21
To: Tuomas Juntunen
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic
operations most of the OSD's went down

On Mon, 4 May 2015, Tuomas Juntunen wrote:
> 5827504:        10.20.0.11:6800/3382530 'ceph1' mds.0.262 up:rejoin seq
33159

This is why it is 'degraded'... stuck in up:rejoin state.

> The active+clean+replay has been there for a day now, so there must be 
> something that is not ok, if it should've gone away in cople of minutes.

...possibly because the pg is stuck in replay state.  Can you do 'ceph pg
<pgid> query' on one of them?  And may be see if bouncing one of the OSDs
for the pg clears it up (ceph osd down $osdid).

sage


> 
> 
> Thanks
> 
> Tuomas
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
> Sent: 4. toukokuuta 2015 18:29
> To: Tuomas Juntunen
> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some 
> basic operations most of the OSD's went down
> 
> On Mon, 4 May 2015, Tuomas Juntunen wrote:
> > Hi
> > 
> > Thanks Sage, I got it working now. Everything else seems to be ok, 
> > except mds is reporting "mds cluster is degraded", not sure what 
> > could be
> wrong.
> > Mds is running and all osds are up and pg's are active+clean and
> > active+clean+replay.
> 
> Great!  The 'replay' part should clear after a minute or two.
> 
> > Had to delete some empty pools which were created while the osd's 
> > were not working and recovery started to go through.
> > 
> > Seems mds is not that stable, this isn't the first time it goes
degraded.
> > Before it just started to work, but now I just can't get it back
working.
> 
> What does 'ceph mds dump' say?
> 
> sage
> 
> > 
> > Thanks
> > 
> > Br,
> > Tuomas
> > 
> > 
> > -----Original Message-----
> > From: tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org 
> > [mailto:tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org]
> > Sent: 1. toukokuuta 2015 21:14
> > To: Sage Weil
> > Cc: tuomas.juntunen; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; 
> > ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after 
> > some basic operations most of the OSD's went down
> > 
> > Thanks, I'll do this when the commit is available and report back.
> > 
> > And indeed, I'll change to the official ones after everything is ok.
> > 
> > Br,
> > Tuomas
> > 
> > > On Fri, 1 May 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> > >> Hi
> > >>
> > >> I deleted the images and img pools and started osd's, they still die.
> > >>
> > >> Here's a log of one of the osd's after this, if you need it.
> > >>
> > >> http://beta.xaasbox.com/ceph/ceph-osd.19.log
> > >
> > > I've pushed another commit that should avoid this case, sha1 
> > > 425bd4e1dba00cc2243b0c27232d1f9740b04e34.
> > >
> > > Note that once the pools are fully deleted (shouldn't take too 
> > > long once the osds are up and stabilize) you should switch back to 
> > > the normal packages that don't have these workarounds.
> > >
> > > sage
> > >
> > >
> > >
> > >>
> > >> Br,
> > >> Tuomas
> > >>
> > >>
> > >> > Thanks man. I'll try it tomorrow. Have a good one.
> > >> >
> > >> > Br,T
> > >> >
> > >> > -------- Original message --------
> > >> > From: Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
> > >> > Date: 30/04/2015  18:23  (GMT+02:00)
> > >> > To: Tuomas Juntunen <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > >> > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > >> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and 
> > >> > after some basic
> > >>
> > >> > operations most of the OSD's went down
> > >> >
> > >> > On Thu, 30 Apr 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> > >> >> Hey
> > >> >>
> > >> >> Yes I can drop the images data, you think this will fix it?
> > >> >
> > >> > It's a slightly different assert that (I believe) should not 
> > >> > trigger once the pool is deleted.  Please give that a try and 
> > >> > if you still hit it I'll whip up a workaround.
> > >> >
> > >> > Thanks!
> > >> > sage
> > >> >
> > >> >  >
> > >> >>
> > >> >> Br,
> > >> >>
> > >> >> Tuomas
> > >> >>
> > >> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> > >> >> >> Hi
> > >> >> >>
> > >> >> >> I updated that version and it seems that something did 
> > >> >> >> happen, the osd's stayed up for a while and 'ceph status' 
> > >> >> >> got
> updated.
> > >> >> >> But then in couple
> > >> of
> > >> >> >> minutes, they all went down the same way.
> > >> >> >>
> > >> >> >> I have attached new 'ceph osd dump -f json-pretty' and got 
> > >> >> >> a new log
> > >> from
> > >> >> >> one of the osd's with osd debug = 20, 
> > >> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
> > >> >> >
> > >> >> > Sam mentioned that you had said earlier that this was not 
> > >> >> > critical
> > data?
> > >> >> > If not, I think the simplest thing is to just drop those 
> > >> >> > pools. The important thing (from my perspective at least :) 
> > >> >> > is that we understand
> > >> the
> > >> >> > root cause and can prevent this in the future.
> > >> >> >
> > >> >> > sage
> > >> >> >
> > >> >> >
> > >> >> >>
> > >> >> >> Thank you!
> > >> >> >>
> > >> >> >> Br,
> > >> >> >> Tuomas
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> -----Original Message-----
> > >> >> >> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
> > >> >> >> Sent: 28. huhtikuuta 2015 23:57
> > >> >> >> To: Tuomas Juntunen
> > >> >> >> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > >> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
> > >> >> >> after some
> > >> basic
> > >> >> >> operations most of the OSD's went down
> > >> >> >>
> > >> >> >> Hi Tuomas,
> > >> >> >>
> > >> >> >> I've pushed an updated wip-hammer-snaps branch.  Can you 
> > >> >> >> please
> > try it?
> > >> >> >> The build will appear here
> > >> >> >>
> > >> >> >>
> > >> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha
> > >> >> >> 1/
> > >> >> >> 08
> > >> >> >> bf531331afd5e
> > >> >> >> 2eb514067f72afda11bcde286
> > >> >> >>
> > >> >> >> (or a similar url; adjust for your distro).
> > >> >> >>
> > >> >> >> Thanks!
> > >> >> >> sage
> > >> >> >>
> > >> >> >>
> > >> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
> > >> >> >>
> > >> >> >> > [adding ceph-devel]
> > >> >> >> >
> > >> >> >> > Okay, I see the problem.  This seems to be unrelated ot 
> > >> >> >> > the giant -> hammer move... it's a result of the tiering 
> > >> >> >> > changes you
> > made:
> > >> >> >> >
> > >> >> >> > > > > > > > The following:
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > ceph osd tier add img images --force-nonempty 
> > >> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph 
> > >> >> >> > > > > > > > osd tier set-overlay img images
> > >> >> >> >
> > >> >> >> > Specifically, --force-nonempty bypassed important safety
> checks.
> > >> >> >> >
> > >> >> >> > 1. images had snapshots (and removed_snaps)
> > >> >> >> >
> > >> >> >> > 2. images was added as a tier *of* img, and img's 
> > >> >> >> > removed_snaps was copied to images, clobbering the 
> > >> >> >> > removed_snaps value (see
> > >> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
> > >> >> >> >
> > >> >> >> > 3. tiering relation was undone, but removed_snaps was 
> > >> >> >> > still gone
> > >> >> >> >
> > >> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is 
> > >> >> >> > initialized with the older map.  later, in 
> > >> >> >> > PGPool::update(), we assume that removed_snaps alwasy 
> > >> >> >> > grows (never shrinks) and we
> > trigger an assert.
> > >> >> >> >
> > >> >> >> > To fix this I think we need to do 2 things:
> > >> >> >> >
> > >> >> >> > 1. make the OSD forgiving out removed_snaps getting 
> > >> >> >> > smaller. This is probably a good thing anyway: once we 
> > >> >> >> > know snaps are removed on all OSDs we can prune the 
> > >> >> >> > interval_set in the
> > OSDMap.  Maybe.
> > >> >> >> >
> > >> >> >> > 2. Fix the mon to prevent this from happening, *even* 
> > >> >> >> > when --force-nonempty is specified.  (This is the root 
> > >> >> >> > cause.)
> > >> >> >> >
> > >> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track
this.
> > >> >> >> >
> > >> >> >> > sage
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Idea was to make images as a tier to img, 
> > >> >> >> > > > > > > > move data to img then change
> > >> >> >> > > > > > > clients to use the new img pool.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Br,
> > >> >> >> > > > > > > > Tuomas
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > > Can you explain exactly what you mean by:
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > "Also I created one pool for tier to be 
> > >> >> >> > > > > > > > > able to move data without
> > >> >> >> > > > > > > outage."
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > -Sam
> > >> >> >> > > > > > > > > ----- Original Message -----
> > >> >> >> > > > > > > > > From: "tuomas juntunen"
> > >> >> >> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > >> >> >> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > >> >> >> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> > >> >> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from 
> > >> >> >> > > > > > > > > Giant to Hammer and after some basic 
> > >> >> >> > > > > > > > > operations most of the OSD's went down
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Hi
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Any solution for this yet?
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Br,
> > >> >> >> > > > > > > > > Tuomas
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >> It looks like you may have hit
> > >> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Ian R. Colle Global Director of Software 
> > >> >> >> > > > > > > > >> Engineering Red Hat (Inktank is now part 
> > >> >> >> > > > > > > > >> of Red Hat!) 
> > >> >> >> > > > > > > > >> http://www.linkedin.com/in/ircolle
> > >> >> >> > > > > > > > >> http://www.twitter.com/ircolle
> > >> >> >> > > > > > > > >> Cell: +1.303.601.7713
> > >> >> >> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> ----- Original Message -----
> > >> >> >> > > > > > > > >> From: "tuomas juntunen"
> > >> >> >> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > >> >> >> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> > >> >> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant 
> > >> >> >> > > > > > > > >> to Hammer and after some basic operations 
> > >> >> >> > > > > > > > >> most of the OSD's went down
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 
> > >> >> >> > > > > > > > >> Hammer
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Then created new pools and deleted some 
> > >> >> >> > > > > > > > >> old ones. Also I created one pool for tier 
> > >> >> >> > > > > > > > >> to be able to move data without
> > >> >> >> > > outage.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> After these operations all but 10 OSD's 
> > >> >> >> > > > > > > > >> are down and creating this kind of 
> > >> >> >> > > > > > > > >> messages to logs, I get more than 100gb of 
> > >> >> >> > > > > > > > >> these in a
> > >> >> >> > > > > > night:
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 
> > >> >> >> > > > > > > > >>7fd8e748d700  5
> > >> osd.23
> > >> >> >> > > pg_epoch:
> > >> >> >> > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started   Â
> > >> >> >> > > > > > > > >>-18>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.808596 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start     
> > >> >> >> > > > > > > > >>-17>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.808608 7fd8e748d700  1
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >> local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: 
> > >> >> >> > > > > > > > >> transitioning to
> > >> Stray
> > >> >> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 
> > >> >> >> > > > > > > > >>7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0
> > >> >> >> > > > > > > > >>0.000000     -15> 2015-04-27
> > >> >> >> > > > > > > > >>10:17:08.808637 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray  
> > >> >> >> > > > > > > > >>Â Â
> > >> >> >> > > > > > > > >>-14> 2015-04-27 10:17:08.808796 
> > >> >> >> > > > > > > > >>-14> 7fd8e748d700Â
> > >> >> >> > > > > > > > >>5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Reset 
> > >> >> >> > > > > > > > >>0.119467
> > >> >> >> > > > > > > > >>4
> > >> >> >> > > > > > > > >>0.000037     -13> 2015-04-27
> > >> >> >> > > > > > > > >>10:17:08.808817 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started   
> > >> >> >> > > > > > > > >>Â
> > >> >> >> > > > > > > > >>-12> 2015-04-27 10:17:08.808828 
> > >> >> >> > > > > > > > >>-12> 7fd8e748d700Â
> > >> >> >> > > > > > > > >>5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Start   Â
> > >> >> >> > > > > > > > >>-11> 2015-04-27 10:17:08.808838 
> > >> >> >> > > > > > > > >>-11> 7fd8e748d700Â
> > >> >> >> > > > > > > > >>1
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY]
> > >> >> >> > > > > > > > >> state<Start>: transitioning to Stray   Â
> > >> >> >> > > > > > > > >>-10> 2015-04-27 10:17:08.808849 
> > >> >> >> > > > > > > > >>-10> 7fd8e748d700Â
> > >> >> >> > > > > > > > >>5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Start 
> > >> >> >> > > > > > > > >>0.000020
> > >> >> >> > > > > > > > >>0
> > >> >> >> > > > > > > > >>0.000000      -9> 2015-04-27
> > >> >> >> > > > > > > > >>10:17:08.808861 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter 
> > >> >> >> > > > > > > > >>Started/Stray      -8> 2015-04-27 
> > >> >> >> > > > > > > > >>10:17:08.809427 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45
> > >> >> >> > > > > > > > >>0.000165      -7> 2015-04-27
> > >> >> >> > > > > > > > >>10:17:08.809445 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter Started      -6>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.809456 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter Start      -5>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.809468 7fd8e748d700  1
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive]
> > >> >> >> > > > > > > > >> state<Start>: transitioning to Primary   
> > >> >> >> > > > > > > > >>Â Â
> > >> >> >> > > > > > > > >>-4> 2015-04-27 10:17:08.809479 
> > >> >> >> > > > > > > > >>-4> 7fd8e748d700Â
> > >> >> >> > > > > > > > >>-4> 5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 
> > >> >> >> > > > > > > > >>0.000000      -3> 2015-04-27 
> > >> >> >> > > > > > > > >>10:17:08.809492 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary    
> > >> >> >> > > > > > > > >>Â
> > >> >> >> > > > > > > > >>-2> 2015-04-27 10:17:08.809502 
> > >> >> >> > > > > > > > >>-2> 7fd8e748d700Â
> > >> >> >> > > > > > > > >>-2> 5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter 
> > >> >> >> > > > > > > > >>Started/Primary/Peering      -1> 
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.809513 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 peering] enter 
> > >> >> >> > > > > > > > >>Started/Primary/Peering/GetInfo       
> > >> >> >> > > > > > > > >>0>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.813837 7fd8e748d700 -1
> > >> >> >> > > > > > > ./include/interval_set.h:
> > >> >> >> > > > > > > > >> In
> > >> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, 
> > >> >> >> > > > > > > > >> T) [with T =
> > >> >> >> > > snapid_t]'
> > >> >> >> > > > > > > > >> thread
> > >> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 
> > >> >> >> > > > > > > > >> 10:17:08.809899
> > >> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED 
> > >> >> >> > > > > > > > >> assert(_size >=
> > >> >> >> > > > > > > > >> 0)
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>  ceph version 0.94.1
> > >> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> > >> >> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char 
> > >> >> >> > > > > > > > >>const*, char
> > >> const*,
> > >> >> >> > > > > > > > >> int, char
> > >> >> >> > > > > > > > >> const*)+0x8b)  [0xbc271b]   2:
> > >> >> >> > > > > > > > >> 
> > >> >> >> > > > > > > > >>(interval_set<snapid_t>::subtract(interval_
> > >> >> >> > > > > > > > >>se
> > >> >> >> > > > > > > > >>t<
> > >> >> >> > > > > > > > >>snapid_t
> > >> >> >> > > > > > > > >> >
> > >> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50]   3: 
> > >> >> >> > > > > > > > >>(PGPool::update(std::tr1::shared_ptr<OSDMap
> > >> >> >> > > > > > > > >> const>)+0x52e) [0x80113e]
> > >> >> >> > > > > > > > >>  4:
> > >> (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> > >> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap 
> > >> >> >> > > > > > > > >> const>const>, std::vector<int,
> > >> >> >> > > > > > > > >> std::allocator<int> >&, int, 
> > >> >> >> > > > > > > > >> std::vector<int, std::allocator<int>
> > >> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) 
> > >> >> >> > > > > > > > >> >[0x801652]
> > >> >> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*, 
> > >> >> >> > > > > > > > >>ThreadPool::TPHandle&, PG::RecoveryCtx*, 
> > >> >> >> > > > > > > > >>std::set<boost::intrusive_ptr<PG>,
> > >> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >, 
> > >> >> >> > > > > > > > >>std::allocator<boost::intrusive_ptr<PG> >
> > >> >> >> > > > > > > > >>>*)+0x2c3)  [0x6b0e43]   6: 
> > >> >> >> > > > > > > > >>(OSD::process_peering_events(std::list<PG*,
> > >> >> >> > > > > > > > >> std::allocator<PG*>
> > >> >> >> > > > > > > > >> > const&,
> > >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]   7:

> > >> >> >> > > > > > > > >>(OSD::PeeringWQ::_process(std::list<PG*,
> > >> >> >> > > > > > > > >> std::allocator<PG*>
> > >> >> >> > > > > > > > >> > const&,
> > >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]   8:
> > >> (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> > >> >> >> > > > > > > > >> [0xbb38ae]   9: 
> > >> >> >> > > > > > > > >>(ThreadPool::WorkThread::entry()+0x10)
> > >> >> >> > > > > > > > >>[0xbb4950]   10: (()+0x8182) 
> > >> >> >> > > > > > > > >>[0x7fd906946182]   11: (clone()+0x6d) 
> > >> >> >> > > > > > > > >>[0x7fd904eb147d]
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the 
> > >> >> >> > > > > > > > >> following messages, also lots of
> > >> >> >> > > > > > > them.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF]
> > from='client.?
> > >> >> >> > > > > > > 10.20.0.13:0/1174409'
> > >> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
> > >> >> >> > > > > > > > >> create-or-move",
> > >> >> >> > > > "args":
> > >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30,
> > "weight":
> > >> >> 1.82}]:
> > >> >> >>
> > >> >> >> > > > > > > > >> dispatch
> > >> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF]
> > from='client.?
> > >> >> >> > > > > > > 10.20.0.13:0/1174483'
> > >> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
> > >> >> >> > > > > > > > >> create-or-move",
> > >> >> >> > > > "args":
> > >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26,
> > "weight":
> > >> >> 1.82}]:
> > >> >> >>
> > >> >> >> > > > > > > > >> dispatch
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 
> > >> >> >> > > > > > > > >> OSD's, nodes are also mons and mds's to save
servers.
> > >> >> >> > > > > > > > >> All run Ubuntu
> > >> >> >> 14.04.2.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> I have pretty much tried everything I 
> > >> >> >> > > > > > > > >> could think
> > of.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Restarting daemons doesn't help.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Any help would be appreciated. I can also 
> > >> >> >> > > > > > > > >> provide more logs if necessary. They just 
> > >> >> >> > > > > > > > >> seem to get pretty large in few
> > >> >> >> > > moments.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Thank you
> > >> >> >> > > > > > > > >> Tuomas
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> __________________________________________
> > >> >> >> > > > > > > > >> __ __ _ ceph-users mailing list 
> > >> >> >> > > > > > > > >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-us
> > >> >> >> > > > > > > > >> er
> > >> >> >> > > > > > > > >> s-
> > >> >> >> > > > > > > > >> ceph.com
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > ___________________________________________
> > >> >> >> > > > > > > > > __ __ ceph-users mailing list 
> > >> >> >> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-use
> > >> >> >> > > > > > > > > rs
> > >> >> >> > > > > > > > > -c
> > >> >> >> > > > > > > > > eph.com
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > _____________________________________________
> > >> >> >> > > > > > > > __ ceph-users mailing list 
> > >> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users
> > >> >> >> > > > > > > > -c
> > >> >> >> > > > > > > > ep
> > >> >> >> > > > > > > > h.com
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > _____________________________________________
> > >> >> >> > > > > > > > __ ceph-users mailing list 
> > >> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users
> > >> >> >> > > > > > > > -c
> > >> >> >> > > > > > > > ep
> > >> >> >> > > > > > > > h.com
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > >
> > >> >> >> > > > >
> > >> >> >> > > >
> > >> >> >> > >
> > >> >> >> > >
> > >> >> >> > _______________________________________________
> > >> >> >> > ceph-users mailing list
> > >> >> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >> >> >
> > >> >> >> >
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >>
> > >> >> --
> > >> >> To unsubscribe from this list: send the line "unsubscribe 
> > >> >> ceph-devel" in the body of a message to 
> > >> >> majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at 
> > >> >> http://vger.kernel.org/majordomo-info.html
> > >> >>
> > >> >>
> > >> > _______________________________________________
> > >> > ceph-users mailing list
> > >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> > >>
> > >>
> > >>
> > 
> > 
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]                   ` <f0d4624d313c49cf355543dbf52d6561-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
@ 2015-05-04 17:20                     ` Sage Weil
       [not found]                       ` <alpine.DEB.2.00.1505041019300.24939-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2015-05-04 17:20 UTC (permalink / raw)
  To: Tuomas Juntunen
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: TEXT/PLAIN, Size: 31769 bytes --]

On Mon, 4 May 2015, Tuomas Juntunen wrote:
> 5827504:        10.20.0.11:6800/3382530 'ceph1' mds.0.262 up:rejoin seq 33159

This is why it is 'degraded'... stuck in up:rejoin state.

> The active+clean+replay has been there for a day now, so there must be
> something that is not ok, if it should've gone away in cople of minutes.

...possibly because the pg is stuck in replay state.  Can you do 'ceph pg 
<pgid> query' on one of them?  And may be see if bouncing one of the OSDs 
for the pg clears it up (ceph osd down $osdid).

sage


> 
> 
> Thanks
> 
> Tuomas
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org] 
> Sent: 4. toukokuuta 2015 18:29
> To: Tuomas Juntunen
> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic
> operations most of the OSD's went down
> 
> On Mon, 4 May 2015, Tuomas Juntunen wrote:
> > Hi
> > 
> > Thanks Sage, I got it working now. Everything else seems to be ok, 
> > except mds is reporting "mds cluster is degraded", not sure what could be
> wrong.
> > Mds is running and all osds are up and pg's are active+clean and
> > active+clean+replay.
> 
> Great!  The 'replay' part should clear after a minute or two.
> 
> > Had to delete some empty pools which were created while the osd's were 
> > not working and recovery started to go through.
> > 
> > Seems mds is not that stable, this isn't the first time it goes degraded.
> > Before it just started to work, but now I just can't get it back working.
> 
> What does 'ceph mds dump' say?
> 
> sage
> 
> > 
> > Thanks
> > 
> > Br,
> > Tuomas
> > 
> > 
> > -----Original Message-----
> > From: tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org
> > [mailto:tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org]
> > Sent: 1. toukokuuta 2015 21:14
> > To: Sage Weil
> > Cc: tuomas.juntunen; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; 
> > ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some 
> > basic operations most of the OSD's went down
> > 
> > Thanks, I'll do this when the commit is available and report back.
> > 
> > And indeed, I'll change to the official ones after everything is ok.
> > 
> > Br,
> > Tuomas
> > 
> > > On Fri, 1 May 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> > >> Hi
> > >>
> > >> I deleted the images and img pools and started osd's, they still die.
> > >>
> > >> Here's a log of one of the osd's after this, if you need it.
> > >>
> > >> http://beta.xaasbox.com/ceph/ceph-osd.19.log
> > >
> > > I've pushed another commit that should avoid this case, sha1 
> > > 425bd4e1dba00cc2243b0c27232d1f9740b04e34.
> > >
> > > Note that once the pools are fully deleted (shouldn't take too long 
> > > once the osds are up and stabilize) you should switch back to the 
> > > normal packages that don't have these workarounds.
> > >
> > > sage
> > >
> > >
> > >
> > >>
> > >> Br,
> > >> Tuomas
> > >>
> > >>
> > >> > Thanks man. I'll try it tomorrow. Have a good one.
> > >> >
> > >> > Br,T
> > >> >
> > >> > -------- Original message --------
> > >> > From: Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
> > >> > Date: 30/04/2015  18:23  (GMT+02:00)
> > >> > To: Tuomas Juntunen <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > >> > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > >> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after 
> > >> > some basic
> > >>
> > >> > operations most of the OSD's went down
> > >> >
> > >> > On Thu, 30 Apr 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> > >> >> Hey
> > >> >>
> > >> >> Yes I can drop the images data, you think this will fix it?
> > >> >
> > >> > It's a slightly different assert that (I believe) should not 
> > >> > trigger once the pool is deleted.  Please give that a try and if 
> > >> > you still hit it I'll whip up a workaround.
> > >> >
> > >> > Thanks!
> > >> > sage
> > >> >
> > >> >  >
> > >> >>
> > >> >> Br,
> > >> >>
> > >> >> Tuomas
> > >> >>
> > >> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> > >> >> >> Hi
> > >> >> >>
> > >> >> >> I updated that version and it seems that something did 
> > >> >> >> happen, the osd's stayed up for a while and 'ceph status' got
> updated.
> > >> >> >> But then in couple
> > >> of
> > >> >> >> minutes, they all went down the same way.
> > >> >> >>
> > >> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a 
> > >> >> >> new log
> > >> from
> > >> >> >> one of the osd's with osd debug = 20, 
> > >> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
> > >> >> >
> > >> >> > Sam mentioned that you had said earlier that this was not 
> > >> >> > critical
> > data?
> > >> >> > If not, I think the simplest thing is to just drop those 
> > >> >> > pools. The important thing (from my perspective at least :) 
> > >> >> > is that we understand
> > >> the
> > >> >> > root cause and can prevent this in the future.
> > >> >> >
> > >> >> > sage
> > >> >> >
> > >> >> >
> > >> >> >>
> > >> >> >> Thank you!
> > >> >> >>
> > >> >> >> Br,
> > >> >> >> Tuomas
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> -----Original Message-----
> > >> >> >> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
> > >> >> >> Sent: 28. huhtikuuta 2015 23:57
> > >> >> >> To: Tuomas Juntunen
> > >> >> >> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > >> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
> > >> >> >> after some
> > >> basic
> > >> >> >> operations most of the OSD's went down
> > >> >> >>
> > >> >> >> Hi Tuomas,
> > >> >> >>
> > >> >> >> I've pushed an updated wip-hammer-snaps branch.  Can you 
> > >> >> >> please
> > try it?
> > >> >> >> The build will appear here
> > >> >> >>
> > >> >> >>
> > >> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/
> > >> >> >> 08
> > >> >> >> bf531331afd5e
> > >> >> >> 2eb514067f72afda11bcde286
> > >> >> >>
> > >> >> >> (or a similar url; adjust for your distro).
> > >> >> >>
> > >> >> >> Thanks!
> > >> >> >> sage
> > >> >> >>
> > >> >> >>
> > >> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
> > >> >> >>
> > >> >> >> > [adding ceph-devel]
> > >> >> >> >
> > >> >> >> > Okay, I see the problem.  This seems to be unrelated ot 
> > >> >> >> > the giant -> hammer move... it's a result of the tiering 
> > >> >> >> > changes you
> > made:
> > >> >> >> >
> > >> >> >> > > > > > > > The following:
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > ceph osd tier add img images --force-nonempty 
> > >> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph 
> > >> >> >> > > > > > > > osd tier set-overlay img images
> > >> >> >> >
> > >> >> >> > Specifically, --force-nonempty bypassed important safety
> checks.
> > >> >> >> >
> > >> >> >> > 1. images had snapshots (and removed_snaps)
> > >> >> >> >
> > >> >> >> > 2. images was added as a tier *of* img, and img's 
> > >> >> >> > removed_snaps was copied to images, clobbering the 
> > >> >> >> > removed_snaps value (see
> > >> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
> > >> >> >> >
> > >> >> >> > 3. tiering relation was undone, but removed_snaps was still 
> > >> >> >> > gone
> > >> >> >> >
> > >> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is 
> > >> >> >> > initialized with the older map.  later, in 
> > >> >> >> > PGPool::update(), we assume that removed_snaps alwasy grows 
> > >> >> >> > (never shrinks) and we
> > trigger an assert.
> > >> >> >> >
> > >> >> >> > To fix this I think we need to do 2 things:
> > >> >> >> >
> > >> >> >> > 1. make the OSD forgiving out removed_snaps getting 
> > >> >> >> > smaller. This is probably a good thing anyway: once we 
> > >> >> >> > know snaps are removed on all OSDs we can prune the 
> > >> >> >> > interval_set in the
> > OSDMap.  Maybe.
> > >> >> >> >
> > >> >> >> > 2. Fix the mon to prevent this from happening, *even* when 
> > >> >> >> > --force-nonempty is specified.  (This is the root cause.)
> > >> >> >> >
> > >> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
> > >> >> >> >
> > >> >> >> > sage
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Idea was to make images as a tier to img, move 
> > >> >> >> > > > > > > > data to img then change
> > >> >> >> > > > > > > clients to use the new img pool.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Br,
> > >> >> >> > > > > > > > Tuomas
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > > Can you explain exactly what you mean by:
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > "Also I created one pool for tier to be able 
> > >> >> >> > > > > > > > > to move data without
> > >> >> >> > > > > > > outage."
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > -Sam
> > >> >> >> > > > > > > > > ----- Original Message -----
> > >> >> >> > > > > > > > > From: "tuomas juntunen"
> > >> >> >> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > >> >> >> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > >> >> >> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> > >> >> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant 
> > >> >> >> > > > > > > > > to Hammer and after some basic operations 
> > >> >> >> > > > > > > > > most of the OSD's went down
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Hi
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Any solution for this yet?
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Br,
> > >> >> >> > > > > > > > > Tuomas
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >> It looks like you may have hit
> > >> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Ian R. Colle Global Director of Software 
> > >> >> >> > > > > > > > >> Engineering Red Hat (Inktank is now part of 
> > >> >> >> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
> > >> >> >> > > > > > > > >> http://www.twitter.com/ircolle
> > >> >> >> > > > > > > > >> Cell: +1.303.601.7713
> > >> >> >> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> ----- Original Message -----
> > >> >> >> > > > > > > > >> From: "tuomas juntunen"
> > >> >> >> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > >> >> >> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> > >> >> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to 
> > >> >> >> > > > > > > > >> Hammer and after some basic operations most 
> > >> >> >> > > > > > > > >> of the OSD's went down
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 
> > >> >> >> > > > > > > > >> Hammer
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Then created new pools and deleted some old 
> > >> >> >> > > > > > > > >> ones. Also I created one pool for tier to be 
> > >> >> >> > > > > > > > >> able to move data without
> > >> >> >> > > outage.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> After these operations all but 10 OSD's are 
> > >> >> >> > > > > > > > >> down and creating this kind of messages to 
> > >> >> >> > > > > > > > >> logs, I get more than 100gb of these in a
> > >> >> >> > > > > > night:
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 
> > >> >> >> > > > > > > > >>7fd8e748d700  5
> > >> osd.23
> > >> >> >> > > pg_epoch:
> > >> >> >> > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started     
> > >> >> >> > > > > > > > >>-18>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.808596 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start     -17>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.808608 7fd8e748d700  1
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >> local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: 
> > >> >> >> > > > > > > > >> transitioning to
> > >> Stray
> > >> >> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 
> > >> >> >> > > > > > > > >>7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0
> > >> >> >> > > > > > > > >>0.000000     -15> 2015-04-27 
> > >> >> >> > > > > > > > >>10:17:08.808637 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> > >> >> >> > > > > > > > >>local-les=16609
> > >> >> >> > > > > > > > >> n=0
> > >> >> >> > > > > > > > >> ec=1 les/c
> > >> >> >> > > > > > > > >> 16609/16659
> > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> > >> >> >> > > > > > > > >> pi=15659-16589/42
> > >> >> >> > > > > > > > >> crt=8480'7 lcod
> > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray   
> > >> >> >> > > > > > > > >>Â
> > >> >> >> > > > > > > > >>-14> 2015-04-27 10:17:08.808796 7fd8e748d700Â
> > >> >> >> > > > > > > > >>5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Reset 0.119467 
> > >> >> >> > > > > > > > >>4
> > >> >> >> > > > > > > > >>0.000037     -13> 2015-04-27 
> > >> >> >> > > > > > > > >>10:17:08.808817 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started   Â
> > >> >> >> > > > > > > > >>-12> 2015-04-27 10:17:08.808828 7fd8e748d700Â
> > >> >> >> > > > > > > > >>5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Start   Â
> > >> >> >> > > > > > > > >>-11> 2015-04-27 10:17:08.808838 7fd8e748d700Â
> > >> >> >> > > > > > > > >>1
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY]
> > >> >> >> > > > > > > > >> state<Start>: transitioning to Stray   Â
> > >> >> >> > > > > > > > >>-10> 2015-04-27 10:17:08.808849 7fd8e748d700Â
> > >> >> >> > > > > > > > >>5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Start 0.000020 
> > >> >> >> > > > > > > > >>0
> > >> >> >> > > > > > > > >>0.000000      -9> 2015-04-27
> > >> >> >> > > > > > > > >>10:17:08.808861 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> > >> >> >> > > > > > > > >>ec=17863  les/c
> > >> >> >> > > > > > > > >> 17879/17879
> > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> > >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started/Stray 
> > >> >> >> > > > > > > > >>     -8> 2015-04-27 10:17:08.809427 
> > >> >> >> > > > > > > > >>7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 
> > >> >> >> > > > > > > > >>0.000165      -7> 2015-04-27 
> > >> >> >> > > > > > > > >>10:17:08.809445 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter Started      -6>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.809456 7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter Start      -5>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.809468 7fd8e748d700  1
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive]
> > >> >> >> > > > > > > > >> state<Start>: transitioning to Primary    
> > >> >> >> > > > > > > > >>Â
> > >> >> >> > > > > > > > >>-4> 2015-04-27 10:17:08.809479 7fd8e748d700  
> > >> >> >> > > > > > > > >>-4> 5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000 
> > >> >> >> > > > > > > > >>     -3> 2015-04-27 10:17:08.809492 
> > >> >> >> > > > > > > > >>7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary    Â
> > >> >> >> > > > > > > > >>-2> 2015-04-27 10:17:08.809502 7fd8e748d700  
> > >> >> >> > > > > > > > >>-2> 5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering 
> > >> >> >> > > > > > > > >>     -1> 2015-04-27 10:17:08.809513 
> > >> >> >> > > > > > > > >>7fd8e748d700  5
> > >> >> >> > > > > > > > >> osd.23
> > >> >> >> > > > pg_epoch:
> > >> >> >> > > > >
> > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> > >> >> >> > > > > > > > >>ec=1 les/c
> > >> >> >> > > > > > > > >> 16127/16344
> > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> > >> >> >> > > > > > > > >>crt=0'0 mlcod
> > >> >> >> > > > > > > > >> 0'0 peering] enter 
> > >> >> >> > > > > > > > >>Started/Primary/Peering/GetInfo       0>
> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.813837 7fd8e748d700 -1
> > >> >> >> > > > > > > ./include/interval_set.h:
> > >> >> >> > > > > > > > >> In
> > >> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) 
> > >> >> >> > > > > > > > >> [with T =
> > >> >> >> > > snapid_t]'
> > >> >> >> > > > > > > > >> thread
> > >> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> > >> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED 
> > >> >> >> > > > > > > > >> assert(_size >=
> > >> >> >> > > > > > > > >> 0)
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>  ceph version 0.94.1
> > >> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> > >> >> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, 
> > >> >> >> > > > > > > > >>char
> > >> const*,
> > >> >> >> > > > > > > > >> int, char
> > >> >> >> > > > > > > > >> const*)+0x8b)  [0xbc271b]   2:
> > >> >> >> > > > > > > > >> 
> > >> >> >> > > > > > > > >>(interval_set<snapid_t>::subtract(interval_se
> > >> >> >> > > > > > > > >>t<
> > >> >> >> > > > > > > > >>snapid_t
> > >> >> >> > > > > > > > >> >
> > >> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50]   3: 
> > >> >> >> > > > > > > > >>(PGPool::update(std::tr1::shared_ptr<OSDMap
> > >> >> >> > > > > > > > >> const>)+0x52e) [0x80113e]
> > >> >> >> > > > > > > > >>  4:
> > >> (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> > >> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
> > >> >> >> > > > > > > > >> const>std::vector<int,
> > >> >> >> > > > > > > > >> std::allocator<int> >&, int, 
> > >> >> >> > > > > > > > >> std::vector<int, std::allocator<int>
> > >> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> > >> >> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*, 
> > >> >> >> > > > > > > > >>ThreadPool::TPHandle&, PG::RecoveryCtx*, 
> > >> >> >> > > > > > > > >>std::set<boost::intrusive_ptr<PG>,
> > >> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >, 
> > >> >> >> > > > > > > > >>std::allocator<boost::intrusive_ptr<PG> >
> > >> >> >> > > > > > > > >>>*)+0x2c3)  [0x6b0e43]   6: 
> > >> >> >> > > > > > > > >>(OSD::process_peering_events(std::list<PG*,
> > >> >> >> > > > > > > > >> std::allocator<PG*>
> > >> >> >> > > > > > > > >> > const&,
> > >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]   7: 
> > >> >> >> > > > > > > > >>(OSD::PeeringWQ::_process(std::list<PG*,
> > >> >> >> > > > > > > > >> std::allocator<PG*>
> > >> >> >> > > > > > > > >> > const&,
> > >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]   8:
> > >> (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> > >> >> >> > > > > > > > >> [0xbb38ae]
> > >> >> >> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10)
> > >> >> >> > > > > > > > >>[0xbb4950]   10: (()+0x8182) 
> > >> >> >> > > > > > > > >>[0x7fd906946182]   11: (clone()+0x6d) 
> > >> >> >> > > > > > > > >>[0x7fd904eb147d]
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the 
> > >> >> >> > > > > > > > >> following messages, also lots of
> > >> >> >> > > > > > > them.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF]
> > from='client.?
> > >> >> >> > > > > > > 10.20.0.13:0/1174409'
> > >> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
> > >> >> >> > > > > > > > >> create-or-move",
> > >> >> >> > > > "args":
> > >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30,
> > "weight":
> > >> >> 1.82}]:
> > >> >> >>
> > >> >> >> > > > > > > > >> dispatch
> > >> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF]
> > from='client.?
> > >> >> >> > > > > > > 10.20.0.13:0/1174483'
> > >> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
> > >> >> >> > > > > > > > >> create-or-move",
> > >> >> >> > > > "args":
> > >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26,
> > "weight":
> > >> >> 1.82}]:
> > >> >> >>
> > >> >> >> > > > > > > > >> dispatch
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, 
> > >> >> >> > > > > > > > >> nodes are also mons and mds's to save servers.
> > >> >> >> > > > > > > > >> All run Ubuntu
> > >> >> >> 14.04.2.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> I have pretty much tried everything I could 
> > >> >> >> > > > > > > > >> think
> > of.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Restarting daemons doesn't help.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Any help would be appreciated. I can also 
> > >> >> >> > > > > > > > >> provide more logs if necessary. They just 
> > >> >> >> > > > > > > > >> seem to get pretty large in few
> > >> >> >> > > moments.
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> Thank you
> > >> >> >> > > > > > > > >> Tuomas
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >> ____________________________________________
> > >> >> >> > > > > > > > >> __ _ ceph-users mailing list 
> > >> >> >> > > > > > > > >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-user
> > >> >> >> > > > > > > > >> s-
> > >> >> >> > > > > > > > >> ceph.com
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >>
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > _____________________________________________
> > >> >> >> > > > > > > > > __ ceph-users mailing list 
> > >> >> >> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users
> > >> >> >> > > > > > > > > -c
> > >> >> >> > > > > > > > > eph.com
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > _______________________________________________
> > >> >> >> > > > > > > > ceph-users mailing list 
> > >> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c
> > >> >> >> > > > > > > > ep
> > >> >> >> > > > > > > > h.com
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > _______________________________________________
> > >> >> >> > > > > > > > ceph-users mailing list 
> > >> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> > >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c
> > >> >> >> > > > > > > > ep
> > >> >> >> > > > > > > > h.com
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > >
> > >> >> >> > > > >
> > >> >> >> > > >
> > >> >> >> > >
> > >> >> >> > >
> > >> >> >> > _______________________________________________
> > >> >> >> > ceph-users mailing list
> > >> >> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >> >> >
> > >> >> >> >
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >>
> > >> >> --
> > >> >> To unsubscribe from this list: send the line "unsubscribe 
> > >> >> ceph-devel" in the body of a message to 
> > >> >> majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at 
> > >> >> http://vger.kernel.org/majordomo-info.html
> > >> >>
> > >> >>
> > >> > _______________________________________________
> > >> > ceph-users mailing list
> > >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> > >>
> > >>
> > >>
> > 
> > 
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]               ` <alpine.DEB.2.00.1505040828590.24939-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-05-04 17:17                 ` Tuomas Juntunen
       [not found]                   ` <f0d4624d313c49cf355543dbf52d6561-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Tuomas Juntunen @ 2015-05-04 17:17 UTC (permalink / raw)
  To: 'Sage Weil'
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hi below is the mds dump

dumped mdsmap epoch 1799
epoch   1799
flags   0
created 2014-12-10 12:44:34.188118
modified        2015-05-04 07:16:37.205350
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    1794
last_failure_osd_epoch  21750
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
max_mds 1
in      0
up      {0=5827504}
failed
stopped
data_pools      0
metadata_pool   1
inline_data     disabled
5827504:        10.20.0.11:6800/3382530 'ceph1' mds.0.262 up:rejoin seq
33159

The active+clean+replay has been there for a day now, so there must be
something that is not ok, if it should've gone away in cople of minutes.


Thanks

Tuomas

-----Original Message-----
From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org] 
Sent: 4. toukokuuta 2015 18:29
To: Tuomas Juntunen
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic
operations most of the OSD's went down

On Mon, 4 May 2015, Tuomas Juntunen wrote:
> Hi
> 
> Thanks Sage, I got it working now. Everything else seems to be ok, 
> except mds is reporting "mds cluster is degraded", not sure what could be
wrong.
> Mds is running and all osds are up and pg's are active+clean and
> active+clean+replay.

Great!  The 'replay' part should clear after a minute or two.

> Had to delete some empty pools which were created while the osd's were 
> not working and recovery started to go through.
> 
> Seems mds is not that stable, this isn't the first time it goes degraded.
> Before it just started to work, but now I just can't get it back working.

What does 'ceph mds dump' say?

sage

> 
> Thanks
> 
> Br,
> Tuomas
> 
> 
> -----Original Message-----
> From: tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org
> [mailto:tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org]
> Sent: 1. toukokuuta 2015 21:14
> To: Sage Weil
> Cc: tuomas.juntunen; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; 
> ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some 
> basic operations most of the OSD's went down
> 
> Thanks, I'll do this when the commit is available and report back.
> 
> And indeed, I'll change to the official ones after everything is ok.
> 
> Br,
> Tuomas
> 
> > On Fri, 1 May 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> >> Hi
> >>
> >> I deleted the images and img pools and started osd's, they still die.
> >>
> >> Here's a log of one of the osd's after this, if you need it.
> >>
> >> http://beta.xaasbox.com/ceph/ceph-osd.19.log
> >
> > I've pushed another commit that should avoid this case, sha1 
> > 425bd4e1dba00cc2243b0c27232d1f9740b04e34.
> >
> > Note that once the pools are fully deleted (shouldn't take too long 
> > once the osds are up and stabilize) you should switch back to the 
> > normal packages that don't have these workarounds.
> >
> > sage
> >
> >
> >
> >>
> >> Br,
> >> Tuomas
> >>
> >>
> >> > Thanks man. I'll try it tomorrow. Have a good one.
> >> >
> >> > Br,T
> >> >
> >> > -------- Original message --------
> >> > From: Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
> >> > Date: 30/04/2015  18:23  (GMT+02:00)
> >> > To: Tuomas Juntunen <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after 
> >> > some basic
> >>
> >> > operations most of the OSD's went down
> >> >
> >> > On Thu, 30 Apr 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> >> >> Hey
> >> >>
> >> >> Yes I can drop the images data, you think this will fix it?
> >> >
> >> > It's a slightly different assert that (I believe) should not 
> >> > trigger once the pool is deleted.  Please give that a try and if 
> >> > you still hit it I'll whip up a workaround.
> >> >
> >> > Thanks!
> >> > sage
> >> >
> >> >  >
> >> >>
> >> >> Br,
> >> >>
> >> >> Tuomas
> >> >>
> >> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> >> >> >> Hi
> >> >> >>
> >> >> >> I updated that version and it seems that something did 
> >> >> >> happen, the osd's stayed up for a while and 'ceph status' got
updated.
> >> >> >> But then in couple
> >> of
> >> >> >> minutes, they all went down the same way.
> >> >> >>
> >> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a 
> >> >> >> new log
> >> from
> >> >> >> one of the osd's with osd debug = 20, 
> >> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
> >> >> >
> >> >> > Sam mentioned that you had said earlier that this was not 
> >> >> > critical
> data?
> >> >> > If not, I think the simplest thing is to just drop those 
> >> >> > pools. The important thing (from my perspective at least :) 
> >> >> > is that we understand
> >> the
> >> >> > root cause and can prevent this in the future.
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Thank you!
> >> >> >>
> >> >> >> Br,
> >> >> >> Tuomas
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> -----Original Message-----
> >> >> >> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
> >> >> >> Sent: 28. huhtikuuta 2015 23:57
> >> >> >> To: Tuomas Juntunen
> >> >> >> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
> >> >> >> after some
> >> basic
> >> >> >> operations most of the OSD's went down
> >> >> >>
> >> >> >> Hi Tuomas,
> >> >> >>
> >> >> >> I've pushed an updated wip-hammer-snaps branch.  Can you 
> >> >> >> please
> try it?
> >> >> >> The build will appear here
> >> >> >>
> >> >> >>
> >> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/
> >> >> >> 08
> >> >> >> bf531331afd5e
> >> >> >> 2eb514067f72afda11bcde286
> >> >> >>
> >> >> >> (or a similar url; adjust for your distro).
> >> >> >>
> >> >> >> Thanks!
> >> >> >> sage
> >> >> >>
> >> >> >>
> >> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
> >> >> >>
> >> >> >> > [adding ceph-devel]
> >> >> >> >
> >> >> >> > Okay, I see the problem.  This seems to be unrelated ot 
> >> >> >> > the giant -> hammer move... it's a result of the tiering 
> >> >> >> > changes you
> made:
> >> >> >> >
> >> >> >> > > > > > > > The following:
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > ceph osd tier add img images --force-nonempty 
> >> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph 
> >> >> >> > > > > > > > osd tier set-overlay img images
> >> >> >> >
> >> >> >> > Specifically, --force-nonempty bypassed important safety
checks.
> >> >> >> >
> >> >> >> > 1. images had snapshots (and removed_snaps)
> >> >> >> >
> >> >> >> > 2. images was added as a tier *of* img, and img's 
> >> >> >> > removed_snaps was copied to images, clobbering the 
> >> >> >> > removed_snaps value (see
> >> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
> >> >> >> >
> >> >> >> > 3. tiering relation was undone, but removed_snaps was still 
> >> >> >> > gone
> >> >> >> >
> >> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is 
> >> >> >> > initialized with the older map.  later, in 
> >> >> >> > PGPool::update(), we assume that removed_snaps alwasy grows 
> >> >> >> > (never shrinks) and we
> trigger an assert.
> >> >> >> >
> >> >> >> > To fix this I think we need to do 2 things:
> >> >> >> >
> >> >> >> > 1. make the OSD forgiving out removed_snaps getting 
> >> >> >> > smaller. This is probably a good thing anyway: once we 
> >> >> >> > know snaps are removed on all OSDs we can prune the 
> >> >> >> > interval_set in the
> OSDMap.  Maybe.
> >> >> >> >
> >> >> >> > 2. Fix the mon to prevent this from happening, *even* when 
> >> >> >> > --force-nonempty is specified.  (This is the root cause.)
> >> >> >> >
> >> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
> >> >> >> >
> >> >> >> > sage
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Idea was to make images as a tier to img, move 
> >> >> >> > > > > > > > data to img then change
> >> >> >> > > > > > > clients to use the new img pool.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Br,
> >> >> >> > > > > > > > Tuomas
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > > Can you explain exactly what you mean by:
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > "Also I created one pool for tier to be able 
> >> >> >> > > > > > > > > to move data without
> >> >> >> > > > > > > outage."
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > -Sam
> >> >> >> > > > > > > > > ----- Original Message -----
> >> >> >> > > > > > > > > From: "tuomas juntunen"
> >> >> >> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> >> >> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> >> >> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> >> >> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant 
> >> >> >> > > > > > > > > to Hammer and after some basic operations 
> >> >> >> > > > > > > > > most of the OSD's went down
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Hi
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Any solution for this yet?
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Br,
> >> >> >> > > > > > > > > Tuomas
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >> It looks like you may have hit
> >> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Ian R. Colle Global Director of Software 
> >> >> >> > > > > > > > >> Engineering Red Hat (Inktank is now part of 
> >> >> >> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
> >> >> >> > > > > > > > >> http://www.twitter.com/ircolle
> >> >> >> > > > > > > > >> Cell: +1.303.601.7713
> >> >> >> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> ----- Original Message -----
> >> >> >> > > > > > > > >> From: "tuomas juntunen"
> >> >> >> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> >> >> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> >> >> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to 
> >> >> >> > > > > > > > >> Hammer and after some basic operations most 
> >> >> >> > > > > > > > >> of the OSD's went down
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 
> >> >> >> > > > > > > > >> Hammer
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Then created new pools and deleted some old 
> >> >> >> > > > > > > > >> ones. Also I created one pool for tier to be 
> >> >> >> > > > > > > > >> able to move data without
> >> >> >> > > outage.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> After these operations all but 10 OSD's are 
> >> >> >> > > > > > > > >> down and creating this kind of messages to 
> >> >> >> > > > > > > > >> logs, I get more than 100gb of these in a
> >> >> >> > > > > > night:
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> osd.23
> >> >> >> > > pg_epoch:
> >> >> >> > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started     
> >> >> >> > > > > > > > >>-18>
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.808596 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start     -17>
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.808608 7fd8e748d700  1
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> >> >> >> > > > > > > > >> local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: 
> >> >> >> > > > > > > > >> transitioning to
> >> Stray
> >> >> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0
> >> >> >> > > > > > > > >>0.000000     -15> 2015-04-27 
> >> >> >> > > > > > > > >>10:17:08.808637 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7]
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray   
> >> >> >> > > > > > > > >>Â
> >> >> >> > > > > > > > >>-14> 2015-04-27 10:17:08.808796 7fd8e748d700Â
> >> >> >> > > > > > > > >>5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Reset 0.119467 
> >> >> >> > > > > > > > >>4
> >> >> >> > > > > > > > >>0.000037     -13> 2015-04-27 
> >> >> >> > > > > > > > >>10:17:08.808817 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started   Â
> >> >> >> > > > > > > > >>-12> 2015-04-27 10:17:08.808828 7fd8e748d700Â
> >> >> >> > > > > > > > >>5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Start   Â
> >> >> >> > > > > > > > >>-11> 2015-04-27 10:17:08.808838 7fd8e748d700Â
> >> >> >> > > > > > > > >>1
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY]
> >> >> >> > > > > > > > >> state<Start>: transitioning to Stray   Â
> >> >> >> > > > > > > > >>-10> 2015-04-27 10:17:08.808849 7fd8e748d700Â
> >> >> >> > > > > > > > >>5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Start 0.000020 
> >> >> >> > > > > > > > >>0
> >> >> >> > > > > > > > >>0.000000      -9> 2015-04-27
> >> >> >> > > > > > > > >>10:17:08.808861 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started/Stray 
> >> >> >> > > > > > > > >>     -8> 2015-04-27 10:17:08.809427 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 
> >> >> >> > > > > > > > >>0.000165      -7> 2015-04-27 
> >> >> >> > > > > > > > >>10:17:08.809445 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Started      -6>
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.809456 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Start      -5>
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.809468 7fd8e748d700  1
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive]
> >> >> >> > > > > > > > >> state<Start>: transitioning to Primary    
> >> >> >> > > > > > > > >>Â
> >> >> >> > > > > > > > >>-4> 2015-04-27 10:17:08.809479 7fd8e748d700  
> >> >> >> > > > > > > > >>-4> 5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000 
> >> >> >> > > > > > > > >>     -3> 2015-04-27 10:17:08.809492 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary    Â
> >> >> >> > > > > > > > >>-2> 2015-04-27 10:17:08.809502 7fd8e748d700  
> >> >> >> > > > > > > > >>-2> 5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering 
> >> >> >> > > > > > > > >>     -1> 2015-04-27 10:17:08.809513 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 
> >> >> >> > > > > > > > >>ec=1 les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 peering] enter 
> >> >> >> > > > > > > > >>Started/Primary/Peering/GetInfo       0>
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.813837 7fd8e748d700 -1
> >> >> >> > > > > > > ./include/interval_set.h:
> >> >> >> > > > > > > > >> In
> >> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) 
> >> >> >> > > > > > > > >> [with T =
> >> >> >> > > snapid_t]'
> >> >> >> > > > > > > > >> thread
> >> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> >> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED 
> >> >> >> > > > > > > > >> assert(_size >=
> >> >> >> > > > > > > > >> 0)
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>  ceph version 0.94.1
> >> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> >> >> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, 
> >> >> >> > > > > > > > >>char
> >> const*,
> >> >> >> > > > > > > > >> int, char
> >> >> >> > > > > > > > >> const*)+0x8b)  [0xbc271b]   2:
> >> >> >> > > > > > > > >> 
> >> >> >> > > > > > > > >>(interval_set<snapid_t>::subtract(interval_se
> >> >> >> > > > > > > > >>t<
> >> >> >> > > > > > > > >>snapid_t
> >> >> >> > > > > > > > >> >
> >> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50]   3: 
> >> >> >> > > > > > > > >>(PGPool::update(std::tr1::shared_ptr<OSDMap
> >> >> >> > > > > > > > >> const>)+0x52e) [0x80113e]
> >> >> >> > > > > > > > >>  4:
> >> (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> >> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
> >> >> >> > > > > > > > >> const>std::vector<int,
> >> >> >> > > > > > > > >> std::allocator<int> >&, int, 
> >> >> >> > > > > > > > >> std::vector<int, std::allocator<int>
> >> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> >> >> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*, 
> >> >> >> > > > > > > > >>ThreadPool::TPHandle&, PG::RecoveryCtx*, 
> >> >> >> > > > > > > > >>std::set<boost::intrusive_ptr<PG>,
> >> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >, 
> >> >> >> > > > > > > > >>std::allocator<boost::intrusive_ptr<PG> >
> >> >> >> > > > > > > > >>>*)+0x2c3)  [0x6b0e43]   6: 
> >> >> >> > > > > > > > >>(OSD::process_peering_events(std::list<PG*,
> >> >> >> > > > > > > > >> std::allocator<PG*>
> >> >> >> > > > > > > > >> > const&,
> >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]   7: 
> >> >> >> > > > > > > > >>(OSD::PeeringWQ::_process(std::list<PG*,
> >> >> >> > > > > > > > >> std::allocator<PG*>
> >> >> >> > > > > > > > >> > const&,
> >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]   8:
> >> (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> >> >> >> > > > > > > > >> [0xbb38ae]
> >> >> >> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10)
> >> >> >> > > > > > > > >>[0xbb4950]   10: (()+0x8182) 
> >> >> >> > > > > > > > >>[0x7fd906946182]   11: (clone()+0x6d) 
> >> >> >> > > > > > > > >>[0x7fd904eb147d]
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the 
> >> >> >> > > > > > > > >> following messages, also lots of
> >> >> >> > > > > > > them.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF]
> from='client.?
> >> >> >> > > > > > > 10.20.0.13:0/1174409'
> >> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
> >> >> >> > > > > > > > >> create-or-move",
> >> >> >> > > > "args":
> >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30,
> "weight":
> >> >> 1.82}]:
> >> >> >>
> >> >> >> > > > > > > > >> dispatch
> >> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF]
> from='client.?
> >> >> >> > > > > > > 10.20.0.13:0/1174483'
> >> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
> >> >> >> > > > > > > > >> create-or-move",
> >> >> >> > > > "args":
> >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26,
> "weight":
> >> >> 1.82}]:
> >> >> >>
> >> >> >> > > > > > > > >> dispatch
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, 
> >> >> >> > > > > > > > >> nodes are also mons and mds's to save servers.
> >> >> >> > > > > > > > >> All run Ubuntu
> >> >> >> 14.04.2.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> I have pretty much tried everything I could 
> >> >> >> > > > > > > > >> think
> of.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Restarting daemons doesn't help.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Any help would be appreciated. I can also 
> >> >> >> > > > > > > > >> provide more logs if necessary. They just 
> >> >> >> > > > > > > > >> seem to get pretty large in few
> >> >> >> > > moments.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Thank you
> >> >> >> > > > > > > > >> Tuomas
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> ____________________________________________
> >> >> >> > > > > > > > >> __ _ ceph-users mailing list 
> >> >> >> > > > > > > > >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-user
> >> >> >> > > > > > > > >> s-
> >> >> >> > > > > > > > >> ceph.com
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > _____________________________________________
> >> >> >> > > > > > > > > __ ceph-users mailing list 
> >> >> >> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> >> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users
> >> >> >> > > > > > > > > -c
> >> >> >> > > > > > > > > eph.com
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > _______________________________________________
> >> >> >> > > > > > > > ceph-users mailing list 
> >> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c
> >> >> >> > > > > > > > ep
> >> >> >> > > > > > > > h.com
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > _______________________________________________
> >> >> >> > > > > > > > ceph-users mailing list 
> >> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c
> >> >> >> > > > > > > > ep
> >> >> >> > > > > > > > h.com
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > _______________________________________________
> >> >> >> > ceph-users mailing list
> >> >> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe 
> >> >> ceph-devel" in the body of a message to 
> >> >> majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at 
> >> >> http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >>
> >>
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]           ` <90c912f778464020445a8a09c7d8c7f5-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
@ 2015-05-04 15:29             ` Sage Weil
       [not found]               ` <alpine.DEB.2.00.1505040828590.24939-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2015-05-04 15:29 UTC (permalink / raw)
  To: Tuomas Juntunen
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: TEXT/PLAIN, Size: 28763 bytes --]

On Mon, 4 May 2015, Tuomas Juntunen wrote:
> Hi
> 
> Thanks Sage, I got it working now. Everything else seems to be ok, except
> mds is reporting "mds cluster is degraded", not sure what could be wrong.
> Mds is running and all osds are up and pg's are active+clean and
> active+clean+replay.

Great!  The 'replay' part should clear after a minute or two.

> Had to delete some empty pools which were created while the osd's were not
> working and recovery started to go through.
> 
> Seems mds is not that stable, this isn't the first time it goes degraded.
> Before it just started to work, but now I just can't get it back working.

What does 'ceph mds dump' say?

sage

> 
> Thanks
> 
> Br,
> Tuomas
> 
> 
> -----Original Message-----
> From: tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org
> [mailto:tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org] 
> Sent: 1. toukokuuta 2015 21:14
> To: Sage Weil
> Cc: tuomas.juntunen; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TasMV2rI37PzA@public.gmane.orgorg
> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
> operations most of the OSD's went down
> 
> Thanks, I'll do this when the commit is available and report back.
> 
> And indeed, I'll change to the official ones after everything is ok.
> 
> Br,
> Tuomas
> 
> > On Fri, 1 May 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> >> Hi
> >>
> >> I deleted the images and img pools and started osd's, they still die.
> >>
> >> Here's a log of one of the osd's after this, if you need it.
> >>
> >> http://beta.xaasbox.com/ceph/ceph-osd.19.log
> >
> > I've pushed another commit that should avoid this case, sha1 
> > 425bd4e1dba00cc2243b0c27232d1f9740b04e34.
> >
> > Note that once the pools are fully deleted (shouldn't take too long 
> > once the osds are up and stabilize) you should switch back to the 
> > normal packages that don't have these workarounds.
> >
> > sage
> >
> >
> >
> >>
> >> Br,
> >> Tuomas
> >>
> >>
> >> > Thanks man. I'll try it tomorrow. Have a good one.
> >> >
> >> > Br,T
> >> >
> >> > -------- Original message --------
> >> > From: Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
> >> > Date: 30/04/2015  18:23  (GMT+02:00)
> >> > To: Tuomas Juntunen <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after 
> >> > some basic
> >>
> >> > operations most of the OSD's went down
> >> >
> >> > On Thu, 30 Apr 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> >> >> Hey
> >> >>
> >> >> Yes I can drop the images data, you think this will fix it?
> >> >
> >> > It's a slightly different assert that (I believe) should not 
> >> > trigger once the pool is deleted.  Please give that a try and if 
> >> > you still hit it I'll whip up a workaround.
> >> >
> >> > Thanks!
> >> > sage
> >> >
> >> >  >
> >> >>
> >> >> Br,
> >> >>
> >> >> Tuomas
> >> >>
> >> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> >> >> >> Hi
> >> >> >>
> >> >> >> I updated that version and it seems that something did happen, 
> >> >> >> the osd's stayed up for a while and 'ceph status' got updated. 
> >> >> >> But then in couple
> >> of
> >> >> >> minutes, they all went down the same way.
> >> >> >>
> >> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a 
> >> >> >> new log
> >> from
> >> >> >> one of the osd's with osd debug = 20, 
> >> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
> >> >> >
> >> >> > Sam mentioned that you had said earlier that this was not critical
> data?
> >> >> > If not, I think the simplest thing is to just drop those pools.  
> >> >> > The important thing (from my perspective at least :) is that we 
> >> >> > understand
> >> the
> >> >> > root cause and can prevent this in the future.
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Thank you!
> >> >> >>
> >> >> >> Br,
> >> >> >> Tuomas
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> -----Original Message-----
> >> >> >> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
> >> >> >> Sent: 28. huhtikuuta 2015 23:57
> >> >> >> To: Tuomas Juntunen
> >> >> >> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
> >> >> >> after some
> >> basic
> >> >> >> operations most of the OSD's went down
> >> >> >>
> >> >> >> Hi Tuomas,
> >> >> >>
> >> >> >> I've pushed an updated wip-hammer-snaps branch.  Can you please
> try it?
> >> >> >> The build will appear here
> >> >> >>
> >> >> >>
> >> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08
> >> >> >> bf531331afd5e
> >> >> >> 2eb514067f72afda11bcde286
> >> >> >>
> >> >> >> (or a similar url; adjust for your distro).
> >> >> >>
> >> >> >> Thanks!
> >> >> >> sage
> >> >> >>
> >> >> >>
> >> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
> >> >> >>
> >> >> >> > [adding ceph-devel]
> >> >> >> >
> >> >> >> > Okay, I see the problem.  This seems to be unrelated ot the 
> >> >> >> > giant -> hammer move... it's a result of the tiering changes you
> made:
> >> >> >> >
> >> >> >> > > > > > > > The following:
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > ceph osd tier add img images --force-nonempty 
> >> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph osd 
> >> >> >> > > > > > > > tier set-overlay img images
> >> >> >> >
> >> >> >> > Specifically, --force-nonempty bypassed important safety checks.
> >> >> >> >
> >> >> >> > 1. images had snapshots (and removed_snaps)
> >> >> >> >
> >> >> >> > 2. images was added as a tier *of* img, and img's 
> >> >> >> > removed_snaps was copied to images, clobbering the 
> >> >> >> > removed_snaps value (see
> >> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
> >> >> >> >
> >> >> >> > 3. tiering relation was undone, but removed_snaps was still 
> >> >> >> > gone
> >> >> >> >
> >> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is 
> >> >> >> > initialized with the older map.  later, in PGPool::update(), 
> >> >> >> > we assume that removed_snaps alwasy grows (never shrinks) and we
> trigger an assert.
> >> >> >> >
> >> >> >> > To fix this I think we need to do 2 things:
> >> >> >> >
> >> >> >> > 1. make the OSD forgiving out removed_snaps getting smaller.  
> >> >> >> > This is probably a good thing anyway: once we know snaps are 
> >> >> >> > removed on all OSDs we can prune the interval_set in the
> OSDMap.  Maybe.
> >> >> >> >
> >> >> >> > 2. Fix the mon to prevent this from happening, *even* when 
> >> >> >> > --force-nonempty is specified.  (This is the root cause.)
> >> >> >> >
> >> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
> >> >> >> >
> >> >> >> > sage
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Idea was to make images as a tier to img, move 
> >> >> >> > > > > > > > data to img then change
> >> >> >> > > > > > > clients to use the new img pool.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Br,
> >> >> >> > > > > > > > Tuomas
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > > Can you explain exactly what you mean by:
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > "Also I created one pool for tier to be able to 
> >> >> >> > > > > > > > > move data without
> >> >> >> > > > > > > outage."
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > -Sam
> >> >> >> > > > > > > > > ----- Original Message -----
> >> >> >> > > > > > > > > From: "tuomas juntunen"
> >> >> >> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> >> >> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> >> >> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> >> >> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to 
> >> >> >> > > > > > > > > Hammer and after some basic operations most of 
> >> >> >> > > > > > > > > the OSD's went down
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Hi
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Any solution for this yet?
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Br,
> >> >> >> > > > > > > > > Tuomas
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >> It looks like you may have hit
> >> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Ian R. Colle
> >> >> >> > > > > > > > >> Global Director of Software Engineering Red 
> >> >> >> > > > > > > > >> Hat (Inktank is now part of Red Hat!) 
> >> >> >> > > > > > > > >> http://www.linkedin.com/in/ircolle
> >> >> >> > > > > > > > >> http://www.twitter.com/ircolle
> >> >> >> > > > > > > > >> Cell: +1.303.601.7713
> >> >> >> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> ----- Original Message -----
> >> >> >> > > > > > > > >> From: "tuomas juntunen"
> >> >> >> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> >> >> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> >> >> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to 
> >> >> >> > > > > > > > >> Hammer and after some basic operations most of 
> >> >> >> > > > > > > > >> the OSD's went down
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 
> >> >> >> > > > > > > > >> Hammer
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Then created new pools and deleted some old 
> >> >> >> > > > > > > > >> ones. Also I created one pool for tier to be 
> >> >> >> > > > > > > > >> able to move data without
> >> >> >> > > outage.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> After these operations all but 10 OSD's are 
> >> >> >> > > > > > > > >> down and creating this kind of messages to 
> >> >> >> > > > > > > > >> logs, I get more than 100gb of these in a
> >> >> >> > > > > > night:
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> osd.23
> >> >> >> > > pg_epoch:
> >> >> >> > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started     -18> 
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.808596 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start     -17> 
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.808608 7fd8e748d700  1
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
> >> >> >> > > > > > > > >> local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: 
> >> >> >> > > > > > > > >> transitioning to
> >> Stray
> >> >> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 
> >> >> >> > > > > > > > >>0.000000     -15> 2015-04-27 10:17:08.808637 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
> >> >> >> > > > > > > > >>local-les=16609
> >> >> >> > > > > > > > >> n=0
> >> >> >> > > > > > > > >> ec=1 les/c
> >> >> >> > > > > > > > >> 16609/16659
> >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray     
> >> >> >> > > > > > > > >>-14> 2015-04-27 10:17:08.808796 7fd8e748d700  
> >> >> >> > > > > > > > >>5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Reset 0.119467 4 
> >> >> >> > > > > > > > >>0.000037     -13> 2015-04-27 10:17:08.808817 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started     
> >> >> >> > > > > > > > >>-12> 2015-04-27 10:17:08.808828 7fd8e748d700  
> >> >> >> > > > > > > > >>5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Start     
> >> >> >> > > > > > > > >>-11> 2015-04-27 10:17:08.808838 7fd8e748d700  
> >> >> >> > > > > > > > >>1
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY]
> >> >> >> > > > > > > > >> state<Start>: transitioning to Stray     
> >> >> >> > > > > > > > >>-10> 2015-04-27 10:17:08.808849 7fd8e748d700  
> >> >> >> > > > > > > > >>5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Start 0.000020 0 
> >> >> >> > > > > > > > >>0.000000      -9> 2015-04-27 
> >> >> >> > > > > > > > >>10:17:08.808861 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
> >> >> >> > > > > > > > >>ec=17863  les/c
> >> >> >> > > > > > > > >> 17879/17879
> >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
> >> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started/Stray  
> >> >> >> > > > > > > > >>    -8> 2015-04-27 10:17:08.809427 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165 
> >> >> >> > > > > > > > >>     -7> 2015-04-27 10:17:08.809445 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Started      -6> 
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.809456 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Start      -5> 
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.809468 7fd8e748d700  1
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive]
> >> >> >> > > > > > > > >> state<Start>: transitioning to Primary      
> >> >> >> > > > > > > > >>-4> 2015-04-27 10:17:08.809479 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000  
> >> >> >> > > > > > > > >>    -3> 2015-04-27 10:17:08.809492 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary      
> >> >> >> > > > > > > > >>-2> 2015-04-27 10:17:08.809502 7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering  
> >> >> >> > > > > > > > >>    -1> 2015-04-27 10:17:08.809513 
> >> >> >> > > > > > > > >>7fd8e748d700  5
> >> >> >> > > > > > > > >> osd.23
> >> >> >> > > > pg_epoch:
> >> >> >> > > > >
> >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
> >> >> >> > > > > > > > >>les/c
> >> >> >> > > > > > > > >> 16127/16344
> >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
> >> >> >> > > > > > > > >>crt=0'0 mlcod
> >> >> >> > > > > > > > >> 0'0 peering] enter 
> >> >> >> > > > > > > > >>Started/Primary/Peering/GetInfo       0> 
> >> >> >> > > > > > > > >>2015-04-27 10:17:08.813837 7fd8e748d700 -1
> >> >> >> > > > > > > ./include/interval_set.h:
> >> >> >> > > > > > > > >> In
> >> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) 
> >> >> >> > > > > > > > >> [with T =
> >> >> >> > > snapid_t]'
> >> >> >> > > > > > > > >> thread
> >> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> >> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED 
> >> >> >> > > > > > > > >> assert(_size >=
> >> >> >> > > > > > > > >> 0)
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>  ceph version 0.94.1
> >> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> >> >> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, 
> >> >> >> > > > > > > > >>char
> >> const*,
> >> >> >> > > > > > > > >> int, char
> >> >> >> > > > > > > > >> const*)+0x8b)
> >> >> >> > > > > > > > >> [0xbc271b]
> >> >> >> > > > > > > > >>  2:
> >> >> >> > > > > > > > >> 
> >> >> >> > > > > > > > >>(interval_set<snapid_t>::subtract(interval_set<
> >> >> >> > > > > > > > >>snapid_t
> >> >> >> > > > > > > > >> >
> >> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50]   3: 
> >> >> >> > > > > > > > >>(PGPool::update(std::tr1::shared_ptr<OSDMap
> >> >> >> > > > > > > > >> const>)+0x52e) [0x80113e]
> >> >> >> > > > > > > > >>  4:
> >> (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> >> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
> >> >> >> > > > > > > > >> const>std::vector<int,
> >> >> >> > > > > > > > >> std::allocator<int> >&, int, std::vector<int, 
> >> >> >> > > > > > > > >> std::allocator<int>
> >> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> >> >> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*,  
> >> >> >> > > > > > > > >>ThreadPool::TPHandle&, PG::RecoveryCtx*,  
> >> >> >> > > > > > > > >>std::set<boost::intrusive_ptr<PG>,
> >> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >,  
> >> >> >> > > > > > > > >>std::allocator<boost::intrusive_ptr<PG> > 
> >> >> >> > > > > > > > >>>*)+0x2c3)  [0x6b0e43]   6: 
> >> >> >> > > > > > > > >>(OSD::process_peering_events(std::list<PG*,
> >> >> >> > > > > > > > >> std::allocator<PG*>
> >> >> >> > > > > > > > >> > const&,
> >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]   7: 
> >> >> >> > > > > > > > >>(OSD::PeeringWQ::_process(std::list<PG*,
> >> >> >> > > > > > > > >> std::allocator<PG*>
> >> >> >> > > > > > > > >> > const&,
> >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]   8:
> >> (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> >> >> >> > > > > > > > >> [0xbb38ae]
> >> >> >> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) 
> >> >> >> > > > > > > > >>[0xbb4950]   10: (()+0x8182) [0x7fd906946182] 
> >> >> >> > > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the 
> >> >> >> > > > > > > > >> following messages, also lots of
> >> >> >> > > > > > > them.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF]
> from='client.?
> >> >> >> > > > > > > 10.20.0.13:0/1174409'
> >> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
> >> >> >> > > > > > > > >> create-or-move",
> >> >> >> > > > "args":
> >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30,
> "weight":
> >> >> 1.82}]:
> >> >> >>
> >> >> >> > > > > > > > >> dispatch
> >> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF]
> from='client.?
> >> >> >> > > > > > > 10.20.0.13:0/1174483'
> >> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
> >> >> >> > > > > > > > >> create-or-move",
> >> >> >> > > > "args":
> >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26,
> "weight":
> >> >> 1.82}]:
> >> >> >>
> >> >> >> > > > > > > > >> dispatch
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, 
> >> >> >> > > > > > > > >> nodes are also mons and mds's to save servers. 
> >> >> >> > > > > > > > >> All run Ubuntu
> >> >> >> 14.04.2.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> I have pretty much tried everything I could think
> of.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Restarting daemons doesn't help.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Any help would be appreciated. I can also 
> >> >> >> > > > > > > > >> provide more logs if necessary. They just seem 
> >> >> >> > > > > > > > >> to get pretty large in few
> >> >> >> > > moments.
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> Thank you
> >> >> >> > > > > > > > >> Tuomas
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >> ______________________________________________
> >> >> >> > > > > > > > >> _ ceph-users mailing list 
> >> >> >> > > > > > > > >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> >> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-
> >> >> >> > > > > > > > >> ceph.com
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >>
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > _______________________________________________
> >> >> >> > > > > > > > > ceph-users mailing list 
> >> >> >> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
> >> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c
> >> >> >> > > > > > > > > eph.com
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > _______________________________________________
> >> >> >> > > > > > > > ceph-users mailing list ceph-users-idqoXFIVOFLNfb0M+mGrxg@public.gmane.orgm 
> >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-cep
> >> >> >> > > > > > > > h.com
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > _______________________________________________
> >> >> >> > > > > > > > ceph-users mailing list ceph-users-idqoXFIVOFLNfb0M+mGrxg@public.gmane.orgm 
> >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-cep
> >> >> >> > > > > > > > h.com
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > _______________________________________________
> >> >> >> > ceph-users mailing list
> >> >> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe 
> >> >> ceph-devel" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org 
> >> >> More majordomo info at  
> >> >> http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >>
> >>
> 
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]         ` <56c96198e8b0d8e70fbf96fdd209d70a-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
@ 2015-05-04  4:11           ` Tuomas Juntunen
  0 siblings, 0 replies; 13+ messages in thread
From: Tuomas Juntunen @ 2015-05-04  4:11 UTC (permalink / raw)
  To: 'Sage Weil'
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hi

Thanks Sage, I got it working now. Everything else seems to be ok, except
mds is reporting "mds cluster is degraded", not sure what could be wrong.
Mds is running and all osds are up and pg's are active+clean and
active+clean+replay.

Had to delete some empty pools which were created while the osd's were not
working and recovery started to go through.

Seems mds is not that stable, this isn't the first time it goes degraded.
Before it just started to work, but now I just can't get it back working.

Thanks

Br,
Tuomas


-----Original Message-----
From: tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org
[mailto:tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org] 
Sent: 1. toukokuuta 2015 21:14
To: Sage Weil
Cc: tuomas.juntunen; ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
operations most of the OSD's went down

Thanks, I'll do this when the commit is available and report back.

And indeed, I'll change to the official ones after everything is ok.

Br,
Tuomas

> On Fri, 1 May 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
>> Hi
>>
>> I deleted the images and img pools and started osd's, they still die.
>>
>> Here's a log of one of the osd's after this, if you need it.
>>
>> http://beta.xaasbox.com/ceph/ceph-osd.19.log
>
> I've pushed another commit that should avoid this case, sha1 
> 425bd4e1dba00cc2243b0c27232d1f9740b04e34.
>
> Note that once the pools are fully deleted (shouldn't take too long 
> once the osds are up and stabilize) you should switch back to the 
> normal packages that don't have these workarounds.
>
> sage
>
>
>
>>
>> Br,
>> Tuomas
>>
>>
>> > Thanks man. I'll try it tomorrow. Have a good one.
>> >
>> > Br,T
>> >
>> > -------- Original message --------
>> > From: Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
>> > Date: 30/04/2015  18:23  (GMT+02:00)
>> > To: Tuomas Juntunen <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
>> > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after 
>> > some basic
>>
>> > operations most of the OSD's went down
>> >
>> > On Thu, 30 Apr 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
>> >> Hey
>> >>
>> >> Yes I can drop the images data, you think this will fix it?
>> >
>> > It's a slightly different assert that (I believe) should not 
>> > trigger once the pool is deleted.  Please give that a try and if 
>> > you still hit it I'll whip up a workaround.
>> >
>> > Thanks!
>> > sage
>> >
>> >  >
>> >>
>> >> Br,
>> >>
>> >> Tuomas
>> >>
>> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
>> >> >> Hi
>> >> >>
>> >> >> I updated that version and it seems that something did happen, 
>> >> >> the osd's stayed up for a while and 'ceph status' got updated. 
>> >> >> But then in couple
>> of
>> >> >> minutes, they all went down the same way.
>> >> >>
>> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a 
>> >> >> new log
>> from
>> >> >> one of the osd's with osd debug = 20, 
>> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
>> >> >
>> >> > Sam mentioned that you had said earlier that this was not critical
data?
>> >> > If not, I think the simplest thing is to just drop those pools.  
>> >> > The important thing (from my perspective at least :) is that we 
>> >> > understand
>> the
>> >> > root cause and can prevent this in the future.
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >>
>> >> >> Thank you!
>> >> >>
>> >> >> Br,
>> >> >> Tuomas
>> >> >>
>> >> >>
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
>> >> >> Sent: 28. huhtikuuta 2015 23:57
>> >> >> To: Tuomas Juntunen
>> >> >> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and 
>> >> >> after some
>> basic
>> >> >> operations most of the OSD's went down
>> >> >>
>> >> >> Hi Tuomas,
>> >> >>
>> >> >> I've pushed an updated wip-hammer-snaps branch.  Can you please
try it?
>> >> >> The build will appear here
>> >> >>
>> >> >>
>> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08
>> >> >> bf531331afd5e
>> >> >> 2eb514067f72afda11bcde286
>> >> >>
>> >> >> (or a similar url; adjust for your distro).
>> >> >>
>> >> >> Thanks!
>> >> >> sage
>> >> >>
>> >> >>
>> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
>> >> >>
>> >> >> > [adding ceph-devel]
>> >> >> >
>> >> >> > Okay, I see the problem.  This seems to be unrelated ot the 
>> >> >> > giant -> hammer move... it's a result of the tiering changes you
made:
>> >> >> >
>> >> >> > > > > > > > The following:
>> >> >> > > > > > > >
>> >> >> > > > > > > > ceph osd tier add img images --force-nonempty 
>> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph osd 
>> >> >> > > > > > > > tier set-overlay img images
>> >> >> >
>> >> >> > Specifically, --force-nonempty bypassed important safety checks.
>> >> >> >
>> >> >> > 1. images had snapshots (and removed_snaps)
>> >> >> >
>> >> >> > 2. images was added as a tier *of* img, and img's 
>> >> >> > removed_snaps was copied to images, clobbering the 
>> >> >> > removed_snaps value (see
>> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
>> >> >> >
>> >> >> > 3. tiering relation was undone, but removed_snaps was still 
>> >> >> > gone
>> >> >> >
>> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is 
>> >> >> > initialized with the older map.  later, in PGPool::update(), 
>> >> >> > we assume that removed_snaps alwasy grows (never shrinks) and we
trigger an assert.
>> >> >> >
>> >> >> > To fix this I think we need to do 2 things:
>> >> >> >
>> >> >> > 1. make the OSD forgiving out removed_snaps getting smaller.  
>> >> >> > This is probably a good thing anyway: once we know snaps are 
>> >> >> > removed on all OSDs we can prune the interval_set in the
OSDMap.  Maybe.
>> >> >> >
>> >> >> > 2. Fix the mon to prevent this from happening, *even* when 
>> >> >> > --force-nonempty is specified.  (This is the root cause.)
>> >> >> >
>> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
>> >> >> >
>> >> >> > sage
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Idea was to make images as a tier to img, move 
>> >> >> > > > > > > > data to img then change
>> >> >> > > > > > > clients to use the new img pool.
>> >> >> > > > > > > >
>> >> >> > > > > > > > Br,
>> >> >> > > > > > > > Tuomas
>> >> >> > > > > > > >
>> >> >> > > > > > > > > Can you explain exactly what you mean by:
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > "Also I created one pool for tier to be able to 
>> >> >> > > > > > > > > move data without
>> >> >> > > > > > > outage."
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > -Sam
>> >> >> > > > > > > > > ----- Original Message -----
>> >> >> > > > > > > > > From: "tuomas juntunen"
>> >> >> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
>> >> >> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> >> >> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
>> >> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to 
>> >> >> > > > > > > > > Hammer and after some basic operations most of 
>> >> >> > > > > > > > > the OSD's went down
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Hi
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Any solution for this yet?
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Br,
>> >> >> > > > > > > > > Tuomas
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >> It looks like you may have hit
>> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Ian R. Colle
>> >> >> > > > > > > > >> Global Director of Software Engineering Red 
>> >> >> > > > > > > > >> Hat (Inktank is now part of Red Hat!) 
>> >> >> > > > > > > > >> http://www.linkedin.com/in/ircolle
>> >> >> > > > > > > > >> http://www.twitter.com/ircolle
>> >> >> > > > > > > > >> Cell: +1.303.601.7713
>> >> >> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> ----- Original Message -----
>> >> >> > > > > > > > >> From: "tuomas juntunen"
>> >> >> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
>> >> >> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
>> >> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to 
>> >> >> > > > > > > > >> Hammer and after some basic operations most of 
>> >> >> > > > > > > > >> the OSD's went down
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 
>> >> >> > > > > > > > >> Hammer
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Then created new pools and deleted some old 
>> >> >> > > > > > > > >> ones. Also I created one pool for tier to be 
>> >> >> > > > > > > > >> able to move data without
>> >> >> > > outage.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> After these operations all but 10 OSD's are 
>> >> >> > > > > > > > >> down and creating this kind of messages to 
>> >> >> > > > > > > > >> logs, I get more than 100gb of these in a
>> >> >> > > > > > night:
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> osd.23
>> >> >> > > pg_epoch:
>> >> >> > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started     -18> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.808596 7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start     -17> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.808608 7fd8e748d700  1
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >> local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: 
>> >> >> > > > > > > > >> transitioning to
>> Stray
>> >> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 
>> >> >> > > > > > > > >>0.000000     -15> 2015-04-27 10:17:08.808637 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray     
>> >> >> > > > > > > > >>-14> 2015-04-27 10:17:08.808796 7fd8e748d700  
>> >> >> > > > > > > > >>5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Reset 0.119467 4 
>> >> >> > > > > > > > >>0.000037     -13> 2015-04-27 10:17:08.808817 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started     
>> >> >> > > > > > > > >>-12> 2015-04-27 10:17:08.808828 7fd8e748d700  
>> >> >> > > > > > > > >>5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Start     
>> >> >> > > > > > > > >>-11> 2015-04-27 10:17:08.808838 7fd8e748d700  
>> >> >> > > > > > > > >>1
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY]
>> >> >> > > > > > > > >> state<Start>: transitioning to Stray     
>> >> >> > > > > > > > >>-10> 2015-04-27 10:17:08.808849 7fd8e748d700  
>> >> >> > > > > > > > >>5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Start 0.000020 0 
>> >> >> > > > > > > > >>0.000000      -9> 2015-04-27 
>> >> >> > > > > > > > >>10:17:08.808861 7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started/Stray  
>> >> >> > > > > > > > >>    -8> 2015-04-27 10:17:08.809427 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165 
>> >> >> > > > > > > > >>     -7> 2015-04-27 10:17:08.809445 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Started      -6> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.809456 7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Start      -5> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.809468 7fd8e748d700  1
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive]
>> >> >> > > > > > > > >> state<Start>: transitioning to Primary      
>> >> >> > > > > > > > >>-4> 2015-04-27 10:17:08.809479 7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000  
>> >> >> > > > > > > > >>    -3> 2015-04-27 10:17:08.809492 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary      
>> >> >> > > > > > > > >>-2> 2015-04-27 10:17:08.809502 7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering  
>> >> >> > > > > > > > >>    -1> 2015-04-27 10:17:08.809513 
>> >> >> > > > > > > > >>7fd8e748d700  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 peering] enter 
>> >> >> > > > > > > > >>Started/Primary/Peering/GetInfo       0> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.813837 7fd8e748d700 -1
>> >> >> > > > > > > ./include/interval_set.h:
>> >> >> > > > > > > > >> In
>> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) 
>> >> >> > > > > > > > >> [with T =
>> >> >> > > snapid_t]'
>> >> >> > > > > > > > >> thread
>> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
>> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED 
>> >> >> > > > > > > > >> assert(_size >=
>> >> >> > > > > > > > >> 0)
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>  ceph version 0.94.1
>> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>> >> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, 
>> >> >> > > > > > > > >>char
>> const*,
>> >> >> > > > > > > > >> int, char
>> >> >> > > > > > > > >> const*)+0x8b)
>> >> >> > > > > > > > >> [0xbc271b]
>> >> >> > > > > > > > >>  2:
>> >> >> > > > > > > > >> 
>> >> >> > > > > > > > >>(interval_set<snapid_t>::subtract(interval_set<
>> >> >> > > > > > > > >>snapid_t
>> >> >> > > > > > > > >> >
>> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50]   3: 
>> >> >> > > > > > > > >>(PGPool::update(std::tr1::shared_ptr<OSDMap
>> >> >> > > > > > > > >> const>)+0x52e) [0x80113e]
>> >> >> > > > > > > > >>  4:
>> (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
>> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
>> >> >> > > > > > > > >> const>std::vector<int,
>> >> >> > > > > > > > >> std::allocator<int> >&, int, std::vector<int, 
>> >> >> > > > > > > > >> std::allocator<int>
>> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
>> >> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*,  
>> >> >> > > > > > > > >>ThreadPool::TPHandle&, PG::RecoveryCtx*,  
>> >> >> > > > > > > > >>std::set<boost::intrusive_ptr<PG>,
>> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >,  
>> >> >> > > > > > > > >>std::allocator<boost::intrusive_ptr<PG> > 
>> >> >> > > > > > > > >>>*)+0x2c3)  [0x6b0e43]   6: 
>> >> >> > > > > > > > >>(OSD::process_peering_events(std::list<PG*,
>> >> >> > > > > > > > >> std::allocator<PG*>
>> >> >> > > > > > > > >> > const&,
>> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]   7: 
>> >> >> > > > > > > > >>(OSD::PeeringWQ::_process(std::list<PG*,
>> >> >> > > > > > > > >> std::allocator<PG*>
>> >> >> > > > > > > > >> > const&,
>> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]   8:
>> (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
>> >> >> > > > > > > > >> [0xbb38ae]
>> >> >> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) 
>> >> >> > > > > > > > >>[0xbb4950]   10: (()+0x8182) [0x7fd906946182] 
>> >> >> > > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the 
>> >> >> > > > > > > > >> following messages, also lots of
>> >> >> > > > > > > them.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF]
from='client.?
>> >> >> > > > > > > 10.20.0.13:0/1174409'
>> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
>> >> >> > > > > > > > >> create-or-move",
>> >> >> > > > "args":
>> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30,
"weight":
>> >> 1.82}]:
>> >> >>
>> >> >> > > > > > > > >> dispatch
>> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF]
from='client.?
>> >> >> > > > > > > 10.20.0.13:0/1174483'
>> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
>> >> >> > > > > > > > >> create-or-move",
>> >> >> > > > "args":
>> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26,
"weight":
>> >> 1.82}]:
>> >> >>
>> >> >> > > > > > > > >> dispatch
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, 
>> >> >> > > > > > > > >> nodes are also mons and mds's to save servers. 
>> >> >> > > > > > > > >> All run Ubuntu
>> >> >> 14.04.2.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> I have pretty much tried everything I could think
of.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Restarting daemons doesn't help.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Any help would be appreciated. I can also 
>> >> >> > > > > > > > >> provide more logs if necessary. They just seem 
>> >> >> > > > > > > > >> to get pretty large in few
>> >> >> > > moments.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Thank you
>> >> >> > > > > > > > >> Tuomas
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> ______________________________________________
>> >> >> > > > > > > > >> _ ceph-users mailing list 
>> >> >> > > > > > > > >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
>> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-
>> >> >> > > > > > > > >> ceph.com
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > _______________________________________________
>> >> >> > > > > > > > > ceph-users mailing list 
>> >> >> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
>> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c
>> >> >> > > > > > > > > eph.com
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > _______________________________________________
>> >> >> > > > > > > > ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
>> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-cep
>> >> >> > > > > > > > h.com
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > _______________________________________________
>> >> >> > > > > > > > ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org 
>> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-cep
>> >> >> > > > > > > > h.com
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > >
>> >> >> > >
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe 
>> >> ceph-devel" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org 
>> >> More majordomo info at  
>> >> http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
       [not found]   ` <dc614dd85caf6cecfd59897b96a019ad-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
@ 2015-05-01 16:04     ` Sage Weil
  2015-05-01 18:13       ` [ceph-users] " tuomas.juntunen
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2015-05-01 16:04 UTC (permalink / raw)
  To: tuomas.juntunen
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: TEXT/PLAIN, Size: 24118 bytes --]

On Fri, 1 May 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> Hi
> 
> I deleted the images and img pools and started osd's, they still die.
> 
> Here's a log of one of the osd's after this, if you need it.
> 
> http://beta.xaasbox.com/ceph/ceph-osd.19.log

I've pushed another commit that should avoid this case, sha1
425bd4e1dba00cc2243b0c27232d1f9740b04e34.

Note that once the pools are fully deleted (shouldn't take too long once 
the osds are up and stabilize) you should switch back to the normal 
packages that don't have these workarounds.

sage



> 
> Br,
> Tuomas
> 
> 
> > Thanks man. I'll try it tomorrow. Have a good one.
> >
> > Br,T
> >
> > -------- Original message --------
> > From: Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
> > Date: 30/04/2015  18:23  (GMT+02:00)
> > To: Tuomas Juntunen <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic
> 
> > operations most of the OSD's went down
> >
> > On Thu, 30 Apr 2015, tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org wrote:
> >> Hey
> >>
> >> Yes I can drop the images data, you think this will fix it?
> >
> > It's a slightly different assert that (I believe) should not trigger once
> > the pool is deleted.  Please give that a try and if you still hit it I'll
> > whip up a workaround.
> >
> > Thanks!
> > sage
> >
> >  >
> >>
> >> Br,
> >>
> >> Tuomas
> >>
> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> >> >> Hi
> >> >>
> >> >> I updated that version and it seems that something did happen, the osd's
> >> >> stayed up for a while and 'ceph status' got updated. But then in couple of
> >> >> minutes, they all went down the same way.
> >> >>
> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a new log from
> >> >> one of the osd's with osd debug = 20,
> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
> >> >
> >> > Sam mentioned that you had said earlier that this was not critical data?
> >> > If not, I think the simplest thing is to just drop those pools.  The
> >> > important thing (from my perspective at least :) is that we understand the
> >> > root cause and can prevent this in the future.
> >> >
> >> > sage
> >> >
> >> >
> >> >>
> >> >> Thank you!
> >> >>
> >> >> Br,
> >> >> Tuomas
> >> >>
> >> >>
> >> >>
> >> >> -----Original Message-----
> >> >> From: Sage Weil [mailto:sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org]
> >> >> Sent: 28. huhtikuuta 2015 23:57
> >> >> To: Tuomas Juntunen
> >> >> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
> >> >> operations most of the OSD's went down
> >> >>
> >> >> Hi Tuomas,
> >> >>
> >> >> I've pushed an updated wip-hammer-snaps branch.  Can you please try it?
> >> >> The build will appear here
> >> >>
> >> >>
> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
> >> >> 2eb514067f72afda11bcde286
> >> >>
> >> >> (or a similar url; adjust for your distro).
> >> >>
> >> >> Thanks!
> >> >> sage
> >> >>
> >> >>
> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
> >> >>
> >> >> > [adding ceph-devel]
> >> >> >
> >> >> > Okay, I see the problem.  This seems to be unrelated ot the giant ->
> >> >> > hammer move... it's a result of the tiering changes you made:
> >> >> >
> >> >> > > > > > > > The following:
> >> >> > > > > > > >
> >> >> > > > > > > > ceph osd tier add img images --force-nonempty ceph osd
> >> >> > > > > > > > tier cache-mode images forward ceph osd tier set-overlay
> >> >> > > > > > > > img images
> >> >> >
> >> >> > Specifically, --force-nonempty bypassed important safety checks.
> >> >> >
> >> >> > 1. images had snapshots (and removed_snaps)
> >> >> >
> >> >> > 2. images was added as a tier *of* img, and img's removed_snaps was
> >> >> > copied to images, clobbering the removed_snaps value (see
> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
> >> >> >
> >> >> > 3. tiering relation was undone, but removed_snaps was still gone
> >> >> >
> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is initialized
> >> >> > with the older map.  later, in PGPool::update(), we assume that
> >> >> > removed_snaps alwasy grows (never shrinks) and we trigger an assert.
> >> >> >
> >> >> > To fix this I think we need to do 2 things:
> >> >> >
> >> >> > 1. make the OSD forgiving out removed_snaps getting smaller.  This is
> >> >> > probably a good thing anyway: once we know snaps are removed on all
> >> >> > OSDs we can prune the interval_set in the OSDMap.  Maybe.
> >> >> >
> >> >> > 2. Fix the mon to prevent this from happening, *even* when
> >> >> > --force-nonempty is specified.  (This is the root cause.)
> >> >> >
> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >
> >> >> > > > > > > >
> >> >> > > > > > > > Idea was to make images as a tier to img, move data to img
> >> >> > > > > > > > then change
> >> >> > > > > > > clients to use the new img pool.
> >> >> > > > > > > >
> >> >> > > > > > > > Br,
> >> >> > > > > > > > Tuomas
> >> >> > > > > > > >
> >> >> > > > > > > > > Can you explain exactly what you mean by:
> >> >> > > > > > > > >
> >> >> > > > > > > > > "Also I created one pool for tier to be able to move
> >> >> > > > > > > > > data without
> >> >> > > > > > > outage."
> >> >> > > > > > > > >
> >> >> > > > > > > > > -Sam
> >> >> > > > > > > > > ----- Original Message -----
> >> >> > > > > > > > > From: "tuomas juntunen"
> >> >> > > > > > > > > <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> >> > > > > > > > > To: "Ian Colle" <icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> >> > > > > > > > > Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> >> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer
> >> >> > > > > > > > > and after some basic operations most of the OSD's went
> >> >> > > > > > > > > down
> >> >> > > > > > > > >
> >> >> > > > > > > > > Hi
> >> >> > > > > > > > >
> >> >> > > > > > > > > Any solution for this yet?
> >> >> > > > > > > > >
> >> >> > > > > > > > > Br,
> >> >> > > > > > > > > Tuomas
> >> >> > > > > > > > >
> >> >> > > > > > > > >> It looks like you may have hit
> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> Ian R. Colle
> >> >> > > > > > > > >> Global Director
> >> >> > > > > > > > >> of Software Engineering Red Hat (Inktank is now part of
> >> >> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
> >> >> > > > > > > > >> http://www.twitter.com/ircolle
> >> >> > > > > > > > >> Cell: +1.303.601.7713
> >> >> > > > > > > > >> Email: icolle-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> ----- Original Message -----
> >> >> > > > > > > > >> From: "tuomas juntunen"
> >> >> > > > > > > > >> <tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g@public.gmane.org>
> >> >> > > > > > > > >> To: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> >> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and
> >> >> > > > > > > > >> after some basic operations most of the OSD's went down
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> Then created new pools and deleted some old ones. Also
> >> >> > > > > > > > >> I created one pool for tier to be able to move data
> >> >> > > > > > > > >> without
> >> >> > > outage.
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> After these operations all but 10 OSD's are down and
> >> >> > > > > > > > >> creating this kind of messages to logs, I get more than
> >> >> > > > > > > > >> 100gb of these in a
> >> >> > > > > > night:
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
> >> >> > > pg_epoch:
> >> >> > > >
> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> >> > > > > > > > >> n=0
> >> >> > > > > > > > >> ec=1 les/c
> >> >> > > > > > > > >> 16609/16659
> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started
> >> >> > > > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> >> > > > > > > > >> n=0
> >> >> > > > > > > > >> ec=1 les/c
> >> >> > > > > > > > >> 16609/16659
> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start
> >> >> > > > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> >> > > > > > > > >> n=0
> >> >> > > > > > > > >> ec=1 les/c
> >> >> > > > > > > > >> 16609/16659
> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> >> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> >> > > > > > > > >> n=0
> >> >> > > > > > > > >> ec=1 les/c
> >> >> > > > > > > > >> 16609/16659
> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
> >> >> > > > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> >> > > > > > > > >> n=0
> >> >> > > > > > > > >> ec=1 les/c
> >> >> > > > > > > > >> 16609/16659
> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> >> > > > > > > > >> pi=15659-16589/42
> >> >> > > > > > > > >> crt=8480'7 lcod
> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
> >> >> > > > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> >> > > > > > > > >> les/c
> >> >> > > > > > > > >> 17879/17879
> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> >> > > > > > > > >> inactive NOTIFY] exit Reset 0.119467 4 0.000037
> >> >> > > > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> >> > > > > > > > >> les/c
> >> >> > > > > > > > >> 17879/17879
> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> >> > > > > > > > >> inactive NOTIFY] enter Started
> >> >> > > > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> >> > > > > > > > >> les/c
> >> >> > > > > > > > >> 17879/17879
> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> >> > > > > > > > >> inactive NOTIFY] enter Start
> >> >> > > > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> >> > > > > > > > >> les/c
> >> >> > > > > > > > >> 17879/17879
> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> >> > > > > > > > >> inactive NOTIFY]
> >> >> > > > > > > > >> state<Start>: transitioning to Stray
> >> >> > > > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> >> > > > > > > > >> les/c
> >> >> > > > > > > > >> 17879/17879
> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> >> > > > > > > > >> inactive NOTIFY] exit Start 0.000020 0 0.000000
> >> >> > > > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> >> > > > > > > > >> les/c
> >> >> > > > > > > > >> 17879/17879
> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> >> > > > > > > > >> inactive NOTIFY] enter Started/Stray
> >> >> > > > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165
> >> >> > > > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 inactive] enter Started
> >> >> > > > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 inactive] enter Start
> >> >> > > > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 inactive]
> >> >> > > > > > > > >> state<Start>: transitioning to Primary
> >> >> > > > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000
> >> >> > > > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary
> >> >> > > > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering
> >> >> > > > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5
> >> >> > > > > > > > >> osd.23
> >> >> > > > pg_epoch:
> >> >> > > > >
> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> >> > > > > > > > >> 16127/16344
> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> >> > > > > > > > >> 0'0 peering] enter Started/Primary/Peering/GetInfo
> >> >> > > > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
> >> >> > > > > > > ./include/interval_set.h:
> >> >> > > > > > > > >> In
> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
> >> >> > > snapid_t]'
> >> >> > > > > > > > >> thread
> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >=
> >> >> > > > > > > > >> 0)
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>  ceph version 0.94.1
> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> >> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*,
> >> >> > > > > > > > >> int, char
> >> >> > > > > > > > >> const*)+0x8b)
> >> >> > > > > > > > >> [0xbc271b]
> >> >> > > > > > > > >>  2:
> >> >> > > > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t
> >> >> > > > > > > > >> >
> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50]
> >> >> > > > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
> >> >> > > > > > > > >> const>)+0x52e) [0x80113e]
> >> >> > > > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>,
> >> >> > > > > > > > >> const>std::vector<int,
> >> >> > > > > > > > >> std::allocator<int> >&, int, std::vector<int,
> >> >> > > > > > > > >> std::allocator<int>
> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> >> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*,
> >> >> > > > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*,
> >> >> > > > > > > > >> std::set<boost::intrusive_ptr<PG>,
> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >,
> >> >> > > > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3)
> >> >> > > > > > > > >> [0x6b0e43]
> >> >> > > > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
> >> >> > > > > > > > >> std::allocator<PG*>
> >> >> > > > > > > > >> > const&,
> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
> >> >> > > > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
> >> >> > > > > > > > >> std::allocator<PG*>
> >> >> > > > > > > > >> > const&,
> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
> >> >> > > > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> >> >> > > > > > > > >> [0xbb38ae]
> >> >> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> >> >> > > > > > > > >>  10: (()+0x8182) [0x7fd906946182]
> >> >> > > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the following
> >> >> > > > > > > > >> messages, also lots of
> >> >> > > > > > > them.
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
> >> >> > > > > > > 10.20.0.13:0/1174409'
> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush
> >> >> > > > > > > > >> create-or-move",
> >> >> > > > "args":
> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight":
> >> 1.82}]:
> >> >>
> >> >> > > > > > > > >> dispatch
> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
> >> >> > > > > > > 10.20.0.13:0/1174483'
> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush
> >> >> > > > > > > > >> create-or-move",
> >> >> > > > "args":
> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight":
> >> 1.82}]:
> >> >>
> >> >> > > > > > > > >> dispatch
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are
> >> >> > > > > > > > >> also mons and mds's to save servers. All run Ubuntu
> >> >> 14.04.2.
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> I have pretty much tried everything I could think of.
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> Restarting daemons doesn't help.
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> Any help would be appreciated. I can also provide more
> >> >> > > > > > > > >> logs if necessary. They just seem to get pretty large
> >> >> > > > > > > > >> in few
> >> >> > > moments.
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> Thank you
> >> >> > > > > > > > >> Tuomas
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>
> >> >> > > > > > > > >> _______________________________________________
> >> >> > > > > > > > >> ceph-users mailing list ceph-users-idqoXFIVOFLNfb0M+mGrxg@public.gmane.orgm
> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>
> >> >> > > > > > > > >>
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > _______________________________________________
> >> >> > > > > > > > > ceph-users mailing list
> >> >> > > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > _______________________________________________
> >> >> > > > > > > > ceph-users mailing list
> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > _______________________________________________
> >> >> > > > > > > > ceph-users mailing list
> >> >> > > > > > > > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> > >
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list
> >> >> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >
> >> >> >
> >> >>
> >> >
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down
@ 2015-04-30 17:27 tuomas.juntunen
  2015-05-01 15:10 ` [ceph-users] " tuomas.juntunen
  0 siblings, 1 reply; 13+ messages in thread
From: tuomas.juntunen @ 2015-04-30 17:27 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 20729 bytes --]

Thanks man. I'll try it tomorrow. Have a good one.

Br,T

-------- Original message --------
From: Sage Weil <sage@newdream.net> 
Date: 30/04/2015  18:23  (GMT+02:00) 
To: Tuomas Juntunen <tuomas.juntunen@databasement.fi> 
Cc: ceph-users@lists.ceph.com, ceph-devel@vger.kernel.org 
Subject: RE: [ceph-users] Upgrade from Giant to Hammer and after some basic\r   operations most of the OSD's went down 

On Thu, 30 Apr 2015, tuomas.juntunen@databasement.fi wrote:
> Hey
> 
> Yes I can drop the images data, you think this will fix it?

It's a slightly different assert that (I believe) should not trigger once 
the pool is deleted.  Please give that a try and if you still hit it I'll 
whip up a workaround.

Thanks!
sage

 > 
> 
> Br,
> 
> Tuomas
> 
> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
> >> Hi
> >>
> >> I updated that version and it seems that something did happen, the osd's
> >> stayed up for a while and 'ceph status' got updated. But then in couple of
> >> minutes, they all went down the same way.
> >>
> >> I have attached new 'ceph osd dump -f json-pretty' and got a new log from
> >> one of the osd's with osd debug = 20,
> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
> >
> > Sam mentioned that you had said earlier that this was not critical data?
> > If not, I think the simplest thing is to just drop those pools.  The
> > important thing (from my perspective at least :) is that we understand the
> > root cause and can prevent this in the future.
> >
> > sage
> >
> >
> >>
> >> Thank you!
> >>
> >> Br,
> >> Tuomas
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Sage Weil [mailto:sage@newdream.net]
> >> Sent: 28. huhtikuuta 2015 23:57
> >> To: Tuomas Juntunen
> >> Cc: ceph-users@lists.ceph.com; ceph-devel@vger.kernel.org
> >> Subject: Re: [ceph-users] Upgrade from Giant to Hammer and after some basic
> >> operations most of the OSD's went down
> >>
> >> Hi Tuomas,
> >>
> >> I've pushed an updated wip-hammer-snaps branch.  Can you please try it?
> >> The build will appear here
> >>
> >>
> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08bf531331afd5e
> >> 2eb514067f72afda11bcde286
> >>
> >> (or a similar url; adjust for your distro).
> >>
> >> Thanks!
> >> sage
> >>
> >>
> >> On Tue, 28 Apr 2015, Sage Weil wrote:
> >>
> >> > [adding ceph-devel]
> >> >
> >> > Okay, I see the problem.  This seems to be unrelated ot the giant ->
> >> > hammer move... it's a result of the tiering changes you made:
> >> >
> >> > > > > > > > The following:
> >> > > > > > > >
> >> > > > > > > > ceph osd tier add img images --force-nonempty ceph osd
> >> > > > > > > > tier cache-mode images forward ceph osd tier set-overlay
> >> > > > > > > > img images
> >> >
> >> > Specifically, --force-nonempty bypassed important safety checks.
> >> >
> >> > 1. images had snapshots (and removed_snaps)
> >> >
> >> > 2. images was added as a tier *of* img, and img's removed_snaps was
> >> > copied to images, clobbering the removed_snaps value (see
> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
> >> >
> >> > 3. tiering relation was undone, but removed_snaps was still gone
> >> >
> >> > 4. on OSD startup, when we load the PG, removed_snaps is initialized
> >> > with the older map.  later, in PGPool::update(), we assume that
> >> > removed_snaps alwasy grows (never shrinks) and we trigger an assert.
> >> >
> >> > To fix this I think we need to do 2 things:
> >> >
> >> > 1. make the OSD forgiving out removed_snaps getting smaller.  This is
> >> > probably a good thing anyway: once we know snaps are removed on all
> >> > OSDs we can prune the interval_set in the OSDMap.  Maybe.
> >> >
> >> > 2. Fix the mon to prevent this from happening, *even* when
> >> > --force-nonempty is specified.  (This is the root cause.)
> >> >
> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
> >> >
> >> > sage
> >> >
> >> >
> >> >
> >> > > > > > > >
> >> > > > > > > > Idea was to make images as a tier to img, move data to img
> >> > > > > > > > then change
> >> > > > > > > clients to use the new img pool.
> >> > > > > > > >
> >> > > > > > > > Br,
> >> > > > > > > > Tuomas
> >> > > > > > > >
> >> > > > > > > > > Can you explain exactly what you mean by:
> >> > > > > > > > >
> >> > > > > > > > > "Also I created one pool for tier to be able to move
> >> > > > > > > > > data without
> >> > > > > > > outage."
> >> > > > > > > > >
> >> > > > > > > > > -Sam
> >> > > > > > > > > ----- Original Message -----
> >> > > > > > > > > From: "tuomas juntunen"
> >> > > > > > > > > <tuomas.juntunen@databasement.fi>
> >> > > > > > > > > To: "Ian Colle" <icolle@redhat.com>
> >> > > > > > > > > Cc: ceph-users@lists.ceph.com
> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
> >> > > > > > > > > Subject: Re: [ceph-users] Upgrade from Giant to Hammer
> >> > > > > > > > > and after some basic operations most of the OSD's went
> >> > > > > > > > > down
> >> > > > > > > > >
> >> > > > > > > > > Hi
> >> > > > > > > > >
> >> > > > > > > > > Any solution for this yet?
> >> > > > > > > > >
> >> > > > > > > > > Br,
> >> > > > > > > > > Tuomas
> >> > > > > > > > >
> >> > > > > > > > >> It looks like you may have hit
> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
> >> > > > > > > > >>
> >> > > > > > > > >> Ian R. Colle
> >> > > > > > > > >> Global Director
> >> > > > > > > > >> of Software Engineering Red Hat (Inktank is now part of
> >> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle
> >> > > > > > > > >> http://www.twitter.com/ircolle
> >> > > > > > > > >> Cell: +1.303.601.7713
> >> > > > > > > > >> Email: icolle@redhat.com
> >> > > > > > > > >>
> >> > > > > > > > >> ----- Original Message -----
> >> > > > > > > > >> From: "tuomas juntunen"
> >> > > > > > > > >> <tuomas.juntunen@databasement.fi>
> >> > > > > > > > >> To: ceph-users@lists.ceph.com
> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
> >> > > > > > > > >> Subject: [ceph-users] Upgrade from Giant to Hammer and
> >> > > > > > > > >> after some basic operations most of the OSD's went down
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer
> >> > > > > > > > >>
> >> > > > > > > > >> Then created new pools and deleted some old ones. Also
> >> > > > > > > > >> I created one pool for tier to be able to move data
> >> > > > > > > > >> without
> >> > > outage.
> >> > > > > > > > >>
> >> > > > > > > > >> After these operations all but 10 OSD's are down and
> >> > > > > > > > >> creating this kind of messages to logs, I get more than
> >> > > > > > > > >> 100gb of these in a
> >> > > > > > night:
> >> > > > > > > > >>
> >> > > > > > > > >>  -19> 2015-04-27 10:17:08.808584 7fd8e748d700  5 osd.23
> >> > > pg_epoch:
> >> > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started
> >> > > > > > > > >>    -18> 2015-04-27 10:17:08.808596 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start
> >> > > > > > > > >>    -17> 2015-04-27 10:17:08.808608 7fd8e748d700  1
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> >> > > > > > > > >>    -16> 2015-04-27 10:17:08.808621 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000
> >> > > > > > > > >>    -15> 2015-04-27 10:17:08.808637 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609
> >> > > > > > > > >> n=0
> >> > > > > > > > >> ec=1 les/c
> >> > > > > > > > >> 16609/16659
> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
> >> > > > > > > > >> pi=15659-16589/42
> >> > > > > > > > >> crt=8480'7 lcod
> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray
> >> > > > > > > > >>    -14> 2015-04-27 10:17:08.808796 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] exit Reset 0.119467 4 0.000037
> >> > > > > > > > >>    -13> 2015-04-27 10:17:08.808817 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] enter Started
> >> > > > > > > > >>    -12> 2015-04-27 10:17:08.808828 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] enter Start
> >> > > > > > > > >>    -11> 2015-04-27 10:17:08.808838 7fd8e748d700  1
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY]
> >> > > > > > > > >> state<Start>: transitioning to Stray
> >> > > > > > > > >>    -10> 2015-04-27 10:17:08.808849 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] exit Start 0.000020 0 0.000000
> >> > > > > > > > >>     -9> 2015-04-27 10:17:08.808861 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863
> >> > > > > > > > >> les/c
> >> > > > > > > > >> 17879/17879
> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0
> >> > > > > > > > >> inactive NOTIFY] enter Started/Stray
> >> > > > > > > > >>     -8> 2015-04-27 10:17:08.809427 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165
> >> > > > > > > > >>     -7> 2015-04-27 10:17:08.809445 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Started
> >> > > > > > > > >>     -6> 2015-04-27 10:17:08.809456 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Start
> >> > > > > > > > >>     -5> 2015-04-27 10:17:08.809468 7fd8e748d700  1
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive]
> >> > > > > > > > >> state<Start>: transitioning to Primary
> >> > > > > > > > >>     -4> 2015-04-27 10:17:08.809479 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000
> >> > > > > > > > >>     -3> 2015-04-27 10:17:08.809492 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Started/Primary
> >> > > > > > > > >>     -2> 2015-04-27 10:17:08.809502 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering
> >> > > > > > > > >>     -1> 2015-04-27 10:17:08.809513 7fd8e748d700  5
> >> > > > > > > > >> osd.23
> >> > > > pg_epoch:
> >> > > > >
> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c
> >> > > > > > > > >> 16127/16344
> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod
> >> > > > > > > > >> 0'0 peering] enter Started/Primary/Peering/GetInfo
> >> > > > > > > > >>      0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1
> >> > > > > > > ./include/interval_set.h:
> >> > > > > > > > >> In
> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) [with T =
> >> > > snapid_t]'
> >> > > > > > > > >> thread
> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED assert(_size >=
> >> > > > > > > > >> 0)
> >> > > > > > > > >>
> >> > > > > > > > >>  ceph version 0.94.1
> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> >> > > > > > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*,
> >> > > > > > > > >> int, char
> >> > > > > > > > >> const*)+0x8b)
> >> > > > > > > > >> [0xbc271b]
> >> > > > > > > > >>  2:
> >> > > > > > > > >> (interval_set<snapid_t>::subtract(interval_set<snapid_t
> >> > > > > > > > >> >
> >> > > > > > > > >> const&)+0xb0) [0x82cd50]
> >> > > > > > > > >>  3: (PGPool::update(std::tr1::shared_ptr<OSDMap
> >> > > > > > > > >> const>)+0x52e) [0x80113e]
> >> > > > > > > > >>  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>,
> >> > > > > > > > >> const>std::vector<int,
> >> > > > > > > > >> std::allocator<int> >&, int, std::vector<int,
> >> > > > > > > > >> std::allocator<int>
> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
> >> > > > > > > > >>  5: (OSD::advance_pg(unsigned int, PG*,
> >> > > > > > > > >> ThreadPool::TPHandle&, PG::RecoveryCtx*,
> >> > > > > > > > >> std::set<boost::intrusive_ptr<PG>,
> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >,
> >> > > > > > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3)
> >> > > > > > > > >> [0x6b0e43]
> >> > > > > > > > >>  6: (OSD::process_peering_events(std::list<PG*,
> >> > > > > > > > >> std::allocator<PG*>
> >> > > > > > > > >> > const&,
> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]
> >> > > > > > > > >>  7: (OSD::PeeringWQ::_process(std::list<PG*,
> >> > > > > > > > >> std::allocator<PG*>
> >> > > > > > > > >> > const&,
> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]
> >> > > > > > > > >>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
> >> > > > > > > > >> [0xbb38ae]
> >> > > > > > > > >>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
> >> > > > > > > > >>  10: (()+0x8182) [0x7fd906946182]
> >> > > > > > > > >>  11: (clone()+0x6d) [0x7fd904eb147d]
> >> > > > > > > > >>
> >> > > > > > > > >> Also by monitoring (ceph -w) I get the following
> >> > > > > > > > >> messages, also lots of
> >> > > > > > > them.
> >> > > > > > > > >>
> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.?
> >> > > > > > > 10.20.0.13:0/1174409'
> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush
> >> > > > > > > > >> create-or-move",
> >> > > > "args":
> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]:
> >>
> >> > > > > > > > >> dispatch
> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.?
> >> > > > > > > 10.20.0.13:0/1174483'
> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush
> >> > > > > > > > >> create-or-move",
> >> > > > "args":
> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]:
> >>
> >> > > > > > > > >> dispatch
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are
> >> > > > > > > > >> also mons and mds's to save servers. All run Ubuntu
> >> 14.04.2.
> >> > > > > > > > >>
> >> > > > > > > > >> I have pretty much tried everything I could think of.
> >> > > > > > > > >>
> >> > > > > > > > >> Restarting daemons doesn't help.
> >> > > > > > > > >>
> >> > > > > > > > >> Any help would be appreciated. I can also provide more
> >> > > > > > > > >> logs if necessary. They just seem to get pretty large
> >> > > > > > > > >> in few
> >> > > moments.
> >> > > > > > > > >>
> >> > > > > > > > >> Thank you
> >> > > > > > > > >> Tuomas
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> _______________________________________________
> >> > > > > > > > >> ceph-users mailing list ceph-users@lists.ceph.com
> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > _______________________________________________
> >> > > > > > > > > ceph-users mailing list
> >> > > > > > > > > ceph-users@lists.ceph.com
> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > _______________________________________________
> >> > > > > > > > ceph-users mailing list
> >> > > > > > > > ceph-users@lists.ceph.com
> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > _______________________________________________
> >> > > > > > > > ceph-users mailing list
> >> > > > > > > > ceph-users@lists.ceph.com
> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >> >
> >>
> >
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

[-- Attachment #1.2: Type: text/html, Size: 36302 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-05-04 17:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <479273764e377f37b81dc6b0ccd55fb3@mail.meizo.com>
     [not found] ` <770484917.5624554.1430133524268.JavaMail.zimbra@redhat.com>
     [not found]   ` <813bbcbbf7d7e7ab4a8e2dba2e5cf6a2@mail.meizo.com>
     [not found]     ` <1551034631.7094890.1430134900209.JavaMail.zimbra@redhat.com>
     [not found]       ` <964da36ebed90592d8f5794ac2617a36@mail.meizo.com>
     [not found]         ` <1226598674.7136470.1430138991322.JavaMail.zimbra@redhat.com>
     [not found]           ` <76bac95ebd000308018bf900d11fae1e@mail.meizo.com>
     [not found]             ` <alpine.DEB.2.00.1504270919020.5458@cobra.newdream.net>
     [not found]               ` <03cd5dfba8f5fec3f80458a92d377a60@mail.meizo.com>
     [not found]                 ` <alpine.DEB.2.00.1504271034560.5458@cobra.newdream.net>
     [not found]                   ` <a06d58aa527edec6225737f18abb055b@mail.meizo.com>
     [not found]                     ` <alpine.DEB.2.00.1504271222002.5458@cobra.newdream.net>
     [not found]                       ` <8bed4ff8a05a8b96ed848e9f1aafa576@mail.meizo.com>
     [not found]                         ` <alpine.DEB.2.00.1504280959280.5458@cobra.newdream.net>
     [not found]                           ` <bb760e0f01a667a582f6bda67cc31684@mail.meizo.com>
     [not found]                             ` <alpine.DEB.2.00.1504281155530.5458@cobra.newdream.net>
     [not found]                               ` <f9adb4b2dcada947f418b6f95ad7a8d1@mail.meizo.com>
2015-04-28 20:19                                 ` [ceph-users] Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down Sage Weil
     [not found]                                   ` <alpine.DEB.2.00.1504281256440.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-04-28 20:57                                     ` Sage Weil
     [not found]                                       ` <alpine.DEB.2.00.1504281355130.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-04-29  4:16                                         ` Tuomas Juntunen
     [not found]                                           ` <81216125e573cf00539f61cc090b282b-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
2015-04-29 15:38                                             ` Sage Weil
     [not found]                                               ` <alpine.DEB.2.00.1504290838060.5458-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-04-30  3:31                                                 ` tuomas.juntunen-TGwGjfj4lcphU2BovMVX9g
     [not found]                                                   ` <928ebb7320e4eb07f14071e997ed7be2-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
2015-04-30 15:23                                                     ` Sage Weil
2015-04-30 17:27 tuomas.juntunen
2015-05-01 15:10 ` [ceph-users] " tuomas.juntunen
     [not found]   ` <dc614dd85caf6cecfd59897b96a019ad-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
2015-05-01 16:04     ` Sage Weil
2015-05-01 18:13       ` [ceph-users] " tuomas.juntunen
     [not found]         ` <56c96198e8b0d8e70fbf96fdd209d70a-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
2015-05-04  4:11           ` Tuomas Juntunen
     [not found]         ` <90c912f778464020445a8a09c7d8c7f5@mail.meizo.com>
     [not found]           ` <90c912f778464020445a8a09c7d8c7f5-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
2015-05-04 15:29             ` Sage Weil
     [not found]               ` <alpine.DEB.2.00.1505040828590.24939-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-05-04 17:17                 ` Tuomas Juntunen
     [not found]                   ` <f0d4624d313c49cf355543dbf52d6561-Mp+lKDbUk+6SvdrsE3bNcA@public.gmane.org>
2015-05-04 17:20                     ` Sage Weil
     [not found]                       ` <alpine.DEB.2.00.1505041019300.24939-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-05-04 17:28                         ` Tuomas Juntunen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.