All of lore.kernel.org
 help / color / mirror / Atom feed
* 7915 is not resolved
@ 2016-01-11 16:52 Boris Lukashev
  2016-01-11 17:25 ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-11 16:52 UTC (permalink / raw)
  To: ceph-devel

I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
cluster backing our openstack glance and cinder services by just
asking RBD to snapshot one of the volumes.
The conditions under which this occured are as follows - bash script
asking cinder to snapshot RBD volumes in rapid succession (2 of them),
which either caused a nova host (and ceph OSD holder) to crash, or
simply suffered the crash simultaneously. On reboot of the host, RBD
started throwing errors, once all OSDs were restarted, they all fail,
crashing with the following:

    -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
enter Started/Primary/Peering/GetInfo
     0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
./include/interval_set.h: In function 'void interval_set<T>::erase(T,
T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
16:37:35.398335
./include/interval_set.h: 386: FAILED assert(_size >= 0)

 ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
 1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
const&)+0xb0) [0x79d140]
 2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
 3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
std::tr1::shared_ptr<OSDMap const>, std::vector<int,
std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
int, PG::RecoveryCtx*)+0x282) [0x772c22]
 4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
std::less<boost::intrusive_ptr<PG> >,
std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
 5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
 6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
 9: (()+0x8182) [0x7f170def5182]
 10: (clone()+0x6d) [0x7f170c51447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

To me, this looks like the snapshot which was being created when the
nova host died is causing the assert to fail since the snap was never
completed and is broken.

http://tracker.ceph.com/issues/11493 which appears very similar is
marked as resolved, but with firefly current (deployed via Fuel and
updated in place with 0.80.11 debs) this issue hit us on Saturday.

Whats the way around this? I imagine commenting out that assert may
cause more damage, but we need to get our OSDs and the RBD data in
them back online. Is there a permanent fix in any branch we can
backport? We built this cluster using Fuel so this affects every
Mirantis user if not every ceph user out there, and the vector into
this catastrophic bug is normal daily operations (snapshot
apparently)....

Thank you all for looking over this, advice would be greatly appreciated.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-11 16:52 7915 is not resolved Boris Lukashev
@ 2016-01-11 17:25 ` Sage Weil
  2016-01-11 18:15   ` Boris Lukashev
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2016-01-11 17:25 UTC (permalink / raw)
  To: Boris Lukashev; +Cc: ceph-devel

On Mon, 11 Jan 2016, Boris Lukashev wrote:
> I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
> cluster backing our openstack glance and cinder services by just
> asking RBD to snapshot one of the volumes.
> The conditions under which this occured are as follows - bash script
> asking cinder to snapshot RBD volumes in rapid succession (2 of them),
> which either caused a nova host (and ceph OSD holder) to crash, or
> simply suffered the crash simultaneously. On reboot of the host, RBD
> started throwing errors, once all OSDs were restarted, they all fail,
> crashing with the following:
> 
>     -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
> 84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
> 84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
> enter Started/Primary/Peering/GetInfo
>      0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
> T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
> 16:37:35.398335
> ./include/interval_set.h: 386: FAILED assert(_size >= 0)
> 
>  ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
> const&)+0xb0) [0x79d140]
>  2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
>  3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
> std::tr1::shared_ptr<OSDMap const>, std::vector<int,
> std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
> int, PG::RecoveryCtx*)+0x282) [0x772c22]
>  4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
> std::less<boost::intrusive_ptr<PG> >,
> std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
>  5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
> const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
>  6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
> const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
>  8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
>  9: (()+0x8182) [0x7f170def5182]
>  10: (clone()+0x6d) [0x7f170c51447d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> To me, this looks like the snapshot which was being created when the
> nova host died is causing the assert to fail since the snap was never
> completed and is broken.
> 
> http://tracker.ceph.com/issues/11493 which appears very similar is
> marked as resolved, but with firefly current (deployed via Fuel and
> updated in place with 0.80.11 debs) this issue hit us on Saturday.

You can try cherry-picking the two commits in wip-11493-b which make the 
OSD semi-gracefully tolerate this situation.  This is a bug that's been 
fixed in hammer, but since the inconsistency has already been introduced 
simply upgrading probably won't resolve it.  Nevertheless, after working 
around this, I'd encourage you to move to hammer and firefly is at end of 
life.

sage

> 
> Whats the way around this? I imagine commenting out that assert may
> cause more damage, but we need to get our OSDs and the RBD data in
> them back online. Is there a permanent fix in any branch we can
> backport? We built this cluster using Fuel so this affects every
> Mirantis user if not every ceph user out there, and the vector into
> this catastrophic bug is normal daily operations (snapshot
> apparently)....
> 
> Thank you all for looking over this, advice would be greatly appreciated.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-11 17:25 ` Sage Weil
@ 2016-01-11 18:15   ` Boris Lukashev
  2016-01-11 19:06     ` Dmitry Borodaenko
  0 siblings, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-11 18:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Thank you, pulling those into my branch currently and kicking off a build.
In terms of upgrading to Hammer - the documentation looks straight
forward enough, but given that this is a Fuel based OpenStack
deployment, i'm wondering if you've heard of any potential
compatibility issues from doing so.

-Boris

On Mon, Jan 11, 2016 at 12:25 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 11 Jan 2016, Boris Lukashev wrote:
>> I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
>> cluster backing our openstack glance and cinder services by just
>> asking RBD to snapshot one of the volumes.
>> The conditions under which this occured are as follows - bash script
>> asking cinder to snapshot RBD volumes in rapid succession (2 of them),
>> which either caused a nova host (and ceph OSD holder) to crash, or
>> simply suffered the crash simultaneously. On reboot of the host, RBD
>> started throwing errors, once all OSDs were restarted, they all fail,
>> crashing with the following:
>>
>>     -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
>> 84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
>> 84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
>> enter Started/Primary/Peering/GetInfo
>>      0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
>> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
>> T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
>> 16:37:35.398335
>> ./include/interval_set.h: 386: FAILED assert(_size >= 0)
>>
>>  ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
>>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
>> const&)+0xb0) [0x79d140]
>>  2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
>>  3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
>> std::tr1::shared_ptr<OSDMap const>, std::vector<int,
>> std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
>> int, PG::RecoveryCtx*)+0x282) [0x772c22]
>>  4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
>> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
>> std::less<boost::intrusive_ptr<PG> >,
>> std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
>>  5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
>> const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
>>  6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
>> const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
>>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
>>  8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
>>  9: (()+0x8182) [0x7f170def5182]
>>  10: (clone()+0x6d) [0x7f170c51447d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> To me, this looks like the snapshot which was being created when the
>> nova host died is causing the assert to fail since the snap was never
>> completed and is broken.
>>
>> http://tracker.ceph.com/issues/11493 which appears very similar is
>> marked as resolved, but with firefly current (deployed via Fuel and
>> updated in place with 0.80.11 debs) this issue hit us on Saturday.
>
> You can try cherry-picking the two commits in wip-11493-b which make the
> OSD semi-gracefully tolerate this situation.  This is a bug that's been
> fixed in hammer, but since the inconsistency has already been introduced
> simply upgrading probably won't resolve it.  Nevertheless, after working
> around this, I'd encourage you to move to hammer and firefly is at end of
> life.
>
> sage
>
>>
>> Whats the way around this? I imagine commenting out that assert may
>> cause more damage, but we need to get our OSDs and the RBD data in
>> them back online. Is there a permanent fix in any branch we can
>> backport? We built this cluster using Fuel so this affects every
>> Mirantis user if not every ceph user out there, and the vector into
>> this catastrophic bug is normal daily operations (snapshot
>> apparently)....
>>
>> Thank you all for looking over this, advice would be greatly appreciated.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-11 18:15   ` Boris Lukashev
@ 2016-01-11 19:06     ` Dmitry Borodaenko
  2016-01-11 20:13       ` Boris Lukashev
  0 siblings, 1 reply; 15+ messages in thread
From: Dmitry Borodaenko @ 2016-01-11 19:06 UTC (permalink / raw)
  To: Boris Lukashev; +Cc: Sage Weil, ceph-devel

Fuel 8.0 will support Hammer, you can grab the packages from:
http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/pool/main/c/ceph/

or, if you build your own packages with the extra patches, grab the
Debian build scripts from:
https://review.fuel-infra.org/#/c/13879/

That would make sure that your packages would work with Fuel.

-- 
Dmitry Borodaenko

On Mon, Jan 11, 2016 at 01:15:50PM -0500, Boris Lukashev wrote:
> Thank you, pulling those into my branch currently and kicking off a build.
> In terms of upgrading to Hammer - the documentation looks straight
> forward enough, but given that this is a Fuel based OpenStack
> deployment, i'm wondering if you've heard of any potential
> compatibility issues from doing so.
> 
> -Boris
> 
> On Mon, Jan 11, 2016 at 12:25 PM, Sage Weil <sage@newdream.net> wrote:
> > On Mon, 11 Jan 2016, Boris Lukashev wrote:
> >> I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
> >> cluster backing our openstack glance and cinder services by just
> >> asking RBD to snapshot one of the volumes.
> >> The conditions under which this occured are as follows - bash script
> >> asking cinder to snapshot RBD volumes in rapid succession (2 of them),
> >> which either caused a nova host (and ceph OSD holder) to crash, or
> >> simply suffered the crash simultaneously. On reboot of the host, RBD
> >> started throwing errors, once all OSDs were restarted, they all fail,
> >> crashing with the following:
> >>
> >>     -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
> >> 84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
> >> 84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
> >> enter Started/Primary/Peering/GetInfo
> >>      0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
> >> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
> >> T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
> >> 16:37:35.398335
> >> ./include/interval_set.h: 386: FAILED assert(_size >= 0)
> >>
> >>  ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
> >>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
> >> const&)+0xb0) [0x79d140]
> >>  2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
> >>  3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
> >> std::tr1::shared_ptr<OSDMap const>, std::vector<int,
> >> std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
> >> int, PG::RecoveryCtx*)+0x282) [0x772c22]
> >>  4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> >> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
> >> std::less<boost::intrusive_ptr<PG> >,
> >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
> >>  5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
> >> const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
> >>  6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
> >> const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
> >>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
> >>  8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
> >>  9: (()+0x8182) [0x7f170def5182]
> >>  10: (clone()+0x6d) [0x7f170c51447d]
> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >> needed to interpret this.
> >>
> >> To me, this looks like the snapshot which was being created when the
> >> nova host died is causing the assert to fail since the snap was never
> >> completed and is broken.
> >>
> >> http://tracker.ceph.com/issues/11493 which appears very similar is
> >> marked as resolved, but with firefly current (deployed via Fuel and
> >> updated in place with 0.80.11 debs) this issue hit us on Saturday.
> >
> > You can try cherry-picking the two commits in wip-11493-b which make the
> > OSD semi-gracefully tolerate this situation.  This is a bug that's been
> > fixed in hammer, but since the inconsistency has already been introduced
> > simply upgrading probably won't resolve it.  Nevertheless, after working
> > around this, I'd encourage you to move to hammer and firefly is at end of
> > life.
> >
> > sage
> >
> >>
> >> Whats the way around this? I imagine commenting out that assert may
> >> cause more damage, but we need to get our OSDs and the RBD data in
> >> them back online. Is there a permanent fix in any branch we can
> >> backport? We built this cluster using Fuel so this affects every
> >> Mirantis user if not every ceph user out there, and the vector into
> >> this catastrophic bug is normal daily operations (snapshot
> >> apparently)....
> >>
> >> Thank you all for looking over this, advice would be greatly appreciated.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-11 19:06     ` Dmitry Borodaenko
@ 2016-01-11 20:13       ` Boris Lukashev
  2016-01-12  2:00         ` Boris Lukashev
  0 siblings, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-11 20:13 UTC (permalink / raw)
  To: Dmitry Borodaenko; +Cc: Sage Weil, ceph-devel

Thank you much - i'll get Hammer built out that way in a bit.

In the meantime, looks like its getting a bit better, but two OSDs are
still giving me trouble.
Sage - your patches appear to have saved 4/5 nodes (8/10OSD). Ceph is
currently resolving state, but osd tree shows the 4 nodes up with
their respective pairs of OSDs. Thank you ever so much.
Dmitry/Mirantis - you might want to merge 0.80.11 + these patches into
your MOS repo for Fuel 7 systems as it may save someone from
catastrophe down the line (plus you're a bit out of date anyway).

On to the last node, the error logs from the OSDs still show a very
similar, but slightly different stack trace:
   -22> 2016-01-11 20:02:37.929902 7f26dadff700  5 osd.6 pg_epoch:
84290 pg[11.2d( v 84268'523421 (84261'520420,84268'523421]
local-les=84266 n=508 ec=38 les/c 84266/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84265-84288/6 crt=84268'523415 lcod 0'0 mlcod 0'0
inactive] enter Started/Primary/Active
   -21> 2016-01-11 20:02:37.930111 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
NOTIFY] exit Reset 0.193889 2 0.000031
   -20> 2016-01-11 20:02:37.930134 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
NOTIFY] enter Started
   -19> 2016-01-11 20:02:37.930146 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
NOTIFY] enter Start
   -18> 2016-01-11 20:02:37.930155 7f26da5fe700  1 osd.6 pg_epoch:
84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
NOTIFY] state<Start>: transitioning to Stray
   -17> 2016-01-11 20:02:37.930168 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
NOTIFY] exit Start 0.000022 0 0.000000
   -16> 2016-01-11 20:02:37.930181 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
NOTIFY] enter Started/Stray
   -15> 2016-01-11 20:02:37.930428 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] exit Reset 0.047014 2 0.000041
   -14> 2016-01-11 20:02:37.930449 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] enter Started
   -13> 2016-01-11 20:02:37.930460 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] enter Start
   -12> 2016-01-11 20:02:37.930471 7f26da5fe700  1 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] state<Start>: transitioning to Primary
   -11> 2016-01-11 20:02:37.930484 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] exit Start 0.000024 0 0.000000
   -10> 2016-01-11 20:02:37.930497 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] enter Started/Primary
    -9> 2016-01-11 20:02:37.930510 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] enter Started/Primary/Peering
    -8> 2016-01-11 20:02:37.930521 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
peering] enter Started/Primary/Peering/GetInfo
    -7> 2016-01-11 20:02:37.930543 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
peering] exit Started/Primary/Peering/GetInfo 0.000022 0 0.000000
    -6> 2016-01-11 20:02:37.930558 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
peering] enter Started/Primary/Peering/GetLog
    -5> 2016-01-11 20:02:37.930592 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
peering] exit Started/Primary/Peering/GetLog 0.000035 0 0.000000
    -4> 2016-01-11 20:02:37.930607 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
peering] enter Started/Primary/Peering/GetMissing
    -3> 2016-01-11 20:02:37.930620 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
peering] exit Started/Primary/Peering/GetMissing 0.000013 0 0.000000
    -2> 2016-01-11 20:02:37.930633 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
peering] exit Started/Primary/Peering 0.000124 0 0.000000
    -1> 2016-01-11 20:02:37.930646 7f26da5fe700  5 osd.6 pg_epoch:
84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
inactive] enter Started/Primary/Active
     0> 2016-01-11 20:02:37.934944 7f26da5fe700 -1
./include/interval_set.h: In function 'void interval_set<T>::erase(T,
T) [with T = snapid_t]' thread 7f26da5fe700 time 2016-01-11
20:02:37.930666
./include/interval_set.h: 389: FAILED assert(_size >= 0)

 ceph version 0.80.11-21-gf526309 (f526309b51b1760df193fca5fc39b4913e890bab)
 1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
const&)+0xb0) [0x79d4e0]
 2: (PG::activate(ObjectStore::Transaction&, unsigned int,
std::list<Context*, std::allocator<Context*> >&, std::map<int,
std::map<spg_t, pg_query_t, std::less<spg_t>,
std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>,
std::allocator<std::pair<int const, std::map<spg_t, pg_query_t,
std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> >
> > > >&, std::map<int, std::vector<std::pair<pg_notify_t,
std::map<unsigned int, pg_interval_t, std::less<unsigned int>,
std::allocator<std::pair<unsigned int const, pg_interval_t> > > >,
std::allocator<std::pair<pg_notify_t, std::map<unsigned int,
pg_interval_t, std::less<unsigned int>,
std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >,
std::less<int>, std::allocator<std::pair<int const,
std::vector<std::pair<pg_notify_t, std::map<unsigned int,
pg_interval_t, std::less<unsigned int>,
std::allocator<std::pair<unsigned int const, pg_interval_t> > > >,
std::allocator<std::pair<pg_notify_t, std::map<unsigned int,
pg_interval_t, std::less<unsigned int>,
std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >
> > >*, PG::RecoveryCtx*)+0x846) [0x7742e6]
 3: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active,
PG::RecoveryState::Primary, PG::RecoveryState::Activating,
(boost::statechart::history_mode)0>::my_context)+0x3fc) [0x77671c]
 4: (boost::statechart::detail::safe_reaction_result
boost::statechart::simple_state<PG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active,
PG::RecoveryState::RecoveryMachine,
boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function
const&)+0xa8) [0x7a7928]
 5: (boost::statechart::simple_state<PG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x13a) [0x7a7d9a]
 6: (boost::statechart::simple_state<PG::RecoveryState::GetMissing,
PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xc0) [0x7a8a20]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
const&)+0x5b) [0x79185b]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_queued_events()+0xd4)
[0x7919d4]
 9: (PG::handle_activate_map(PG::RecoveryCtx*)+0x11b) [0x74092b]
 10: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
std::less<boost::intrusive_ptr<PG> >,
std::allocator<boost::intrusive_ptr<PG> > >*)+0x6c2) [0x654ca2]
 11: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x20c) [0x65535c]
 12: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x18) [0x69c8c8]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5af11]
 14: (ThreadPool::WorkThread::entry()+0x10) [0xa5be00]
 15: (()+0x8182) [0x7f26f08ab182]
 16: (clone()+0x6d) [0x7f26eeeca47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


This is running the same pacakges as all the others, and is not the
actual node which originally crashed on us during the snapshot.
Some of our pools are only 2X replication due to the small cluster
size. Is there any way to ensure that killing both OSDs won't affect
both replicas of some bits? I'm still trying to get the hang of all
the Ceph CLIs.

So far rbd and rados dont seem to work (assuming the pool is figuring
itself out), but at least the OSDs are coming up... Is this expected
behavior or do i need to hit it with the proverbial wrench a few more
times?

Thank you much,
-Boris Lukashev

On Mon, Jan 11, 2016 at 2:06 PM, Dmitry Borodaenko
<dborodaenko@mirantis.com> wrote:
> Fuel 8.0 will support Hammer, you can grab the packages from:
> http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/pool/main/c/ceph/
>
> or, if you build your own packages with the extra patches, grab the
> Debian build scripts from:
> https://review.fuel-infra.org/#/c/13879/
>
> That would make sure that your packages would work with Fuel.
>
> --
> Dmitry Borodaenko
>
> On Mon, Jan 11, 2016 at 01:15:50PM -0500, Boris Lukashev wrote:
>> Thank you, pulling those into my branch currently and kicking off a build.
>> In terms of upgrading to Hammer - the documentation looks straight
>> forward enough, but given that this is a Fuel based OpenStack
>> deployment, i'm wondering if you've heard of any potential
>> compatibility issues from doing so.
>>
>> -Boris
>>
>> On Mon, Jan 11, 2016 at 12:25 PM, Sage Weil <sage@newdream.net> wrote:
>> > On Mon, 11 Jan 2016, Boris Lukashev wrote:
>> >> I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
>> >> cluster backing our openstack glance and cinder services by just
>> >> asking RBD to snapshot one of the volumes.
>> >> The conditions under which this occured are as follows - bash script
>> >> asking cinder to snapshot RBD volumes in rapid succession (2 of them),
>> >> which either caused a nova host (and ceph OSD holder) to crash, or
>> >> simply suffered the crash simultaneously. On reboot of the host, RBD
>> >> started throwing errors, once all OSDs were restarted, they all fail,
>> >> crashing with the following:
>> >>
>> >>     -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
>> >> 84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
>> >> 84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
>> >> enter Started/Primary/Peering/GetInfo
>> >>      0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
>> >> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
>> >> T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
>> >> 16:37:35.398335
>> >> ./include/interval_set.h: 386: FAILED assert(_size >= 0)
>> >>
>> >>  ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
>> >>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
>> >> const&)+0xb0) [0x79d140]
>> >>  2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
>> >>  3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
>> >> std::tr1::shared_ptr<OSDMap const>, std::vector<int,
>> >> std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
>> >> int, PG::RecoveryCtx*)+0x282) [0x772c22]
>> >>  4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
>> >> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
>> >> std::less<boost::intrusive_ptr<PG> >,
>> >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
>> >>  5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
>> >> const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
>> >>  6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
>> >> const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
>> >>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
>> >>  8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
>> >>  9: (()+0x8182) [0x7f170def5182]
>> >>  10: (clone()+0x6d) [0x7f170c51447d]
>> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >> needed to interpret this.
>> >>
>> >> To me, this looks like the snapshot which was being created when the
>> >> nova host died is causing the assert to fail since the snap was never
>> >> completed and is broken.
>> >>
>> >> http://tracker.ceph.com/issues/11493 which appears very similar is
>> >> marked as resolved, but with firefly current (deployed via Fuel and
>> >> updated in place with 0.80.11 debs) this issue hit us on Saturday.
>> >
>> > You can try cherry-picking the two commits in wip-11493-b which make the
>> > OSD semi-gracefully tolerate this situation.  This is a bug that's been
>> > fixed in hammer, but since the inconsistency has already been introduced
>> > simply upgrading probably won't resolve it.  Nevertheless, after working
>> > around this, I'd encourage you to move to hammer and firefly is at end of
>> > life.
>> >
>> > sage
>> >
>> >>
>> >> Whats the way around this? I imagine commenting out that assert may
>> >> cause more damage, but we need to get our OSDs and the RBD data in
>> >> them back online. Is there a permanent fix in any branch we can
>> >> backport? We built this cluster using Fuel so this affects every
>> >> Mirantis user if not every ceph user out there, and the vector into
>> >> this catastrophic bug is normal daily operations (snapshot
>> >> apparently)....
>> >>
>> >> Thank you all for looking over this, advice would be greatly appreciated.
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-11 20:13       ` Boris Lukashev
@ 2016-01-12  2:00         ` Boris Lukashev
  2016-01-12  6:53           ` Mykola Golub
  0 siblings, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-12  2:00 UTC (permalink / raw)
  To: ceph-devel; +Cc: Sage Weil

In case anyone is following the mailing list later on, we spoke in IRC
and Sage provided a patch - http://fpaste.org/309609/52550203/ which
allows 8/10 OSDs to start (or at least not crash, ust hang sort of) -
one node still refuses to do so, and has two OSDs.
All OSDs which DO start, produce log output until 'done with init,
starting boot process' and seem to speak with the mon, but dont do
anything - all PGs are still marked down while OSD tree shows the
proper OSDs in and up (with two out and down on the "bad node").
RBD doesnt work, rados ls hangs, the mon log shows stale messages a la
"http://pastebin.com/BE98a2HP" which seem to indicate that its still
trying to commit the snapshot which caused all of this in the first
place (the same one preventing the "bad node" from running its OSDs
for more than a few seconds before they crash out in the same ASSERT).

If there is an outstanding message attempting to commit the bad
snapshot on the queue, and the receiving OSDs wont come up, is that
supposed to freeze all of the other ones? Is there no way to drop the
message and "safe boot" (ro or what not) the OSDs to extract data from
them?
Is there a "formal" procedure for saving data from this sort of
craziness? I'm happy to throw it all in a giant zpool and rebuild the
OSDs under Hammer in some clean way, it would just be rather nice to
get back the months of images and volumes in there...

Thank you,
-Boris Lukashev


On Mon, Jan 11, 2016 at 3:13 PM, Boris Lukashev
<blukashev@sempervictus.com> wrote:
> Thank you much - i'll get Hammer built out that way in a bit.
>
> In the meantime, looks like its getting a bit better, but two OSDs are
> still giving me trouble.
> Sage - your patches appear to have saved 4/5 nodes (8/10OSD). Ceph is
> currently resolving state, but osd tree shows the 4 nodes up with
> their respective pairs of OSDs. Thank you ever so much.
> Dmitry/Mirantis - you might want to merge 0.80.11 + these patches into
> your MOS repo for Fuel 7 systems as it may save someone from
> catastrophe down the line (plus you're a bit out of date anyway).
>
> On to the last node, the error logs from the OSDs still show a very
> similar, but slightly different stack trace:
>    -22> 2016-01-11 20:02:37.929902 7f26dadff700  5 osd.6 pg_epoch:
> 84290 pg[11.2d( v 84268'523421 (84261'520420,84268'523421]
> local-les=84266 n=508 ec=38 les/c 84266/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84265-84288/6 crt=84268'523415 lcod 0'0 mlcod 0'0
> inactive] enter Started/Primary/Active
>    -21> 2016-01-11 20:02:37.930111 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
> local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
> r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
> NOTIFY] exit Reset 0.193889 2 0.000031
>    -20> 2016-01-11 20:02:37.930134 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
> local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
> r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
> NOTIFY] enter Started
>    -19> 2016-01-11 20:02:37.930146 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
> local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
> r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
> NOTIFY] enter Start
>    -18> 2016-01-11 20:02:37.930155 7f26da5fe700  1 osd.6 pg_epoch:
> 84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
> local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
> r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
> NOTIFY] state<Start>: transitioning to Stray
>    -17> 2016-01-11 20:02:37.930168 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
> local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
> r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
> NOTIFY] exit Start 0.000022 0 0.000000
>    -16> 2016-01-11 20:02:37.930181 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.25( v 84268'783451 (84261'780450,84268'783451]
> local-les=84268 n=500 ec=38 les/c 84268/84268 84288/84288/84288) [4]
> r=-1 lpr=84289 pi=84197-84287/7 crt=84268'783449 lcod 0'0 inactive
> NOTIFY] enter Started/Stray
>    -15> 2016-01-11 20:02:37.930428 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] exit Reset 0.047014 2 0.000041
>    -14> 2016-01-11 20:02:37.930449 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] enter Started
>    -13> 2016-01-11 20:02:37.930460 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] enter Start
>    -12> 2016-01-11 20:02:37.930471 7f26da5fe700  1 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] state<Start>: transitioning to Primary
>    -11> 2016-01-11 20:02:37.930484 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] exit Start 0.000024 0 0.000000
>    -10> 2016-01-11 20:02:37.930497 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] enter Started/Primary
>     -9> 2016-01-11 20:02:37.930510 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] enter Started/Primary/Peering
>     -8> 2016-01-11 20:02:37.930521 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> peering] enter Started/Primary/Peering/GetInfo
>     -7> 2016-01-11 20:02:37.930543 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> peering] exit Started/Primary/Peering/GetInfo 0.000022 0 0.000000
>     -6> 2016-01-11 20:02:37.930558 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> peering] enter Started/Primary/Peering/GetLog
>     -5> 2016-01-11 20:02:37.930592 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> peering] exit Started/Primary/Peering/GetLog 0.000035 0 0.000000
>     -4> 2016-01-11 20:02:37.930607 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> peering] enter Started/Primary/Peering/GetMissing
>     -3> 2016-01-11 20:02:37.930620 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> peering] exit Started/Primary/Peering/GetMissing 0.000013 0 0.000000
>     -2> 2016-01-11 20:02:37.930633 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> peering] exit Started/Primary/Peering 0.000124 0 0.000000
>     -1> 2016-01-11 20:02:37.930646 7f26da5fe700  5 osd.6 pg_epoch:
> 84290 pg[11.3b( v 84268'1233293 (84261'1230292,84268'1233293]
> local-les=84268 n=511 ec=38 les/c 84268/84268 84289/84289/84289) [6]
> r=0 lpr=84289 pi=84267-84288/5 crt=84268'1233291 lcod 0'0 mlcod 0'0
> inactive] enter Started/Primary/Active
>      0> 2016-01-11 20:02:37.934944 7f26da5fe700 -1
> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
> T) [with T = snapid_t]' thread 7f26da5fe700 time 2016-01-11
> 20:02:37.930666
> ./include/interval_set.h: 389: FAILED assert(_size >= 0)
>
>  ceph version 0.80.11-21-gf526309 (f526309b51b1760df193fca5fc39b4913e890bab)
>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
> const&)+0xb0) [0x79d4e0]
>  2: (PG::activate(ObjectStore::Transaction&, unsigned int,
> std::list<Context*, std::allocator<Context*> >&, std::map<int,
> std::map<spg_t, pg_query_t, std::less<spg_t>,
> std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>,
> std::allocator<std::pair<int const, std::map<spg_t, pg_query_t,
> std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> >
>> > > >&, std::map<int, std::vector<std::pair<pg_notify_t,
> std::map<unsigned int, pg_interval_t, std::less<unsigned int>,
> std::allocator<std::pair<unsigned int const, pg_interval_t> > > >,
> std::allocator<std::pair<pg_notify_t, std::map<unsigned int,
> pg_interval_t, std::less<unsigned int>,
> std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >,
> std::less<int>, std::allocator<std::pair<int const,
> std::vector<std::pair<pg_notify_t, std::map<unsigned int,
> pg_interval_t, std::less<unsigned int>,
> std::allocator<std::pair<unsigned int const, pg_interval_t> > > >,
> std::allocator<std::pair<pg_notify_t, std::map<unsigned int,
> pg_interval_t, std::less<unsigned int>,
> std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >
>> > >*, PG::RecoveryCtx*)+0x846) [0x7742e6]
>  3: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active,
> PG::RecoveryState::Primary, PG::RecoveryState::Activating,
> (boost::statechart::history_mode)0>::my_context)+0x3fc) [0x77671c]
>  4: (boost::statechart::detail::safe_reaction_result
> boost::statechart::simple_state<PG::RecoveryState::Peering,
> PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
> (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active,
> PG::RecoveryState::RecoveryMachine,
> boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function
> const&)+0xa8) [0x7a7928]
>  5: (boost::statechart::simple_state<PG::RecoveryState::Peering,
> PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x13a) [0x7a7d9a]
>  6: (boost::statechart::simple_state<PG::RecoveryState::GetMissing,
> PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0xc0) [0x7a8a20]
>  7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> PG::RecoveryState::Initial, std::allocator<void>,
> boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
> const&)+0x5b) [0x79185b]
>  8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> PG::RecoveryState::Initial, std::allocator<void>,
> boost::statechart::null_exception_translator>::process_queued_events()+0xd4)
> [0x7919d4]
>  9: (PG::handle_activate_map(PG::RecoveryCtx*)+0x11b) [0x74092b]
>  10: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
> std::less<boost::intrusive_ptr<PG> >,
> std::allocator<boost::intrusive_ptr<PG> > >*)+0x6c2) [0x654ca2]
>  11: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
> const&, ThreadPool::TPHandle&)+0x20c) [0x65535c]
>  12: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
> const&, ThreadPool::TPHandle&)+0x18) [0x69c8c8]
>  13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5af11]
>  14: (ThreadPool::WorkThread::entry()+0x10) [0xa5be00]
>  15: (()+0x8182) [0x7f26f08ab182]
>  16: (clone()+0x6d) [0x7f26eeeca47d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
>
> This is running the same pacakges as all the others, and is not the
> actual node which originally crashed on us during the snapshot.
> Some of our pools are only 2X replication due to the small cluster
> size. Is there any way to ensure that killing both OSDs won't affect
> both replicas of some bits? I'm still trying to get the hang of all
> the Ceph CLIs.
>
> So far rbd and rados dont seem to work (assuming the pool is figuring
> itself out), but at least the OSDs are coming up... Is this expected
> behavior or do i need to hit it with the proverbial wrench a few more
> times?
>
> Thank you much,
> -Boris Lukashev
>
> On Mon, Jan 11, 2016 at 2:06 PM, Dmitry Borodaenko
> <dborodaenko@mirantis.com> wrote:
>> Fuel 8.0 will support Hammer, you can grab the packages from:
>> http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/pool/main/c/ceph/
>>
>> or, if you build your own packages with the extra patches, grab the
>> Debian build scripts from:
>> https://review.fuel-infra.org/#/c/13879/
>>
>> That would make sure that your packages would work with Fuel.
>>
>> --
>> Dmitry Borodaenko
>>
>> On Mon, Jan 11, 2016 at 01:15:50PM -0500, Boris Lukashev wrote:
>>> Thank you, pulling those into my branch currently and kicking off a build.
>>> In terms of upgrading to Hammer - the documentation looks straight
>>> forward enough, but given that this is a Fuel based OpenStack
>>> deployment, i'm wondering if you've heard of any potential
>>> compatibility issues from doing so.
>>>
>>> -Boris
>>>
>>> On Mon, Jan 11, 2016 at 12:25 PM, Sage Weil <sage@newdream.net> wrote:
>>> > On Mon, 11 Jan 2016, Boris Lukashev wrote:
>>> >> I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
>>> >> cluster backing our openstack glance and cinder services by just
>>> >> asking RBD to snapshot one of the volumes.
>>> >> The conditions under which this occured are as follows - bash script
>>> >> asking cinder to snapshot RBD volumes in rapid succession (2 of them),
>>> >> which either caused a nova host (and ceph OSD holder) to crash, or
>>> >> simply suffered the crash simultaneously. On reboot of the host, RBD
>>> >> started throwing errors, once all OSDs were restarted, they all fail,
>>> >> crashing with the following:
>>> >>
>>> >>     -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
>>> >> 84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
>>> >> 84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
>>> >> enter Started/Primary/Peering/GetInfo
>>> >>      0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
>>> >> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
>>> >> T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
>>> >> 16:37:35.398335
>>> >> ./include/interval_set.h: 386: FAILED assert(_size >= 0)
>>> >>
>>> >>  ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
>>> >>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
>>> >> const&)+0xb0) [0x79d140]
>>> >>  2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
>>> >>  3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
>>> >> std::tr1::shared_ptr<OSDMap const>, std::vector<int,
>>> >> std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
>>> >> int, PG::RecoveryCtx*)+0x282) [0x772c22]
>>> >>  4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
>>> >> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
>>> >> std::less<boost::intrusive_ptr<PG> >,
>>> >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
>>> >>  5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
>>> >> const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
>>> >>  6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
>>> >> const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
>>> >>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
>>> >>  8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
>>> >>  9: (()+0x8182) [0x7f170def5182]
>>> >>  10: (clone()+0x6d) [0x7f170c51447d]
>>> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> >> needed to interpret this.
>>> >>
>>> >> To me, this looks like the snapshot which was being created when the
>>> >> nova host died is causing the assert to fail since the snap was never
>>> >> completed and is broken.
>>> >>
>>> >> http://tracker.ceph.com/issues/11493 which appears very similar is
>>> >> marked as resolved, but with firefly current (deployed via Fuel and
>>> >> updated in place with 0.80.11 debs) this issue hit us on Saturday.
>>> >
>>> > You can try cherry-picking the two commits in wip-11493-b which make the
>>> > OSD semi-gracefully tolerate this situation.  This is a bug that's been
>>> > fixed in hammer, but since the inconsistency has already been introduced
>>> > simply upgrading probably won't resolve it.  Nevertheless, after working
>>> > around this, I'd encourage you to move to hammer and firefly is at end of
>>> > life.
>>> >
>>> > sage
>>> >
>>> >>
>>> >> Whats the way around this? I imagine commenting out that assert may
>>> >> cause more damage, but we need to get our OSDs and the RBD data in
>>> >> them back online. Is there a permanent fix in any branch we can
>>> >> backport? We built this cluster using Fuel so this affects every
>>> >> Mirantis user if not every ceph user out there, and the vector into
>>> >> this catastrophic bug is normal daily operations (snapshot
>>> >> apparently)....
>>> >>
>>> >> Thank you all for looking over this, advice would be greatly appreciated.
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> the body of a message to majordomo@vger.kernel.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>
>>> >>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12  2:00         ` Boris Lukashev
@ 2016-01-12  6:53           ` Mykola Golub
  2016-01-12 13:24             ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Mykola Golub @ 2016-01-12  6:53 UTC (permalink / raw)
  To: Boris Lukashev; +Cc: ceph-devel, Sage Weil

On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
> In case anyone is following the mailing list later on, we spoke in IRC
> and Sage provided a patch - http://fpaste.org/309609/52550203/

> diff --git a/src/osd/PG.cc b/src/osd/PG.cc
> index dc18aec..f9ee23c 100644
> --- a/src/osd/PG.cc
> +++ b/src/osd/PG.cc
> @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
>    name = map->get_pool_name(id);
>    if (pi->get_snap_epoch() == map->get_epoch()) {
>      pi->build_removed_snaps(newly_removed_snaps);
> -    newly_removed_snaps.subtract(cached_removed_snaps);
> -    cached_removed_snaps.union_of(newly_removed_snaps);
> +    interval_set<snapid_t> intersection;
> +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
> +    if (!(intersection == cached_removed_snaps)) {
> +      newly_removed_snaps.subtract(cached_removed_snaps);

Sage, won't it still violate the assert?
"intersection != cached_removed_snaps" means that cached_removed_snaps
contains snapshots missed in newly_removed_snaps, and we can't subtract?

> +      cached_removed_snaps.union_of(newly_removed_snaps);
> +    } else {
> +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
> +      cached_removed_snaps = newly_removed_snaps;
> +      newly_removed_snaps.clear();
> +    }
>      snapc = pi->get_snap_context();
>    } else {
>      newly_removed_snaps.clear();

-- 
Mykola Golub
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12  6:53           ` Mykola Golub
@ 2016-01-12 13:24             ` Sage Weil
  2016-01-12 17:43               ` Boris Lukashev
  2016-01-13 16:49               ` Alexey Sheplyakov
  0 siblings, 2 replies; 15+ messages in thread
From: Sage Weil @ 2016-01-12 13:24 UTC (permalink / raw)
  To: Mykola Golub; +Cc: Boris Lukashev, ceph-devel

On Tue, 12 Jan 2016, Mykola Golub wrote:
> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
> > In case anyone is following the mailing list later on, we spoke in IRC
> > and Sage provided a patch - http://fpaste.org/309609/52550203/
> 
> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
> > index dc18aec..f9ee23c 100644
> > --- a/src/osd/PG.cc
> > +++ b/src/osd/PG.cc
> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
> >    name = map->get_pool_name(id);
> >    if (pi->get_snap_epoch() == map->get_epoch()) {
> >      pi->build_removed_snaps(newly_removed_snaps);
> > -    newly_removed_snaps.subtract(cached_removed_snaps);
> > -    cached_removed_snaps.union_of(newly_removed_snaps);
> > +    interval_set<snapid_t> intersection;
> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
> > +    if (!(intersection == cached_removed_snaps)) {
> > +      newly_removed_snaps.subtract(cached_removed_snaps);
> 
> Sage, won't it still violate the assert?
> "intersection != cached_removed_snaps" means that cached_removed_snaps
> contains snapshots missed in newly_removed_snaps, and we can't subtract?

Oops, yeah, just remote the !.

As you can see the problem is that the OSDMap's removed snaps shrank 
somehow.  If you crank up logging you can see what the competing sets 
are.

An alternative fix/hack would be to modify the monitor to allow the 
snapids that were previously in the set to be added back into the OSDMap.  
That's arguably a better fix, although it's a bit more work.  But, even 
then, something like the above will be needed since there are still 
OSDMaps in the history where the set is smaller.

sage

> 
> > +      cached_removed_snaps.union_of(newly_removed_snaps);
> > +    } else {
> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
> > +      cached_removed_snaps = newly_removed_snaps;
> > +      newly_removed_snaps.clear();
> > +    }
> >      snapc = pi->get_snap_context();
> >    } else {
> >      newly_removed_snaps.clear();
> 
> -- 
> Mykola Golub
>  
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12 13:24             ` Sage Weil
@ 2016-01-12 17:43               ` Boris Lukashev
  2016-01-12 17:47                 ` Sage Weil
  2016-01-13 16:49               ` Alexey Sheplyakov
  1 sibling, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-12 17:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mykola Golub, ceph-devel

Should i try altering the patch with the ! removed and reloading the
OSDs? (would read as ' if (intersection == cached_removed_snaps) { ')
Just for my own education on the matter - if there is disagreement
between the contents of an OSD and the map, like in this case where a
pending request seems to be outstanding, is there no mediation process
between the on-disk data (OSD) and metadata (map) services? With XFS
being used underneath most of the time, that strikes me as somewhat
scary - its not the most consistent of filesystems on the best of
days.

With the mismatch between the OSD and map, but xfs_check coming back
clean, should i be worried about a corrupt cluster in the event that i
can somehow get it running?
I figure that with FireFly dying and Hammer available from Mirantis, i
should upgrade the cluster, but i would like to know what the safest
way forward is - i'd really prefer to keep using Ceph, its been
educational and quite handy, but if i have to rebuild the cluster
it'll need to keep playing nice with the Fuel deployed OpenStack. If i
can get access to the images stored by Glance and Swift metadata, i'll
gladly export and rebuild clean presuming i can figure out how. The
RBD images are already saved (manual export by tracking the rbd
segment hashes from the metadata files bearing volume-UUID
designations matching what i saw in Cinder, and dd-ing chunks into
flat files for raw images). Worst case, if the cluster wont come back
up and give me access to the data, what's the process for getting it
to a "clean" state such that i can upgrade to hammer and reseed my
glance, swift, and volume data from backups/exports? Do i need to
remove and re-add OSDs, or is there some darker magic at play to
ensure there are no remnants of bad data/messages?

Thank you all
-Boris

On Tue, Jan 12, 2016 at 8:24 AM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 12 Jan 2016, Mykola Golub wrote:
>> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
>> > In case anyone is following the mailing list later on, we spoke in IRC
>> > and Sage provided a patch - http://fpaste.org/309609/52550203/
>>
>> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
>> > index dc18aec..f9ee23c 100644
>> > --- a/src/osd/PG.cc
>> > +++ b/src/osd/PG.cc
>> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
>> >    name = map->get_pool_name(id);
>> >    if (pi->get_snap_epoch() == map->get_epoch()) {
>> >      pi->build_removed_snaps(newly_removed_snaps);
>> > -    newly_removed_snaps.subtract(cached_removed_snaps);
>> > -    cached_removed_snaps.union_of(newly_removed_snaps);
>> > +    interval_set<snapid_t> intersection;
>> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
>> > +    if (!(intersection == cached_removed_snaps)) {
>> > +      newly_removed_snaps.subtract(cached_removed_snaps);
>>
>> Sage, won't it still violate the assert?
>> "intersection != cached_removed_snaps" means that cached_removed_snaps
>> contains snapshots missed in newly_removed_snaps, and we can't subtract?
>
> Oops, yeah, just remote the !.
>
> As you can see the problem is that the OSDMap's removed snaps shrank
> somehow.  If you crank up logging you can see what the competing sets
> are.
>
> An alternative fix/hack would be to modify the monitor to allow the
> snapids that were previously in the set to be added back into the OSDMap.
> That's arguably a better fix, although it's a bit more work.  But, even
> then, something like the above will be needed since there are still
> OSDMaps in the history where the set is smaller.
>
> sage
>
>>
>> > +      cached_removed_snaps.union_of(newly_removed_snaps);
>> > +    } else {
>> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
>> > +      cached_removed_snaps = newly_removed_snaps;
>> > +      newly_removed_snaps.clear();
>> > +    }
>> >      snapc = pi->get_snap_context();
>> >    } else {
>> >      newly_removed_snaps.clear();
>>
>> --
>> Mykola Golub
>>
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12 17:43               ` Boris Lukashev
@ 2016-01-12 17:47                 ` Sage Weil
  2016-01-12 18:03                   ` Boris Lukashev
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2016-01-12 17:47 UTC (permalink / raw)
  To: Boris Lukashev; +Cc: Mykola Golub, ceph-devel

On Tue, 12 Jan 2016, Boris Lukashev wrote:
> Should i try altering the patch with the ! removed and reloading the
> OSDs? (would read as ' if (intersection == cached_removed_snaps) { ')
> Just for my own education on the matter - if there is disagreement
> between the contents of an OSD and the map, like in this case where a
> pending request seems to be outstanding, is there no mediation process
> between the on-disk data (OSD) and metadata (map) services? With XFS
> being used underneath most of the time, that strikes me as somewhat
> scary - its not the most consistent of filesystems on the best of
> days.
> 
> With the mismatch between the OSD and map, but xfs_check coming back
> clean, should i be worried about a corrupt cluster in the event that i
> can somehow get it running?
> I figure that with FireFly dying and Hammer available from Mirantis, i
> should upgrade the cluster, but i would like to know what the safest
> way forward is - i'd really prefer to keep using Ceph, its been
> educational and quite handy, but if i have to rebuild the cluster
> it'll need to keep playing nice with the Fuel deployed OpenStack. If i
> can get access to the images stored by Glance and Swift metadata, i'll
> gladly export and rebuild clean presuming i can figure out how. The
> RBD images are already saved (manual export by tracking the rbd
> segment hashes from the metadata files bearing volume-UUID
> designations matching what i saw in Cinder, and dd-ing chunks into
> flat files for raw images). Worst case, if the cluster wont come back
> up and give me access to the data, what's the process for getting it
> to a "clean" state such that i can upgrade to hammer and reseed my
> glance, swift, and volume data from backups/exports? Do i need to
> remove and re-add OSDs, or is there some darker magic at play to
> ensure there are no remnants of bad data/messages?

I think teh safe path is to 

(1) reproduce with debug osd = 20, so we can see what the removed_snaps is 
on the pg vs hte one in the osdmap.

(2) fix the ! in the patch and restart the osds

(3) re-add the deleted snaps to the osdmap so that things are back in 
sync.  This is possible through the librados API so it should be pretty 
simple to fix.  But, let's look at what (1) shows first.

sage


> 
> Thank you all
> -Boris
> 
> On Tue, Jan 12, 2016 at 8:24 AM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 12 Jan 2016, Mykola Golub wrote:
> >> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
> >> > In case anyone is following the mailing list later on, we spoke in IRC
> >> > and Sage provided a patch - http://fpaste.org/309609/52550203/
> >>
> >> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
> >> > index dc18aec..f9ee23c 100644
> >> > --- a/src/osd/PG.cc
> >> > +++ b/src/osd/PG.cc
> >> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
> >> >    name = map->get_pool_name(id);
> >> >    if (pi->get_snap_epoch() == map->get_epoch()) {
> >> >      pi->build_removed_snaps(newly_removed_snaps);
> >> > -    newly_removed_snaps.subtract(cached_removed_snaps);
> >> > -    cached_removed_snaps.union_of(newly_removed_snaps);
> >> > +    interval_set<snapid_t> intersection;
> >> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
> >> > +    if (!(intersection == cached_removed_snaps)) {
> >> > +      newly_removed_snaps.subtract(cached_removed_snaps);
> >>
> >> Sage, won't it still violate the assert?
> >> "intersection != cached_removed_snaps" means that cached_removed_snaps
> >> contains snapshots missed in newly_removed_snaps, and we can't subtract?
> >
> > Oops, yeah, just remote the !.
> >
> > As you can see the problem is that the OSDMap's removed snaps shrank
> > somehow.  If you crank up logging you can see what the competing sets
> > are.
> >
> > An alternative fix/hack would be to modify the monitor to allow the
> > snapids that were previously in the set to be added back into the OSDMap.
> > That's arguably a better fix, although it's a bit more work.  But, even
> > then, something like the above will be needed since there are still
> > OSDMaps in the history where the set is smaller.
> >
> > sage
> >
> >>
> >> > +      cached_removed_snaps.union_of(newly_removed_snaps);
> >> > +    } else {
> >> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
> >> > +      cached_removed_snaps = newly_removed_snaps;
> >> > +      newly_removed_snaps.clear();
> >> > +    }
> >> >      snapc = pi->get_snap_context();
> >> >    } else {
> >> >      newly_removed_snaps.clear();
> >>
> >> --
> >> Mykola Golub
> >>
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12 17:47                 ` Sage Weil
@ 2016-01-12 18:03                   ` Boris Lukashev
  2016-01-12 20:21                     ` Boris Lukashev
  0 siblings, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-12 18:03 UTC (permalink / raw)
  To: ceph-devel

I've put some of the output from debug osd 20 at
http://pastebin.com/he5snqwF, it seems one of the last operations is
in fact " activate - purged_snaps [1~5,8~2,b~d,19~f]
cached_removed_snaps [1~5,8~2,b~d,19~b]" which seems to make sense in
the context of this mismatch...
There is an ungodly amount of output from level 20, anything specific
you'd like me to grep for?

On Tue, Jan 12, 2016 at 12:47 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 12 Jan 2016, Boris Lukashev wrote:
>> Should i try altering the patch with the ! removed and reloading the
>> OSDs? (would read as ' if (intersection == cached_removed_snaps) { ')
>> Just for my own education on the matter - if there is disagreement
>> between the contents of an OSD and the map, like in this case where a
>> pending request seems to be outstanding, is there no mediation process
>> between the on-disk data (OSD) and metadata (map) services? With XFS
>> being used underneath most of the time, that strikes me as somewhat
>> scary - its not the most consistent of filesystems on the best of
>> days.
>>
>> With the mismatch between the OSD and map, but xfs_check coming back
>> clean, should i be worried about a corrupt cluster in the event that i
>> can somehow get it running?
>> I figure that with FireFly dying and Hammer available from Mirantis, i
>> should upgrade the cluster, but i would like to know what the safest
>> way forward is - i'd really prefer to keep using Ceph, its been
>> educational and quite handy, but if i have to rebuild the cluster
>> it'll need to keep playing nice with the Fuel deployed OpenStack. If i
>> can get access to the images stored by Glance and Swift metadata, i'll
>> gladly export and rebuild clean presuming i can figure out how. The
>> RBD images are already saved (manual export by tracking the rbd
>> segment hashes from the metadata files bearing volume-UUID
>> designations matching what i saw in Cinder, and dd-ing chunks into
>> flat files for raw images). Worst case, if the cluster wont come back
>> up and give me access to the data, what's the process for getting it
>> to a "clean" state such that i can upgrade to hammer and reseed my
>> glance, swift, and volume data from backups/exports? Do i need to
>> remove and re-add OSDs, or is there some darker magic at play to
>> ensure there are no remnants of bad data/messages?
>
> I think teh safe path is to
>
> (1) reproduce with debug osd = 20, so we can see what the removed_snaps is
> on the pg vs hte one in the osdmap.
>
> (2) fix the ! in the patch and restart the osds
>
> (3) re-add the deleted snaps to the osdmap so that things are back in
> sync.  This is possible through the librados API so it should be pretty
> simple to fix.  But, let's look at what (1) shows first.
>
> sage
>
>
>>
>> Thank you all
>> -Boris
>>
>> On Tue, Jan 12, 2016 at 8:24 AM, Sage Weil <sage@newdream.net> wrote:
>> > On Tue, 12 Jan 2016, Mykola Golub wrote:
>> >> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
>> >> > In case anyone is following the mailing list later on, we spoke in IRC
>> >> > and Sage provided a patch - http://fpaste.org/309609/52550203/
>> >>
>> >> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
>> >> > index dc18aec..f9ee23c 100644
>> >> > --- a/src/osd/PG.cc
>> >> > +++ b/src/osd/PG.cc
>> >> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
>> >> >    name = map->get_pool_name(id);
>> >> >    if (pi->get_snap_epoch() == map->get_epoch()) {
>> >> >      pi->build_removed_snaps(newly_removed_snaps);
>> >> > -    newly_removed_snaps.subtract(cached_removed_snaps);
>> >> > -    cached_removed_snaps.union_of(newly_removed_snaps);
>> >> > +    interval_set<snapid_t> intersection;
>> >> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
>> >> > +    if (!(intersection == cached_removed_snaps)) {
>> >> > +      newly_removed_snaps.subtract(cached_removed_snaps);
>> >>
>> >> Sage, won't it still violate the assert?
>> >> "intersection != cached_removed_snaps" means that cached_removed_snaps
>> >> contains snapshots missed in newly_removed_snaps, and we can't subtract?
>> >
>> > Oops, yeah, just remote the !.
>> >
>> > As you can see the problem is that the OSDMap's removed snaps shrank
>> > somehow.  If you crank up logging you can see what the competing sets
>> > are.
>> >
>> > An alternative fix/hack would be to modify the monitor to allow the
>> > snapids that were previously in the set to be added back into the OSDMap.
>> > That's arguably a better fix, although it's a bit more work.  But, even
>> > then, something like the above will be needed since there are still
>> > OSDMaps in the history where the set is smaller.
>> >
>> > sage
>> >
>> >>
>> >> > +      cached_removed_snaps.union_of(newly_removed_snaps);
>> >> > +    } else {
>> >> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
>> >> > +      cached_removed_snaps = newly_removed_snaps;
>> >> > +      newly_removed_snaps.clear();
>> >> > +    }
>> >> >      snapc = pi->get_snap_context();
>> >> >    } else {
>> >> >      newly_removed_snaps.clear();
>> >>
>> >> --
>> >> Mykola Golub
>> >>
>> >>
>> >>
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12 18:03                   ` Boris Lukashev
@ 2016-01-12 20:21                     ` Boris Lukashev
  2016-01-13  1:06                       ` Boris Lukashev
  0 siblings, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-12 20:21 UTC (permalink / raw)
  To: ceph-devel

Having added the following diff to the patch stack and rebuilt, i'm
still seeing the two OSD not come up.
However, with sage's help in IRC, i now have the following patch and
changes to the librados hello world example to clean out bad snaps
found via osd log 20:

diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index d7174af..d78ee31 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -137,7 +137,7 @@ void PGPool::update(OSDMapRef map)
     pi->build_removed_snaps(newly_removed_snaps);
     interval_set<snapid_t> intersection;
     intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
-    if (!(intersection == cached_removed_snaps)) {
+    if (intersection == cached_removed_snaps) {
       newly_removed_snaps.subtract(cached_removed_snaps);
       cached_removed_snaps.union_of(newly_removed_snaps);
     } else {


in hello world, remove everything after io_ctx is initialized until
the out: section and just before it, add
  /*
   * remove snapshots
   */
  {
    io_ctx.selfmanaged_snap_remove(0x19+0xb);
    io_ctx.selfmanaged_snap_remove(0x19+0xc);
    io_ctx.selfmanaged_snap_remove(0x19+0xd);
    io_ctx.selfmanaged_snap_remove(0x19+0xe);
  }

with appropriate pointers to the snaps.

So, with the OSDs still not starting, i'm curious as to what the next
step is - should i keep trying to get the OSDs up, or should i try to
remove the snaps with the librados bin and then try to bring them up.
Since i'm manually deleting things from Ceph, i figure i only get one
shot so suggestions are very welcome :).

Thanks as always,
-Boris

On Tue, Jan 12, 2016 at 1:03 PM, Boris Lukashev
<blukashev@sempervictus.com> wrote:
> I've put some of the output from debug osd 20 at
> http://pastebin.com/he5snqwF, it seems one of the last operations is
> in fact " activate - purged_snaps [1~5,8~2,b~d,19~f]
> cached_removed_snaps [1~5,8~2,b~d,19~b]" which seems to make sense in
> the context of this mismatch...
> There is an ungodly amount of output from level 20, anything specific
> you'd like me to grep for?
>
> On Tue, Jan 12, 2016 at 12:47 PM, Sage Weil <sage@newdream.net> wrote:
>> On Tue, 12 Jan 2016, Boris Lukashev wrote:
>>> Should i try altering the patch with the ! removed and reloading the
>>> OSDs? (would read as ' if (intersection == cached_removed_snaps) { ')
>>> Just for my own education on the matter - if there is disagreement
>>> between the contents of an OSD and the map, like in this case where a
>>> pending request seems to be outstanding, is there no mediation process
>>> between the on-disk data (OSD) and metadata (map) services? With XFS
>>> being used underneath most of the time, that strikes me as somewhat
>>> scary - its not the most consistent of filesystems on the best of
>>> days.
>>>
>>> With the mismatch between the OSD and map, but xfs_check coming back
>>> clean, should i be worried about a corrupt cluster in the event that i
>>> can somehow get it running?
>>> I figure that with FireFly dying and Hammer available from Mirantis, i
>>> should upgrade the cluster, but i would like to know what the safest
>>> way forward is - i'd really prefer to keep using Ceph, its been
>>> educational and quite handy, but if i have to rebuild the cluster
>>> it'll need to keep playing nice with the Fuel deployed OpenStack. If i
>>> can get access to the images stored by Glance and Swift metadata, i'll
>>> gladly export and rebuild clean presuming i can figure out how. The
>>> RBD images are already saved (manual export by tracking the rbd
>>> segment hashes from the metadata files bearing volume-UUID
>>> designations matching what i saw in Cinder, and dd-ing chunks into
>>> flat files for raw images). Worst case, if the cluster wont come back
>>> up and give me access to the data, what's the process for getting it
>>> to a "clean" state such that i can upgrade to hammer and reseed my
>>> glance, swift, and volume data from backups/exports? Do i need to
>>> remove and re-add OSDs, or is there some darker magic at play to
>>> ensure there are no remnants of bad data/messages?
>>
>> I think teh safe path is to
>>
>> (1) reproduce with debug osd = 20, so we can see what the removed_snaps is
>> on the pg vs hte one in the osdmap.
>>
>> (2) fix the ! in the patch and restart the osds
>>
>> (3) re-add the deleted snaps to the osdmap so that things are back in
>> sync.  This is possible through the librados API so it should be pretty
>> simple to fix.  But, let's look at what (1) shows first.
>>
>> sage
>>
>>
>>>
>>> Thank you all
>>> -Boris
>>>
>>> On Tue, Jan 12, 2016 at 8:24 AM, Sage Weil <sage@newdream.net> wrote:
>>> > On Tue, 12 Jan 2016, Mykola Golub wrote:
>>> >> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
>>> >> > In case anyone is following the mailing list later on, we spoke in IRC
>>> >> > and Sage provided a patch - http://fpaste.org/309609/52550203/
>>> >>
>>> >> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
>>> >> > index dc18aec..f9ee23c 100644
>>> >> > --- a/src/osd/PG.cc
>>> >> > +++ b/src/osd/PG.cc
>>> >> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
>>> >> >    name = map->get_pool_name(id);
>>> >> >    if (pi->get_snap_epoch() == map->get_epoch()) {
>>> >> >      pi->build_removed_snaps(newly_removed_snaps);
>>> >> > -    newly_removed_snaps.subtract(cached_removed_snaps);
>>> >> > -    cached_removed_snaps.union_of(newly_removed_snaps);
>>> >> > +    interval_set<snapid_t> intersection;
>>> >> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
>>> >> > +    if (!(intersection == cached_removed_snaps)) {
>>> >> > +      newly_removed_snaps.subtract(cached_removed_snaps);
>>> >>
>>> >> Sage, won't it still violate the assert?
>>> >> "intersection != cached_removed_snaps" means that cached_removed_snaps
>>> >> contains snapshots missed in newly_removed_snaps, and we can't subtract?
>>> >
>>> > Oops, yeah, just remote the !.
>>> >
>>> > As you can see the problem is that the OSDMap's removed snaps shrank
>>> > somehow.  If you crank up logging you can see what the competing sets
>>> > are.
>>> >
>>> > An alternative fix/hack would be to modify the monitor to allow the
>>> > snapids that were previously in the set to be added back into the OSDMap.
>>> > That's arguably a better fix, although it's a bit more work.  But, even
>>> > then, something like the above will be needed since there are still
>>> > OSDMaps in the history where the set is smaller.
>>> >
>>> > sage
>>> >
>>> >>
>>> >> > +      cached_removed_snaps.union_of(newly_removed_snaps);
>>> >> > +    } else {
>>> >> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
>>> >> > +      cached_removed_snaps = newly_removed_snaps;
>>> >> > +      newly_removed_snaps.clear();
>>> >> > +    }
>>> >> >      snapc = pi->get_snap_context();
>>> >> >    } else {
>>> >> >      newly_removed_snaps.clear();
>>> >>
>>> >> --
>>> >> Mykola Golub
>>> >>
>>> >>
>>> >>
>>>
>>>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12 20:21                     ` Boris Lukashev
@ 2016-01-13  1:06                       ` Boris Lukashev
  2016-01-13 11:35                         ` Mykola Golub
  0 siblings, 1 reply; 15+ messages in thread
From: Boris Lukashev @ 2016-01-13  1:06 UTC (permalink / raw)
  To: ceph-devel

Thank you sage! The patch below allowed one of the two stuck OSDs to start up.
Whats the procedure from here? Is there a way to make Ceph consistent?
OSD 4 is still refusing to start, new error message however.

Should i drop OSD 4 and rebuild? If so whats the proper way to do
this? I've not started any VMs from the volumes yet, but i can rbd ls
-p compute which is new and ceph -s shows "health HEALTH_WARN 245 pgs
degraded; 245 pgs stuck unclean; recovery 23261/258270 objects
degraded (9.006%); 1/10 in osds are down"
How can i ensure everything is consistent before upgrading the cluster
to with hammer debs from the Mirantis repo?

Thank you very much for the assistance,
-Boris

Patch diff (slightly modified to compile):

--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -135,8 +135,22 @@ void PGPool::update(OSDMapRef map)
   name = map->get_pool_name(id);
   if (pi->get_snap_epoch() == map->get_epoch()) {
     pi->build_removed_snaps(newly_removed_snaps);
-    newly_removed_snaps.subtract(cached_removed_snaps);
-    cached_removed_snaps.union_of(newly_removed_snaps);
+    lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << "
removed_snaps " << newly_removed_snaps << " cached_removed " <<
cached_removed_snaps << dendl;
+    interval_set<snapid_t> intersection;
+    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
+    lgeneric_subdout(g_ceph_context, osd, 0)
+      << __func__ << " removed_snaps " << newly_removed_snaps
+      << " cached_removed " << cached_removed_snaps
+      << " interaction " << intersection
+      << dendl;
+    if (intersection == cached_removed_snaps) {
+      newly_removed_snaps.subtract(cached_removed_snaps);
+      cached_removed_snaps.union_of(newly_removed_snaps);
+    } else {
+      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << "
cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
+      cached_removed_snaps = newly_removed_snaps;
+      newly_removed_snaps.clear();
+    }
     snapc = pi->get_snap_context();
   } else {
     newly_removed_snaps.clear();
@@ -1473,7 +1487,9 @@ void PG::activate(ObjectStore::Transaction& t,
     dout(20) << "activate - purged_snaps " << info.purged_snaps
             << " cached_removed_snaps " << pool.cached_removed_snaps << dendl;
     snap_trimq = pool.cached_removed_snaps;
-    snap_trimq.subtract(info.purged_snaps);
+    interval_set<snapid_t> intersection;
+    intersection.intersection_of(info.purged_snaps, snap_trimq);
+    snap_trimq.subtract(intersection);
     dout(10) << "activate - snap_trimq " << snap_trimq << dendl;
     if (!snap_trimq.empty() && is_clean())
       queue_snap_trim();

On Tue, Jan 12, 2016 at 3:21 PM, Boris Lukashev
<blukashev@sempervictus.com> wrote:
> Having added the following diff to the patch stack and rebuilt, i'm
> still seeing the two OSD not come up.
> However, with sage's help in IRC, i now have the following patch and
> changes to the librados hello world example to clean out bad snaps
> found via osd log 20:
>
> diff --git a/src/osd/PG.cc b/src/osd/PG.cc
> index d7174af..d78ee31 100644
> --- a/src/osd/PG.cc
> +++ b/src/osd/PG.cc
> @@ -137,7 +137,7 @@ void PGPool::update(OSDMapRef map)
>      pi->build_removed_snaps(newly_removed_snaps);
>      interval_set<snapid_t> intersection;
>      intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
> -    if (!(intersection == cached_removed_snaps)) {
> +    if (intersection == cached_removed_snaps) {
>        newly_removed_snaps.subtract(cached_removed_snaps);
>        cached_removed_snaps.union_of(newly_removed_snaps);
>      } else {
>
>
> in hello world, remove everything after io_ctx is initialized until
> the out: section and just before it, add
>   /*
>    * remove snapshots
>    */
>   {
>     io_ctx.selfmanaged_snap_remove(0x19+0xb);
>     io_ctx.selfmanaged_snap_remove(0x19+0xc);
>     io_ctx.selfmanaged_snap_remove(0x19+0xd);
>     io_ctx.selfmanaged_snap_remove(0x19+0xe);
>   }
>
> with appropriate pointers to the snaps.
>
> So, with the OSDs still not starting, i'm curious as to what the next
> step is - should i keep trying to get the OSDs up, or should i try to
> remove the snaps with the librados bin and then try to bring them up.
> Since i'm manually deleting things from Ceph, i figure i only get one
> shot so suggestions are very welcome :).
>
> Thanks as always,
> -Boris
>
> On Tue, Jan 12, 2016 at 1:03 PM, Boris Lukashev
> <blukashev@sempervictus.com> wrote:
>> I've put some of the output from debug osd 20 at
>> http://pastebin.com/he5snqwF, it seems one of the last operations is
>> in fact " activate - purged_snaps [1~5,8~2,b~d,19~f]
>> cached_removed_snaps [1~5,8~2,b~d,19~b]" which seems to make sense in
>> the context of this mismatch...
>> There is an ungodly amount of output from level 20, anything specific
>> you'd like me to grep for?
>>
>> On Tue, Jan 12, 2016 at 12:47 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Tue, 12 Jan 2016, Boris Lukashev wrote:
>>>> Should i try altering the patch with the ! removed and reloading the
>>>> OSDs? (would read as ' if (intersection == cached_removed_snaps) { ')
>>>> Just for my own education on the matter - if there is disagreement
>>>> between the contents of an OSD and the map, like in this case where a
>>>> pending request seems to be outstanding, is there no mediation process
>>>> between the on-disk data (OSD) and metadata (map) services? With XFS
>>>> being used underneath most of the time, that strikes me as somewhat
>>>> scary - its not the most consistent of filesystems on the best of
>>>> days.
>>>>
>>>> With the mismatch between the OSD and map, but xfs_check coming back
>>>> clean, should i be worried about a corrupt cluster in the event that i
>>>> can somehow get it running?
>>>> I figure that with FireFly dying and Hammer available from Mirantis, i
>>>> should upgrade the cluster, but i would like to know what the safest
>>>> way forward is - i'd really prefer to keep using Ceph, its been
>>>> educational and quite handy, but if i have to rebuild the cluster
>>>> it'll need to keep playing nice with the Fuel deployed OpenStack. If i
>>>> can get access to the images stored by Glance and Swift metadata, i'll
>>>> gladly export and rebuild clean presuming i can figure out how. The
>>>> RBD images are already saved (manual export by tracking the rbd
>>>> segment hashes from the metadata files bearing volume-UUID
>>>> designations matching what i saw in Cinder, and dd-ing chunks into
>>>> flat files for raw images). Worst case, if the cluster wont come back
>>>> up and give me access to the data, what's the process for getting it
>>>> to a "clean" state such that i can upgrade to hammer and reseed my
>>>> glance, swift, and volume data from backups/exports? Do i need to
>>>> remove and re-add OSDs, or is there some darker magic at play to
>>>> ensure there are no remnants of bad data/messages?
>>>
>>> I think teh safe path is to
>>>
>>> (1) reproduce with debug osd = 20, so we can see what the removed_snaps is
>>> on the pg vs hte one in the osdmap.
>>>
>>> (2) fix the ! in the patch and restart the osds
>>>
>>> (3) re-add the deleted snaps to the osdmap so that things are back in
>>> sync.  This is possible through the librados API so it should be pretty
>>> simple to fix.  But, let's look at what (1) shows first.
>>>
>>> sage
>>>
>>>
>>>>
>>>> Thank you all
>>>> -Boris
>>>>
>>>> On Tue, Jan 12, 2016 at 8:24 AM, Sage Weil <sage@newdream.net> wrote:
>>>> > On Tue, 12 Jan 2016, Mykola Golub wrote:
>>>> >> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
>>>> >> > In case anyone is following the mailing list later on, we spoke in IRC
>>>> >> > and Sage provided a patch - http://fpaste.org/309609/52550203/
>>>> >>
>>>> >> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
>>>> >> > index dc18aec..f9ee23c 100644
>>>> >> > --- a/src/osd/PG.cc
>>>> >> > +++ b/src/osd/PG.cc
>>>> >> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
>>>> >> >    name = map->get_pool_name(id);
>>>> >> >    if (pi->get_snap_epoch() == map->get_epoch()) {
>>>> >> >      pi->build_removed_snaps(newly_removed_snaps);
>>>> >> > -    newly_removed_snaps.subtract(cached_removed_snaps);
>>>> >> > -    cached_removed_snaps.union_of(newly_removed_snaps);
>>>> >> > +    interval_set<snapid_t> intersection;
>>>> >> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
>>>> >> > +    if (!(intersection == cached_removed_snaps)) {
>>>> >> > +      newly_removed_snaps.subtract(cached_removed_snaps);
>>>> >>
>>>> >> Sage, won't it still violate the assert?
>>>> >> "intersection != cached_removed_snaps" means that cached_removed_snaps
>>>> >> contains snapshots missed in newly_removed_snaps, and we can't subtract?
>>>> >
>>>> > Oops, yeah, just remote the !.
>>>> >
>>>> > As you can see the problem is that the OSDMap's removed snaps shrank
>>>> > somehow.  If you crank up logging you can see what the competing sets
>>>> > are.
>>>> >
>>>> > An alternative fix/hack would be to modify the monitor to allow the
>>>> > snapids that were previously in the set to be added back into the OSDMap.
>>>> > That's arguably a better fix, although it's a bit more work.  But, even
>>>> > then, something like the above will be needed since there are still
>>>> > OSDMaps in the history where the set is smaller.
>>>> >
>>>> > sage
>>>> >
>>>> >>
>>>> >> > +      cached_removed_snaps.union_of(newly_removed_snaps);
>>>> >> > +    } else {
>>>> >> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
>>>> >> > +      cached_removed_snaps = newly_removed_snaps;
>>>> >> > +      newly_removed_snaps.clear();
>>>> >> > +    }
>>>> >> >      snapc = pi->get_snap_context();
>>>> >> >    } else {
>>>> >> >      newly_removed_snaps.clear();
>>>> >>
>>>> >> --
>>>> >> Mykola Golub
>>>> >>
>>>> >>
>>>> >>
>>>>
>>>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-13  1:06                       ` Boris Lukashev
@ 2016-01-13 11:35                         ` Mykola Golub
  0 siblings, 0 replies; 15+ messages in thread
From: Mykola Golub @ 2016-01-13 11:35 UTC (permalink / raw)
  To: Boris Lukashev; +Cc: ceph-devel

On Tue, Jan 12, 2016 at 08:06:03PM -0500, Boris Lukashev wrote:
> Thank you sage! The patch below allowed one of the two stuck OSDs to start up.
> Whats the procedure from here? Is there a way to make Ceph consistent?
> OSD 4 is still refusing to start, new error message however.

Could you please post the log for this case? It might be helpful to
set 'debug osd = 20' before starting.

-- 
Mykola Golub

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 7915 is not resolved
  2016-01-12 13:24             ` Sage Weil
  2016-01-12 17:43               ` Boris Lukashev
@ 2016-01-13 16:49               ` Alexey Sheplyakov
  1 sibling, 0 replies; 15+ messages in thread
From: Alexey Sheplyakov @ 2016-01-13 16:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mykola Golub, Boris Lukashev, ceph-devel

Sage,

> As you can see the problem is that the OSDMap's removed snaps shrank somehow.

Should we teach master (infernalis, hammer) to handle such a situation
more gracefully too?
Or at least make the OSD fail with a more clear message?

Best regards,
     Alexey


On Tue, Jan 12, 2016 at 4:24 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 12 Jan 2016, Mykola Golub wrote:
>> On Mon, Jan 11, 2016 at 09:00:18PM -0500, Boris Lukashev wrote:
>> > In case anyone is following the mailing list later on, we spoke in IRC
>> > and Sage provided a patch - http://fpaste.org/309609/52550203/
>>
>> > diff --git a/src/osd/PG.cc b/src/osd/PG.cc
>> > index dc18aec..f9ee23c 100644
>> > --- a/src/osd/PG.cc
>> > +++ b/src/osd/PG.cc
>> > @@ -135,8 +135,16 @@ void PGPool::update(OSDMapRef map)
>> >    name = map->get_pool_name(id);
>> >    if (pi->get_snap_epoch() == map->get_epoch()) {
>> >      pi->build_removed_snaps(newly_removed_snaps);
>> > -    newly_removed_snaps.subtract(cached_removed_snaps);
>> > -    cached_removed_snaps.union_of(newly_removed_snaps);
>> > +    interval_set<snapid_t> intersection;
>> > +    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
>> > +    if (!(intersection == cached_removed_snaps)) {
>> > +      newly_removed_snaps.subtract(cached_removed_snaps);
>>
>> Sage, won't it still violate the assert?
>> "intersection != cached_removed_snaps" means that cached_removed_snaps
>> contains snapshots missed in newly_removed_snaps, and we can't subtract?
>
> Oops, yeah, just remote the !.
>
> As you can see the problem is that the OSDMap's removed snaps shrank
> somehow.  If you crank up logging you can see what the competing sets
> are.
>
> An alternative fix/hack would be to modify the monitor to allow the
> snapids that were previously in the set to be added back into the OSDMap.
> That's arguably a better fix, although it's a bit more work.  But, even
> then, something like the above will be needed since there are still
> OSDMaps in the history where the set is smaller.
>
> sage
>
>>
>> > +      cached_removed_snaps.union_of(newly_removed_snaps);
>> > +    } else {
>> > +      lgeneric_subdout(g_ceph_context, osd, 0) << __func__ << " cached_removed_snaps shrank from " << cached_removed_snaps << dendl;
>> > +      cached_removed_snaps = newly_removed_snaps;
>> > +      newly_removed_snaps.clear();
>> > +    }
>> >      snapc = pi->get_snap_context();
>> >    } else {
>> >      newly_removed_snaps.clear();
>>
>> --
>> Mykola Golub
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-01-13 16:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-11 16:52 7915 is not resolved Boris Lukashev
2016-01-11 17:25 ` Sage Weil
2016-01-11 18:15   ` Boris Lukashev
2016-01-11 19:06     ` Dmitry Borodaenko
2016-01-11 20:13       ` Boris Lukashev
2016-01-12  2:00         ` Boris Lukashev
2016-01-12  6:53           ` Mykola Golub
2016-01-12 13:24             ` Sage Weil
2016-01-12 17:43               ` Boris Lukashev
2016-01-12 17:47                 ` Sage Weil
2016-01-12 18:03                   ` Boris Lukashev
2016-01-12 20:21                     ` Boris Lukashev
2016-01-13  1:06                       ` Boris Lukashev
2016-01-13 11:35                         ` Mykola Golub
2016-01-13 16:49               ` Alexey Sheplyakov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.