All of lore.kernel.org
 help / color / mirror / Atom feed
* Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
@ 2012-01-14 14:40 Martin Mailand
  2012-01-15  2:45 ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Martin Mailand @ 2012-01-14 14:40 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 287 bytes --]

Hi
one of four OSD died during the update to v0.40 with an Assertion 
os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
Even after a complete shutdown of the cluster an a new start with all 
OSD at the same version, this osd did not start.

The OSD Log it attached.

-martin

[-- Attachment #2: osd.0.log.bz2 --]
[-- Type: application/x-bzip, Size: 23701 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
  2012-01-14 14:40 Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error") Martin Mailand
@ 2012-01-15  2:45 ` Sage Weil
  2012-01-15  5:52   ` Sage Weil
  2012-01-15 11:39   ` Martin Mailand
  0 siblings, 2 replies; 6+ messages in thread
From: Sage Weil @ 2012-01-15  2:45 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

Hi Martin-

On Sat, 14 Jan 2012, Martin Mailand wrote:

> Hi
> one of four OSD died during the update to v0.40 with an Assertion
> os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
> Even after a complete shutdown of the cluster an a new start with all OSD at
> the same version, this osd did not start.
> 
> The OSD Log it attached.

It's trying to replay a transaction that appears to be invalid because the 
.2 clone is smaller than it thinks.  Is this the first time the OSD 
crashed, or did it crash once, and you cranked up logs and generated 
this one?  If you have the previous log, that would be helpful... it 
should have a similar tranasction dump but a different stack trace.

Also, are any of the 6 patches on top of 0.40 related to the filestore or 
osd?

Thanks!
sage


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
  2012-01-15  2:45 ` Sage Weil
@ 2012-01-15  5:52   ` Sage Weil
  2012-01-15 11:41     ` Martin Mailand
  2012-01-15 11:39   ` Martin Mailand
  1 sibling, 1 reply; 6+ messages in thread
From: Sage Weil @ 2012-01-15  5:52 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

Hi Martin-

On Sat, 14 Jan 2012, Sage Weil wrote:
> Hi Martin-
> 
> On Sat, 14 Jan 2012, Martin Mailand wrote:
> 
> > Hi
> > one of four OSD died during the update to v0.40 with an Assertion
> > os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
> > Even after a complete shutdown of the cluster an a new start with all OSD at
> > the same version, this osd did not start.
> > 
> > The OSD Log it attached.
> 
> It's trying to replay a transaction that appears to be invalid because the 
> .2 clone is smaller than it thinks.  Is this the first time the OSD 
> crashed, or did it crash once, and you cranked up logs and generated 
> this one?  If you have the previous log, that would be helpful... it 
> should have a similar tranasction dump but a different stack trace.

I pushed a wip-osd-dump-journal branch to git that will make

	ceph-osd -i <whatever> --dump-journal > /tmp/foo.txt

dump the contents of your entire osd journal (sans data) to a text file.  
Do you mind sending that along as well?  I'd like to see what is in the 
journal _after_ the event that is failing (if anything).

Thanks!
sage


> 
> Also, are any of the 6 patches on top of 0.40 related to the filestore or 
> osd?
> 
> Thanks!
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
  2012-01-15  2:45 ` Sage Weil
  2012-01-15  5:52   ` Sage Weil
@ 2012-01-15 11:39   ` Martin Mailand
  2012-01-16  6:14     ` Sage Weil
  1 sibling, 1 reply; 6+ messages in thread
From: Martin Mailand @ 2012-01-15 11:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,
that's exactly what I did, the first two crashes are in this log, 
unfortunately there was no debug level set.

http://85.214.49.87/ceph/osd.0.full.log.bz2

-martin



Am 15.01.2012 03:45, schrieb Sage Weil:
> Hi Martin-
>
> On Sat, 14 Jan 2012, Martin Mailand wrote:
>
>> Hi
>> one of four OSD died during the update to v0.40 with an Assertion
>> os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
>> Even after a complete shutdown of the cluster an a new start with all OSD at
>> the same version, this osd did not start.
>>
>> The OSD Log it attached.
>
> It's trying to replay a transaction that appears to be invalid because the
> .2 clone is smaller than it thinks.  Is this the first time the OSD
> crashed, or did it crash once, and you cranked up logs and generated
> this one?  If you have the previous log, that would be helpful... it
> should have a similar tranasction dump but a different stack trace.
>
> Also, are any of the 6 patches on top of 0.40 related to the filestore or
> osd?
>
> Thanks!
> sage
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
  2012-01-15  5:52   ` Sage Weil
@ 2012-01-15 11:41     ` Martin Mailand
  0 siblings, 0 replies; 6+ messages in thread
From: Martin Mailand @ 2012-01-15 11:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

here is the requested dump file.

http://85.214.49.87/ceph/foo.txt.bz2

-martin


Am 15.01.2012 06:52, schrieb Sage Weil:
> Hi Martin-
>
> On Sat, 14 Jan 2012, Sage Weil wrote:
>> Hi Martin-
>>
>> On Sat, 14 Jan 2012, Martin Mailand wrote:
>>
>>> Hi
>>> one of four OSD died during the update to v0.40 with an Assertion
>>> os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
>>> Even after a complete shutdown of the cluster an a new start with all OSD at
>>> the same version, this osd did not start.
>>>
>>> The OSD Log it attached.
>>
>> It's trying to replay a transaction that appears to be invalid because the
>> .2 clone is smaller than it thinks.  Is this the first time the OSD
>> crashed, or did it crash once, and you cranked up logs and generated
>> this one?  If you have the previous log, that would be helpful... it
>> should have a similar tranasction dump but a different stack trace.
>
> I pushed a wip-osd-dump-journal branch to git that will make
>
> 	ceph-osd -i<whatever>  --dump-journal>  /tmp/foo.txt
>
> dump the contents of your entire osd journal (sans data) to a text file.
> Do you mind sending that along as well?  I'd like to see what is in the
> journal _after_ the event that is failing (if anything).
>
> Thanks!
> sage
>
>
>>
>> Also, are any of the 6 patches on top of 0.40 related to the filestore or
>> osd?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
  2012-01-15 11:39   ` Martin Mailand
@ 2012-01-16  6:14     ` Sage Weil
  0 siblings, 0 replies; 6+ messages in thread
From: Sage Weil @ 2012-01-16  6:14 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel

On Sun, 15 Jan 2012, Martin Mailand wrote:
> Hi Sage,
> that's exactly what I did, the first two crashes are in this log,
> unfortunately there was no debug level set.

Whoops, right.. the (old) replay messages above confused me.  

There are a couple possibilities here.  One is that the recovery code went 
in the wrong order.  I'm a bit skeptical, though, and even if it did, this 
was mostly just rewritten in wip-backfill for 0.41, so I don't think it's 
worth debugging.  We have some tools to hammer on then snapshot + 
recovery code, but they aren't in the regular qa rotation yet.

More likely is that the SnapSet notion of clone_overlap got out of sync 
with the actual clones.  To check that, we need a dump of the xattrs on 
the _head object.  File sizes and attrs for the clones would help too.

Are the _2 clone objects on other replicas 4MB or 23 bytes?

Is this keeping your cluster down?  If so, you can work around the problem 
by making the _2 clone object 4MB so that reply will succeed.  Just be
aware that that rbd image's content will be corrupted.  :/

sage


> 
> http://85.214.49.87/ceph/osd.0.full.log.bz2
> 
> -martin
> 
> 
> 
> Am 15.01.2012 03:45, schrieb Sage Weil:
> > Hi Martin-
> > 
> > On Sat, 14 Jan 2012, Martin Mailand wrote:
> > 
> > > Hi
> > > one of four OSD died during the update to v0.40 with an Assertion
> > > os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error")
> > > Even after a complete shutdown of the cluster an a new start with all OSD
> > > at
> > > the same version, this osd did not start.
> > > 
> > > The OSD Log it attached.
> > 
> > It's trying to replay a transaction that appears to be invalid because the
> > .2 clone is smaller than it thinks.  Is this the first time the OSD
> > crashed, or did it crash once, and you cranked up logs and generated
> > this one?  If you have the previous log, that would be helpful... it
> > should have a similar tranasction dump but a different stack trace.
> > 
> > Also, are any of the 6 patches on top of 0.40 related to the filestore or
> > osd?
> > 
> > Thanks!
> > sage
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-01-16  6:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-14 14:40 Assertion in v0.40 - os/FileStore.cc: 2438: FAILED assert(0 == "unexpected error") Martin Mailand
2012-01-15  2:45 ` Sage Weil
2012-01-15  5:52   ` Sage Weil
2012-01-15 11:41     ` Martin Mailand
2012-01-15 11:39   ` Martin Mailand
2012-01-16  6:14     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.