All of lore.kernel.org
 help / color / mirror / Atom feed
* Assertion error in librados
@ 2014-02-25 14:49 Filippos Giannakos
  2014-02-25 15:28 ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Filippos Giannakos @ 2014-02-25 14:49 UTC (permalink / raw)
  To: ceph-devel; +Cc: synnefo-devel

Hello all,

We recently bumped into the following assertion error in librados on our
production service:


common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa2c2ccf700 time 2014-02-21 07:23:26.340791
common/Mutex.cc: 93: FAILED assert(r == 0)
 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: (Mutex::Lock(bool)+0x131) [0x7fa2c7707431]
 2: (SimpleMessenger::submit_message(Message*, Connection*, entity_addr_t const&, int, bool)+0x52) [0x7fa2c7863172]
 3: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0x23e) [0x7fa2c7863bfe]
 4: (Objecter::send_op(Objecter::Op*)+0x32c) [0x7fa2c76b317c]
 5: (Objecter::handle_osd_map(MOSDMap*)+0x365) [0x7fa2c76b7805]
 6: (librados::RadosClient::_dispatch(Message*)+0x7c) [0x7fa2c768c70c]
 7: (librados::RadosClient::ms_dispatch(Message*)+0x9b) [0x7fa2c768c82b]
 8: (DispatchQueue::entry()+0x4eb) [0x7fa2c7800d2b]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa2c78666ad]
 10: (()+0x6b50) [0x7fa2c7203b50]
 11: (clone()+0x6d) [0x7fa2c6b570ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'


From what I can tell, there were some network problems on our RADOS cluster,
after which many of our librados clients failed with the above assertion error.

Do you have any ideas of what might went wrong ?

Kind Regards,
-- 
Filippos
<philipgian@grnet.gr>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 14:49 Assertion error in librados Filippos Giannakos
@ 2014-02-25 15:28 ` Gregory Farnum
  2014-02-25 15:41   ` Yehuda Sadeh
  2014-02-25 15:58   ` Filippos Giannakos
  0 siblings, 2 replies; 10+ messages in thread
From: Gregory Farnum @ 2014-02-25 15:28 UTC (permalink / raw)
  To: Filippos Giannakos; +Cc: ceph-devel, synnefo-devel

Do you have logs? The assert indicates that the messenger got back
something other than "okay" when trying to grab a local Mutex, which
shouldn't be able to happen. It may be that some error-handling path
didn't drop it (within the same thread that later tried to grab it
again), but we'll need more details to track it down.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Feb 25, 2014 at 6:49 AM, Filippos Giannakos <philipgian@grnet.gr> wrote:
> Hello all,
>
> We recently bumped into the following assertion error in librados on our
> production service:
>
>
> common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa2c2ccf700 time 2014-02-21 07:23:26.340791
> common/Mutex.cc: 93: FAILED assert(r == 0)
>  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>  1: (Mutex::Lock(bool)+0x131) [0x7fa2c7707431]
>  2: (SimpleMessenger::submit_message(Message*, Connection*, entity_addr_t const&, int, bool)+0x52) [0x7fa2c7863172]
>  3: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0x23e) [0x7fa2c7863bfe]
>  4: (Objecter::send_op(Objecter::Op*)+0x32c) [0x7fa2c76b317c]
>  5: (Objecter::handle_osd_map(MOSDMap*)+0x365) [0x7fa2c76b7805]
>  6: (librados::RadosClient::_dispatch(Message*)+0x7c) [0x7fa2c768c70c]
>  7: (librados::RadosClient::ms_dispatch(Message*)+0x9b) [0x7fa2c768c82b]
>  8: (DispatchQueue::entry()+0x4eb) [0x7fa2c7800d2b]
>  9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa2c78666ad]
>  10: (()+0x6b50) [0x7fa2c7203b50]
>  11: (clone()+0x6d) [0x7fa2c6b570ed]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
>
>
> From what I can tell, there were some network problems on our RADOS cluster,
> after which many of our librados clients failed with the above assertion error.
>
> Do you have any ideas of what might went wrong ?
>
> Kind Regards,
> --
> Filippos
> <philipgian@grnet.gr>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 15:28 ` Gregory Farnum
@ 2014-02-25 15:41   ` Yehuda Sadeh
  2014-02-25 15:58   ` Filippos Giannakos
  1 sibling, 0 replies; 10+ messages in thread
From: Yehuda Sadeh @ 2014-02-25 15:41 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Filippos Giannakos, ceph-devel, synnefo-devel

Looks to me like we try to send a message in the handle_osd_map when
we are still under the lock that we try to grab.

Yehuda

On Tue, Feb 25, 2014 at 7:28 AM, Gregory Farnum <greg@inktank.com> wrote:
> Do you have logs? The assert indicates that the messenger got back
> something other than "okay" when trying to grab a local Mutex, which
> shouldn't be able to happen. It may be that some error-handling path
> didn't drop it (within the same thread that later tried to grab it
> again), but we'll need more details to track it down.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Feb 25, 2014 at 6:49 AM, Filippos Giannakos <philipgian@grnet.gr> wrote:
>> Hello all,
>>
>> We recently bumped into the following assertion error in librados on our
>> production service:
>>
>>
>> common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa2c2ccf700 time 2014-02-21 07:23:26.340791
>> common/Mutex.cc: 93: FAILED assert(r == 0)
>>  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>>  1: (Mutex::Lock(bool)+0x131) [0x7fa2c7707431]
>>  2: (SimpleMessenger::submit_message(Message*, Connection*, entity_addr_t const&, int, bool)+0x52) [0x7fa2c7863172]
>>  3: (SimpleMessenger::_send_message(Message*, Connection*, bool)+0x23e) [0x7fa2c7863bfe]
>>  4: (Objecter::send_op(Objecter::Op*)+0x32c) [0x7fa2c76b317c]
>>  5: (Objecter::handle_osd_map(MOSDMap*)+0x365) [0x7fa2c76b7805]
>>  6: (librados::RadosClient::_dispatch(Message*)+0x7c) [0x7fa2c768c70c]
>>  7: (librados::RadosClient::ms_dispatch(Message*)+0x9b) [0x7fa2c768c82b]
>>  8: (DispatchQueue::entry()+0x4eb) [0x7fa2c7800d2b]
>>  9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa2c78666ad]
>>  10: (()+0x6b50) [0x7fa2c7203b50]
>>  11: (clone()+0x6d) [0x7fa2c6b570ed]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>>
>>
>> From what I can tell, there were some network problems on our RADOS cluster,
>> after which many of our librados clients failed with the above assertion error.
>>
>> Do you have any ideas of what might went wrong ?
>>
>> Kind Regards,
>> --
>> Filippos
>> <philipgian@grnet.gr>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 15:28 ` Gregory Farnum
  2014-02-25 15:41   ` Yehuda Sadeh
@ 2014-02-25 15:58   ` Filippos Giannakos
  2014-02-25 17:33     ` Noah Watkins
  2014-02-25 17:51     ` Gregory Farnum
  1 sibling, 2 replies; 10+ messages in thread
From: Filippos Giannakos @ 2014-02-25 15:58 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, synnefo-devel

Hi Greg,

Unfortunately we don't keep any Ceph related logs on the client side. On the
server side, we kept the default log settings to avoid overlogging.
Do you think that there might be something usefull on the OSD side ?

On Tue, Feb 25, 2014 at 07:28:30AM -0800, Gregory Farnum wrote:
> Do you have logs? The assert indicates that the messenger got back
> something other than "okay" when trying to grab a local Mutex, which
> shouldn't be able to happen. It may be that some error-handling path
> didn't drop it (within the same thread that later tried to grab it
> again), but we'll need more details to track it down.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 

Kind Regards,
-- 
Filippos
<philipgian@grnet.gr>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 15:58   ` Filippos Giannakos
@ 2014-02-25 17:33     ` Noah Watkins
  2014-02-25 17:51       ` Josh Durgin
  2014-02-25 17:51     ` Gregory Farnum
  1 sibling, 1 reply; 10+ messages in thread
From: Noah Watkins @ 2014-02-25 17:33 UTC (permalink / raw)
  To: Filippos Giannakos; +Cc: Gregory Farnum, ceph-devel, synnefo-devel

Perhaps using gtest-style asserts (ASSERT_EQ(r, 0)) in Ceph would be
useful so we can see parameter values to the assertion in the log. In
this case, the return value from pthread_mutex_lock is almost
certainly EINVAL, but it'd be informative to know for sure.

On Tue, Feb 25, 2014 at 7:58 AM, Filippos Giannakos <philipgian@grnet.gr> wrote:
> Hi Greg,
>
> Unfortunately we don't keep any Ceph related logs on the client side. On the
> server side, we kept the default log settings to avoid overlogging.
> Do you think that there might be something usefull on the OSD side ?
>
> On Tue, Feb 25, 2014 at 07:28:30AM -0800, Gregory Farnum wrote:
>> Do you have logs? The assert indicates that the messenger got back
>> something other than "okay" when trying to grab a local Mutex, which
>> shouldn't be able to happen. It may be that some error-handling path
>> didn't drop it (within the same thread that later tried to grab it
>> again), but we'll need more details to track it down.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>
> Kind Regards,
> --
> Filippos
> <philipgian@grnet.gr>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 15:58   ` Filippos Giannakos
  2014-02-25 17:33     ` Noah Watkins
@ 2014-02-25 17:51     ` Gregory Farnum
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2014-02-25 17:51 UTC (permalink / raw)
  To: Filippos Giannakos; +Cc: ceph-devel, synnefo-devel

Nope; it's an entirely local problem. I'm kind of surprised there
wasn't more logging available in the same location you got the core
dump, but it's possible the log generation is turned off (in addition
to the log dumping).

The Dispatch lock and the messenger lock are distinct, Yehuda, and the
messenger and pipe locks are the only ones we should be grabbing in
submit_message. Unless you were referring to something else?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Feb 25, 2014 at 7:58 AM, Filippos Giannakos <philipgian@grnet.gr> wrote:
> Hi Greg,
>
> Unfortunately we don't keep any Ceph related logs on the client side. On the
> server side, we kept the default log settings to avoid overlogging.
> Do you think that there might be something usefull on the OSD side ?
>
> On Tue, Feb 25, 2014 at 07:28:30AM -0800, Gregory Farnum wrote:
>> Do you have logs? The assert indicates that the messenger got back
>> something other than "okay" when trying to grab a local Mutex, which
>> shouldn't be able to happen. It may be that some error-handling path
>> didn't drop it (within the same thread that later tried to grab it
>> again), but we'll need more details to track it down.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>
> Kind Regards,
> --
> Filippos
> <philipgian@grnet.gr>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 17:33     ` Noah Watkins
@ 2014-02-25 17:51       ` Josh Durgin
  2014-02-25 19:26         ` Noah Watkins
  0 siblings, 1 reply; 10+ messages in thread
From: Josh Durgin @ 2014-02-25 17:51 UTC (permalink / raw)
  To: Noah Watkins, Filippos Giannakos
  Cc: Gregory Farnum, ceph-devel, synnefo-devel

That's a good idea. This particular assert in a Mutex is almost always
a use-after-free of the Mutex or structure containing it though.

On 02/25/2014 09:33 AM, Noah Watkins wrote:
> Perhaps using gtest-style asserts (ASSERT_EQ(r, 0)) in Ceph would be
> useful so we can see parameter values to the assertion in the log. In
> this case, the return value from pthread_mutex_lock is almost
> certainly EINVAL, but it'd be informative to know for sure.
>
> On Tue, Feb 25, 2014 at 7:58 AM, Filippos Giannakos <philipgian@grnet.gr> wrote:
>> Hi Greg,
>>
>> Unfortunately we don't keep any Ceph related logs on the client side. On the
>> server side, we kept the default log settings to avoid overlogging.
>> Do you think that there might be something usefull on the OSD side ?
>>
>> On Tue, Feb 25, 2014 at 07:28:30AM -0800, Gregory Farnum wrote:
>>> Do you have logs? The assert indicates that the messenger got back
>>> something other than "okay" when trying to grab a local Mutex, which
>>> shouldn't be able to happen. It may be that some error-handling path
>>> didn't drop it (within the same thread that later tried to grab it
>>> again), but we'll need more details to track it down.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 17:51       ` Josh Durgin
@ 2014-02-25 19:26         ` Noah Watkins
  2014-03-28  9:54           ` Filippos Giannakos
  0 siblings, 1 reply; 10+ messages in thread
From: Noah Watkins @ 2014-02-25 19:26 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Filippos Giannakos, Gregory Farnum, ceph-devel, synnefo-devel

On Tue, Feb 25, 2014 at 9:51 AM, Josh Durgin <josh.durgin@inktank.com> wrote:
> That's a good idea. This particular assert in a Mutex is almost always
> a use-after-free of the Mutex or structure containing it though.

I think that a use-after-free will also throw an EINVAL (assuming it
isn't a pathalogical case) as pthread_mutex_lock checks an
initialization magic variable. I think that particular mutex isn't
initialized with flags that would cause any of the other possible
return values.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-02-25 19:26         ` Noah Watkins
@ 2014-03-28  9:54           ` Filippos Giannakos
  2014-03-31 17:10             ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Filippos Giannakos @ 2014-03-28  9:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Josh Durgin, Gregory Farnum, Noah Watkins, synnefo-devel

Hello,

We recently bumped again into the same assertion error.
Do you have any indications or update regarding the cause ?

On Tue, Feb 25, 2014 at 11:26:15AM -0800, Noah Watkins wrote:
> On Tue, Feb 25, 2014 at 9:51 AM, Josh Durgin <josh.durgin@inktank.com> wrote:
> > That's a good idea. This particular assert in a Mutex is almost always
> > a use-after-free of the Mutex or structure containing it though.
> 
> I think that a use-after-free will also throw an EINVAL (assuming it
> isn't a pathalogical case) as pthread_mutex_lock checks an
> initialization magic variable. I think that particular mutex isn't
> initialized with flags that would cause any of the other possible
> return values.

Kind regards,
-- 
Filippos
<philipgian@grnet.gr>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Assertion error in librados
  2014-03-28  9:54           ` Filippos Giannakos
@ 2014-03-31 17:10             ` Gregory Farnum
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2014-03-31 17:10 UTC (permalink / raw)
  To: Filippos Giannakos; +Cc: ceph-devel, Josh Durgin, Noah Watkins, synnefo-devel

Nope, I don't think anybody's looked into it. If you have core dumps
you could get a backtrace and the return value referenced.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Mar 28, 2014 at 2:54 AM, Filippos Giannakos <philipgian@grnet.gr> wrote:
> Hello,
>
> We recently bumped again into the same assertion error.
> Do you have any indications or update regarding the cause ?
>
> On Tue, Feb 25, 2014 at 11:26:15AM -0800, Noah Watkins wrote:
>> On Tue, Feb 25, 2014 at 9:51 AM, Josh Durgin <josh.durgin@inktank.com> wrote:
>> > That's a good idea. This particular assert in a Mutex is almost always
>> > a use-after-free of the Mutex or structure containing it though.
>>
>> I think that a use-after-free will also throw an EINVAL (assuming it
>> isn't a pathalogical case) as pthread_mutex_lock checks an
>> initialization magic variable. I think that particular mutex isn't
>> initialized with flags that would cause any of the other possible
>> return values.
>
> Kind regards,
> --
> Filippos
> <philipgian@grnet.gr>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-03-31 17:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-25 14:49 Assertion error in librados Filippos Giannakos
2014-02-25 15:28 ` Gregory Farnum
2014-02-25 15:41   ` Yehuda Sadeh
2014-02-25 15:58   ` Filippos Giannakos
2014-02-25 17:33     ` Noah Watkins
2014-02-25 17:51       ` Josh Durgin
2014-02-25 19:26         ` Noah Watkins
2014-03-28  9:54           ` Filippos Giannakos
2014-03-31 17:10             ` Gregory Farnum
2014-02-25 17:51     ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.