All of lore.kernel.org
 help / color / mirror / Atom feed
* all OSDs crash at more or less the same time
@ 2016-03-06 23:37 Willem Jan Withagen
  2016-03-07 11:08 ` Sage Weil
  0 siblings, 1 reply; 2+ messages in thread
From: Willem Jan Withagen @ 2016-03-06 23:37 UTC (permalink / raw)
  To: Ceph Development

Hi,

While running cephtool-test-rados.sh "all of a sudden" the OSDs
disapear, I had one of the logs open which contained at the end:

    -2> 2016-03-06 21:56:02.073226 80569ed00  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x806795200' had timed out after 15
    -1> 2016-03-06 21:56:02.073248 80569ed00  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x806795200' had suicide timed out after 150
     0> 2016-03-06 21:56:02.113948 80569ed00 -1 common/HeartbeatMap.cc:
In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d
*, const char *, time_t)' thread 80569ed00 time 2016-03-06 21:56:02.073269
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

the monitor is still running. It claims the heartbeat_map is valid, but
still it suicides??

And what messages would prevent this from happening?
Receiving heartbeats from other OSDs?

IF so how would a 2 OSD server even survive when its connection would be
split for longer than 2,5 minute?

--WjW

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: all OSDs crash at more or less the same time
  2016-03-06 23:37 all OSDs crash at more or less the same time Willem Jan Withagen
@ 2016-03-07 11:08 ` Sage Weil
  0 siblings, 0 replies; 2+ messages in thread
From: Sage Weil @ 2016-03-07 11:08 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Ceph Development

On Mon, 7 Mar 2016, Willem Jan Withagen wrote:
> Hi,
> 
> While running cephtool-test-rados.sh "all of a sudden" the OSDs
> disapear, I had one of the logs open which contained at the end:
> 
>     -2> 2016-03-06 21:56:02.073226 80569ed00  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x806795200' had timed out after 15
>     -1> 2016-03-06 21:56:02.073248 80569ed00  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x806795200' had suicide timed out after 150
>      0> 2016-03-06 21:56:02.113948 80569ed00 -1 common/HeartbeatMap.cc:
> In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d
> *, const char *, time_t)' thread 80569ed00 time 2016-03-06 21:56:02.073269
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
> 
> the monitor is still running. It claims the heartbeat_map is valid, but
> still it suicides??
> 
> And what messages would prevent this from happening?
> Receiving heartbeats from other OSDs?
> 
> IF so how would a 2 OSD server even survive when its connection would be
> split for longer than 2,5 minute?

This is an internal heartbeat indicating that the osd_op_to thread got 
stuck somewhere.  Search backward in the log for the thread id 806795200 
to see the last thing that it did...

sage

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-03-07 11:08 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-06 23:37 all OSDs crash at more or less the same time Willem Jan Withagen
2016-03-07 11:08 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.