From mboxrd@z Thu Jan 1 00:00:00 1970 From: Willem Jan Withagen Subject: Re: OSD not coming back up again Date: Thu, 11 Aug 2016 13:13:54 +0200 Message-ID: References: <453729567.2335.1470896803201@ox.pcextreme.nl> <301395781.2383.1470913324040@ox.pcextreme.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: Received: from smtp.digiware.nl ([176.74.240.9]:50081 "EHLO smtp.digiware.nl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752100AbcHKLOI (ORCPT ); Thu, 11 Aug 2016 07:14:08 -0400 In-Reply-To: <301395781.2383.1470913324040@ox.pcextreme.nl> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Wido den Hollander , Ceph Development On 11-8-2016 13:02, Wido den Hollander wrote: >> Right before setting the osd to down, the newest map is 173. >> So some maps have been exchanged.... >> >> How does the OSD decide that it is healthy? >> If it gets peer (ping) messages that is is up? >> > > Not 100% sure, but waiting for healthy is something I haven't seen before. > > Is it incrementing the newest_map when the cluster advances? > >>> Maybe try debug_osd = 20 >> >> But then still I need to know what to look for, since 20 generates >> serious output. >> > True, it will. But I don't know exactly what to look for. debug_osd = 20 might reveal more information there. > > It might be a very simple log line which tells you what is going on. See my other mail, log did not reveal much. Other than that it made me look at the sockets. But looking at the socket-states, I think the sockets on the OSD that is going down are not correctly closed. And so osd.0 thinks it is still connected. And osd.1 and osd.2 are without connection to osd.0, so they are correct in suggesting that it is dead. --WjW