From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gregory Farnum Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster Date: Thu, 10 Jan 2013 10:32:25 -0800 Message-ID: References: <50B80600.1000304@inktank.com> <1354237947.86472.YahooMailNeo@web121901.mail.ne1.yahoo.com> <1354238575.90788.YahooMailNeo@web121904.mail.ne1.yahoo.com> <1357591756.80653.YahooMailNeo@web121903.mail.ne1.yahoo.com> <1357592421.16299.YahooMailNeo@web121902.mail.ne1.yahoo.com> <1357680673.72602.YahooMailNeo@web121904.mail.ne1.yahoo.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Return-path: Received: from mail-qa0-f44.google.com ([209.85.216.44]:45877 "EHLO mail-qa0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755193Ab3AJSc0 convert rfc822-to-8bit (ORCPT ); Thu, 10 Jan 2013 13:32:26 -0500 Received: by mail-qa0-f44.google.com with SMTP id z4so1938243qan.3 for ; Thu, 10 Jan 2013 10:32:25 -0800 (PST) In-Reply-To: <1357680673.72602.YahooMailNeo@web121904.mail.ne1.yahoo.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Isaac Otsiabah Cc: "ceph-devel@vger.kernel.org" On Tue, Jan 8, 2013 at 1:31 PM, Isaac Otsiabah wrote: > > > Hi Greg, it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail. I ran it several times and finally got it to fail on (osd.0) using default crush map. The attached tar file contains log files for all components on g8ct plus the ceph.conf. > > I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5) > > > > id weight type name up/down reweight > -1 6 root default > -3 6 rack unknownrack > -2 3 host g8ct > 0 1 osd.0 down 1 > 1 1 osd.1 up 1 > 2 1 osd.2 up 1 > -4 3 host g13ct > 3 1 osd.3 up 1 > 4 1 osd.4 up 1 > 5 1 osd.5 up 1 > > > > The error messages are in ceph.log and ceph-osd.0.log: > > ceph.log:2013-01-08 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had wrong cluster addr (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571) > ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571) Thanks. I had a brief look through these logs on Tuesday and want to spend more time with them because they have some odd stuff in them. It *looks* like the OSD is starting out using a single IP for both the public and cluster networks and then switching over at some point, which is...odd. Knowing more details about how your network is actually set up would be very helpful. -Greg