All of lore.kernel.org
 help / color / mirror / Atom feed
* parsing in the ceph osd subsystem
@ 2012-11-29  7:45 Andrey Korolyov
  2012-11-29  7:53 ` Gregory Farnum
  2012-11-29 16:34 ` Sage Weil
  0 siblings, 2 replies; 24+ messages in thread
From: Andrey Korolyov @ 2012-11-29  7:45 UTC (permalink / raw)
  To: ceph-devel

$ ceph osd down -
osd.0 is already down
$ ceph osd down ---
osd.0 is already down

the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
ceph cli should explicitly work only with positive integers plus zero,
refusing all other input.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: parsing in the ceph osd subsystem
  2012-11-29  7:45 parsing in the ceph osd subsystem Andrey Korolyov
@ 2012-11-29  7:53 ` Gregory Farnum
  2012-11-29 16:34 ` Sage Weil
  1 sibling, 0 replies; 24+ messages in thread
From: Gregory Farnum @ 2012-11-29  7:53 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

On Wednesday, November 28, 2012 at 11:45 PM, Andrey Korolyov wrote:
> $ ceph osd down -
> osd.0 is already down
> $ ceph osd down ---
> osd.0 is already down
> 
> the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
> ceph cli should explicitly work only with positive integers plus zero,
> refusing all other input.

 
Yes indeed! This is already fixed in bobtail (and v0.54, maybe?), and I believe Joao went through and audited all our input parsing pretty carefully. Certainly things are much better than they were.
-Greg



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: parsing in the ceph osd subsystem
  2012-11-29  7:45 parsing in the ceph osd subsystem Andrey Korolyov
  2012-11-29  7:53 ` Gregory Farnum
@ 2012-11-29 16:34 ` Sage Weil
  2012-11-29 16:49   ` Joao Eduardo Luis
  2012-11-29 19:01   ` Andrey Korolyov
  1 sibling, 2 replies; 24+ messages in thread
From: Sage Weil @ 2012-11-29 16:34 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel

On Thu, 29 Nov 2012, Andrey Korolyov wrote:
> $ ceph osd down -
> osd.0 is already down
> $ ceph osd down ---
> osd.0 is already down
> 
> the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
> ceph cli should explicitly work only with positive integers plus zero,
> refusing all other input.

which branch is this?  this parsing is cleaned u pin the latest 
next/master.



> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: parsing in the ceph osd subsystem
  2012-11-29 16:34 ` Sage Weil
@ 2012-11-29 16:49   ` Joao Eduardo Luis
  2012-11-29 19:01   ` Andrey Korolyov
  1 sibling, 0 replies; 24+ messages in thread
From: Joao Eduardo Luis @ 2012-11-29 16:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Andrey Korolyov, ceph-devel

On 11/29/2012 04:34 PM, Sage Weil wrote:
> On Thu, 29 Nov 2012, Andrey Korolyov wrote:
>> $ ceph osd down -
>> osd.0 is already down
>> $ ceph osd down ---
>> osd.0 is already down
>>
>> the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
>> ceph cli should explicitly work only with positive integers plus zero,
>> refusing all other input.
> 
> which branch is this?  this parsing is cleaned u pin the latest 
> next/master.

I confirm this is fixed, but while making sure noticed that 'ceph osd
create <garbage>' will create an osd nonetheless. Just added a fix for
this to wip-mon-osd-create-fix that will make sure that if the uuid
argument is specified, then -EINVAL is returned in case the uuid is invalid.

  -Joao


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: parsing in the ceph osd subsystem
  2012-11-29 16:34 ` Sage Weil
  2012-11-29 16:49   ` Joao Eduardo Luis
@ 2012-11-29 19:01   ` Andrey Korolyov
  2012-11-29 19:49     ` Joao Eduardo Luis
  2012-11-30  1:04     ` Joao Eduardo Luis
  1 sibling, 2 replies; 24+ messages in thread
From: Andrey Korolyov @ 2012-11-29 19:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil <sage@inktank.com> wrote:
> On Thu, 29 Nov 2012, Andrey Korolyov wrote:
>> $ ceph osd down -
>> osd.0 is already down
>> $ ceph osd down ---
>> osd.0 is already down
>>
>> the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
>> ceph cli should explicitly work only with positive integers plus zero,
>> refusing all other input.
>
> which branch is this?  this parsing is cleaned u pin the latest
> next/master.
>
>

It was produced by 0.54-tag. I have built
dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has
gone(except parsing ``-0'' as 0 and 00000/0000001 as 0 and 1
correspondingly).

>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: parsing in the ceph osd subsystem
  2012-11-29 19:01   ` Andrey Korolyov
@ 2012-11-29 19:49     ` Joao Eduardo Luis
  2012-11-30  1:04     ` Joao Eduardo Luis
  1 sibling, 0 replies; 24+ messages in thread
From: Joao Eduardo Luis @ 2012-11-29 19:49 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: Sage Weil, ceph-devel

On 11/29/2012 07:01 PM, Andrey Korolyov wrote:
> On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil <sage@inktank.com> wrote:
>> On Thu, 29 Nov 2012, Andrey Korolyov wrote:
>>> $ ceph osd down -
>>> osd.0 is already down
>>> $ ceph osd down ---
>>> osd.0 is already down
>>>
>>> the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
>>> ceph cli should explicitly work only with positive integers plus zero,
>>> refusing all other input.
>>
>> which branch is this?  this parsing is cleaned u pin the latest
>> next/master.
>>
>>
> 
> It was produced by 0.54-tag. I have built
> dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has
> gone(except parsing ``-0'' as 0 and 00000/0000001 as 0 and 1
> correspondingly).

We use strtol() to parse numeric values, and '-0', '00000' or '00001'
are valid numeric values. I suppose we could enforce the argument to be
numeric only, hence getting rid of '-0' and enforce stricter checks on
the parameters to rule out valid numeric values that look funny, which
in the '0*\d' cases it should be fairly simple.

  -Joao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: parsing in the ceph osd subsystem
  2012-11-29 19:01   ` Andrey Korolyov
  2012-11-29 19:49     ` Joao Eduardo Luis
@ 2012-11-30  1:04     ` Joao Eduardo Luis
       [not found]       ` <1354237947.86472.YahooMailNeo@web121901.mail.ne1.yahoo.com>
  1 sibling, 1 reply; 24+ messages in thread
From: Joao Eduardo Luis @ 2012-11-30  1:04 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: Sage Weil, ceph-devel

On 11/29/2012 07:01 PM, Andrey Korolyov wrote:
> On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil <sage@inktank.com> wrote:
>> On Thu, 29 Nov 2012, Andrey Korolyov wrote:
>>> $ ceph osd down -
>>> osd.0 is already down
>>> $ ceph osd down ---
>>> osd.0 is already down
>>>
>>> the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys
>>> ceph cli should explicitly work only with positive integers plus zero,
>>> refusing all other input.
>>
>> which branch is this?  this parsing is cleaned u pin the latest
>> next/master.
>>
>>
> 
> It was produced by 0.54-tag. I have built
> dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has
> gone(except parsing ``-0'' as 0 and 00000/0000001 as 0 and 1
> correspondingly).

A fix for the signed parameter has been pushed to next. However, after
consideration, when it comes to the '0+\d' parameters, that kind of
input was considered valid; Greg put it best on IRC, and I quote:

<gregaf> joao: not sure we want to prevent "01" from parsing as 1, I
suspect some people with large clusters will find that useful so they
can conflate the name and ID while keeping everything three digits

Hope this makes sense to you.

  -Joao

^ permalink raw reply	[flat|nested] 24+ messages in thread

* What is the new command to add osd to the crushmap to enable it to receive data
       [not found]       ` <1354237947.86472.YahooMailNeo@web121901.mail.ne1.yahoo.com>
@ 2012-11-30  1:22         ` Isaac Otsiabah
  2012-11-30  1:54           ` Joao Eduardo Luis
       [not found]           ` <1357591756.80653.YahooMailNeo@web121903.mail.ne1.yahoo.com>
  0 siblings, 2 replies; 24+ messages in thread
From: Isaac Otsiabah @ 2012-11-30  1:22 UTC (permalink / raw)
  To: ceph-devel

This command below which adds a new to the crushmap to enable it to receive data has changed and does not work anymore.


ceph osd crush set {id} {name}


Please, what is the new command to add a new osd to the crushmap to enable it to receive data?


Isaac 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What is the new command to add osd to the crushmap to enable it to receive data
  2012-11-30  1:22         ` What is the new command to add osd to the crushmap to enable it to receive data Isaac Otsiabah
@ 2012-11-30  1:54           ` Joao Eduardo Luis
       [not found]           ` <1357591756.80653.YahooMailNeo@web121903.mail.ne1.yahoo.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Joao Eduardo Luis @ 2012-11-30  1:54 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: ceph-devel

On 11/30/2012 01:22 AM, Isaac Otsiabah wrote:
> This command below which adds a new to the crushmap to enable it to receive data has changed and does not work anymore.
> 
> 
> ceph osd crush set {id} {name}
> 
> 
> Please, what is the new command to add a new osd to the crushmap to enable it to receive data?

You must specify a weight and a location. For instance,

  ceph osd crush set 0 osd.0 1.0 root=default

Also, you can check up on the docs for more infos.

From the docs at [1]:

Add the OSD to the CRUSH map so that it can begin receiving data. You
may also decompile the CRUSH map, add the OSD to the device list, add
the host as a bucket (if it’s not already in the CRUSH map), add the
device as an item in the host, assign it a weight, recompile it and set
it. See Add/Move an OSD for details.

ceph osd crush set {id} {name} {weight} pool={pool-name}
[{bucket-type}={bucket-name} ...]


[1] http://ceph.com/docs/master/rados/operations/add-or-rm-osds/


  -Joao
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* osd down (for 2 about 2 minutes) error after adding a new host  to my cluster
       [not found]           ` <1357591756.80653.YahooMailNeo@web121903.mail.ne1.yahoo.com>
@ 2013-01-07 21:00             ` Isaac Otsiabah
  2013-01-07 21:27               ` Gregory Farnum
  0 siblings, 1 reply; 24+ messages in thread
From: Isaac Otsiabah @ 2013-01-07 21:00 UTC (permalink / raw)
  To: ceph-devel



When i add a new host (with osd's) to my existing cluster, 1 or 2 previous osd(s) goes down for about 2 minutes and then they  come back up.  


[root@h1ct ~]# ceph osd tree

# id    weight  type name       up/down reweight
-1     
3       root default
-3      3               rack unknownrack
-2      3                       host h1
0       1                               osd.0   up      1
1       1                               osd.1   up      1
2      
1                               osd.2   up      1


For example, after adding host h2 (with 3 new osd) to the above cluster and running the "ceph osd tree" command, i see this: 


[root@h1 ~]# ceph osd tree

# id    weight  type name       up/down reweight
-1      6       root default
-3     
6               rack unknownrack
-2      3                       host h1
0       1                               osd.0   up      1
1       1                               osd.1   down    1
2      
1                               osd.2   up      1
-4      3                       host h2
3       1                               osd.3   up      1
4       1                               osd.4   up     
1
5       1                               osd.5   up      1


The down osd always come back up after 2 minutes or less andi see the following error message in the respective osd log file:  
2013-01-07 04:40:17.613028 7fec7f092760  1 journal _open 
/ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
4096 bytes, directio = 1, aio = 0
2013-01-07 04:40:17.613122 
7fec7f092760  1 journal _open /ceph_journal/journals/journal_2 fd 26: 
1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2013-01-07
04:42:10.006533 7fec746f7710  0 -- 192.168.0.124:6808/19449 >> 
192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0 
l=0).accept connect_seq 0 vs existing 0 state connecting
2013-01-07 
04:45:29.834341 7fec743f4710  0 -- 192.168.1.124:6808/19449 >> 
192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
l=0).fault, initiating reconnect
2013-01-07 04:45:29.835748 
7fec743f4710  0 -- 192.168.1.124:6808/19449 >> 
192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
l=0).fault, initiating reconnect
2013-01-07 04:45:30.835219 7fec743f4710  0 -- 
192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
reconnect
2013-01-07 04:45:30.837318 7fec743f4710  0 -- 
192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
reconnect
2013-01-07 04:45:30.851984 7fec637fe710  0 log [ERR] : map 
e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
192.168.1.124:6808/19449)

Also, this only happens  only when the cluster ip address and the public ip address are different for example
....
....
....
[osd.0]
        host = g8ct
        public address = 192.168.0.124
        cluster address = 192.168.1.124
        btrfs devs = /dev/sdb

....
....

but does not happen when they are the same.  Any idea what may be the issue?

Isaac
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-01-07 21:00             ` osd down (for 2 about 2 minutes) error after adding a new host to my cluster Isaac Otsiabah
@ 2013-01-07 21:27               ` Gregory Farnum
       [not found]                 ` <1357680673.72602.YahooMailNeo@web121904.mail.ne1.yahoo.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Farnum @ 2013-01-07 21:27 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: ceph-devel

On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
> 
> 
> When i add a new host (with osd's) to my existing cluster, 1 or 2 previous osd(s) goes down for about 2 minutes and then they come back up. 
> 
> 
> [root@h1ct ~]# ceph osd tree
> 
> # id weight type name up/down reweight
> -1 
> 3 root default
> -3 3 rack unknownrack
> -2 3 host h1
> 0 1 osd.0 up 1
> 1 1 osd.1 up 1
> 2 
> 1 osd.2 up 1
> 
> 
> For example, after adding host h2 (with 3 new osd) to the above cluster and running the "ceph osd tree" command, i see this: 
> 
> 
> [root@h1 ~]# ceph osd tree
> 
> # id weight type name up/down reweight
> -1 6 root default
> -3 
> 6 rack unknownrack
> -2 3 host h1
> 0 1 osd.0 up 1
> 1 1 osd.1 down 1
> 2 
> 1 osd.2 up 1
> -4 3 host h2
> 3 1 osd.3 up 1
> 4 1 osd.4 up 
> 1
> 5 1 osd.5 up 1
> 
> 
> The down osd always come back up after 2 minutes or less andi see the following error message in the respective osd log file: 
> 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open 
> /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
> 4096 bytes, directio = 1, aio = 0
> 2013-01-07 04:40:17.613122 
> 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26: 
> 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
> 2013-01-07
> 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >> 
> 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0 
> l=0).accept connect_seq 0 vs existing 0 state connecting
> 2013-01-07 
> 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
> l=0).fault, initiating reconnect
> 2013-01-07 04:45:29.835748 
> 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
> l=0).fault, initiating reconnect
> 2013-01-07 04:45:30.835219 7fec743f4710 0 -- 
> 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
> reconnect
> 2013-01-07 04:45:30.837318 7fec743f4710 0 -- 
> 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
> reconnect
> 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map 
> e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
> 192.168.1.124:6808/19449)
> 
> Also, this only happens only when the cluster ip address and the public ip address are different for example
> ....
> ....
> ....
> [osd.0]
> host = g8ct
> public address = 192.168.0.124
> cluster address = 192.168.1.124
> btrfs devs = /dev/sdb
> 
> ....
> ....
> 
> but does not happen when they are the same. Any idea what may be the issue?
> 
This isn't familiar to me at first glance. What version of Ceph are you using?

If this is easy to reproduce, can you pastebin your ceph.conf and then add "debug ms = 1" to your global config and gather up the logs from each daemon?
-Greg


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
       [not found]                 ` <1357680673.72602.YahooMailNeo@web121904.mail.ne1.yahoo.com>
@ 2013-01-10 18:32                   ` Gregory Farnum
  2013-01-10 18:45                     ` What is the acceptable attachment file size on the mail server? Isaac Otsiabah
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Farnum @ 2013-01-10 18:32 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: ceph-devel

On Tue, Jan 8, 2013 at 1:31 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Hi Greg, it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail. I ran it several times and finally got it to fail on (osd.0) using default crush map. The attached tar file contains log files  for all components on g8ct plus the ceph.conf.
>
> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2)  and then added host g13ct (osd.3, osd.4, osd.5)
>
>
>
>  id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g8ct
> 0       1                               osd.0   down    1
> 1       1                               osd.1   up      1
> 2       1                               osd.2   up      1
> -4      3                       host g13ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
> The error messages are in ceph.log and ceph-osd.0.log:
>
> ceph.log:2013-01-08 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had wrong cluster addr (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710  0 log [ERR] : map e15 had wrong cluster addr (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)

Thanks. I had a brief look through these logs on Tuesday and want to
spend more time with them because they have some odd stuff in them. It
*looks* like the OSD is starting out using a single IP for both the
public and cluster networks and then switching over at some point,
which is...odd.
Knowing more details about how your network is actually set up would
be very helpful.
-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* What is the acceptable attachment file size on the mail server?
  2013-01-10 18:32                   ` Gregory Farnum
@ 2013-01-10 18:45                     ` Isaac Otsiabah
  2013-01-10 18:57                       ` Gregory Farnum
  2013-01-12  1:41                       ` Yan, Zheng 
  0 siblings, 2 replies; 24+ messages in thread
From: Isaac Otsiabah @ 2013-01-10 18:45 UTC (permalink / raw)
  To: ceph-devel


What is the acceptable attachment file size? because i have been trying to post a problem with an attachment greater than 1.5MG and it seems  to get lost.

Isaac
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What is the acceptable attachment file size on the mail server?
  2013-01-10 18:45                     ` What is the acceptable attachment file size on the mail server? Isaac Otsiabah
@ 2013-01-10 18:57                       ` Gregory Farnum
  2013-01-12  1:41                       ` Yan, Zheng 
  1 sibling, 0 replies; 24+ messages in thread
From: Gregory Farnum @ 2013-01-10 18:57 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: ceph-devel

On Thu, Jan 10, 2013 at 10:45 AM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
> What is the acceptable attachment file size? because i have been trying to post a problem with an attachment greater than 1.5MG and it seems  to get lost.

That wouldn't be surprising; Sage suggests the FAQ
(http://www.tux.org/lkml/) has an answer somewhere although I couldn't
find it in a quick check.
-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What is the acceptable attachment file size on the mail server?
  2013-01-10 18:45                     ` What is the acceptable attachment file size on the mail server? Isaac Otsiabah
  2013-01-10 18:57                       ` Gregory Farnum
@ 2013-01-12  1:41                       ` Yan, Zheng 
  1 sibling, 0 replies; 24+ messages in thread
From: Yan, Zheng  @ 2013-01-12  1:41 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: ceph-devel

On Fri, Jan 11, 2013 at 2:45 AM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
> What is the acceptable attachment file size? because i have been trying to post a problem with an attachment greater than 1.5MG and it seems  to get lost.

I remember that some time ago, LKML does not accept mail larger than
100k. But I don't know if that limit has changed.

Yan, Zheng

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-02-15 17:20           ` Sam Lang
@ 2013-02-16  2:00             ` Isaac Otsiabah
  0 siblings, 0 replies; 24+ messages in thread
From: Isaac Otsiabah @ 2013-02-16  2:00 UTC (permalink / raw)
  To: Sam Lang; +Cc: Gregory Farnum, ceph-devel



Hello Sam and Gregory, i got machines today and tested it with the monitor process running on a separate system with no osd daemons and i did not see the problem. On Monday i will do a few test to confirm.

Isaac



----- Original Message -----
From: Sam Lang <sam.lang@inktank.com>
To: Isaac Otsiabah <zmoo76b@yahoo.com>
Cc: Gregory Farnum <greg@inktank.com>; "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Sent: Friday, February 15, 2013 9:20 AM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

On Mon, Feb 11, 2013 at 7:39 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Yes, there were osd daemons running on the same node that the monitor was
> running on.  If that is the case then i will run a test case with the
> monitor running on a different node where no osd is running and see what happens. Thank you.

Hi Isaac,

Any luck?  Does the problem reproduce with the mon running on a separate host?
-sam

>
> Isaac
>
> ________________________________
> From: Gregory Farnum <greg@inktank.com>
> To: Isaac Otsiabah <zmoo76b@yahoo.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Monday, February 11, 2013 12:29 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
> jIsaac,
> I'm sorry I haven't been able to wrangle any time to look into this
> more yet, but Sage pointed out in a related thread that there might be
> some buggy handling of things like this if the OSD and the monitor are
> located on the same host. Am I correct in assuming that with your
> small cluster, all your OSDs are co-located with a monitor daemon?
> -Greg
>
> On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>>
>>
>> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>>
>>
>> [root@g13ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth2
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
>> [root@g13ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>> [root@g14ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth5
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>>
>>
>> Isaac
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Isaac Otsiabah <zmoo76b@yahoo.com>
>> To: Gregory Farnum <greg@inktank.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Friday, January 25, 2013 9:51 AM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>>
>>
>> Gregory, the network physical layout is simple, the two networks are
>> separate. the 192.168.0 and the 192.168.1 are not subnets within a
>> network.
>>
>> Isaac
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Thursday, January 24, 2013 1:28 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
>> -Greg
>>
>>
>> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>>
>>>
>>>
>>> Gregory, i tried send the the attached debug output several times and
>>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>>> reconnection failures by the error message line below. The ceph version
>>> is 0.56
>>>
>>>
>>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>>> I
>>> ran it several times and finally got it to fail on (osd.0) using
>>> default crush map. The attached tar file contains log files for all
>>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>>
>>>
>>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>>
>>>
>>> id weight type name up/down reweight
>>> -1 6 root default
>>> -3 6 rack unknownrack
>>> -2 3 host g8ct
>>> 0 1 osd.0 down 1
>>> 1 1 osd.1 up 1
>>> 2 1 osd.2 up 1
>>> -4 3 host g13ct
>>> 3 1 osd.3 up 1
>>> 4 1 osd.4 up 1
>>> 5 1 osd.5 up 1
>>>
>>>
>>>
>>> The error messages are in ceph.log and ceph-osd.0.log:
>>>
>>> ceph.log:2013-01-08
>>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>>> wrong cluster addr (192.168.0.124:6802/25571 != my
>>> 192.168.1.124:6802/25571)
>>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>>
>>>
>>>
>>> [root@g8ct ceph]# ceph -v
>>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>>
>>>
>>> Isaac
>>>
>>>
>>> ----- Original Message -----
>>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>>> Sent: Monday, January 7, 2013 1:27 PM
>>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>>
>>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>>
>>>
>>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>>> previous osd(s) goes down for about 2 minutes and then they come back
>>> up.
>>> >
>>> >
>>> > [root@h1ct ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1
>>> > 3 root default
>>> > -3 3 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 up 1
>>> > 2
>>> > 1 osd.2 up 1
>>>
>>>
>>> For example, after adding host h2 (with 3 new osd) to the above cluster
>>> and running the "ceph osd tree" command, i see this:
>>> >
>>> >
>>> > [root@h1 ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1 6 root default
>>> > -3
>>> > 6 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 down 1
>>> > 2
>>> > 1 osd.2 up 1
>>> > -4 3 host h2
>>> > 3 1 osd.3 up 1
>>> > 4 1 osd.4 up
>>> > 1
>>> > 5 1 osd.5 up 1
>>>
>>>
>>> The down osd always come back up after 2 minutes or less andi see the
>>> following error message in the respective osd log file:
>>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>>> > 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07 04:40:17.613122
>>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07
>>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>>> > 2013-01-07
>>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:29.835748
>>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>>> > 192.168.1.124:6808/19449)
>>> >
>>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>>> > ....
>>> > ....
>>> > ....
>>> > [osd.0]
>>> > host = g8ct
>>> > public address = 192.168.0.124
>>> > cluster address = 192.168.1.124
>>> > btrfs devs = /dev/sdb
>>> >
>>> > ....
>>> > ....
>>> >
>>> > but does not happen when they are the same. Any idea what may be the issue?
>>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>>
>>> If
>>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>>> "debug ms = 1" to your global config and gather up the logs from each
>>> daemon?
>>> -Greg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>> More majordomo info at http://vger.kernel.org/majordomo
>>>
>>>
>>> Attachments:
>>> - ceph-osd.0.log.tar.gz
>>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>>
>>
>> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>>
>>
>> [root@g13ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth2
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
>> [root@g13ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>> [root@g14ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth5
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>>
>>
>> Isaac
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Isaac Otsiabah <zmoo76b@yahoo.com>
>> To: Gregory Farnum <greg@inktank.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Friday, January 25, 2013 9:51 AM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>>
>>
>> Gregory, the network physical layout is simple, the two networks are
>> separate. the 192.168.0 and the 192.168.1 are not subnets within a
>> network.
>>
>> Isaac
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Thursday, January 24, 2013 1:28 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
>> -Greg
>>
>>
>> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>>
>>>
>>>
>>> Gregory, i tried send the the attached debug output several times and
>>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>>> reconnection failures by the error message line below. The ceph version
>>> is 0.56
>>>
>>>
>>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>>> I
>>> ran it several times and finally got it to fail on (osd.0) using
>>> default crush map. The attached tar file contains log files for all
>>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>>
>>>
>>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>>
>>>
>>> id weight type name up/down reweight
>>> -1 6 root default
>>> -3 6 rack unknownrack
>>> -2 3 host g8ct
>>> 0 1 osd.0 down 1
>>> 1 1 osd.1 up 1
>>> 2 1 osd.2 up 1
>>> -4 3 host g13ct
>>> 3 1 osd.3 up 1
>>> 4 1 osd.4 up 1
>>> 5 1 osd.5 up 1
>>>
>>>
>>>
>>> The error messages are in ceph.log and ceph-osd.0.log:
>>>
>>> ceph.log:2013-01-08
>>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>>> wrong cluster addr (192.168.0.124:6802/25571 != my
>>> 192.168.1.124:6802/25571)
>>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>>
>>>
>>>
>>> [root@g8ct ceph]# ceph -v
>>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>>
>>>
>>> Isaac
>>>
>>>
>>> ----- Original Message -----
>>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>>> Sent: Monday, January 7, 2013 1:27 PM
>>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>>
>>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>>
>>>
>>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>>> previous osd(s) goes down for about 2 minutes and then they come back
>>> up.
>>> >
>>> >
>>> > [root@h1ct ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1
>>> > 3 root default
>>> > -3 3 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 up 1
>>> > 2
>>> > 1 osd.2 up 1
>>>
>>>
>>> For example, after adding host h2 (with 3 new osd) to the above cluster
>>> and running the "ceph osd tree" command, i see this:
>>> >
>>> >
>>> > [root@h1 ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1 6 root default
>>> > -3
>>> > 6 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 down 1
>>> > 2
>>> > 1 osd.2 up 1
>>> > -4 3 host h2
>>> > 3 1 osd.3 up 1
>>> > 4 1 osd.4 up
>>> > 1
>>> > 5 1 osd.5 up 1
>>>
>>>
>>> The down osd always come back up after 2 minutes or less andi see the
>>> following error message in the respective osd log file:
>>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>>> > 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07 04:40:17.613122
>>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07
>>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>>> > 2013-01-07
>>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:29.835748
>>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>>> > 192.168.1.124:6808/19449)
>>> >
>>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>>> > ....
>>> > ....
>>> > ....
>>> > [osd.0]
>>> > host = g8ct
>>> > public address = 192.168.0.124
>>> > cluster address = 192.168.1.124
>>> > btrfs devs = /dev/sdb
>>> >
>>> > ....
>>> > ....
>>> >
>>> > but does not happen when they are the same. Any idea what may be the issue?
>>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>>
>>> If
>>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>>> "debug ms = 1" to your global config and gather up the logs from each
>>> daemon?
>>> -Greg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>> More majordomo info at http://vger.kernel.org/majordomo
>>>
>>>
>>> Attachments:
>>> - ceph-osd.0.log.tar.gz
>>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-02-12  1:39         ` Isaac Otsiabah
@ 2013-02-15 17:20           ` Sam Lang
  2013-02-16  2:00             ` Isaac Otsiabah
  0 siblings, 1 reply; 24+ messages in thread
From: Sam Lang @ 2013-02-15 17:20 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: Gregory Farnum, ceph-devel

On Mon, Feb 11, 2013 at 7:39 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Yes, there were osd daemons running on the same node that the monitor was
> running on.  If that is the case then i will run a test case with the
> monitor running on a different node where no osd is running and see what happens. Thank you.

Hi Isaac,

Any luck?  Does the problem reproduce with the mon running on a separate host?
-sam

>
> Isaac
>
> ________________________________
> From: Gregory Farnum <greg@inktank.com>
> To: Isaac Otsiabah <zmoo76b@yahoo.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Monday, February 11, 2013 12:29 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
> jIsaac,
> I'm sorry I haven't been able to wrangle any time to look into this
> more yet, but Sage pointed out in a related thread that there might be
> some buggy handling of things like this if the OSD and the monitor are
> located on the same host. Am I correct in assuming that with your
> small cluster, all your OSDs are co-located with a monitor daemon?
> -Greg
>
> On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>>
>>
>> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>>
>>
>> [root@g13ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth2
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
>> [root@g13ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>> [root@g14ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth5
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>>
>>
>> Isaac
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Isaac Otsiabah <zmoo76b@yahoo.com>
>> To: Gregory Farnum <greg@inktank.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Friday, January 25, 2013 9:51 AM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>>
>>
>> Gregory, the network physical layout is simple, the two networks are
>> separate. the 192.168.0 and the 192.168.1 are not subnets within a
>> network.
>>
>> Isaac
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Thursday, January 24, 2013 1:28 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
>> -Greg
>>
>>
>> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>>
>>>
>>>
>>> Gregory, i tried send the the attached debug output several times and
>>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>>> reconnection failures by the error message line below. The ceph version
>>> is 0.56
>>>
>>>
>>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>>> I
>>> ran it several times and finally got it to fail on (osd.0) using
>>> default crush map. The attached tar file contains log files for all
>>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>>
>>>
>>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>>
>>>
>>> id weight type name up/down reweight
>>> -1 6 root default
>>> -3 6 rack unknownrack
>>> -2 3 host g8ct
>>> 0 1 osd.0 down 1
>>> 1 1 osd.1 up 1
>>> 2 1 osd.2 up 1
>>> -4 3 host g13ct
>>> 3 1 osd.3 up 1
>>> 4 1 osd.4 up 1
>>> 5 1 osd.5 up 1
>>>
>>>
>>>
>>> The error messages are in ceph.log and ceph-osd.0.log:
>>>
>>> ceph.log:2013-01-08
>>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>>> wrong cluster addr (192.168.0.124:6802/25571 != my
>>> 192.168.1.124:6802/25571)
>>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>>
>>>
>>>
>>> [root@g8ct ceph]# ceph -v
>>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>>
>>>
>>> Isaac
>>>
>>>
>>> ----- Original Message -----
>>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>>> Sent: Monday, January 7, 2013 1:27 PM
>>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>>
>>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>>
>>>
>>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>>> previous osd(s) goes down for about 2 minutes and then they come back
>>> up.
>>> >
>>> >
>>> > [root@h1ct ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1
>>> > 3 root default
>>> > -3 3 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 up 1
>>> > 2
>>> > 1 osd.2 up 1
>>>
>>>
>>> For example, after adding host h2 (with 3 new osd) to the above cluster
>>> and running the "ceph osd tree" command, i see this:
>>> >
>>> >
>>> > [root@h1 ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1 6 root default
>>> > -3
>>> > 6 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 down 1
>>> > 2
>>> > 1 osd.2 up 1
>>> > -4 3 host h2
>>> > 3 1 osd.3 up 1
>>> > 4 1 osd.4 up
>>> > 1
>>> > 5 1 osd.5 up 1
>>>
>>>
>>> The down osd always come back up after 2 minutes or less andi see the
>>> following error message in the respective osd log file:
>>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>>> > 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07 04:40:17.613122
>>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07
>>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>>> > 2013-01-07
>>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:29.835748
>>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>>> > 192.168.1.124:6808/19449)
>>> >
>>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>>> > ....
>>> > ....
>>> > ....
>>> > [osd.0]
>>> > host = g8ct
>>> > public address = 192.168.0.124
>>> > cluster address = 192.168.1.124
>>> > btrfs devs = /dev/sdb
>>> >
>>> > ....
>>> > ....
>>> >
>>> > but does not happen when they are the same. Any idea what may be the issue?
>>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>>
>>> If
>>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>>> "debug ms = 1" to your global config and gather up the logs from each
>>> daemon?
>>> -Greg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>> More majordomo info at http://vger.kernel.org/majordomo
>>>
>>>
>>> Attachments:
>>> - ceph-osd.0.log.tar.gz
>>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>>
>>
>> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>>
>>
>> [root@g13ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth2
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
>> [root@g13ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>> [root@g14ct ~]# netstat -r
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
>> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
>> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
>> link-local      *               255.255.0.0     U         0 0          0 eth3
>> link-local      *               255.255.0.0     U         0 0          0 eth5
>> link-local      *               255.255.0.0     U         0 0          0 eth0
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
>> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
>> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
>> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
>> [root@g14ct ~]# ceph osd tree
>>
>> # id    weight  type name       up/down reweight
>> -1      6       root default
>> -3      6               rack unknownrack
>> -2      3                       host g13ct
>> 0       1                               osd.0   up      1
>> 1       1                               osd.1   down    1
>> 2       1                               osd.2   up      1
>> -4      3                       host g14ct
>> 3       1                               osd.3   up      1
>> 4       1                               osd.4   up      1
>> 5       1                               osd.5   up      1
>>
>>
>>
>>
>>
>> Isaac
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Isaac Otsiabah <zmoo76b@yahoo.com>
>> To: Gregory Farnum <greg@inktank.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Friday, January 25, 2013 9:51 AM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>>
>>
>> Gregory, the network physical layout is simple, the two networks are
>> separate. the 192.168.0 and the 192.168.1 are not subnets within a
>> network.
>>
>> Isaac
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com>
>> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
>> Sent: Thursday, January 24, 2013 1:28 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
>> -Greg
>>
>>
>> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>>
>>>
>>>
>>> Gregory, i tried send the the attached debug output several times and
>>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>>> reconnection failures by the error message line below. The ceph version
>>> is 0.56
>>>
>>>
>>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>>> I
>>> ran it several times and finally got it to fail on (osd.0) using
>>> default crush map. The attached tar file contains log files for all
>>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>>
>>>
>>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>>
>>>
>>> id weight type name up/down reweight
>>> -1 6 root default
>>> -3 6 rack unknownrack
>>> -2 3 host g8ct
>>> 0 1 osd.0 down 1
>>> 1 1 osd.1 up 1
>>> 2 1 osd.2 up 1
>>> -4 3 host g13ct
>>> 3 1 osd.3 up 1
>>> 4 1 osd.4 up 1
>>> 5 1 osd.5 up 1
>>>
>>>
>>>
>>> The error messages are in ceph.log and ceph-osd.0.log:
>>>
>>> ceph.log:2013-01-08
>>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>>> wrong cluster addr (192.168.0.124:6802/25571 != my
>>> 192.168.1.124:6802/25571)
>>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>>
>>>
>>>
>>> [root@g8ct ceph]# ceph -v
>>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>>
>>>
>>> Isaac
>>>
>>>
>>> ----- Original Message -----
>>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>>> Sent: Monday, January 7, 2013 1:27 PM
>>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>>
>>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>>
>>>
>>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>>> previous osd(s) goes down for about 2 minutes and then they come back
>>> up.
>>> >
>>> >
>>> > [root@h1ct ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1
>>> > 3 root default
>>> > -3 3 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 up 1
>>> > 2
>>> > 1 osd.2 up 1
>>>
>>>
>>> For example, after adding host h2 (with 3 new osd) to the above cluster
>>> and running the "ceph osd tree" command, i see this:
>>> >
>>> >
>>> > [root@h1 ~]# ceph osd tree
>>> >
>>> > # id weight type name up/down reweight
>>> > -1 6 root default
>>> > -3
>>> > 6 rack unknownrack
>>> > -2 3 host h1
>>> > 0 1 osd.0 up 1
>>> > 1 1 osd.1 down 1
>>> > 2
>>> > 1 osd.2 up 1
>>> > -4 3 host h2
>>> > 3 1 osd.3 up 1
>>> > 4 1 osd.4 up
>>> > 1
>>> > 5 1 osd.5 up 1
>>>
>>>
>>> The down osd always come back up after 2 minutes or less andi see the
>>> following error message in the respective osd log file:
>>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>>> > 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07 04:40:17.613122
>>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>>> > 2013-01-07
>>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>>> > 2013-01-07
>>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:29.835748
>>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>>> > l=0).fault, initiating reconnect
>>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>>> > reconnect
>>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>>> > 192.168.1.124:6808/19449)
>>> >
>>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>>> > ....
>>> > ....
>>> > ....
>>> > [osd.0]
>>> > host = g8ct
>>> > public address = 192.168.0.124
>>> > cluster address = 192.168.1.124
>>> > btrfs devs = /dev/sdb
>>> >
>>> > ....
>>> > ....
>>> >
>>> > but does not happen when they are the same. Any idea what may be the issue?
>>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>>
>>> If
>>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>>> "debug ms = 1" to your global config and gather up the logs from each
>>> daemon?
>>> -Greg
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>> More majordomo info at http://vger.kernel.org/majordomo
>>>
>>>
>>> Attachments:
>>> - ceph-osd.0.log.tar.gz
>>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-02-11 20:29       ` Gregory Farnum
@ 2013-02-12  1:39         ` Isaac Otsiabah
  2013-02-15 17:20           ` Sam Lang
  0 siblings, 1 reply; 24+ messages in thread
From: Isaac Otsiabah @ 2013-02-12  1:39 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel



Yes, there were osd daemons running on the same node that the monitor was 
running on.  If that is the case then i will run a test case with the 
monitor running on a different node where no osd is running and see what happens. Thank you. 

Isaac

________________________________
From: Gregory Farnum <greg@inktank.com>
To: Isaac Otsiabah <zmoo76b@yahoo.com> 
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org> 
Sent: Monday, February 11, 2013 12:29 PM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

jIsaac,
I'm sorry I haven't been able to wrangle any time to look into this
more yet, but Sage pointed out in a related thread that there might be
some buggy handling of things like this if the OSD and the monitor are
located on the same host. Am I correct in assuming that with your
small cluster, all your OSDs are co-located with a monitor daemon?
-Greg

On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>
>
> [root@g13ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth2
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
> [root@g13ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
> [root@g14ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth5
> link-local      *               255.255.0.0     U         0 0          0 eth0
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
>
>
> Isaac
>
>
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: Isaac Otsiabah <zmoo76b@yahoo.com>
> To: Gregory Farnum <greg@inktank.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Friday, January 25, 2013 9:51 AM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
>
>
> Gregory, the network physical layout is simple, the two networks are
> separate. the 192.168.0 and the 192.168.1 are not subnets within a
> network.
>
> Isaac
>
>
>
>
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com>
> To: Isaac Otsiabah <zmoo76b@yahoo.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Thursday, January 24, 2013 1:28 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
> -Greg
>
>
> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>
>>
>>
>> Gregory, i tried send the the attached debug output several times and
>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>> reconnection failures by the error message line below. The ceph version
>> is 0.56
>>
>>
>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>> I
>> ran it several times and finally got it to fail on (osd.0) using
>> default crush map. The attached tar file contains log files for all
>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>
>>
>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>
>>
>> id weight type name up/down reweight
>> -1 6 root default
>> -3 6 rack unknownrack
>> -2 3 host g8ct
>> 0 1 osd.0 down 1
>> 1 1 osd.1 up 1
>> 2 1 osd.2 up 1
>> -4 3 host g13ct
>> 3 1 osd.3 up 1
>> 4 1 osd.4 up 1
>> 5 1 osd.5 up 1
>>
>>
>>
>> The error messages are in ceph.log and ceph-osd.0.log:
>>
>> ceph.log:2013-01-08
>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>> wrong cluster addr (192.168.0.124:6802/25571 != my
>> 192.168.1.124:6802/25571)
>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>
>>
>>
>> [root@g8ct ceph]# ceph -v
>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>
>>
>> Isaac
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>> Sent: Monday, January 7, 2013 1:27 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>
>>
>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>> previous osd(s) goes down for about 2 minutes and then they come back
>> up.
>> >
>> >
>> > [root@h1ct ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1
>> > 3 root default
>> > -3 3 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 up 1
>> > 2
>> > 1 osd.2 up 1
>>
>>
>> For example, after adding host h2 (with 3 new osd) to the above cluster
>> and running the "ceph osd tree" command, i see this:
>> >
>> >
>> > [root@h1 ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1 6 root default
>> > -3
>> > 6 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 down 1
>> > 2
>> > 1 osd.2 up 1
>> > -4 3 host h2
>> > 3 1 osd.3 up 1
>> > 4 1 osd.4 up
>> > 1
>> > 5 1 osd.5 up 1
>>
>>
>> The down osd always come back up after 2 minutes or less andi see the
>> following error message in the respective osd log file:
>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>> > 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07 04:40:17.613122
>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07
>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>> > 2013-01-07
>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:29.835748
>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>> > 192.168.1.124:6808/19449)
>> >
>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>> > ....
>> > ....
>> > ....
>> > [osd.0]
>> > host = g8ct
>> > public address = 192.168.0.124
>> > cluster address = 192.168.1.124
>> > btrfs devs = /dev/sdb
>> >
>> > ....
>> > ....
>> >
>> > but does not happen when they are the same. Any idea what may be the issue?
>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>
>> If
>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>> "debug ms = 1" to your global config and gather up the logs from each
>> daemon?
>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo
>>
>>
>> Attachments:
>> - ceph-osd.0.log.tar.gz
>>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>
>
> [root@g13ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth2
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
> [root@g13ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
> [root@g14ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth5
> link-local      *               255.255.0.0     U         0 0          0 eth0
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
>
>
> Isaac
>
>
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: Isaac Otsiabah <zmoo76b@yahoo.com>
> To: Gregory Farnum <greg@inktank.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Friday, January 25, 2013 9:51 AM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
>
>
> Gregory, the network physical layout is simple, the two networks are
> separate. the 192.168.0 and the 192.168.1 are not subnets within a
> network.
>
> Isaac
>
>
>
>
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com>
> To: Isaac Otsiabah <zmoo76b@yahoo.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Thursday, January 24, 2013 1:28 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
> -Greg
>
>
> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>
>>
>>
>> Gregory, i tried send the the attached debug output several times and
>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>> reconnection failures by the error message line below. The ceph version
>> is 0.56
>>
>>
>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>> I
>> ran it several times and finally got it to fail on (osd.0) using
>> default crush map. The attached tar file contains log files for all
>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>
>>
>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>
>>
>> id weight type name up/down reweight
>> -1 6 root default
>> -3 6 rack unknownrack
>> -2 3 host g8ct
>> 0 1 osd.0 down 1
>> 1 1 osd.1 up 1
>> 2 1 osd.2 up 1
>> -4 3 host g13ct
>> 3 1 osd.3 up 1
>> 4 1 osd.4 up 1
>> 5 1 osd.5 up 1
>>
>>
>>
>> The error messages are in ceph.log and ceph-osd.0.log:
>>
>> ceph.log:2013-01-08
>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>> wrong cluster addr (192.168.0.124:6802/25571 != my
>> 192.168.1.124:6802/25571)
>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>
>>
>>
>> [root@g8ct ceph]# ceph -v
>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>
>>
>> Isaac
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>> Sent: Monday, January 7, 2013 1:27 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>
>>
>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>> previous osd(s) goes down for about 2 minutes and then they come back
>> up.
>> >
>> >
>> > [root@h1ct ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1
>> > 3 root default
>> > -3 3 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 up 1
>> > 2
>> > 1 osd.2 up 1
>>
>>
>> For example, after adding host h2 (with 3 new osd) to the above cluster
>> and running the "ceph osd tree" command, i see this:
>> >
>> >
>> > [root@h1 ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1 6 root default
>> > -3
>> > 6 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 down 1
>> > 2
>> > 1 osd.2 up 1
>> > -4 3 host h2
>> > 3 1 osd.3 up 1
>> > 4 1 osd.4 up
>> > 1
>> > 5 1 osd.5 up 1
>>
>>
>> The down osd always come back up after 2 minutes or less andi see the
>> following error message in the respective osd log file:
>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>> > 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07 04:40:17.613122
>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07
>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>> > 2013-01-07
>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:29.835748
>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>> > 192.168.1.124:6808/19449)
>> >
>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>> > ....
>> > ....
>> > ....
>> > [osd.0]
>> > host = g8ct
>> > public address = 192.168.0.124
>> > cluster address = 192.168.1.124
>> > btrfs devs = /dev/sdb
>> >
>> > ....
>> > ....
>> >
>> > but does not happen when they are the same. Any idea what may be the issue?
>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>
>> If
>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>> "debug ms = 1" to your global config and gather up the logs from each
>> daemon?
>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo
>>
>>
>> Attachments:
>> - ceph-osd.0.log.tar.gz
>>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-01-28 20:17     ` Isaac Otsiabah
@ 2013-02-11 20:29       ` Gregory Farnum
  2013-02-12  1:39         ` Isaac Otsiabah
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Farnum @ 2013-02-11 20:29 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: ceph-devel

jIsaac,
I'm sorry I haven't been able to wrangle any time to look into this
more yet, but Sage pointed out in a related thread that there might be
some buggy handling of things like this if the OSD and the monitor are
located on the same host. Am I correct in assuming that with your
small cluster, all your OSDs are co-located with a monitor daemon?
-Greg

On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>
>
> [root@g13ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth2
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
> [root@g13ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
> [root@g14ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth5
> link-local      *               255.255.0.0     U         0 0          0 eth0
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
>
>
> Isaac
>
>
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: Isaac Otsiabah <zmoo76b@yahoo.com>
> To: Gregory Farnum <greg@inktank.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Friday, January 25, 2013 9:51 AM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
>
>
> Gregory, the network physical layout is simple, the two networks are
> separate. the 192.168.0 and the 192.168.1 are not subnets within a
> network.
>
> Isaac
>
>
>
>
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com>
> To: Isaac Otsiabah <zmoo76b@yahoo.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Thursday, January 24, 2013 1:28 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
> -Greg
>
>
> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>
>>
>>
>> Gregory, i tried send the the attached debug output several times and
>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>> reconnection failures by the error message line below. The ceph version
>> is 0.56
>>
>>
>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>> I
>> ran it several times and finally got it to fail on (osd.0) using
>> default crush map. The attached tar file contains log files for all
>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>
>>
>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>
>>
>> id weight type name up/down reweight
>> -1 6 root default
>> -3 6 rack unknownrack
>> -2 3 host g8ct
>> 0 1 osd.0 down 1
>> 1 1 osd.1 up 1
>> 2 1 osd.2 up 1
>> -4 3 host g13ct
>> 3 1 osd.3 up 1
>> 4 1 osd.4 up 1
>> 5 1 osd.5 up 1
>>
>>
>>
>> The error messages are in ceph.log and ceph-osd.0.log:
>>
>> ceph.log:2013-01-08
>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>> wrong cluster addr (192.168.0.124:6802/25571 != my
>> 192.168.1.124:6802/25571)
>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>
>>
>>
>> [root@g8ct ceph]# ceph -v
>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>
>>
>> Isaac
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>> Sent: Monday, January 7, 2013 1:27 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>
>>
>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>> previous osd(s) goes down for about 2 minutes and then they come back
>> up.
>> >
>> >
>> > [root@h1ct ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1
>> > 3 root default
>> > -3 3 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 up 1
>> > 2
>> > 1 osd.2 up 1
>>
>>
>> For example, after adding host h2 (with 3 new osd) to the above cluster
>> and running the "ceph osd tree" command, i see this:
>> >
>> >
>> > [root@h1 ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1 6 root default
>> > -3
>> > 6 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 down 1
>> > 2
>> > 1 osd.2 up 1
>> > -4 3 host h2
>> > 3 1 osd.3 up 1
>> > 4 1 osd.4 up
>> > 1
>> > 5 1 osd.5 up 1
>>
>>
>> The down osd always come back up after 2 minutes or less andi see the
>> following error message in the respective osd log file:
>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>> > 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07 04:40:17.613122
>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07
>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>> > 2013-01-07
>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:29.835748
>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>> > 192.168.1.124:6808/19449)
>> >
>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>> > ....
>> > ....
>> > ....
>> > [osd.0]
>> > host = g8ct
>> > public address = 192.168.0.124
>> > cluster address = 192.168.1.124
>> > btrfs devs = /dev/sdb
>> >
>> > ....
>> > ....
>> >
>> > but does not happen when they are the same. Any idea what may be the issue?
>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>
>> If
>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>> "debug ms = 1" to your global config and gather up the logs from each
>> daemon?
>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo
>>
>>
>> Attachments:
>> - ceph-osd.0.log.tar.gz
>>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.
>
>
> [root@g13ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth2
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
> [root@g13ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
> [root@g14ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth5
> link-local      *               255.255.0.0     U         0 0          0 eth0
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
>
>
> Isaac
>
>
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: Isaac Otsiabah <zmoo76b@yahoo.com>
> To: Gregory Farnum <greg@inktank.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Friday, January 25, 2013 9:51 AM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
>
>
> Gregory, the network physical layout is simple, the two networks are
> separate. the 192.168.0 and the 192.168.1 are not subnets within a
> network.
>
> Isaac
>
>
>
>
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com>
> To: Isaac Otsiabah <zmoo76b@yahoo.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Thursday, January 24, 2013 1:28 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
> -Greg
>
>
> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>
>>
>>
>> Gregory, i tried send the the attached debug output several times and
>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>> reconnection failures by the error message line below. The ceph version
>> is 0.56
>>
>>
>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>> I
>> ran it several times and finally got it to fail on (osd.0) using
>> default crush map. The attached tar file contains log files for all
>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>
>>
>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>
>>
>> id weight type name up/down reweight
>> -1 6 root default
>> -3 6 rack unknownrack
>> -2 3 host g8ct
>> 0 1 osd.0 down 1
>> 1 1 osd.1 up 1
>> 2 1 osd.2 up 1
>> -4 3 host g13ct
>> 3 1 osd.3 up 1
>> 4 1 osd.4 up 1
>> 5 1 osd.5 up 1
>>
>>
>>
>> The error messages are in ceph.log and ceph-osd.0.log:
>>
>> ceph.log:2013-01-08
>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>> wrong cluster addr (192.168.0.124:6802/25571 != my
>> 192.168.1.124:6802/25571)
>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>
>>
>>
>> [root@g8ct ceph]# ceph -v
>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>
>>
>> Isaac
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>> Sent: Monday, January 7, 2013 1:27 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>
>>
>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>> previous osd(s) goes down for about 2 minutes and then they come back
>> up.
>> >
>> >
>> > [root@h1ct ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1
>> > 3 root default
>> > -3 3 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 up 1
>> > 2
>> > 1 osd.2 up 1
>>
>>
>> For example, after adding host h2 (with 3 new osd) to the above cluster
>> and running the "ceph osd tree" command, i see this:
>> >
>> >
>> > [root@h1 ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1 6 root default
>> > -3
>> > 6 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 down 1
>> > 2
>> > 1 osd.2 up 1
>> > -4 3 host h2
>> > 3 1 osd.3 up 1
>> > 4 1 osd.4 up
>> > 1
>> > 5 1 osd.5 up 1
>>
>>
>> The down osd always come back up after 2 minutes or less andi see the
>> following error message in the respective osd log file:
>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>> > 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07 04:40:17.613122
>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07
>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>> > 2013-01-07
>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:29.835748
>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>> > 192.168.1.124:6808/19449)
>> >
>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>> > ....
>> > ....
>> > ....
>> > [osd.0]
>> > host = g8ct
>> > public address = 192.168.0.124
>> > cluster address = 192.168.1.124
>> > btrfs devs = /dev/sdb
>> >
>> > ....
>> > ....
>> >
>> > but does not happen when they are the same. Any idea what may be the issue?
>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>
>> If
>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>> "debug ms = 1" to your global config and gather up the logs from each
>> daemon?
>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo
>>
>>
>> Attachments:
>> - ceph-osd.0.log.tar.gz
>>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-01-25 17:51   ` Isaac Otsiabah
  2013-01-25 23:46     ` Sam Lang
@ 2013-01-28 20:17     ` Isaac Otsiabah
  2013-02-11 20:29       ` Gregory Farnum
  1 sibling, 1 reply; 24+ messages in thread
From: Isaac Otsiabah @ 2013-01-28 20:17 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 10982 bytes --]



Gregory, i recreated the osd down problem again this morning on two nodes (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and half after adding osd 3, 4, 5 were adde4d. i have included the routing table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are attached. The crush map was default. Also, it could be a timing issue because it does not always fail when  using default crush map, it takes several trials before you see it. Thank you.


[root@g13ct ~]# netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
link-local      *               255.255.0.0     U         0 0          0 eth3
link-local      *               255.255.0.0     U         0 0          0 eth0
link-local      *               255.255.0.0     U         0 0          0 eth2
192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
[root@g13ct ~]# ceph osd tree

# id    weight  type name       up/down reweight
-1      6       root default
-3      6               rack unknownrack
-2      3                       host g13ct
0       1                               osd.0   up      1
1       1                               osd.1   down    1
2       1                               osd.2   up      1
-4      3                       host g14ct
3       1                               osd.3   up      1
4       1                               osd.4   up      1
5       1                               osd.5   up      1



[root@g14ct ~]# ceph osd tree

# id    weight  type name       up/down reweight
-1      6       root default
-3      6               rack unknownrack
-2      3                       host g13ct
0       1                               osd.0   up      1
1       1                               osd.1   down    1
2       1                               osd.2   up      1
-4      3                       host g14ct
3       1                               osd.3   up      1
4       1                               osd.4   up      1
5       1                               osd.5   up      1

[root@g14ct ~]# netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
link-local      *               255.255.0.0     U         0 0          0 eth3
link-local      *               255.255.0.0     U         0 0          0 eth5
link-local      *               255.255.0.0     U         0 0          0 eth0
192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
[root@g14ct ~]# ceph osd tree

# id    weight  type name       up/down reweight
-1      6       root default
-3      6               rack unknownrack
-2      3                       host g13ct
0       1                               osd.0   up      1
1       1                               osd.1   down    1
2       1                               osd.2   up      1
-4      3                       host g14ct
3       1                               osd.3   up      1
4       1                               osd.4   up      1
5       1                               osd.5   up      1





Isaac










----- Original Message -----
From: Isaac Otsiabah <zmoo76b@yahoo.com>
To: Gregory Farnum <greg@inktank.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Sent: Friday, January 25, 2013 9:51 AM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster



Gregory, the network physical layout is simple, the two networks are 
separate. the 192.168.0 and the 192.168.1 are not subnets within a 
network.

Isaac  




----- Original Message -----
From: Gregory Farnum <greg@inktank.com>
To: Isaac Otsiabah <zmoo76b@yahoo.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Sent: Thursday, January 24, 2013 1:28 PM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :) 
-Greg


On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:

> 
> 
> Gregory, i tried send the the attached debug output several times and 
> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the 
> reconnection failures by the error message line below. The ceph version 
> is 0.56
> 
> 
> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
> I
> ran it several times and finally got it to fail on (osd.0) using 
> default crush map. The attached tar file contains log files for all 
> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
> 
> 
> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
> 
> 
> id weight type name up/down reweight
> -1 6 root default
> -3 6 rack unknownrack
> -2 3 host g8ct
> 0 1 osd.0 down 1
> 1 1 osd.1 up 1
> 2 1 osd.2 up 1
> -4 3 host g13ct
> 3 1 osd.3 up 1
> 4 1 osd.4 up 1
> 5 1 osd.5 up 1
> 
> 
> 
> The error messages are in ceph.log and ceph-osd.0.log:
> 
> ceph.log:2013-01-08
> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had 
> wrong cluster addr (192.168.0.124:6802/25571 != my 
> 192.168.1.124:6802/25571)
> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr 
> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
> 
> 
> 
> [root@g8ct ceph]# ceph -v
> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
> 
> 
> Isaac
> 
> 
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
> Sent: Monday, January 7, 2013 1:27 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
> 
> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
> 
> 
> When i add a new host (with osd's) to my existing cluster, 1 or 2 
> previous osd(s) goes down for about 2 minutes and then they come back 
> up. 
> > 
> > 
> > [root@h1ct ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 
> > 3 root default
> > -3 3 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 up 1
> > 2 
> > 1 osd.2 up 1
> 
> 
> For example, after adding host h2 (with 3 new osd) to the above cluster
> and running the "ceph osd tree" command, i see this: 
> > 
> > 
> > [root@h1 ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 6 root default
> > -3 
> > 6 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 down 1
> > 2 
> > 1 osd.2 up 1
> > -4 3 host h2
> > 3 1 osd.3 up 1
> > 4 1 osd.4 up 
> > 1
> > 5 1 osd.5 up 1
> 
> 
> The down osd always come back up after 2 minutes or less andi see the 
> following error message in the respective osd log file: 
> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open 
> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
> > 4096 bytes, directio = 1, aio = 0
> > 2013-01-07 04:40:17.613122 
> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26: 
> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
> > 2013-01-07
> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >> 
> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0 
> > l=0).accept connect_seq 0 vs existing 0 state connecting
> > 2013-01-07 
> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:29.835748 
> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:30.835219 7fec743f4710 0 -- 
> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
> > reconnect
> > 2013-01-07 04:45:30.837318 7fec743f4710 0 -- 
> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
> > reconnect
> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map 
> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
> > 192.168.1.124:6808/19449)
> > 
> > Also, this only happens only when the cluster ip address and the public ip address are different for example
> > ....
> > ....
> > ....
> > [osd.0]
> > host = g8ct
> > public address = 192.168.0.124
> > cluster address = 192.168.1.124
> > btrfs devs = /dev/sdb
> > 
> > ....
> > ....
> > 
> > but does not happen when they are the same. Any idea what may be the issue?
> This isn't familiar to me at first glance. What version of Ceph are you using?
> 
> If
> this is easy to reproduce, can you pastebin your ceph.conf and then add
> "debug ms = 1" to your global config and gather up the logs from each 
> daemon?
> -Greg
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo
> 
> 
> Attachments: 
> - ceph-osd.0.log.tar.gz
> 



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: osd.1.tar.gz --]
[-- Type: application/x-gzip, Size: 11360 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-01-25 17:51   ` Isaac Otsiabah
@ 2013-01-25 23:46     ` Sam Lang
  2013-01-28 20:17     ` Isaac Otsiabah
  1 sibling, 0 replies; 24+ messages in thread
From: Sam Lang @ 2013-01-25 23:46 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: Gregory Farnum, ceph-devel

On Fri, Jan 25, 2013 at 11:51 AM, Isaac Otsiabah <zmoo76b@yahoo.com> wrote:
>
>
> Gregory, the network physical layout is simple, the two networks are
> separate. the 192.168.0 and the 192.168.1 are not subnets within a
> network.

Hi Isaac,

Could you send us your routing tables on the osds (route -n).  That's
one more bit of information that might be useful for tracking this
down.
Thanks,
-sam

>
> Isaac
>
>
>
>
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com>
> To: Isaac Otsiabah <zmoo76b@yahoo.com>
> Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> Sent: Thursday, January 24, 2013 1:28 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>
> What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :)
> -Greg
>
>
> On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:
>
>>
>>
>> Gregory, i tried send the the attached debug output several times and
>> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the
>> reconnection failures by the error message line below. The ceph version
>> is 0.56
>>
>>
>> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
>> I
>> ran it several times and finally got it to fail on (osd.0) using
>> default crush map. The attached tar file contains log files for all
>> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
>>
>>
>> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
>>
>>
>> id weight type name up/down reweight
>> -1 6 root default
>> -3 6 rack unknownrack
>> -2 3 host g8ct
>> 0 1 osd.0 down 1
>> 1 1 osd.1 up 1
>> 2 1 osd.2 up 1
>> -4 3 host g13ct
>> 3 1 osd.3 up 1
>> 4 1 osd.4 up 1
>> 5 1 osd.5 up 1
>>
>>
>>
>> The error messages are in ceph.log and ceph-osd.0.log:
>>
>> ceph.log:2013-01-08
>> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had
>> wrong cluster addr (192.168.0.124:6802/25571 != my
>> 192.168.1.124:6802/25571)
>> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr
>> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
>>
>>
>>
>> [root@g8ct ceph]# ceph -v
>> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>
>>
>> Isaac
>>
>>
>> ----- Original Message -----
>> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
>> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
>> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
>> Sent: Monday, January 7, 2013 1:27 PM
>> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
>>
>> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
>>
>>
>> When i add a new host (with osd's) to my existing cluster, 1 or 2
>> previous osd(s) goes down for about 2 minutes and then they come back
>> up.
>> >
>> >
>> > [root@h1ct ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1
>> > 3 root default
>> > -3 3 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 up 1
>> > 2
>> > 1 osd.2 up 1
>>
>>
>> For example, after adding host h2 (with 3 new osd) to the above cluster
>> and running the "ceph osd tree" command, i see this:
>> >
>> >
>> > [root@h1 ~]# ceph osd tree
>> >
>> > # id weight type name up/down reweight
>> > -1 6 root default
>> > -3
>> > 6 rack unknownrack
>> > -2 3 host h1
>> > 0 1 osd.0 up 1
>> > 1 1 osd.1 down 1
>> > 2
>> > 1 osd.2 up 1
>> > -4 3 host h2
>> > 3 1 osd.3 up 1
>> > 4 1 osd.4 up
>> > 1
>> > 5 1 osd.5 up 1
>>
>>
>> The down osd always come back up after 2 minutes or less andi see the
>> following error message in the respective osd log file:
>> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open
>> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size
>> > 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07 04:40:17.613122
>> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26:
>> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
>> > 2013-01-07
>> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >>
>> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0
>> > l=0).accept connect_seq 0 vs existing 0 state connecting
>> > 2013-01-07
>> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:29.835748
>> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >>
>> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3
>> > l=0).fault, initiating reconnect
>> > 2013-01-07 04:45:30.835219 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.837318 7fec743f4710 0 --
>> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072
>> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating
>> > reconnect
>> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map
>> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my
>> > 192.168.1.124:6808/19449)
>> >
>> > Also, this only happens only when the cluster ip address and the public ip address are different for example
>> > ....
>> > ....
>> > ....
>> > [osd.0]
>> > host = g8ct
>> > public address = 192.168.0.124
>> > cluster address = 192.168.1.124
>> > btrfs devs = /dev/sdb
>> >
>> > ....
>> > ....
>> >
>> > but does not happen when they are the same. Any idea what may be the issue?
>> This isn't familiar to me at first glance. What version of Ceph are you using?
>>
>> If
>> this is easy to reproduce, can you pastebin your ceph.conf and then add
>> "debug ms = 1" to your global config and gather up the logs from each
>> daemon?
>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>> More majordomo info at http://vger.kernel.org/majordomo
>>
>>
>> Attachments:
>> - ceph-osd.0.log.tar.gz
>>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-01-24 21:28 ` Gregory Farnum
@ 2013-01-25 17:51   ` Isaac Otsiabah
  2013-01-25 23:46     ` Sam Lang
  2013-01-28 20:17     ` Isaac Otsiabah
  0 siblings, 2 replies; 24+ messages in thread
From: Isaac Otsiabah @ 2013-01-25 17:51 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel



Gregory, the network physical layout is simple, the two networks are 
separate. the 192.168.0 and the 192.168.1 are not subnets within a 
network.

Isaac  




----- Original Message -----
From: Gregory Farnum <greg@inktank.com>
To: Isaac Otsiabah <zmoo76b@yahoo.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Sent: Thursday, January 24, 2013 1:28 PM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :) 
-Greg


On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:

> 
> 
> Gregory, i tried send the the attached debug output several times and 
> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the 
> reconnection failures by the error message line below. The ceph version 
> is 0.56
> 
> 
> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
> I
> ran it several times and finally got it to fail on (osd.0) using 
> default crush map. The attached tar file contains log files for all 
> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
> 
> 
> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
> 
> 
> id weight type name up/down reweight
> -1 6 root default
> -3 6 rack unknownrack
> -2 3 host g8ct
> 0 1 osd.0 down 1
> 1 1 osd.1 up 1
> 2 1 osd.2 up 1
> -4 3 host g13ct
> 3 1 osd.3 up 1
> 4 1 osd.4 up 1
> 5 1 osd.5 up 1
> 
> 
> 
> The error messages are in ceph.log and ceph-osd.0.log:
> 
> ceph.log:2013-01-08
> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had 
> wrong cluster addr (192.168.0.124:6802/25571 != my 
> 192.168.1.124:6802/25571)
> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr 
> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
> 
> 
> 
> [root@g8ct ceph]# ceph -v
> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
> 
> 
> Isaac
> 
> 
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
> Sent: Monday, January 7, 2013 1:27 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
> 
> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
> 
> 
> When i add a new host (with osd's) to my existing cluster, 1 or 2 
> previous osd(s) goes down for about 2 minutes and then they come back 
> up. 
> > 
> > 
> > [root@h1ct ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 
> > 3 root default
> > -3 3 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 up 1
> > 2 
> > 1 osd.2 up 1
> 
> 
> For example, after adding host h2 (with 3 new osd) to the above cluster
> and running the "ceph osd tree" command, i see this: 
> > 
> > 
> > [root@h1 ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 6 root default
> > -3 
> > 6 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 down 1
> > 2 
> > 1 osd.2 up 1
> > -4 3 host h2
> > 3 1 osd.3 up 1
> > 4 1 osd.4 up 
> > 1
> > 5 1 osd.5 up 1
> 
> 
> The down osd always come back up after 2 minutes or less andi see the 
> following error message in the respective osd log file: 
> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open 
> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
> > 4096 bytes, directio = 1, aio = 0
> > 2013-01-07 04:40:17.613122 
> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26: 
> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
> > 2013-01-07
> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >> 
> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0 
> > l=0).accept connect_seq 0 vs existing 0 state connecting
> > 2013-01-07 
> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:29.835748 
> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:30.835219 7fec743f4710 0 -- 
> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
> > reconnect
> > 2013-01-07 04:45:30.837318 7fec743f4710 0 -- 
> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
> > reconnect
> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map 
> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
> > 192.168.1.124:6808/19449)
> > 
> > Also, this only happens only when the cluster ip address and the public ip address are different for example
> > ....
> > ....
> > ....
> > [osd.0]
> > host = g8ct
> > public address = 192.168.0.124
> > cluster address = 192.168.1.124
> > btrfs devs = /dev/sdb
> > 
> > ....
> > ....
> > 
> > but does not happen when they are the same. Any idea what may be the issue?
> This isn't familiar to me at first glance. What version of Ceph are you using?
> 
> If
> this is easy to reproduce, can you pastebin your ceph.conf and then add
> "debug ms = 1" to your global config and gather up the logs from each 
> daemon?
> -Greg
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo
> 
> 
> Attachments: 
> - ceph-osd.0.log.tar.gz
> 



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
  2013-01-24 17:24 osd down (for 2 about 2 minutes) error after adding a new host to my cluster Isaac Otsiabah
@ 2013-01-24 21:28 ` Gregory Farnum
  2013-01-25 17:51   ` Isaac Otsiabah
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Farnum @ 2013-01-24 21:28 UTC (permalink / raw)
  To: Isaac Otsiabah; +Cc: ceph-devel

What's the physical layout of your networking? This additional log may prove helpful as well, but I really need a bit more context in evaluating the messages I see from the first one. :) 
-Greg


On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:

> 
> 
> Gregory, i tried send the the attached debug output several times and 
> the mail server rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the 
> reconnection failures by the error message line below. The ceph version 
> is 0.56
> 
> 
> it appears to be a timing issue because with the flag (debug ms=1) turned on, the system ran slower and became harder to fail.
> I
> ran it several times and finally got it to fail on (osd.0) using 
> default crush map. The attached tar file contains log files for all 
> components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.
> 
> 
> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then added host g13ct (osd.3, osd.4, osd.5)
> 
> 
> id weight type name up/down reweight
> -1 6 root default
> -3 6 rack unknownrack
> -2 3 host g8ct
> 0 1 osd.0 down 1
> 1 1 osd.1 up 1
> 2 1 osd.2 up 1
> -4 3 host g13ct
> 3 1 osd.3 up 1
> 4 1 osd.4 up 1
> 5 1 osd.5 up 1
> 
> 
> 
> The error messages are in ceph.log and ceph-osd.0.log:
> 
> ceph.log:2013-01-08
> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had 
> wrong cluster addr (192.168.0.124:6802/25571 != my 
> 192.168.1.124:6802/25571)
> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 had wrong cluster addr 
> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
> 
> 
> 
> [root@g8ct ceph]# ceph -v
> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
> 
> 
> Isaac
> 
> 
> ----- Original Message -----
> From: Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>
> To: Isaac Otsiabah <zmoo76b@yahoo.com (mailto:zmoo76b@yahoo.com)>
> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
> Sent: Monday, January 7, 2013 1:27 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
> 
> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
> 
> 
> When i add a new host (with osd's) to my existing cluster, 1 or 2 
> previous osd(s) goes down for about 2 minutes and then they come back 
> up. 
> > 
> > 
> > [root@h1ct ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 
> > 3 root default
> > -3 3 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 up 1
> > 2 
> > 1 osd.2 up 1
> 
> 
> For example, after adding host h2 (with 3 new osd) to the above cluster
> and running the "ceph osd tree" command, i see this: 
> > 
> > 
> > [root@h1 ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 6 root default
> > -3 
> > 6 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 down 1
> > 2 
> > 1 osd.2 up 1
> > -4 3 host h2
> > 3 1 osd.3 up 1
> > 4 1 osd.4 up 
> > 1
> > 5 1 osd.5 up 1
> 
> 
> The down osd always come back up after 2 minutes or less andi see the 
> following error message in the respective osd log file: 
> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open 
> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
> > 4096 bytes, directio = 1, aio = 0
> > 2013-01-07 04:40:17.613122 
> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26: 
> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
> > 2013-01-07
> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >> 
> > 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0 
> > l=0).accept connect_seq 0 vs existing 0 state connecting
> > 2013-01-07 
> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:29.835748 
> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:30.835219 7fec743f4710 0 -- 
> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
> > reconnect
> > 2013-01-07 04:45:30.837318 7fec743f4710 0 -- 
> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> > pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
> > reconnect
> > 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map 
> > e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
> > 192.168.1.124:6808/19449)
> > 
> > Also, this only happens only when the cluster ip address and the public ip address are different for example
> > ....
> > ....
> > ....
> > [osd.0]
> > host = g8ct
> > public address = 192.168.0.124
> > cluster address = 192.168.1.124
> > btrfs devs = /dev/sdb
> > 
> > ....
> > ....
> > 
> > but does not happen when they are the same. Any idea what may be the issue?
> This isn't familiar to me at first glance. What version of Ceph are you using?
> 
> If
> this is easy to reproduce, can you pastebin your ceph.conf and then add
> "debug ms = 1" to your global config and gather up the logs from each 
> daemon?
> -Greg
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo
> 
> 
> Attachments: 
> - ceph-osd.0.log.tar.gz
> 




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
@ 2013-01-24 17:24 Isaac Otsiabah
  2013-01-24 21:28 ` Gregory Farnum
  0 siblings, 1 reply; 24+ messages in thread
From: Isaac Otsiabah @ 2013-01-24 17:24 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5466 bytes --]



Gregory, i tried send the the attached debug output several times and 
the mail server  rejected them all probably becauseof the file size so i cut the log file size down and it is attached. You will see the 
reconnection failures by the error message line below. The ceph version 
is 0.56


it appears to be a timing issue because with the flag (debug  ms=1) turned on, the system ran slower and became harder to fail.
I
 ran it several times and finally got it to fail on (osd.0) using 
default crush map. The attached tar file contains log files  for all 
components on g8ct plus the ceph.conf. By the way, the log file contain only the last 1384 lines where the error occurs.


I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2)  and then added host g13ct (osd.3, osd.4, osd.5)


 id    weight  type name       up/down reweight
-1      6       root default
-3      6               rack unknownrack
-2      3                       host g8ct
0       1                               osd.0   down    1
1       1                               osd.1   up      1
2       1                               osd.2   up      1
-4      3                       host g13ct
3       1                               osd.3   up      1
4       1                               osd.4   up      1
5       1                               osd.5   up      1



The error messages are in ceph.log and ceph-osd.0.log:

ceph.log:2013-01-08
05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had 
wrong cluster addr (192.168.0.124:6802/25571 != my 
192.168.1.124:6802/25571)
ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710  0 log [ERR] : map e15 had wrong cluster addr 
(192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)



[root@g8ct ceph]# ceph -v
ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)


Isaac


----- Original Message -----
From: Gregory Farnum <greg@inktank.com>
To: Isaac Otsiabah <zmoo76b@yahoo.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Sent: Monday, January 7, 2013 1:27 PM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
> 
> 
>
When i add a new host (with osd's) to my existing cluster, 1 or 2 
previous osd(s) goes down for about 2 minutes and then they come back 
up. 
> 
> 
> [root@h1ct ~]# ceph osd tree
> 
> # id weight type name up/down reweight
> -1 
> 3 root default
> -3 3 rack unknownrack
> -2 3 host h1
> 0 1 osd.0 up 1
> 1 1 osd.1 up 1
> 2 
> 1 osd.2 up 1
> 
> 
>
For example, after adding host h2 (with 3 new osd) to the above cluster
and running the "ceph osd tree" command, i see this: 
> 
> 
> [root@h1 ~]# ceph osd tree
> 
> # id weight type name up/down reweight
> -1 6 root default
> -3 
> 6 rack unknownrack
> -2 3 host h1
> 0 1 osd.0 up 1
> 1 1 osd.1 down 1
> 2 
> 1 osd.2 up 1
> -4 3 host h2
> 3 1 osd.3 up 1
> 4 1 osd.4 up 
> 1
> 5 1 osd.5 up 1
> 
> 
>
The down osd always come back up after 2 minutes or less andi see the 
following error message in the respective osd log file: 
> 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open 
> /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
> 4096 bytes, directio = 1, aio = 0
> 2013-01-07 04:40:17.613122 
> 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26: 
> 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
> 2013-01-07
> 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >> 
> 192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0 
> l=0).accept connect_seq 0 vs existing 0 state connecting
> 2013-01-07 
> 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
> l=0).fault, initiating reconnect
> 2013-01-07 04:45:29.835748 
> 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
> l=0).fault, initiating reconnect
> 2013-01-07 04:45:30.835219 7fec743f4710 0 -- 
> 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
> reconnect
> 2013-01-07 04:45:30.837318 7fec743f4710 0 -- 
> 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
> reconnect
> 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map 
> e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
> 192.168.1.124:6808/19449)
> 
> Also, this only happens only when the cluster ip address and the public ip address are different for example
> ....
> ....
> ....
> [osd.0]
> host = g8ct
> public address = 192.168.0.124
> cluster address = 192.168.1.124
> btrfs devs = /dev/sdb
> 
> ....
> ....
> 
> but does not happen when they are the same. Any idea what may be the issue?
> 
This isn't familiar to me at first glance. What version of Ceph are you using?

If
this is easy to reproduce, can you pastebin your ceph.conf and then add
"debug ms = 1" to your global config and gather up the logs from each 
daemon?
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo

[-- Attachment #2: ceph-osd.0.log.tar.gz --]
[-- Type: application/x-gzip, Size: 30099 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-02-16  2:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-29  7:45 parsing in the ceph osd subsystem Andrey Korolyov
2012-11-29  7:53 ` Gregory Farnum
2012-11-29 16:34 ` Sage Weil
2012-11-29 16:49   ` Joao Eduardo Luis
2012-11-29 19:01   ` Andrey Korolyov
2012-11-29 19:49     ` Joao Eduardo Luis
2012-11-30  1:04     ` Joao Eduardo Luis
     [not found]       ` <1354237947.86472.YahooMailNeo@web121901.mail.ne1.yahoo.com>
2012-11-30  1:22         ` What is the new command to add osd to the crushmap to enable it to receive data Isaac Otsiabah
2012-11-30  1:54           ` Joao Eduardo Luis
     [not found]           ` <1357591756.80653.YahooMailNeo@web121903.mail.ne1.yahoo.com>
2013-01-07 21:00             ` osd down (for 2 about 2 minutes) error after adding a new host to my cluster Isaac Otsiabah
2013-01-07 21:27               ` Gregory Farnum
     [not found]                 ` <1357680673.72602.YahooMailNeo@web121904.mail.ne1.yahoo.com>
2013-01-10 18:32                   ` Gregory Farnum
2013-01-10 18:45                     ` What is the acceptable attachment file size on the mail server? Isaac Otsiabah
2013-01-10 18:57                       ` Gregory Farnum
2013-01-12  1:41                       ` Yan, Zheng 
2013-01-24 17:24 osd down (for 2 about 2 minutes) error after adding a new host to my cluster Isaac Otsiabah
2013-01-24 21:28 ` Gregory Farnum
2013-01-25 17:51   ` Isaac Otsiabah
2013-01-25 23:46     ` Sam Lang
2013-01-28 20:17     ` Isaac Otsiabah
2013-02-11 20:29       ` Gregory Farnum
2013-02-12  1:39         ` Isaac Otsiabah
2013-02-15 17:20           ` Sam Lang
2013-02-16  2:00             ` Isaac Otsiabah

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.