All of lore.kernel.org
 help / color / mirror / Atom feed
* Mismatching nonce for 'ceph osd.0 tell'
@ 2016-09-12 22:59 Willem Jan Withagen
  2016-09-13  2:29 ` Haomai Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-09-12 22:59 UTC (permalink / raw)
  To: Ceph Development

Hi

When running  cephtool-test-mon.sh, part of it executes:
  ceph tell osd.0 version
I see reports on the commandline, I guess that this is the OSD
complaining that things are wrong:

2016-09-12 23:50:39.239037 814e50e00  0 -- 127.0.0.1:0/1925715881 >>
127.0.0.1:6800/26384 conn(0x814fde800 sd=18 :-1
s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0
l=1)._process_connection connect claims to be 127.0.0.1:6800/1026384 not
127.0.0.1:6800/26384 - wrong node!

Which it will run until it is shot down.... after 3600 secs.

the nonce is incremented with 1000000 on every rebind.

But what I do not understand is how this mismatch has occurred.
I would expect port 6800 to be the port on which the OSD is connected
too, so the connecting party (ceph in this case) thinks the nonce to be
1026384. Did the MON have this information? And where did the MON then
get it from....

Somewhere one of the parts did not receive the new nonce, or did not
also increment it?

Any suggestions welcomed on directions where to look,

Thanx,
--WjW

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-12 22:59 Mismatching nonce for 'ceph osd.0 tell' Willem Jan Withagen
@ 2016-09-13  2:29 ` Haomai Wang
  2016-09-13  9:00   ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: Haomai Wang @ 2016-09-13  2:29 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Ceph Development

On Tue, Sep 13, 2016 at 6:59 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> Hi
>
> When running  cephtool-test-mon.sh, part of it executes:
>   ceph tell osd.0 version
> I see reports on the commandline, I guess that this is the OSD
> complaining that things are wrong:
>
> 2016-09-12 23:50:39.239037 814e50e00  0 -- 127.0.0.1:0/1925715881 >>
> 127.0.0.1:6800/26384 conn(0x814fde800 sd=18 :-1
> s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0
> l=1)._process_connection connect claims to be 127.0.0.1:6800/1026384 not
> 127.0.0.1:6800/26384 - wrong node!
>
> Which it will run until it is shot down.... after 3600 secs.
>
> the nonce is incremented with 1000000 on every rebind.
>
> But what I do not understand is how this mismatch has occurred.
> I would expect port 6800 to be the port on which the OSD is connected
> too, so the connecting party (ceph in this case) thinks the nonce to be
> 1026384. Did the MON have this information? And where did the MON then
> get it from....
>
> Somewhere one of the parts did not receive the new nonce, or did not
> also increment it?

nonce is a part of ceph_entity_addr, so OSDMap will take this

>
> Any suggestions welcomed on directions where to look,
>
> Thanx,
> --WjW
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-13  2:29 ` Haomai Wang
@ 2016-09-13  9:00   ` Willem Jan Withagen
  2016-09-13 19:52     ` Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-09-13  9:00 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Ceph Development

On 13-9-2016 04:29, Haomai Wang wrote:
> On Tue, Sep 13, 2016 at 6:59 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>> Hi
>>
>> When running  cephtool-test-mon.sh, part of it executes:
>>   ceph tell osd.0 version
>> I see reports on the commandline, I guess that this is the OSD
>> complaining that things are wrong:
>>
>> 2016-09-12 23:50:39.239037 814e50e00  0 -- 127.0.0.1:0/1925715881 >>
>> 127.0.0.1:6800/26384 conn(0x814fde800 sd=18 :-1
>> s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0
>> l=1)._process_connection connect claims to be 127.0.0.1:6800/1026384 not
>> 127.0.0.1:6800/26384 - wrong node!
>>
>> Which it will run until it is shot down.... after 3600 secs.
>>
>> the nonce is incremented with 1000000 on every rebind.
>>
>> But what I do not understand is how this mismatch has occurred.
>> I would expect port 6800 to be the port on which the OSD is connected
>> too, so the connecting party (ceph in this case) thinks the nonce to be
>> 1026384. Did the MON have this information? And where did the MON then
>> get it from....
>>
>> Somewhere one of the parts did not receive the new nonce, or did not
>> also increment it?
> 
> nonce is a part of ceph_entity_addr, so OSDMap will take this

Right, but then the following is also suspicious???

====
# ceph osd dump
epoch 188
fsid 2e02472d-ecbb-43ac-a687-bbf2523233d9
created 2016-09-13 10:28:07.970254
modified 2016-09-13 10:34:57.318988
flags sortbitwise,require_jewel_osds,require_kraken_osds
pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
max_osd 10
osd.0 up   in  weight 1 up_from 175 up_thru 185 down_at 172
last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565
127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up
e0e44b9c-9869-49d8-8afb-bdb71c04ea27
osd.1 up   in  weight 1 up_from 10 up_thru 184 down_at 0
last_clean_interval [0,0) 127.0.0.1:6804/36579 127.0.0.1:6805/36579
127.0.0.1:6806/36579 127.0.0.1:6807/36579 exists,up
b554849c-2cf1-4cf7-a5fd-3529d33345ff
osd.2 up   in  weight 1 up_from 12 up_thru 185 down_at 0
last_clean_interval [0,0) 127.0.0.1:6808/36593 127.0.0.1:6809/36593
127.0.0.1:6810/36593 127.0.0.1:6811/36593 exists,up
2d6648ba-72e1-4c53-ae10-929a9d13a3dd
====

osd.0 has:
	127.0.0.1:6800/36565
	127.0.0.1:6800/1036565
	127.0.0.1:6804/1036565
	127.0.0.1:6805/1036565

So I guess that one nonce did not get updated, because I would expect
all ports te be rebound, and incr the nonce?

The other bad thing is that ports 6804 and 6805 are now both in osd.0
and osd.1, which is going to create some trouble also I would guess.

So this is what the osdmap distributes?
And then MON reports to clients?

How would I retrieve the dump from osd.0 itself?
Trying:
# ceph -c ceph.conf daemon osd.0 dump
Can't get admin socket path: [Errno 2] No such file or directory

Using the admin-socket directly does work.
But there is not really a command to get something equal to:
====
osd.0 up   in  weight 1 up_from 175 up_thru 185 down_at 172
last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565
127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up
e0e44b9c-9869-49d8-8afb-bdb71c04ea27
====
So that one can see what the OSD itself thinks it is....

--WjW



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-13  9:00   ` Willem Jan Withagen
@ 2016-09-13 19:52     ` Gregory Farnum
  2016-09-13 20:21       ` Willem Jan Withagen
  2016-09-13 20:29       ` Willem Jan Withagen
  0 siblings, 2 replies; 18+ messages in thread
From: Gregory Farnum @ 2016-09-13 19:52 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Haomai Wang, Ceph Development

On Tue, Sep 13, 2016 at 2:00 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 13-9-2016 04:29, Haomai Wang wrote:
>> On Tue, Sep 13, 2016 at 6:59 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>> Hi
>>>
>>> When running  cephtool-test-mon.sh, part of it executes:
>>>   ceph tell osd.0 version
>>> I see reports on the commandline, I guess that this is the OSD
>>> complaining that things are wrong:
>>>
>>> 2016-09-12 23:50:39.239037 814e50e00  0 -- 127.0.0.1:0/1925715881 >>
>>> 127.0.0.1:6800/26384 conn(0x814fde800 sd=18 :-1
>>> s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0
>>> l=1)._process_connection connect claims to be 127.0.0.1:6800/1026384 not
>>> 127.0.0.1:6800/26384 - wrong node!
>>>
>>> Which it will run until it is shot down.... after 3600 secs.
>>>
>>> the nonce is incremented with 1000000 on every rebind.
>>>
>>> But what I do not understand is how this mismatch has occurred.
>>> I would expect port 6800 to be the port on which the OSD is connected
>>> too, so the connecting party (ceph in this case) thinks the nonce to be
>>> 1026384. Did the MON have this information? And where did the MON then
>>> get it from....
>>>
>>> Somewhere one of the parts did not receive the new nonce, or did not
>>> also increment it?
>>
>> nonce is a part of ceph_entity_addr, so OSDMap will take this
>
> Right, but then the following is also suspicious???
>
> ====
> # ceph osd dump
> epoch 188
> fsid 2e02472d-ecbb-43ac-a687-bbf2523233d9
> created 2016-09-13 10:28:07.970254
> modified 2016-09-13 10:34:57.318988
> flags sortbitwise,require_jewel_osds,require_kraken_osds
> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
> max_osd 10
> osd.0 up   in  weight 1 up_from 175 up_thru 185 down_at 172
> last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565
> 127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up
> e0e44b9c-9869-49d8-8afb-bdb71c04ea27
> osd.1 up   in  weight 1 up_from 10 up_thru 184 down_at 0
> last_clean_interval [0,0) 127.0.0.1:6804/36579 127.0.0.1:6805/36579
> 127.0.0.1:6806/36579 127.0.0.1:6807/36579 exists,up
> b554849c-2cf1-4cf7-a5fd-3529d33345ff
> osd.2 up   in  weight 1 up_from 12 up_thru 185 down_at 0
> last_clean_interval [0,0) 127.0.0.1:6808/36593 127.0.0.1:6809/36593
> 127.0.0.1:6810/36593 127.0.0.1:6811/36593 exists,up
> 2d6648ba-72e1-4c53-ae10-929a9d13a3dd
> ====
>
> osd.0 has:
>         127.0.0.1:6800/36565
>         127.0.0.1:6800/1036565
>         127.0.0.1:6804/1036565
>         127.0.0.1:6805/1036565
>
> So I guess that one nonce did not get updated, because I would expect
> all ports te be rebound, and incr the nonce?
>
> The other bad thing is that ports 6804 and 6805 are now both in osd.0
> and osd.1, which is going to create some trouble also I would guess.
>
> So this is what the osdmap distributes?
> And then MON reports to clients?
>
> How would I retrieve the dump from osd.0 itself?
> Trying:
> # ceph -c ceph.conf daemon osd.0 dump
> Can't get admin socket path: [Errno 2] No such file or directory
>
> Using the admin-socket directly does work.
> But there is not really a command to get something equal to:
> ====
> osd.0 up   in  weight 1 up_from 175 up_thru 185 down_at 172
> last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565
> 127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up
> e0e44b9c-9869-49d8-8afb-bdb71c04ea27
> ====
> So that one can see what the OSD itself thinks it is....

Is osd.0 actually running? If so it *should* have a socket, unless
you've disabled them somehow. Check the logs and see if there are
failures when it gets set up, I guess?

Anyway, something has indeed gone terribly wrong here. I know at one
point you had some messenger patches you were using to try and get
stuff going on BSD; if you still have some there I think you need to
consider them suspect. Otherwise, uh...the network stack is behaving
very differently than Linux's?
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-13 19:52     ` Gregory Farnum
@ 2016-09-13 20:21       ` Willem Jan Withagen
  2016-09-13 20:29       ` Willem Jan Withagen
  1 sibling, 0 replies; 18+ messages in thread
From: Willem Jan Withagen @ 2016-09-13 20:21 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, Ceph Development

On 13-9-2016 21:52, Gregory Farnum wrote:
> On Tue, Sep 13, 2016 at 2:00 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>> On 13-9-2016 04:29, Haomai Wang wrote:
>>> On Tue, Sep 13, 2016 at 6:59 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>> Hi
>>>>
>>>> When running  cephtool-test-mon.sh, part of it executes:
>>>>   ceph tell osd.0 version
>>>> I see reports on the commandline, I guess that this is the OSD
>>>> complaining that things are wrong:
>>>>
>>>> 2016-09-12 23:50:39.239037 814e50e00  0 -- 127.0.0.1:0/1925715881 >>
>>>> 127.0.0.1:6800/26384 conn(0x814fde800 sd=18 :-1
>>>> s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0
>>>> l=1)._process_connection connect claims to be 127.0.0.1:6800/1026384 not
>>>> 127.0.0.1:6800/26384 - wrong node!
>>>>
>>>> Which it will run until it is shot down.... after 3600 secs.
>>>>
>>>> the nonce is incremented with 1000000 on every rebind.
>>>>
>>>> But what I do not understand is how this mismatch has occurred.
>>>> I would expect port 6800 to be the port on which the OSD is connected
>>>> too, so the connecting party (ceph in this case) thinks the nonce to be
>>>> 1026384. Did the MON have this information? And where did the MON then
>>>> get it from....
>>>>
>>>> Somewhere one of the parts did not receive the new nonce, or did not
>>>> also increment it?
>>>
>>> nonce is a part of ceph_entity_addr, so OSDMap will take this
>>
>> Right, but then the following is also suspicious???
>>
>> ====
>> # ceph osd dump
>> epoch 188
>> fsid 2e02472d-ecbb-43ac-a687-bbf2523233d9
>> created 2016-09-13 10:28:07.970254
>> modified 2016-09-13 10:34:57.318988
>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>> pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
>> rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
>> max_osd 10
>> osd.0 up   in  weight 1 up_from 175 up_thru 185 down_at 172
>> last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565
>> 127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up
>> e0e44b9c-9869-49d8-8afb-bdb71c04ea27
>> osd.1 up   in  weight 1 up_from 10 up_thru 184 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6804/36579 127.0.0.1:6805/36579
>> 127.0.0.1:6806/36579 127.0.0.1:6807/36579 exists,up
>> b554849c-2cf1-4cf7-a5fd-3529d33345ff
>> osd.2 up   in  weight 1 up_from 12 up_thru 185 down_at 0
>> last_clean_interval [0,0) 127.0.0.1:6808/36593 127.0.0.1:6809/36593
>> 127.0.0.1:6810/36593 127.0.0.1:6811/36593 exists,up
>> 2d6648ba-72e1-4c53-ae10-929a9d13a3dd
>> ====
>>
>> osd.0 has:
>>         127.0.0.1:6800/36565
>>         127.0.0.1:6800/1036565
>>         127.0.0.1:6804/1036565
>>         127.0.0.1:6805/1036565
>>
>> So I guess that one nonce did not get updated, because I would expect
>> all ports te be rebound, and incr the nonce?
>>
>> The other bad thing is that ports 6804 and 6805 are now both in osd.0
>> and osd.1, which is going to create some trouble also I would guess.
>>
>> So this is what the osdmap distributes?
>> And then MON reports to clients?
>>
>> How would I retrieve the dump from osd.0 itself?
>> Trying:
>> # ceph -c ceph.conf daemon osd.0 dump
>> Can't get admin socket path: [Errno 2] No such file or directory
>>
>> Using the admin-socket directly does work.
>> But there is not really a command to get something equal to:
>> ====
>> osd.0 up   in  weight 1 up_from 175 up_thru 185 down_at 172
>> last_clean_interval [8,174) 127.0.0.1:6800/36565 127.0.0.1:6800/1036565
>> 127.0.0.1:6804/1036565 127.0.0.1:6805/1036565 exists,up
>> e0e44b9c-9869-49d8-8afb-bdb71c04ea27
>> ====
>> So that one can see what the OSD itself thinks it is....
> 
> Is osd.0 actually running? If so it *should* have a socket, unless
> you've disabled them somehow. Check the logs and see if there are
> failures when it gets set up, I guess?

Yup, the OSD is/are up. I needed to work on the EventKqueue() stuff
because events could not be submitted once threads were formed.

> Anyway, something has indeed gone terribly wrong here. I know at one
> point you had some messenger patches you were using to try and get
> stuff going on BSD; if you still have some there I think you need to
> consider them suspect. Otherwise, uh...the network stack is behaving
> very differently than Linux's?

Well a large part of the challenge is that Linux is using EventEpoll()
and on FreeBSD EventKqueue() is the choice. (if only because FreeBSD
does not have Epoll() )

On of the problems is that the descriptor for reading/writing the events
does not survive a fork. And thus any setup and or events that are done
before threads are forked will be lost.

So yes, my patches are incomplete, that much I know. But I'm gathering
knowledge to see how to beter diagnose the problem, and have the tools
to see that the fix really works.
Hence my question if there is a ceph argument that delivers more or less
the same OSD dump output, directly from the OSD itself.

I'm discussing this also with Haomai Wang, who did the first version of
EventKqueue...

--WjW


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-13 19:52     ` Gregory Farnum
  2016-09-13 20:21       ` Willem Jan Withagen
@ 2016-09-13 20:29       ` Willem Jan Withagen
  2016-09-13 20:35         ` Gregory Farnum
  1 sibling, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-09-13 20:29 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, Ceph Development

On 13-9-2016 21:52, Gregory Farnum wrote:
> Is osd.0 actually running? If so it *should* have a socket, unless
> you've disabled them somehow. Check the logs and see if there are
> failures when it gets set up, I guess?
> 
> Anyway, something has indeed gone terribly wrong here. I know at one
> point you had some messenger patches you were using to try and get
> stuff going on BSD; if you still have some there I think you need to
> consider them suspect. Otherwise, uh...the network stack is behaving
> very differently than Linux's?

So what is the expected result of an osd down/up?

Before it is connected to ports like:
	127.0.0.1:{6800,,6801,6802}/{nonce=pid-like}
after the osd has gone down/up the sockets work be:
	127.0.0.1:{6800,,6801,6802}/{(nonce=pid-like)+1000000}

or are the ports also incremented?

--WjW



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-13 20:29       ` Willem Jan Withagen
@ 2016-09-13 20:35         ` Gregory Farnum
  2016-09-28 12:01           ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2016-09-13 20:35 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Haomai Wang, Ceph Development

On Tue, Sep 13, 2016 at 1:29 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 13-9-2016 21:52, Gregory Farnum wrote:
>> Is osd.0 actually running? If so it *should* have a socket, unless
>> you've disabled them somehow. Check the logs and see if there are
>> failures when it gets set up, I guess?
>>
>> Anyway, something has indeed gone terribly wrong here. I know at one
>> point you had some messenger patches you were using to try and get
>> stuff going on BSD; if you still have some there I think you need to
>> consider them suspect. Otherwise, uh...the network stack is behaving
>> very differently than Linux's?
>
> So what is the expected result of an osd down/up?
>
> Before it is connected to ports like:
>         127.0.0.1:{6800,,6801,6802}/{nonce=pid-like}
> after the osd has gone down/up the sockets work be:
>         127.0.0.1:{6800,,6801,6802}/{(nonce=pid-like)+1000000}
>
> or are the ports also incremented?

IIRC, it should usually be the same ports and different nonce. But the
port is *allowed* to change; that happens sometimes if there was an
unclean shutdown and the port is still considered in-use by the OS for
instance.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-13 20:35         ` Gregory Farnum
@ 2016-09-28 12:01           ` Willem Jan Withagen
  2016-10-03 17:50             ` Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-09-28 12:01 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, Ceph Development

On 13-9-2016 22:35, Gregory Farnum wrote:
> On Tue, Sep 13, 2016 at 1:29 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>> On 13-9-2016 21:52, Gregory Farnum wrote:
>>> Is osd.0 actually running? If so it *should* have a socket, unless
>>> you've disabled them somehow. Check the logs and see if there are
>>> failures when it gets set up, I guess?
>>>
>>> Anyway, something has indeed gone terribly wrong here. I know at one
>>> point you had some messenger patches you were using to try and get
>>> stuff going on BSD; if you still have some there I think you need to
>>> consider them suspect. Otherwise, uh...the network stack is behaving
>>> very differently than Linux's?
>>
>> So what is the expected result of an osd down/up?
>>
>> Before it is connected to ports like:
>>         127.0.0.1:{6800,,6801,6802}/{nonce=pid-like}
>> after the osd has gone down/up the sockets work be:
>>         127.0.0.1:{6800,,6801,6802}/{(nonce=pid-like)+1000000}
>>
>> or are the ports also incremented?
> 
> IIRC, it should usually be the same ports and different nonce. But the
> port is *allowed* to change; that happens sometimes if there was an
> unclean shutdown and the port is still considered in-use by the OS for
> instance.
> 
ATM, I have the feeling that I'm even more off track.

In FreeBSD I have:
119: starting osd.0 at :/0 osd_data testdir/osd-bench/0
testdir/osd-bench/0/journal
119: create-or-move updating item name 'osd.0' weight 1 at location
{host=localhost,root=default} to crush map
119: 0
119: epoch 5
119: fsid 6e5ab220-d761-4636-b4b0-27eacadb41e3
119: created 2016-09-28 02:32:44.704630
119: modified 2016-09-28 02:32:49.971511
119: flags sortbitwise,require_jewel_osds,require_kraken_osds
119: pool 1 'rbd' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 3 flags hashpspool
stripe_width 0
119: max_osd 1
119: osd.0 down out weight 0 up_from 0 up_thru 0 down_at 0
last_clean_interval [0,0) :/0 :/0 :/0 :/0 exists,new
1dddf568-c6dd-47d1-b4b4-584867a8b48d

Where as in Linux I have, at the same point in the script:
130: starting osd.0 at :/0 osd_data testdir/osd-bench/0
testdir/osd-bench/0/journal
130: create-or-move updating item name 'osd.0' weight 1 at location
{host=localhost,root=default} to crush map
130: 0
130: osd.0 up   in  weight 1 up_from 6 up_thru 6 down_at 0
last_clean_interval [0,0) 127.0.0.1:6800/19815 127.0.0.1:6801/19815
127.0.0.1:6802/19815 127.0.0.1:6803/19815 exists,up
084e3410-9eb9-4fc3-b395-e46e9587d351

Question here is:
  If I ask 'ceph osd dump', I'm actually asking ceph-mon.
  And cehp-mon has learned this from (crush?)maps being sent to it by
  ceph-osd.

Is there an easy way to debug/monitor the content of what ceph-osd sends
and ceph-mon receives in the maps?
Just to make sure that it is clear where the problem occurs.

thanx,
--WjW


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-09-28 12:01           ` Willem Jan Withagen
@ 2016-10-03 17:50             ` Gregory Farnum
  2016-10-04 11:57               ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2016-10-03 17:50 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Haomai Wang, Ceph Development

On Wed, Sep 28, 2016 at 5:01 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 13-9-2016 22:35, Gregory Farnum wrote:
>> On Tue, Sep 13, 2016 at 1:29 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>> On 13-9-2016 21:52, Gregory Farnum wrote:
>>>> Is osd.0 actually running? If so it *should* have a socket, unless
>>>> you've disabled them somehow. Check the logs and see if there are
>>>> failures when it gets set up, I guess?
>>>>
>>>> Anyway, something has indeed gone terribly wrong here. I know at one
>>>> point you had some messenger patches you were using to try and get
>>>> stuff going on BSD; if you still have some there I think you need to
>>>> consider them suspect. Otherwise, uh...the network stack is behaving
>>>> very differently than Linux's?
>>>
>>> So what is the expected result of an osd down/up?
>>>
>>> Before it is connected to ports like:
>>>         127.0.0.1:{6800,,6801,6802}/{nonce=pid-like}
>>> after the osd has gone down/up the sockets work be:
>>>         127.0.0.1:{6800,,6801,6802}/{(nonce=pid-like)+1000000}
>>>
>>> or are the ports also incremented?
>>
>> IIRC, it should usually be the same ports and different nonce. But the
>> port is *allowed* to change; that happens sometimes if there was an
>> unclean shutdown and the port is still considered in-use by the OS for
>> instance.
>>
> ATM, I have the feeling that I'm even more off track.
>
> In FreeBSD I have:
> 119: starting osd.0 at :/0 osd_data testdir/osd-bench/0
> testdir/osd-bench/0/journal
> 119: create-or-move updating item name 'osd.0' weight 1 at location
> {host=localhost,root=default} to crush map
> 119: 0
> 119: epoch 5
> 119: fsid 6e5ab220-d761-4636-b4b0-27eacadb41e3
> 119: created 2016-09-28 02:32:44.704630
> 119: modified 2016-09-28 02:32:49.971511
> 119: flags sortbitwise,require_jewel_osds,require_kraken_osds
> 119: pool 1 'rbd' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 4 pgp_num 4 last_change 3 flags hashpspool
> stripe_width 0
> 119: max_osd 1
> 119: osd.0 down out weight 0 up_from 0 up_thru 0 down_at 0
> last_clean_interval [0,0) :/0 :/0 :/0 :/0 exists,new
> 1dddf568-c6dd-47d1-b4b4-584867a8b48d
>
> Where as in Linux I have, at the same point in the script:
> 130: starting osd.0 at :/0 osd_data testdir/osd-bench/0
> testdir/osd-bench/0/journal
> 130: create-or-move updating item name 'osd.0' weight 1 at location
> {host=localhost,root=default} to crush map
> 130: 0
> 130: osd.0 up   in  weight 1 up_from 6 up_thru 6 down_at 0
> last_clean_interval [0,0) 127.0.0.1:6800/19815 127.0.0.1:6801/19815
> 127.0.0.1:6802/19815 127.0.0.1:6803/19815 exists,up
> 084e3410-9eb9-4fc3-b395-e46e9587d351
>
> Question here is:
>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>   And cehp-mon has learned this from (crush?)maps being sent to it by
>   ceph-osd.

The monitor has learned about specific IP addresses/nonces/etc via
MOSDBoot messages from the OSDs. The crush locations are set via
monitor command messages, generally invoked as part of the init
scripts. Maps are generated entirely on the monitor. :)

> Is there an easy way to debug/monitor the content of what ceph-osd sends
> and ceph-mon receives in the maps?
> Just to make sure that it is clear where the problem occurs.

You should be able to see the info going in and out by bumping the
debug levels up — every message's "print" function is invoked when
it's sent/received as long as you have "debug ms = 1". It looks like
the MOSDBoot message doesn't natively dump its addresses but you can
add them easily if you need to.
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-10-03 17:50             ` Gregory Farnum
@ 2016-10-04 11:57               ` Willem Jan Withagen
  2016-12-08 10:03                 ` kefu chai
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-10-04 11:57 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, Ceph Development

On 3-10-2016 19:50, Gregory Farnum wrote:
>> Question here is:
>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>   ceph-osd.
> 
> The monitor has learned about specific IP addresses/nonces/etc via
> MOSDBoot messages from the OSDs. The crush locations are set via
> monitor command messages, generally invoked as part of the init
> scripts. Maps are generated entirely on the monitor. :)
> 
>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>> and ceph-mon receives in the maps?
>> Just to make sure that it is clear where the problem occurs.
> 
> You should be able to see the info going in and out by bumping the
> debug levels up — every message's "print" function is invoked when
> it's sent/received as long as you have "debug ms = 1". It looks like
> the MOSDBoot message doesn't natively dump its addresses but you can
> add them easily if you need to.

Hi Greg,

Thanx for the answer....

I've got debug_ms already pumped up all the way to 20.
So I do get to see what addresses are selected during bind. But still
they do not end up at the MON, and 'ceph osd dump' reports:
	:/0
as bind address.

I'm going to add some more debugs to actually see what MOSDBoot is doing....

It looks like I've tackled most of the EventKqueue forking trouble, by
keeping a full administration.
Got to make some more FreeBSD tests to see if I'm actually solving what
I think is the problem. :)

--WjW

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-10-04 11:57               ` Willem Jan Withagen
@ 2016-12-08 10:03                 ` kefu chai
  2016-12-08 12:30                   ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: kefu chai @ 2016-12-08 10:03 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Gregory Farnum, Haomai Wang, Ceph Development

On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 3-10-2016 19:50, Gregory Farnum wrote:
>>> Question here is:
>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>   ceph-osd.
>>
>> The monitor has learned about specific IP addresses/nonces/etc via
>> MOSDBoot messages from the OSDs. The crush locations are set via
>> monitor command messages, generally invoked as part of the init
>> scripts. Maps are generated entirely on the monitor. :)
>>
>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>> and ceph-mon receives in the maps?
>>> Just to make sure that it is clear where the problem occurs.
>>
>> You should be able to see the info going in and out by bumping the
>> debug levels up — every message's "print" function is invoked when
>> it's sent/received as long as you have "debug ms = 1". It looks like
>> the MOSDBoot message doesn't natively dump its addresses but you can
>> add them easily if you need to.
>
> Hi Greg,
>
> Thanx for the answer....
>
> I've got debug_ms already pumped up all the way to 20.
> So I do get to see what addresses are selected during bind. But still
> they do not end up at the MON, and 'ceph osd dump' reports:
>         :/0
> as bind address.
>
> I'm going to add some more debugs to actually see what MOSDBoot is doing....

there are multiple messengers used by ceph-osd, the one connected by
rados client is the external/public messenger. it is also used by osd
to talk with the monitor.

the nonce of the external address of an OSD does not change after it's
up: it's always the pid of ceph-osd process. and the (peer) address of
the booting OSD collected by monitor comes from the connection's
peer_addr field, which is set when the monitor accepts the connection
from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
AsyncConnection::_process_connection().

but there are chances that an OSD is restarted and fail to bind its
external messenger to the specified the port. in that case, ceph-osd
will try with another port, but keep the nonce the same. but when it
comes to other messengers used by ceph-osd, their nonces increase by
1000000 every time they rebind. that's why "ceph osd thrash" can
change the nonces of the cluster_addr, heartbeat_back_addr and
heartbeat_front_addr. the PR of
https://github.com/ceph/ceph/pull/11706 actually changes the behavior
of the messengers of these three messengers. and it has nothing to do
with the external messenger to which the ceph cli client is
connecting.

so you might want to check
1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
2) while the nonce of the same messenger is $pid when the ceph cli
connects to it.

my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
it avoids setting the nonce before the rebind finishes. and i tried
with your producer on my linux box, no luck =(

>
> It looks like I've tackled most of the EventKqueue forking trouble, by
> keeping a full administration.
> Got to make some more FreeBSD tests to see if I'm actually solving what
> I think is the problem. :)
>
> --WjW
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-12-08 10:03                 ` kefu chai
@ 2016-12-08 12:30                   ` Willem Jan Withagen
  2016-12-09  8:59                     ` kefu chai
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-12-08 12:30 UTC (permalink / raw)
  To: kefu chai; +Cc: Gregory Farnum, Haomai Wang, Ceph Development

On 8-12-2016 11:03, kefu chai wrote:
> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>> Question here is:
>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>   ceph-osd.
>>>
>>> The monitor has learned about specific IP addresses/nonces/etc via
>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>> monitor command messages, generally invoked as part of the init
>>> scripts. Maps are generated entirely on the monitor. :)
>>>
>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>> and ceph-mon receives in the maps?
>>>> Just to make sure that it is clear where the problem occurs.
>>>
>>> You should be able to see the info going in and out by bumping the
>>> debug levels up — every message's "print" function is invoked when
>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>> add them easily if you need to.
>>
>> Hi Greg,
>>
>> Thanx for the answer....
>>
>> I've got debug_ms already pumped up all the way to 20.
>> So I do get to see what addresses are selected during bind. But still
>> they do not end up at the MON, and 'ceph osd dump' reports:
>>         :/0
>> as bind address.
>>
>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
> 
> there are multiple messengers used by ceph-osd, the one connected by
> rados client is the external/public messenger. it is also used by osd
> to talk with the monitor.
> 
> the nonce of the external address of an OSD does not change after it's
> up: it's always the pid of ceph-osd process. and the (peer) address of
> the booting OSD collected by monitor comes from the connection's
> peer_addr field, which is set when the monitor accepts the connection
> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
> AsyncConnection::_process_connection().
> 
> but there are chances that an OSD is restarted and fail to bind its
> external messenger to the specified the port. in that case, ceph-osd
> will try with another port, but keep the nonce the same. but when it
> comes to other messengers used by ceph-osd, their nonces increase by
> 1000000 every time they rebind. that's why "ceph osd thrash" can
> change the nonces of the cluster_addr, heartbeat_back_addr and
> heartbeat_front_addr. the PR of
> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
> of the messengers of these three messengers. and it has nothing to do
> with the external messenger to which the ceph cli client is
> connecting.
> 
> so you might want to check
> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
> 2) while the nonce of the same messenger is $pid when the ceph cli
> connects to it.
> 
> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
> it avoids setting the nonce before the rebind finishes. and i tried
> with your producer on my linux box, no luck =(

Right,

You gave me a lot of things to think about, and to start figuring out.

And you are right that something really bad needs to happen to an OSD to
get in this state. But that is what the tests actually do: They just
down/up or kill OSDs and restart.

And from previous discussions I "learned" that if the process doesn't
die but needs to rebind on the port, the OSD stays at the same port but
increments the nonce to indicate that it is a fresh connection. And log
printing actually shows that the code is going thru a rebind.

Now the bad thing is that the Linux and FreeBSD log do comparable things
with my (small) change to the setting of addr. And the nonce is indeed
incremented, which increment is actually picked up by all ceph components.

But if I keep the old code, the nonces are running out of sync.
Your patch doesn't hurt, but it also doesn't help: I still get
mismatched nonces.

But like I started: lots of things again to consider.

--WjW



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-12-08 12:30                   ` Willem Jan Withagen
@ 2016-12-09  8:59                     ` kefu chai
  2016-12-09  9:22                       ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: kefu chai @ 2016-12-09  8:59 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Gregory Farnum, Haomai Wang, Ceph Development

On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 8-12-2016 11:03, kefu chai wrote:
>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>> Question here is:
>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>   ceph-osd.
>>>>
>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>> monitor command messages, generally invoked as part of the init
>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>
>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>> and ceph-mon receives in the maps?
>>>>> Just to make sure that it is clear where the problem occurs.
>>>>
>>>> You should be able to see the info going in and out by bumping the
>>>> debug levels up — every message's "print" function is invoked when
>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>> add them easily if you need to.
>>>
>>> Hi Greg,
>>>
>>> Thanx for the answer....
>>>
>>> I've got debug_ms already pumped up all the way to 20.
>>> So I do get to see what addresses are selected during bind. But still
>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>         :/0
>>> as bind address.
>>>
>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>
>> there are multiple messengers used by ceph-osd, the one connected by
>> rados client is the external/public messenger. it is also used by osd
>> to talk with the monitor.
>>
>> the nonce of the external address of an OSD does not change after it's
>> up: it's always the pid of ceph-osd process. and the (peer) address of
>> the booting OSD collected by monitor comes from the connection's
>> peer_addr field, which is set when the monitor accepts the connection
>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>> AsyncConnection::_process_connection().
>>
>> but there are chances that an OSD is restarted and fail to bind its
>> external messenger to the specified the port. in that case, ceph-osd
>> will try with another port, but keep the nonce the same. but when it
>> comes to other messengers used by ceph-osd, their nonces increase by
>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>> change the nonces of the cluster_addr, heartbeat_back_addr and
>> heartbeat_front_addr. the PR of
>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>> of the messengers of these three messengers. and it has nothing to do
>> with the external messenger to which the ceph cli client is
>> connecting.
>>
>> so you might want to check
>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>> 2) while the nonce of the same messenger is $pid when the ceph cli
>> connects to it.
>>
>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>> it avoids setting the nonce before the rebind finishes. and i tried
>> with your producer on my linux box, no luck =(
>
> Right,
>
> You gave me a lot of things to think about, and to start figuring out.
>
> And you are right that something really bad needs to happen to an OSD to
> get in this state. But that is what the tests actually do: They just
> down/up or kill OSDs and restart.
>
> And from previous discussions I "learned" that if the process doesn't
> die but needs to rebind on the port, the OSD stays at the same port but
> increments the nonce to indicate that it is a fresh connection. And log

the external messenger should *not* increment its nonce.

> printing actually shows that the code is going thru a rebind.

and it should *not* go through rebind().

>
> Now the bad thing is that the Linux and FreeBSD log do comparable things
> with my (small) change to the setting of addr. And the nonce is indeed
> incremented, which increment is actually picked up by all ceph components.
>
> But if I keep the old code, the nonces are running out of sync.
> Your patch doesn't hurt, but it also doesn't help: I still get
> mismatched nonces.

yeah, that's expected.

>
> But like I started: lots of things again to consider.
>
> --WjW
>
>



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-12-09  8:59                     ` kefu chai
@ 2016-12-09  9:22                       ` Willem Jan Withagen
  2016-12-09 12:02                         ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-12-09  9:22 UTC (permalink / raw)
  To: kefu chai; +Cc: Gregory Farnum, Haomai Wang, Ceph Development

On 9-12-2016 09:59, kefu chai wrote:
> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>> On 8-12-2016 11:03, kefu chai wrote:
>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>> Question here is:
>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>   ceph-osd.
>>>>>
>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>> monitor command messages, generally invoked as part of the init
>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>
>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>> and ceph-mon receives in the maps?
>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>
>>>>> You should be able to see the info going in and out by bumping the
>>>>> debug levels up — every message's "print" function is invoked when
>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>> add them easily if you need to.
>>>>
>>>> Hi Greg,
>>>>
>>>> Thanx for the answer....
>>>>
>>>> I've got debug_ms already pumped up all the way to 20.
>>>> So I do get to see what addresses are selected during bind. But still
>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>         :/0
>>>> as bind address.
>>>>
>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>
>>> there are multiple messengers used by ceph-osd, the one connected by
>>> rados client is the external/public messenger. it is also used by osd
>>> to talk with the monitor.
>>>
>>> the nonce of the external address of an OSD does not change after it's
>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>> the booting OSD collected by monitor comes from the connection's
>>> peer_addr field, which is set when the monitor accepts the connection
>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>> AsyncConnection::_process_connection().
>>>
>>> but there are chances that an OSD is restarted and fail to bind its
>>> external messenger to the specified the port. in that case, ceph-osd
>>> will try with another port, but keep the nonce the same. but when it
>>> comes to other messengers used by ceph-osd, their nonces increase by
>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>> heartbeat_front_addr. the PR of
>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>> of the messengers of these three messengers. and it has nothing to do
>>> with the external messenger to which the ceph cli client is
>>> connecting.
>>>
>>> so you might want to check
>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>> connects to it.
>>>
>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>> it avoids setting the nonce before the rebind finishes. and i tried
>>> with your producer on my linux box, no luck =(
>>
>> Right,
>>
>> You gave me a lot of things to think about, and to start figuring out.
>>
>> And you are right that something really bad needs to happen to an OSD to
>> get in this state. But that is what the tests actually do: They just
>> down/up or kill OSDs and restart.
>>
>> And from previous discussions I "learned" that if the process doesn't
>> die but needs to rebind on the port, the OSD stays at the same port but
>> increments the nonce to indicate that it is a fresh connection. And log
> 
> the external messenger should *not* increment its nonce.
> 
>> printing actually shows that the code is going thru a rebind.
> 
> and it should *not* go through rebind().

I have to dig thru the testscript but as far as I can tell just about
all of the daemons are getting reboots in this test.

So when would I get a rebind?

I thought it was because I had an OSD incorrectly marked down:
./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
This I found in the logs, and then I got a rebind.

Wido suggested looking for this message, on my question why my OSDs were
not getting UP after a good hustle with all OSDs and MONs.

And that is one of the tests in cephtool-test-mon.sh.
right before the 'ceph tell osd.0 version' there are tests like:
  ceph osd set noup
  ceph osd down 0
  ceph osd dump | grep 'osd.0 down'
  ceph osd unset noup
and
  ceph osd reweight osd.0 .5
  ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
  ceph osd out 0
  ceph osd in 0
  ceph osd dump | grep ^osd.0 | grep 'weight 0.5'


>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>> with my (small) change to the setting of addr. And the nonce is indeed
>> incremented, which increment is actually picked up by all ceph components.

So now I have 2 challenges??

1) Find out why I get a rebind, where you think I should not.
   For that I'll have to collect all maltreatment that is done in
   cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
   to see what is up.
2) If we get a rebind...
   Why doesn't the FreeBSD version end up with consistent noncees.

"Good thing" about the previous code was that I could tweak it, and at
least get it to Work for FreeBSD. Have not had the time to see if I
could again with this code....

--WjW

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-12-09  9:22                       ` Willem Jan Withagen
@ 2016-12-09 12:02                         ` Willem Jan Withagen
  2016-12-10  0:39                           ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-12-09 12:02 UTC (permalink / raw)
  To: kefu chai; +Cc: Gregory Farnum, Haomai Wang, Ceph Development

On 9-12-2016 10:22, Willem Jan Withagen wrote:
> On 9-12-2016 09:59, kefu chai wrote:
>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>> On 8-12-2016 11:03, kefu chai wrote:
>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>>> Question here is:
>>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>>   ceph-osd.
>>>>>>
>>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>>> monitor command messages, generally invoked as part of the init
>>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>>
>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>>> and ceph-mon receives in the maps?
>>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>>
>>>>>> You should be able to see the info going in and out by bumping the
>>>>>> debug levels up — every message's "print" function is invoked when
>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>>> add them easily if you need to.
>>>>>
>>>>> Hi Greg,
>>>>>
>>>>> Thanx for the answer....
>>>>>
>>>>> I've got debug_ms already pumped up all the way to 20.
>>>>> So I do get to see what addresses are selected during bind. But still
>>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>>         :/0
>>>>> as bind address.
>>>>>
>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>>
>>>> there are multiple messengers used by ceph-osd, the one connected by
>>>> rados client is the external/public messenger. it is also used by osd
>>>> to talk with the monitor.
>>>>
>>>> the nonce of the external address of an OSD does not change after it's
>>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>>> the booting OSD collected by monitor comes from the connection's
>>>> peer_addr field, which is set when the monitor accepts the connection
>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>>> AsyncConnection::_process_connection().
>>>>
>>>> but there are chances that an OSD is restarted and fail to bind its
>>>> external messenger to the specified the port. in that case, ceph-osd
>>>> will try with another port, but keep the nonce the same. but when it
>>>> comes to other messengers used by ceph-osd, their nonces increase by
>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>>> heartbeat_front_addr. the PR of
>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>>> of the messengers of these three messengers. and it has nothing to do
>>>> with the external messenger to which the ceph cli client is
>>>> connecting.
>>>>
>>>> so you might want to check
>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>>> connects to it.
>>>>
>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>>> it avoids setting the nonce before the rebind finishes. and i tried
>>>> with your producer on my linux box, no luck =(
>>>
>>> Right,
>>>
>>> You gave me a lot of things to think about, and to start figuring out.
>>>
>>> And you are right that something really bad needs to happen to an OSD to
>>> get in this state. But that is what the tests actually do: They just
>>> down/up or kill OSDs and restart.
>>>
>>> And from previous discussions I "learned" that if the process doesn't
>>> die but needs to rebind on the port, the OSD stays at the same port but
>>> increments the nonce to indicate that it is a fresh connection. And log
>>
>> the external messenger should *not* increment its nonce.
>>
>>> printing actually shows that the code is going thru a rebind.
>>
>> and it should *not* go through rebind().
> 
> I have to dig thru the testscript but as far as I can tell just about
> all of the daemons are getting reboots in this test.
> 
> So when would I get a rebind?
> 
> I thought it was because I had an OSD incorrectly marked down:
> ./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
> This I found in the logs, and then I got a rebind.
> 
> Wido suggested looking for this message, on my question why my OSDs were
> not getting UP after a good hustle with all OSDs and MONs.
> 
> And that is one of the tests in cephtool-test-mon.sh.
> right before the 'ceph tell osd.0 version' there are tests like:
>   ceph osd set noup
>   ceph osd down 0
>   ceph osd dump | grep 'osd.0 down'
>   ceph osd unset noup
> and
>   ceph osd reweight osd.0 .5
>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>   ceph osd out 0
>   ceph osd in 0
>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
> 
> 
>>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>>> with my (small) change to the setting of addr. And the nonce is indeed
>>> incremented, which increment is actually picked up by all ceph components.
> 
> So now I have 2 challenges??
> 
> 1) Find out why I get a rebind, where you think I should not.
>    For that I'll have to collect all maltreatment that is done in
>    cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
>    to see what is up.
> 2) If we get a rebind...
>    Why doesn't the FreeBSD version end up with consistent noncees.
> 
> "Good thing" about the previous code was that I could tweak it, and at
> least get it to Work for FreeBSD. Have not had the time to see if I
> could again with this code....

So the smallest sequence I can find that demonstrates the problem:
function test_mon_rebind()
{
  ceph osd set noup
  ceph osd down 0
  ceph osd dump | grep 'osd.0 down'
  ceph osd unset noup
  max_run=1000
  for ((i=0; i < $max_run; i++)); do
    if ! ceph osd dump | grep 'osd.0 up'; then
      echo "waiting for osd.0 to come back up ($i/$max_run)"
      sleep 1
    else
      break
    fi
  done
  ceph osd dump | grep 'osd.0 up'

  for id in `ceph osd ls` ; do
    retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version
  done
}

Which matches with what I thought I knew:
  OSD down => up => rebind
which follows from the log where the osd complains about being marked
down incorrectly.
search for
   log_channel(cluster) log [WRN] : map e8 wrongly marked me down
in the osd.0.log

--WjW


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-12-09 12:02                         ` Willem Jan Withagen
@ 2016-12-10  0:39                           ` Willem Jan Withagen
  2016-12-13  1:02                             ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-12-10  0:39 UTC (permalink / raw)
  To: Willem Jan Withagen, kefu chai
  Cc: Gregory Farnum, Haomai Wang, Ceph Development

On 9-12-2016 13:02, Willem Jan Withagen wrote:
> On 9-12-2016 10:22, Willem Jan Withagen wrote:
>> On 9-12-2016 09:59, kefu chai wrote:
>>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>> On 8-12-2016 11:03, kefu chai wrote:
>>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>>>> Question here is:
>>>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>>>   ceph-osd.
>>>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>>>> monitor command messages, generally invoked as part of the init
>>>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>>>
>>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>>>> and ceph-mon receives in the maps?
>>>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>>> You should be able to see the info going in and out by bumping the
>>>>>>> debug levels up — every message's "print" function is invoked when
>>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>>>> add them easily if you need to.
>>>>>> Hi Greg,
>>>>>>
>>>>>> Thanx for the answer....
>>>>>>
>>>>>> I've got debug_ms already pumped up all the way to 20.
>>>>>> So I do get to see what addresses are selected during bind. But still
>>>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>>>         :/0
>>>>>> as bind address.
>>>>>>
>>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>>> there are multiple messengers used by ceph-osd, the one connected by
>>>>> rados client is the external/public messenger. it is also used by osd
>>>>> to talk with the monitor.
>>>>>
>>>>> the nonce of the external address of an OSD does not change after it's
>>>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>>>> the booting OSD collected by monitor comes from the connection's
>>>>> peer_addr field, which is set when the monitor accepts the connection
>>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>>>> AsyncConnection::_process_connection().
>>>>>
>>>>> but there are chances that an OSD is restarted and fail to bind its
>>>>> external messenger to the specified the port. in that case, ceph-osd
>>>>> will try with another port, but keep the nonce the same. but when it
>>>>> comes to other messengers used by ceph-osd, their nonces increase by
>>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>>>> heartbeat_front_addr. the PR of
>>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>>>> of the messengers of these three messengers. and it has nothing to do
>>>>> with the external messenger to which the ceph cli client is
>>>>> connecting.
>>>>>
>>>>> so you might want to check
>>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>>>> connects to it.
>>>>>
>>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>>>> it avoids setting the nonce before the rebind finishes. and i tried
>>>>> with your producer on my linux box, no luck =(
>>>> Right,
>>>>
>>>> You gave me a lot of things to think about, and to start figuring out.
>>>>
>>>> And you are right that something really bad needs to happen to an OSD to
>>>> get in this state. But that is what the tests actually do: They just
>>>> down/up or kill OSDs and restart.
>>>>
>>>> And from previous discussions I "learned" that if the process doesn't
>>>> die but needs to rebind on the port, the OSD stays at the same port but
>>>> increments the nonce to indicate that it is a fresh connection. And log
>>> the external messenger should *not* increment its nonce.
>>>
>>>> printing actually shows that the code is going thru a rebind.
>>> and it should *not* go through rebind().
>> I have to dig thru the testscript but as far as I can tell just about
>> all of the daemons are getting reboots in this test.
>>
>> So when would I get a rebind?
>>
>> I thought it was because I had an OSD incorrectly marked down:
>> ./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
>> This I found in the logs, and then I got a rebind.
>>
>> Wido suggested looking for this message, on my question why my OSDs were
>> not getting UP after a good hustle with all OSDs and MONs.
>>
>> And that is one of the tests in cephtool-test-mon.sh.
>> right before the 'ceph tell osd.0 version' there are tests like:
>>   ceph osd set noup
>>   ceph osd down 0
>>   ceph osd dump | grep 'osd.0 down'
>>   ceph osd unset noup
>> and
>>   ceph osd reweight osd.0 .5
>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>   ceph osd out 0
>>   ceph osd in 0
>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>
>>
>>>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>>>> with my (small) change to the setting of addr. And the nonce is indeed
>>>> incremented, which increment is actually picked up by all ceph components.
>> So now I have 2 challenges??
>>
>> 1) Find out why I get a rebind, where you think I should not.
>>    For that I'll have to collect all maltreatment that is done in
>>    cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
>>    to see what is up.
>> 2) If we get a rebind...
>>    Why doesn't the FreeBSD version end up with consistent noncees.
>>
>> "Good thing" about the previous code was that I could tweak it, and at
>> least get it to Work for FreeBSD. Have not had the time to see if I
>> could again with this code....
> So the smallest sequence I can find that demonstrates the problem:
> function test_mon_rebind()
> {
>   ceph osd set noup
>   ceph osd down 0
>   ceph osd dump | grep 'osd.0 down'
>   ceph osd unset noup
>   max_run=1000
>   for ((i=0; i < $max_run; i++)); do
>     if ! ceph osd dump | grep 'osd.0 up'; then
>       echo "waiting for osd.0 to come back up ($i/$max_run)"
>       sleep 1
>     else
>       break
>     fi
>   done
>   ceph osd dump | grep 'osd.0 up'
>
>   for id in `ceph osd ls` ; do
>     retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version
>   done
> }
>
> Which matches with what I thought I knew:
>   OSD down => up => rebind
> which follows from the log where the osd complains about being marked
> down incorrectly.
> search for
>    log_channel(cluster) log [WRN] : map e8 wrongly marked me down
> in the osd.0.log
>

I ran the code snippet, and it DOES generate a rebind even on Linux
Centos 7.
So I expect that the rebind action is called for?

Also note that at least one nonce is incremented, AND new ports are
selected.
6801,6802,6803 are avoided, and it looks like already connected ports
are avoided as well.
(obvious, since binding on them will not work)

Next is the FreeBSD log, and that one is relatively seriously wrong
since nonce are 0 .
And that must be incorrect.

Last is the log of the current master code with the patch I made of
msg->my_addr, versus addr.
IMHO a change that should be semantical equal?
  return bind(msgr->my_inst.addr, new_avoid);
  // return bind(addr, new_avoid);
One thing that is not working there is the port avoidance. That I have
to look into since there is no reason
why this should not work under FreeBSD.

Linux log:
2016-12-10 00:24:26.971102 7fdf0674e700  0 log_channel(cluster) log
[WRN] : map e20 wrongly marked me down
2016-12-10 00:24:26.971153 7fdf0674e700  1 -- 127.0.0.1:6801/22932
rebind rebind avoid 6801,6802,6803
2016-12-10 00:24:26.971440 7fdf0674e700  1 -- 127.0.0.1:6801/22932
shutdown_connections
2016-12-10 00:24:26.971600 7fdf0674e700  1  Processor -- rebind rebind
avoid 6801,6802,6803
2016-12-10 00:24:26.972454 7fdf0674e700  1  Processor -- bind bind
my_inst.addr is 127.0.0.1:6812/1022932
2016-12-10 00:24:26.972488 7fdf0674e700  1  Processor -- start
2016-12-10 00:24:26.972520 7fdf0674e700  1 -- 127.0.0.1:6802/22932
rebind rebind avoid 6801,6802,6803
2016-12-10 00:24:26.972564 7fdf0674e700  1 -- 127.0.0.1:6802/22932
shutdown_connections
2016-12-10 00:24:26.972599 7fdf0674e700  1  Processor -- rebind rebind
avoid 6801,6802,6803
2016-12-10 00:24:26.973286 7fdf0674e700  1 -- 127.0.0.1:0/22932
learned_addr learned my addr 127.0.0.1:0/22932
2016-12-10 00:24:26.973311 7fdf0674e700  1  Processor -- bind bind
my_inst.addr is 127.0.0.1:6813/1022932
2016-12-10 00:24:26.973318 7fdf0674e700  1  Processor -- start
2016-12-10 00:24:26.973345 7fdf0674e700  1 -- 127.0.0.1:6803/22932
rebind rebind avoid 6801,6802,6803
2016-12-10 00:24:26.973385 7fdf0674e700  1 -- 127.0.0.1:6803/22932
shutdown_connections
2016-12-10 00:24:26.973408 7fdf0674e700  1  Processor -- rebind rebind
avoid 6801,6802,6803
2016-12-10 00:24:26.974099 7fdf0674e700  1 -- 127.0.0.1:0/22932
learned_addr learned my addr 127.0.0.1:0/22932
2016-12-10 00:24:26.974136 7fdf0674e700  1  Processor -- bind bind
my_inst.addr is 127.0.0.1:6814/1022932
2016-12-10 00:24:26.974144 7fdf0674e700  1  Processor -- start
2016-12-10 00:24:26.974172 7fdf0674e700  1 -- 127.0.0.1:0/22932
shutdown_connections
2016-12-10 00:24:26.974249 7fdf0674e700  1 -- 127.0.0.1:0/22932 >>
127.0.0.1:6806/23094 conn(0x7fdf1f21a000 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down
2016-12-10 00:24:26.974266 7fdf0674e700  1 -- 127.0.0.1:0/22932 >>
127.0.0.1:6807/23094 conn(0x7fdf1f302000 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down
2016-12-10 00:24:26.974274 7fdf0674e700  1 -- 127.0.0.1:0/22932 >>
127.0.0.1:6810/23257 conn(0x7fdf1f303800 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down
2016-12-10 00:24:26.974281 7fdf0674e700  1 -- 127.0.0.1:0/22932 >>
127.0.0.1:6811/23257 conn(0x7fdf1f3aa000 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down

The FreeBSD variant in current code looks like:
2016-12-10 00:58:45.612080 bb96d80  0 log_channel(cluster) log [WRN] :
map e19 wrongly marked me down
2016-12-10 00:58:45.612128 bb96d80  1 -- 127.0.0.1:6801/0 rebind Async
rebind avoid 6801,6802,6803
2016-12-10 00:58:45.612175 bb96d80  1 -- 127.0.0.1:6801/0
shutdown_connections
2016-12-10 00:58:45.612331 bb96d80  1 -- 127.0.0.1:6800/0 _finish_bind
bind my_inst.addr is 127.0.0.1:6800/0
2016-12-10 00:58:45.612339 bb96d80  1  Processor -- start
2016-12-10 00:58:45.612366 bb96d80  1 -- 127.0.0.1:6802/0 rebind Async
rebind avoid 6801,6802,6803
2016-12-10 00:58:45.612399 bb96d80  1 -- 127.0.0.1:6802/0
shutdown_connections
2016-12-10 00:58:45.612476 bb96d80  1 -- 127.0.0.1:0/0 learned_addr
learned my addr 127.0.0.1:0/0
2016-12-10 00:58:45.612487 bb96d80  1 -- 127.0.0.1:6804/0 _finish_bind
bind my_inst.addr is 127.0.0.1:6804/0
2016-12-10 00:58:45.612491 bb96d80  1  Processor -- start
2016-12-10 00:58:45.612520 bb96d80  1 -- 127.0.0.1:6803/0 rebind Async
rebind avoid 6801,6802,6803
2016-12-10 00:58:45.612553 bb96d80  1 -- 127.0.0.1:6803/0
shutdown_connections
2016-12-10 00:58:45.612659 bb96d80  1 -- 127.0.0.1:0/0 learned_addr
learned my addr 127.0.0.1:0/0
2016-12-10 00:58:45.612670 bb96d80  1 -- 127.0.0.1:6805/0 _finish_bind
bind my_inst.addr is 127.0.0.1:6805/0
2016-12-10 00:58:45.612674 bb96d80  1  Processor -- start
2016-12-10 00:58:45.612696 bb96d80  1 -- 127.0.0.1:0/40923
shutdown_connections
2016-12-10 00:58:45.612742 bb99600  1 -- 127.0.0.1:0/40923 >>
127.0.0.1:6811/0 conn(0xbd01800 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down:2365
2016-12-10 00:58:45.612926 bb96d80  1 -- 127.0.0.1:0/40923 >>
127.0.0.1:6806/0 conn(0xbcc4000 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down:2365
2016-12-10 00:58:45.612967 bb96d80  1 -- 127.0.0.1:0/40923 >>
127.0.0.1:6807/0 conn(0xbcc5800 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down:2365
2016-12-10 00:58:45.612983 bb96d80  1 -- 127.0.0.1:0/40923 >>
127.0.0.1:6810/0 conn(0xbd04800 :-1
s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).mark_down:2365
2016-12-10 00:58:45.613011 bb96d80  1 -- 127.0.0.1:0/40923 >>
127.0.0.1:6811/0 conn(0xbcda000 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0
l=1).mark_down:2365

And then the current code:
2016-12-10 01:14:01.951312 bb96d80  0 log_channel(cluster) log [WRN] :
map e17 wrongly marked me down
2016-12-10 01:14:01.951431 bb96d80  1 -- 127.0.0.1:6801/43637 rebind
Async rebind avoid 6801,6802,6803
2016-12-10 01:14:01.951477 bb96d80  1 -- 127.0.0.1:6801/43637
shutdown_connections
2016-12-10 01:14:01.951554 bb96d80  1  Processor -- rebind Proc rebind
avoid 6801,6802,6803
2016-12-10 01:14:01.951629 bb96d80  1  Processor -- bind:156 end bind
my_inst.addr is 127.0.0.1:6801/1043637
2016-12-10 01:14:01.951637 bb96d80  1  Processor -- start
2016-12-10 01:14:01.951660 bb96d80  1 -- 127.0.0.1:6802/43637 rebind
Async rebind avoid 6801,6802,6803
2016-12-10 01:14:01.951700 bb96d80  1 -- 127.0.0.1:6802/43637
shutdown_connections
2016-12-10 01:14:01.951727 bb96d80  1  Processor -- rebind Proc rebind
avoid 6801,6802,6803
2016-12-10 01:14:01.951762 bb96d80  1 -- 127.0.0.1:6802/1043637
learned_addr learned my addr 127.0.0.1:6802/1043637
2016-12-10 01:14:01.951773 bb96d80  1  Processor -- bind:156 end bind
my_inst.addr is 127.0.0.1:6802/1043637
2016-12-10 01:14:01.951777 bb96d80  1  Processor -- start
2016-12-10 01:14:01.951797 bb96d80  1 -- 127.0.0.1:6803/43637 rebind
Async rebind avoid 6801,6802,6803
2016-12-10 01:14:01.951827 bb96d80  1 -- 127.0.0.1:6803/43637
shutdown_connections
2016-12-10 01:14:01.951854 bb96d80  1  Processor -- rebind Proc rebind
avoid 6801,6802,6803
2016-12-10 01:14:01.951889 bb96d80  1 -- 127.0.0.1:6803/1043637
learned_addr learned my addr 127.0.0.1:6803/1043637
2016-12-10 01:14:01.951900 bb96d80  1  Processor -- bind:156 end bind
my_inst.addr is 127.0.0.1:6803/1043637
2016-12-10 01:14:01.951904 bb96d80  1  Processor -- start
2016-12-10 01:14:01.951925 bb96d80  1 -- 127.0.0.1:0/43637
shutdown_connections
2016-12-10 01:14:01.952008 bb99600  1 -- 127.0.0.1:0/43637 >>
127.0.0.1:6811/43663 conn(0xbd62000 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down:2365
2016-12-10 01:14:01.952220 bb96d80  1 -- 127.0.0.1:0/43637 >>
127.0.0.1:6806/43650 conn(0xbc6e000 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down:2365
2016-12-10 01:14:01.952246 bb96d80  1 -- 127.0.0.1:0/43637 >>
127.0.0.1:6807/43650 conn(0xbc6f800 :-1 s=STATE_CLOSED pgs=1 cs=1
l=1).mark_down:2365
2016-12-10 01:14:01.952260 bb96d80  1 -- 127.0.0.1:0/43637 >>
127.0.0.1:6810/43663 conn(0xbc72800 :-1
s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0  l=1).mark_down:2365
2016-12-10 01:14:01.952289 bb96d80  1 -- 127.0.0.1:0/43637 >>
127.0.0.1:6811/43663 conn(0xbd65000 :-1
s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).mark_down:2365

--WjW




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-12-10  0:39                           ` Willem Jan Withagen
@ 2016-12-13  1:02                             ` Willem Jan Withagen
  2016-12-13  1:34                               ` Willem Jan Withagen
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-12-13  1:02 UTC (permalink / raw)
  To: kefu chai; +Cc: Gregory Farnum, Haomai Wang, Ceph Development

On 10-12-2016 01:39, Willem Jan Withagen wrote:
> On 9-12-2016 13:02, Willem Jan Withagen wrote:
>> On 9-12-2016 10:22, Willem Jan Withagen wrote:
>>> On 9-12-2016 09:59, kefu chai wrote:
>>>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>> On 8-12-2016 11:03, kefu chai wrote:
>>>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>>>>> Question here is:
>>>>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>>>>   ceph-osd.
>>>>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>>>>> monitor command messages, generally invoked as part of the init
>>>>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>>>>
>>>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>>>>> and ceph-mon receives in the maps?
>>>>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>>>> You should be able to see the info going in and out by bumping the
>>>>>>>> debug levels up — every message's "print" function is invoked when
>>>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>>>>> add them easily if you need to.
>>>>>>> Hi Greg,
>>>>>>>
>>>>>>> Thanx for the answer....
>>>>>>>
>>>>>>> I've got debug_ms already pumped up all the way to 20.
>>>>>>> So I do get to see what addresses are selected during bind. But still
>>>>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>>>>         :/0
>>>>>>> as bind address.
>>>>>>>
>>>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>>>> there are multiple messengers used by ceph-osd, the one connected by
>>>>>> rados client is the external/public messenger. it is also used by osd
>>>>>> to talk with the monitor.
>>>>>>
>>>>>> the nonce of the external address of an OSD does not change after it's
>>>>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>>>>> the booting OSD collected by monitor comes from the connection's
>>>>>> peer_addr field, which is set when the monitor accepts the connection
>>>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>>>>> AsyncConnection::_process_connection().
>>>>>>
>>>>>> but there are chances that an OSD is restarted and fail to bind its
>>>>>> external messenger to the specified the port. in that case, ceph-osd
>>>>>> will try with another port, but keep the nonce the same. but when it
>>>>>> comes to other messengers used by ceph-osd, their nonces increase by
>>>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>>>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>>>>> heartbeat_front_addr. the PR of
>>>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>>>>> of the messengers of these three messengers. and it has nothing to do
>>>>>> with the external messenger to which the ceph cli client is
>>>>>> connecting.
>>>>>>
>>>>>> so you might want to check
>>>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>>>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>>>>> connects to it.
>>>>>>
>>>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>>>>> it avoids setting the nonce before the rebind finishes. and i tried
>>>>>> with your producer on my linux box, no luck =(
>>>>> Right,
>>>>>
>>>>> You gave me a lot of things to think about, and to start figuring out.
>>>>>
>>>>> And you are right that something really bad needs to happen to an OSD to
>>>>> get in this state. But that is what the tests actually do: They just
>>>>> down/up or kill OSDs and restart.
>>>>>
>>>>> And from previous discussions I "learned" that if the process doesn't
>>>>> die but needs to rebind on the port, the OSD stays at the same port but
>>>>> increments the nonce to indicate that it is a fresh connection. And log
>>>> the external messenger should *not* increment its nonce.
>>>>
>>>>> printing actually shows that the code is going thru a rebind.
>>>> and it should *not* go through rebind().
>>> I have to dig thru the testscript but as far as I can tell just about
>>> all of the daemons are getting reboots in this test.
>>>
>>> So when would I get a rebind?
>>>
>>> I thought it was because I had an OSD incorrectly marked down:
>>> ./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
>>> This I found in the logs, and then I got a rebind.
>>>
>>> Wido suggested looking for this message, on my question why my OSDs were
>>> not getting UP after a good hustle with all OSDs and MONs.
>>>
>>> And that is one of the tests in cephtool-test-mon.sh.
>>> right before the 'ceph tell osd.0 version' there are tests like:
>>>   ceph osd set noup
>>>   ceph osd down 0
>>>   ceph osd dump | grep 'osd.0 down'
>>>   ceph osd unset noup
>>> and
>>>   ceph osd reweight osd.0 .5
>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>   ceph osd out 0
>>>   ceph osd in 0
>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>
>>>
>>>>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>>>>> with my (small) change to the setting of addr. And the nonce is indeed
>>>>> incremented, which increment is actually picked up by all ceph components.
>>> So now I have 2 challenges??
>>>
>>> 1) Find out why I get a rebind, where you think I should not.
>>>    For that I'll have to collect all maltreatment that is done in
>>>    cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
>>>    to see what is up.
>>> 2) If we get a rebind...
>>>    Why doesn't the FreeBSD version end up with consistent noncees.
>>>
>>> "Good thing" about the previous code was that I could tweak it, and at
>>> least get it to Work for FreeBSD. Have not had the time to see if I
>>> could again with this code....
>> So the smallest sequence I can find that demonstrates the problem:
>> function test_mon_rebind()
>> {
>>   ceph osd set noup
>>   ceph osd down 0
>>   ceph osd dump | grep 'osd.0 down'
>>   ceph osd unset noup
>>   max_run=1000
>>   for ((i=0; i < $max_run; i++)); do
>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>       echo "waiting for osd.0 to come back up ($i/$max_run)"
>>       sleep 1
>>     else
>>       break
>>     fi
>>   done
>>   ceph osd dump | grep 'osd.0 up'
>>
>>   for id in `ceph osd ls` ; do
>>     retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version
>>   done
>> }
>>
>> Which matches with what I thought I knew:
>>   OSD down => up => rebind
>> which follows from the log where the osd complains about being marked
>> down incorrectly.
>> search for
>>    log_channel(cluster) log [WRN] : map e8 wrongly marked me down
>> in the osd.0.log
>>
> 
> I ran the code snippet, and it DOES generate a rebind even on Linux
> Centos 7.
> So I expect that the rebind action is called for?
> 
> Also note that at least one nonce is incremented, AND new ports are
> selected.
> 6801,6802,6803 are avoided, and it looks like already connected ports
> are avoided as well.
> (obvious, since binding on them will not work)
> 
> Next is the FreeBSD log, and that one is relatively seriously wrong
> since nonce are 0 .
> And that must be incorrect.
> 
> Last is the log of the current master code with the patch I made of
> msg->my_addr, versus addr.
> IMHO a change that should be semantical equal?
>   return bind(msgr->my_inst.addr, new_avoid);
>   // return bind(addr, new_avoid);
> One thing that is not working there is the port avoidance. That I have
> to look into since there is no reason
> why this should not work under FreeBSD.

I have tested with the new code Kefu submitted to master this afternoon:
The Linux variant gives:
7f36b1dd1700  1 -- 127.0.0.1:6800/13799 <== mon.2 127.0.0.1:8204/0 38
==== osd_map(18..21 src has 1..21) v3 ==== 1012+0+0 (424389225 0 0)
0x7f36cdcc7cc0 con 0x7f36cdb24000
7f36b5ac6700  0 log_channel(cluster) log [WRN] : map e21 wrongly marked
me down
7f36b5ac6700  1 -- 127.0.0.1:6801/13799 rebind rebind avoid 6801,6802,6803
7f36b5ac6700  1 -- 127.0.0.1:6801/13799 shutdown_connections
7f36b5ac6700  1 -- 127.0.0.1:6812/1013799 _finish_bind bind my_inst.addr
is 127.0.0.1:6812/1013799
7f36b5ac6700  1  Processor -- start

Note that the new allocated channel is on port 6812 with an incremented
nonce, and the channel on port 6800 is not touched. Probably because it
is still open, and it is not the channel being rebound.

Now the FreeBSD version:
bb99f80  1 -- 127.0.0.1:6800/6449 <== mon.1 127.0.0.1:7203/0 158 ====
osd_map(172..175 src has 1..175) v3 ==== 1045+0+0 (1598708051 0 0)
0xc08eb40 con 0xbbca000
bb98d80  0 log_channel(cluster) log [WRN] : map e175 wrongly marked me down
bb98d80  1 -- 127.0.0.1:6801/6449 rebind rebind avoid 6801,6802,6803
bb98d80  1 -- 127.0.0.1:6801/6449 shutdown_connections
bb98d80  1 -- 127.0.0.1:6800/1006449 _finish_bind bind my_inst.addr is
127.0.0.1:6800/1006449
bb98d80  1  Processor -- start

And here the new channel is allocated on port 6800, as if that would be
available. Should it also not be marked as such in avoid_ports??
But that port is still being used as is indicated in the first line.
FreeBSD does not seem to have a problem binding to that port ??
And thus a channel is listening on 6800, but with a new nonce.
So that could be the reason that ceph and rados are complaining that
they cannot talk to osd.0

So why is 6800 not in the list for channels to avoid?

--WjW



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Mismatching nonce for 'ceph osd.0 tell'
  2016-12-13  1:02                             ` Willem Jan Withagen
@ 2016-12-13  1:34                               ` Willem Jan Withagen
  0 siblings, 0 replies; 18+ messages in thread
From: Willem Jan Withagen @ 2016-12-13  1:34 UTC (permalink / raw)
  To: kefu chai; +Cc: Gregory Farnum, Haomai Wang, Ceph Development

On 13-12-2016 02:02, Willem Jan Withagen wrote:
> On 10-12-2016 01:39, Willem Jan Withagen wrote:
>> On 9-12-2016 13:02, Willem Jan Withagen wrote:
>>> On 9-12-2016 10:22, Willem Jan Withagen wrote:
>>>> On 9-12-2016 09:59, kefu chai wrote:
>>>>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>>> On 8-12-2016 11:03, kefu chai wrote:
>>>>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>>>>>> Question here is:
>>>>>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>>>>>   ceph-osd.
>>>>>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>>>>>> monitor command messages, generally invoked as part of the init
>>>>>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>>>>>
>>>>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>>>>>> and ceph-mon receives in the maps?
>>>>>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>>>>> You should be able to see the info going in and out by bumping the
>>>>>>>>> debug levels up — every message's "print" function is invoked when
>>>>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>>>>>> add them easily if you need to.
>>>>>>>> Hi Greg,
>>>>>>>>
>>>>>>>> Thanx for the answer....
>>>>>>>>
>>>>>>>> I've got debug_ms already pumped up all the way to 20.
>>>>>>>> So I do get to see what addresses are selected during bind. But still
>>>>>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>>>>>         :/0
>>>>>>>> as bind address.
>>>>>>>>
>>>>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>>>>> there are multiple messengers used by ceph-osd, the one connected by
>>>>>>> rados client is the external/public messenger. it is also used by osd
>>>>>>> to talk with the monitor.
>>>>>>>
>>>>>>> the nonce of the external address of an OSD does not change after it's
>>>>>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>>>>>> the booting OSD collected by monitor comes from the connection's
>>>>>>> peer_addr field, which is set when the monitor accepts the connection
>>>>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>>>>>> AsyncConnection::_process_connection().
>>>>>>>
>>>>>>> but there are chances that an OSD is restarted and fail to bind its
>>>>>>> external messenger to the specified the port. in that case, ceph-osd
>>>>>>> will try with another port, but keep the nonce the same. but when it
>>>>>>> comes to other messengers used by ceph-osd, their nonces increase by
>>>>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>>>>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>>>>>> heartbeat_front_addr. the PR of
>>>>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>>>>>> of the messengers of these three messengers. and it has nothing to do
>>>>>>> with the external messenger to which the ceph cli client is
>>>>>>> connecting.
>>>>>>>
>>>>>>> so you might want to check
>>>>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>>>>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>>>>>> connects to it.
>>>>>>>
>>>>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>>>>>> it avoids setting the nonce before the rebind finishes. and i tried
>>>>>>> with your producer on my linux box, no luck =(
>>>>>> Right,
>>>>>>
>>>>>> You gave me a lot of things to think about, and to start figuring out.
>>>>>>
>>>>>> And you are right that something really bad needs to happen to an OSD to
>>>>>> get in this state. But that is what the tests actually do: They just
>>>>>> down/up or kill OSDs and restart.
>>>>>>
>>>>>> And from previous discussions I "learned" that if the process doesn't
>>>>>> die but needs to rebind on the port, the OSD stays at the same port but
>>>>>> increments the nonce to indicate that it is a fresh connection. And log
>>>>> the external messenger should *not* increment its nonce.
>>>>>
>>>>>> printing actually shows that the code is going thru a rebind.
>>>>> and it should *not* go through rebind().
>>>> I have to dig thru the testscript but as far as I can tell just about
>>>> all of the daemons are getting reboots in this test.
>>>>
>>>> So when would I get a rebind?
>>>>
>>>> I thought it was because I had an OSD incorrectly marked down:
>>>> ./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
>>>> This I found in the logs, and then I got a rebind.
>>>>
>>>> Wido suggested looking for this message, on my question why my OSDs were
>>>> not getting UP after a good hustle with all OSDs and MONs.
>>>>
>>>> And that is one of the tests in cephtool-test-mon.sh.
>>>> right before the 'ceph tell osd.0 version' there are tests like:
>>>>   ceph osd set noup
>>>>   ceph osd down 0
>>>>   ceph osd dump | grep 'osd.0 down'
>>>>   ceph osd unset noup
>>>> and
>>>>   ceph osd reweight osd.0 .5
>>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>>   ceph osd out 0
>>>>   ceph osd in 0
>>>>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>>>>
>>>>
>>>>>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>>>>>> with my (small) change to the setting of addr. And the nonce is indeed
>>>>>> incremented, which increment is actually picked up by all ceph components.
>>>> So now I have 2 challenges??
>>>>
>>>> 1) Find out why I get a rebind, where you think I should not.
>>>>    For that I'll have to collect all maltreatment that is done in
>>>>    cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
>>>>    to see what is up.
>>>> 2) If we get a rebind...
>>>>    Why doesn't the FreeBSD version end up with consistent noncees.
>>>>
>>>> "Good thing" about the previous code was that I could tweak it, and at
>>>> least get it to Work for FreeBSD. Have not had the time to see if I
>>>> could again with this code....
>>> So the smallest sequence I can find that demonstrates the problem:
>>> function test_mon_rebind()
>>> {
>>>   ceph osd set noup
>>>   ceph osd down 0
>>>   ceph osd dump | grep 'osd.0 down'
>>>   ceph osd unset noup
>>>   max_run=1000
>>>   for ((i=0; i < $max_run; i++)); do
>>>     if ! ceph osd dump | grep 'osd.0 up'; then
>>>       echo "waiting for osd.0 to come back up ($i/$max_run)"
>>>       sleep 1
>>>     else
>>>       break
>>>     fi
>>>   done
>>>   ceph osd dump | grep 'osd.0 up'
>>>
>>>   for id in `ceph osd ls` ; do
>>>     retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version
>>>   done
>>> }
>>>
>>> Which matches with what I thought I knew:
>>>   OSD down => up => rebind
>>> which follows from the log where the osd complains about being marked
>>> down incorrectly.
>>> search for
>>>    log_channel(cluster) log [WRN] : map e8 wrongly marked me down
>>> in the osd.0.log
>>>
>>
>> I ran the code snippet, and it DOES generate a rebind even on Linux
>> Centos 7.
>> So I expect that the rebind action is called for?
>>
>> Also note that at least one nonce is incremented, AND new ports are
>> selected.
>> 6801,6802,6803 are avoided, and it looks like already connected ports
>> are avoided as well.
>> (obvious, since binding on them will not work)
>>
>> Next is the FreeBSD log, and that one is relatively seriously wrong
>> since nonce are 0 .
>> And that must be incorrect.
>>
>> Last is the log of the current master code with the patch I made of
>> msg->my_addr, versus addr.
>> IMHO a change that should be semantical equal?
>>   return bind(msgr->my_inst.addr, new_avoid);
>>   // return bind(addr, new_avoid);
>> One thing that is not working there is the port avoidance. That I have
>> to look into since there is no reason
>> why this should not work under FreeBSD.
> 
> I have tested with the new code Kefu submitted to master this afternoon:
> The Linux variant gives:
> 7f36b1dd1700  1 -- 127.0.0.1:6800/13799 <== mon.2 127.0.0.1:8204/0 38
> ==== osd_map(18..21 src has 1..21) v3 ==== 1012+0+0 (424389225 0 0)
> 0x7f36cdcc7cc0 con 0x7f36cdb24000
> 7f36b5ac6700  0 log_channel(cluster) log [WRN] : map e21 wrongly marked
> me down
> 7f36b5ac6700  1 -- 127.0.0.1:6801/13799 rebind rebind avoid 6801,6802,6803
> 7f36b5ac6700  1 -- 127.0.0.1:6801/13799 shutdown_connections
> 7f36b5ac6700  1 -- 127.0.0.1:6812/1013799 _finish_bind bind my_inst.addr
> is 127.0.0.1:6812/1013799
> 7f36b5ac6700  1  Processor -- start
> 
> Note that the new allocated channel is on port 6812 with an incremented
> nonce, and the channel on port 6800 is not touched. Probably because it
> is still open, and it is not the channel being rebound.
> 
> Now the FreeBSD version:
> bb99f80  1 -- 127.0.0.1:6800/6449 <== mon.1 127.0.0.1:7203/0 158 ====
> osd_map(172..175 src has 1..175) v3 ==== 1045+0+0 (1598708051 0 0)
> 0xc08eb40 con 0xbbca000
> bb98d80  0 log_channel(cluster) log [WRN] : map e175 wrongly marked me down
> bb98d80  1 -- 127.0.0.1:6801/6449 rebind rebind avoid 6801,6802,6803
> bb98d80  1 -- 127.0.0.1:6801/6449 shutdown_connections
> bb98d80  1 -- 127.0.0.1:6800/1006449 _finish_bind bind my_inst.addr is
> 127.0.0.1:6800/1006449
> bb98d80  1  Processor -- start
> 
> And here the new channel is allocated on port 6800, as if that would be
> available. Should it also not be marked as such in avoid_ports??
> But that port is still being used as is indicated in the first line.
> FreeBSD does not seem to have a problem binding to that port ??
> And thus a channel is listening on 6800, but with a new nonce.
> So that could be the reason that ceph and rados are complaining that
> they cannot talk to osd.0
> 
> So why is 6800 not in the list for channels to avoid?

I patched osd/OSD.cc: (added to fit with 80 chars)
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -7126,6 +7126,7 @@ void OSD::_committed_osd_maps(epoch_t first,
epoch_t last, MOSDMap *m)
        start_waiting_for_healthy();

        set<int> avoid_ports;
+        avoid_ports.insert(client_messenger->get_myaddr().get_port());
        avoid_ports.insert(cluster_messenger->get_myaddr().get_port());
 avoid_ports.insert(hb_back_server_messenger->get_myaddr().get_port());
 avoid_ports.insert(hb_front_server_messenger->get_myaddr().get_port());

And just adding this extra port to avoid gets things done.

I guess that there are enough remarks to be made on this patch??

--WjW


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-12-13  1:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-12 22:59 Mismatching nonce for 'ceph osd.0 tell' Willem Jan Withagen
2016-09-13  2:29 ` Haomai Wang
2016-09-13  9:00   ` Willem Jan Withagen
2016-09-13 19:52     ` Gregory Farnum
2016-09-13 20:21       ` Willem Jan Withagen
2016-09-13 20:29       ` Willem Jan Withagen
2016-09-13 20:35         ` Gregory Farnum
2016-09-28 12:01           ` Willem Jan Withagen
2016-10-03 17:50             ` Gregory Farnum
2016-10-04 11:57               ` Willem Jan Withagen
2016-12-08 10:03                 ` kefu chai
2016-12-08 12:30                   ` Willem Jan Withagen
2016-12-09  8:59                     ` kefu chai
2016-12-09  9:22                       ` Willem Jan Withagen
2016-12-09 12:02                         ` Willem Jan Withagen
2016-12-10  0:39                           ` Willem Jan Withagen
2016-12-13  1:02                             ` Willem Jan Withagen
2016-12-13  1:34                               ` Willem Jan Withagen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.