Re: Mismatching nonce for 'ceph osd.0 tell'

From: Willem Jan Withagen <wjw@digiware.nl>
To: kefu chai <tchaikov@gmail.com>
Cc: Gregory Farnum <gfarnum@redhat.com>,
	Haomai Wang <haomai@xsky.com>,
	Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: Mismatching nonce for 'ceph osd.0 tell'
Date: Fri, 9 Dec 2016 13:02:28 +0100	[thread overview]
Message-ID: <822ea89a-d556-9ff1-5302-115ef05890cc@digiware.nl> (raw)
In-Reply-To: <390636b2-3bcf-a3c0-1d50-9e62e53bbda8@digiware.nl>

On 9-12-2016 10:22, Willem Jan Withagen wrote:
> On 9-12-2016 09:59, kefu chai wrote:
>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>> On 8-12-2016 11:03, kefu chai wrote:
>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>>>> On 3-10-2016 19:50, Gregory Farnum wrote:
>>>>>>> Question here is:
>>>>>>>   If I ask 'ceph osd dump', I'm actually asking ceph-mon.
>>>>>>>   And cehp-mon has learned this from (crush?)maps being sent to it by
>>>>>>>   ceph-osd.
>>>>>>
>>>>>> The monitor has learned about specific IP addresses/nonces/etc via
>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via
>>>>>> monitor command messages, generally invoked as part of the init
>>>>>> scripts. Maps are generated entirely on the monitor. :)
>>>>>>
>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends
>>>>>>> and ceph-mon receives in the maps?
>>>>>>> Just to make sure that it is clear where the problem occurs.
>>>>>>
>>>>>> You should be able to see the info going in and out by bumping the
>>>>>> debug levels up — every message's "print" function is invoked when
>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like
>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can
>>>>>> add them easily if you need to.
>>>>>
>>>>> Hi Greg,
>>>>>
>>>>> Thanx for the answer....
>>>>>
>>>>> I've got debug_ms already pumped up all the way to 20.
>>>>> So I do get to see what addresses are selected during bind. But still
>>>>> they do not end up at the MON, and 'ceph osd dump' reports:
>>>>>         :/0
>>>>> as bind address.
>>>>>
>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing....
>>>>
>>>> there are multiple messengers used by ceph-osd, the one connected by
>>>> rados client is the external/public messenger. it is also used by osd
>>>> to talk with the monitor.
>>>>
>>>> the nonce of the external address of an OSD does not change after it's
>>>> up: it's always the pid of ceph-osd process. and the (peer) address of
>>>> the booting OSD collected by monitor comes from the connection's
>>>> peer_addr field, which is set when the monitor accepts the connection
>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in
>>>> AsyncConnection::_process_connection().
>>>>
>>>> but there are chances that an OSD is restarted and fail to bind its
>>>> external messenger to the specified the port. in that case, ceph-osd
>>>> will try with another port, but keep the nonce the same. but when it
>>>> comes to other messengers used by ceph-osd, their nonces increase by
>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can
>>>> change the nonces of the cluster_addr, heartbeat_back_addr and
>>>> heartbeat_front_addr. the PR of
>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior
>>>> of the messengers of these three messengers. and it has nothing to do
>>>> with the external messenger to which the ceph cli client is
>>>> connecting.
>>>>
>>>> so you might want to check
>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid
>>>> 2) while the nonce of the same messenger is $pid when the ceph cli
>>>> connects to it.
>>>>
>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup.
>>>> it avoids setting the nonce before the rebind finishes. and i tried
>>>> with your producer on my linux box, no luck =(
>>>
>>> Right,
>>>
>>> You gave me a lot of things to think about, and to start figuring out.
>>>
>>> And you are right that something really bad needs to happen to an OSD to
>>> get in this state. But that is what the tests actually do: They just
>>> down/up or kill OSDs and restart.
>>>
>>> And from previous discussions I "learned" that if the process doesn't
>>> die but needs to rebind on the port, the OSD stays at the same port but
>>> increments the nonce to indicate that it is a fresh connection. And log
>>
>> the external messenger should *not* increment its nonce.
>>
>>> printing actually shows that the code is going thru a rebind.
>>
>> and it should *not* go through rebind().
> 
> I have to dig thru the testscript but as far as I can tell just about
> all of the daemons are getting reboots in this test.
> 
> So when would I get a rebind?
> 
> I thought it was because I had an OSD incorrectly marked down:
> ./src/osd/OSD.cc:7074:                 << " wrongly marked me down";
> This I found in the logs, and then I got a rebind.
> 
> Wido suggested looking for this message, on my question why my OSDs were
> not getting UP after a good hustle with all OSDs and MONs.
> 
> And that is one of the tests in cephtool-test-mon.sh.
> right before the 'ceph tell osd.0 version' there are tests like:
>   ceph osd set noup
>   ceph osd down 0
>   ceph osd dump | grep 'osd.0 down'
>   ceph osd unset noup
> and
>   ceph osd reweight osd.0 .5
>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
>   ceph osd out 0
>   ceph osd in 0
>   ceph osd dump | grep ^osd.0 | grep 'weight 0.5'
> 
> 
>>> Now the bad thing is that the Linux and FreeBSD log do comparable things
>>> with my (small) change to the setting of addr. And the nonce is indeed
>>> incremented, which increment is actually picked up by all ceph components.
> 
> So now I have 2 challenges??
> 
> 1) Find out why I get a rebind, where you think I should not.
>    For that I'll have to collect all maltreatment that is done in
>    cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs
>    to see what is up.
> 2) If we get a rebind...
>    Why doesn't the FreeBSD version end up with consistent noncees.
> 
> "Good thing" about the previous code was that I could tweak it, and at
> least get it to Work for FreeBSD. Have not had the time to see if I
> could again with this code....

So the smallest sequence I can find that demonstrates the problem:
function test_mon_rebind()
{
  ceph osd set noup
  ceph osd down 0
  ceph osd dump | grep 'osd.0 down'
  ceph osd unset noup
  max_run=1000
  for ((i=0; i < $max_run; i++)); do
    if ! ceph osd dump | grep 'osd.0 up'; then
      echo "waiting for osd.0 to come back up ($i/$max_run)"
      sleep 1
    else
      break
    fi
  done
  ceph osd dump | grep 'osd.0 up'

  for id in `ceph osd ls` ; do
    retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version
  done
}

Which matches with what I thought I knew:
  OSD down => up => rebind
which follows from the log where the osd complains about being marked
down incorrectly.
search for
   log_channel(cluster) log [WRN] : map e8 wrongly marked me down
in the osd.0.log

--WjW