Mismatching nonce for 'ceph osd.0 tell'

* Mismatching nonce for 'ceph osd.0 tell'
@ 2016-09-12 22:59 Willem Jan Withagen
  2016-09-13  2:29 ` Haomai Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Willem Jan Withagen @ 2016-09-12 22:59 UTC (permalink / raw)
  To: Ceph Development

Hi

When running  cephtool-test-mon.sh, part of it executes:
  ceph tell osd.0 version
I see reports on the commandline, I guess that this is the OSD
complaining that things are wrong:

2016-09-12 23:50:39.239037 814e50e00  0 -- 127.0.0.1:0/1925715881 >>
127.0.0.1:6800/26384 conn(0x814fde800 sd=18 :-1
s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0
l=1)._process_connection connect claims to be 127.0.0.1:6800/1026384 not
127.0.0.1:6800/26384 - wrong node!

Which it will run until it is shot down.... after 3600 secs.

the nonce is incremented with 1000000 on every rebind.

But what I do not understand is how this mismatch has occurred.
I would expect port 6800 to be the port on which the OSD is connected
too, so the connecting party (ceph in this case) thinks the nonce to be
1026384. Did the MON have this information? And where did the MON then
get it from....

Somewhere one of the parts did not receive the new nonce, or did not
also increment it?

Any suggestions welcomed on directions where to look,

Thanx,
--WjW

^ permalink raw reply	[flat|nested] 18+ messages in thread