All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Farnum <gfarnum@redhat.com>
To: Sage Weil <sweil@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>, Haomai Wang <haomai@xsky.com>,
	Marcus Watts <mwatts@redhat.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: msgr2 protocol
Date: Tue, 13 Sep 2016 13:07:41 -0700	[thread overview]
Message-ID: <CAJ4mKGZeGMGSMiLqmbakrXizW2=FFq4Tt1VzyxZ=+ebsuqTvyw@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1609131506250.19761@piezo.us.to>

On Tue, Sep 13, 2016 at 8:10 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 13 Sep 2016, Jeff Layton wrote:
>> On Tue, 2016-09-13 at 13:31 +0000, Sage Weil wrote:
>> > On Tue, 13 Sep 2016, Jeff Layton wrote:
>> > > On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote:
>> > > > On Sat, 10 Sep 2016, Haomai Wang wrote:
>> > > > > About thing is v1/v2 compatible. I rethink the details:
>> > > > >
>> > > > > 0. we need to define the new banner which must longer than before("ceph v027")
>> > > > > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
>> > > > > 2. both in simle/async codes, server side must issue banner firstly
>> > > > > 3. if server side supports v2 and client only supports v1, client will
>> > > > > receive 9 bytes and do memcmp, then reject this connection via closing
>> > > > > socket. So server side could retry the older version
>> > > > > 4. if server side only supports v1 and client supports v2, client
>> > > > > according banner to reply corresponding banner
>> > > > >
>> > > > > This tricky design is based on the implementation fact "accept side
>> > > > > issue the banner firstly" and "new banner is longer than old banner",
>> > > > > and this way doesn't need to involve other dependences like mon port
>> > > > > changes.
>> > > > >
>> > > > > Does this way has problem?
>> > > >
>> > > > I was thinking we avoid this problem and any hacky initial handshakes by
>> > > > speaking v2 on the new port and v1 on the old port.  Then the monmap has
>> > > > an entity_addrvec_t with both a v1 and v2 address (encoding with just the
>> > > > v1 address for old clients). Same for the OSDs.
>> > > >
>> > > > The v1 handshake just isn't extensible (how do you tell a v2 client
>> > > > connecting that you speak both v1 and v2?).
>> > > >
>> > >
>> > > Depending on port assignments for the protocol is pretty icky though.
>> > > There may be valid reasons to use different ports in some environments
>> > > and then that heuristic goes right out the window.
>> > >
>> > > One thing that is really strange about both the old and new protocols
>> > > is that they have the client and server sending the initial exchange
>> > > concurrently, or have the server send it first.  While it may speed up
>> > > the initial negotiation slightly, it makes it really hard to handle
>> > > fallback to earlier protocol versions (as Haomai pointed out), as the
>> > > client is responsible for handing reconnects.
>> > >
>> > > Consider the case where we have a client that supports only v1 but a
>> > > server that supports v1 and v2. Client connects and then server sends a
>> > > v2 message. Client doesn't understand it and closes the connection and
>> > > reconnects, only to end up in the same situation on the second attempt.
>> > >
>> > > There's no way for the server to preserve the state from the initial
>> > > connection attempt and handle the new connection with v1. Would it not
>> > > make more sense to have the client connect and send its initial banner,
>> > > and then let the server decide what sort of banner to send based on
>> > > what the client sent?
>> >
>> > This is why the v2 banner has the features values (%lx with supported and
>> > required bits).  Clients and servers (connecter and accepters, really,
>> > since servers talk to each other too) can concurrently announce what they
>> > support and require and then go from there.  It doesn't help with the v1
>> > transition, but the addrvec changes (entity_addr_t now has a type
>> > indicating which protocol is spoken, and multiple addrs can be listed for
>> > any server) along with a mon port change (which we have to do anyway to
>> > switch to our IANA assigned port) handle the v1 transition.
>> >
>>
>> Ahh ok, I didn't realize ceph was squatting on a port! Ok, then if
>> you're planning to switch to a new well-known port anyway, then a clean
>> break like this makes more sense.
>>
>> I'll confess though that I don't quite understand the whole point of
>> the entity_addr_t's. What purpose does it serve to exchange network
>> addresses here?
>
> The main thing is that entity_addr_t contains a nonce to distinguish
> between difference incarnations of the same server on the same port.  When
> an OSD is marked down and comes back up, the nonce will be different, and
> its peers can tell they're talking to the new/current instance without any
> stale state (or whatever).  Currently we guard this at the messenger
> layer, so that if we're trying to connect to a particularly instance
> of osd.12 we will simply fail to connect if that port is occupied by
> someone else (e.g., a newer instance of osd.12 that we don't know about
> yet) so that we don't confuse them or ourselves.
>
>> Is it simply to inform the peer of other ways that it
>> can be reached?
>
> With the addrvec changes anybody connecting to (this version of) you
> should already have a list of all your addresses...
>
>> What happens if I pick up my laptop that's acting as a
>> ceph client and wander onto a new network. Does anything break?
>
> I'm sure something will break currently, but eventually I think we can
> shake these issues out... for clients, at least.  The servers all talk to
> each other so we assume there is no NAT gumming up the works.

Well, base RADOS will work okay since it doesn't worry about caching,
but neither RBD nor CephFS handle this at all right now. If you're
putting the machine to sleep, presumably you're breaking the timeouts
on CephFS caps and the RBD locks. You've witnessed the horror of
trying to work through the data safety requirements there, Jeff. ;)

Beyond that, though, I think all our data structures are ultimately IP
based in terms of remembering sessions and such. We could maybe
recover using the messenger session cookies, but it wouldn't be a
trivial thing.
-Greg

  reply	other threads:[~2016-09-13 20:07 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-26 18:17 msgr2 protocol Sage Weil
2016-05-27  4:41 ` Haomai Wang
2016-05-27  4:45   ` Haomai Wang
2016-05-27  8:28   ` Marcus Watts
2016-05-27 17:33     ` Sage Weil
2016-05-27 17:28   ` Sage Weil
2016-05-27  9:44 ` Yehuda Sadeh-Weinraub
2016-05-27 17:37   ` Sage Weil
2016-05-28 18:19     ` Yehuda Sadeh-Weinraub
2016-06-02 15:43       ` Sage Weil
2016-06-02 15:59         ` Haomai Wang
2016-06-02 16:35           ` Sage Weil
2016-06-02 18:11 ` Gregory Farnum
2016-06-02 18:24   ` Sage Weil
2016-06-02 18:34     ` Gregory Farnum
2016-06-03 13:11       ` Sage Weil
2016-06-03 13:24       ` Sage Weil
2016-06-03 16:47         ` Haomai Wang
2016-06-03 17:33           ` Sage Weil
2016-06-03 17:35             ` Haomai Wang
2016-06-06  8:23               ` Junwang Zhao
2016-06-10  8:31                 ` Marcus Watts
2016-06-10 10:11                   ` Sage Weil
2016-06-10 10:48                   ` Sage Weil
2016-06-06 20:16             ` Gregory Farnum
2016-06-10 11:04               ` Sage Weil
2016-06-10 19:05                 ` Marcus Watts
2016-06-10 21:15                   ` Sage Weil
2016-06-10 21:22                     ` Gregory Farnum
2016-06-11 23:05                     ` Marcus Watts
2016-06-12 23:59                       ` Sage Weil
     [not found]                         ` <CACJqLyax_SXEZp3vA2_wR+CdwKOo2Re=SsK2xfXqmXjz9d8iNw@mail.gmail.com>
2016-09-09 21:14                           ` Sage Weil
     [not found]                             ` <CACJqLyYwKZ5_1OHR_5=+mr=1ED2Nt34x4TB29j5dE1D+NjzFpg@mail.gmail.com>
2016-09-10 14:43                               ` Haomai Wang
2016-09-11 17:05                                 ` Sage Weil
2016-09-12  2:29                                   ` Haomai Wang
2016-09-12 13:21                                     ` Sage Weil
2016-09-13  0:03                                       ` Gregory Farnum
2016-09-13  1:35                                         ` Haomai Wang
2016-09-13 13:21                                           ` Sage Weil
2016-09-13 11:50                                       ` Jeff Layton
2016-09-13 11:18                                   ` Jeff Layton
2016-09-13 13:31                                     ` Sage Weil
2016-09-13 14:48                                       ` Jeff Layton
2016-09-13 15:10                                         ` Sage Weil
2016-09-13 20:07                                           ` Gregory Farnum [this message]
2016-06-02 18:16 ` Gregory Farnum
2016-06-29 11:59 Avner Ben Hanoch
2016-06-29 16:52 ` Yehuda Sadeh-Weinraub
2016-06-30 11:59   ` Avner Ben Hanoch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ4mKGZeGMGSMiLqmbakrXizW2=FFq4Tt1VzyxZ=+ebsuqTvyw@mail.gmail.com' \
    --to=gfarnum@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=haomai@xsky.com \
    --cc=jlayton@redhat.com \
    --cc=mwatts@redhat.com \
    --cc=sweil@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.