All of lore.kernel.org
 help / color / mirror / Atom feed
* msgr2 protocol
@ 2016-05-26 18:17 Sage Weil
  2016-05-27  4:41 ` Haomai Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Sage Weil @ 2016-05-26 18:17 UTC (permalink / raw)
  To: ceph-devel

I wrote up a basic proposal for the new msgr2 protocol:

	http://pad.ceph.com/p/msgr2

It is pretty similar to the current protocol, with a few key changes:

1. The initial banner has a version number for protocl features supported 
and required.  This will allow optional behavior later.  The current 
protocol doesn't allow this (the banner string is fixed and has to match 
verbatim).

2. The auth handshake is a low-level msgr exchange now.  This more or less 
matches the MAuth and MAuthReply exchange with the mon.  Also, the 
authenticator/ticket presentation for established clients can be sent here 
as part of this exchange, instead of as part of the msg_connect and 
msg_connect_reply exchnage.

3. The identification of peers during connect is moved to the TAG_IDENT 
stage.  This way it could happen after authentication and/or encryption, 
if we like.  (Not sure it matters.)

4. Signatures are a separate message now that follows the previous 
message.  If a message doesn't have a signature that follows, it is 
dropped.  Once authenticated we can sign all the other handshake exchanges 
(TAG_IDENT, etc.) as well as the messages themselves.

5. The reconnect behavior for stateful connections is a separate 
exchange. This keeps the stateless connections free of clutter.

6. A few changes in the auth_none and cephx integratoin will be needed.  
For example, all the current stubs assume that authentication happens over 
MAuth message and authorization happens in an authorizer blob in 
ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to 
multiplex the cephx message blobs. Also, because the IDENT exchanges 
happens later, we may need to pass additional info in the auth handshake 
messages (like the peer type, or whatever else is needed).

7. Lots of messages can go either way, and I tried ot avoid a strict 
request/response model so that things could be pipelined, and we'd spend a 
minimal amount of time waiting for a response from the other end.  For 
example,

C:
 initiates connection
S:
 accepts connection
 -> banner
 -> TAG_AUTH_METHODS
C:
 -> banner
 -> TAG_AUTH_SET_METHOD
 -> TAG_AUTH_AUTH_REQUEST
S:
 -> TAG_AUTH_REPLY
C:
 -> TAG_ENCRYPT_BEGIN
 -> TAG_IDENT
 -> TAG_SIGNATURE
S:
 -> TAG_ENCRYPT_BEGIN
 -> TAG_IDENT
 -> TAG_SIGNATURE
C:
 -> TAG_START
 -> TAG_SIGNATURE
 -> TAG_MSG
 -> TAG_SIGNATURE
    ...
S:
 -> TAG_MSG
 -> TAG_SIGNATURE
    ...

Comments, please!  The exhange is a bit less structured as far as who 
sends what message, with the idea that we could pipeline a lot of it, but 
it may end up being too ambiguous.  Let me know what you think...

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread
* RE: msgr2 protocol
@ 2016-06-29 11:59 Avner Ben Hanoch
  2016-06-29 16:52 ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 49+ messages in thread
From: Avner Ben Hanoch @ 2016-06-29 11:59 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub, Sage Weil; +Cc: Ceph Development

bbbb

On Sat, 28 May 2016 11:19 AM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: 
>On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>
>> If you do a connect and immediately write a few bytes to teh TCP 
>> stream, does that actaully translate to fewer packets?  I was guessing 
>> that the server writing the first bytes of the exchange would be fine 
>> but if it speeds things up for the client to optimistically start the 
>> exchange too we may as well...
>>
>
>While haven't really looked at it recently, I don't think it'd be possible to embed data with the SYN packet using the plain >vanilla tcp implementation. However, I believe that doing connect() and sending data immediately following it should improve >things, specifically if doing async connect (as with the async messenger), but this still needs to be proven.
>
>Yehuda

I am using TCP with network sniffers like Wireshark and this is always the case that I see in Linux  - *sending data soon after connect will always save packet by combining the ACK from the last step of TCP 3-way handshake with the 1st data packet* .  
This is the case even when I did "short" activity between connect and send.

Sniffer will show you 3 packets on the stream:
1.	Client sends SYN packet
2.	Server replies with SYN-ACK packet
3.	Client send *data packet* that have the ACK flag set in it (this ACK completes the TCP 3-way handshake and makes 'accept' return on the server side)

synchronous or asynchronous socket isn't relevant here because 'connect' returns with success upon receiving SYN-ACK from the server regardless of the actual client send of the TCP 3-way completing ACK (i.e., the client application doesn't need this ACK for relying on the socket as connected - only the server side need it).

From my experience, even disabling nagle (TCP_NODELAY) doesn't affect this behavior (probably because TCP_NODELAY only affect sending *data* faster but does not change TCP handshake behavior)

If you need a test application, I can provide you
Avner

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-09-13 20:07 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 18:17 msgr2 protocol Sage Weil
2016-05-27  4:41 ` Haomai Wang
2016-05-27  4:45   ` Haomai Wang
2016-05-27  8:28   ` Marcus Watts
2016-05-27 17:33     ` Sage Weil
2016-05-27 17:28   ` Sage Weil
2016-05-27  9:44 ` Yehuda Sadeh-Weinraub
2016-05-27 17:37   ` Sage Weil
2016-05-28 18:19     ` Yehuda Sadeh-Weinraub
2016-06-02 15:43       ` Sage Weil
2016-06-02 15:59         ` Haomai Wang
2016-06-02 16:35           ` Sage Weil
2016-06-02 18:11 ` Gregory Farnum
2016-06-02 18:24   ` Sage Weil
2016-06-02 18:34     ` Gregory Farnum
2016-06-03 13:11       ` Sage Weil
2016-06-03 13:24       ` Sage Weil
2016-06-03 16:47         ` Haomai Wang
2016-06-03 17:33           ` Sage Weil
2016-06-03 17:35             ` Haomai Wang
2016-06-06  8:23               ` Junwang Zhao
2016-06-10  8:31                 ` Marcus Watts
2016-06-10 10:11                   ` Sage Weil
2016-06-10 10:48                   ` Sage Weil
2016-06-06 20:16             ` Gregory Farnum
2016-06-10 11:04               ` Sage Weil
2016-06-10 19:05                 ` Marcus Watts
2016-06-10 21:15                   ` Sage Weil
2016-06-10 21:22                     ` Gregory Farnum
2016-06-11 23:05                     ` Marcus Watts
2016-06-12 23:59                       ` Sage Weil
     [not found]                         ` <CACJqLyax_SXEZp3vA2_wR+CdwKOo2Re=SsK2xfXqmXjz9d8iNw@mail.gmail.com>
2016-09-09 21:14                           ` Sage Weil
     [not found]                             ` <CACJqLyYwKZ5_1OHR_5=+mr=1ED2Nt34x4TB29j5dE1D+NjzFpg@mail.gmail.com>
2016-09-10 14:43                               ` Haomai Wang
2016-09-11 17:05                                 ` Sage Weil
2016-09-12  2:29                                   ` Haomai Wang
2016-09-12 13:21                                     ` Sage Weil
2016-09-13  0:03                                       ` Gregory Farnum
2016-09-13  1:35                                         ` Haomai Wang
2016-09-13 13:21                                           ` Sage Weil
2016-09-13 11:50                                       ` Jeff Layton
2016-09-13 11:18                                   ` Jeff Layton
2016-09-13 13:31                                     ` Sage Weil
2016-09-13 14:48                                       ` Jeff Layton
2016-09-13 15:10                                         ` Sage Weil
2016-09-13 20:07                                           ` Gregory Farnum
2016-06-02 18:16 ` Gregory Farnum
2016-06-29 11:59 Avner Ben Hanoch
2016-06-29 16:52 ` Yehuda Sadeh-Weinraub
2016-06-30 11:59   ` Avner Ben Hanoch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.