All of lore.kernel.org
 help / color / mirror / Atom feed
* msgr2 protocol
@ 2016-05-26 18:17 Sage Weil
  2016-05-27  4:41 ` Haomai Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Sage Weil @ 2016-05-26 18:17 UTC (permalink / raw)
  To: ceph-devel

I wrote up a basic proposal for the new msgr2 protocol:

	http://pad.ceph.com/p/msgr2

It is pretty similar to the current protocol, with a few key changes:

1. The initial banner has a version number for protocl features supported 
and required.  This will allow optional behavior later.  The current 
protocol doesn't allow this (the banner string is fixed and has to match 
verbatim).

2. The auth handshake is a low-level msgr exchange now.  This more or less 
matches the MAuth and MAuthReply exchange with the mon.  Also, the 
authenticator/ticket presentation for established clients can be sent here 
as part of this exchange, instead of as part of the msg_connect and 
msg_connect_reply exchnage.

3. The identification of peers during connect is moved to the TAG_IDENT 
stage.  This way it could happen after authentication and/or encryption, 
if we like.  (Not sure it matters.)

4. Signatures are a separate message now that follows the previous 
message.  If a message doesn't have a signature that follows, it is 
dropped.  Once authenticated we can sign all the other handshake exchanges 
(TAG_IDENT, etc.) as well as the messages themselves.

5. The reconnect behavior for stateful connections is a separate 
exchange. This keeps the stateless connections free of clutter.

6. A few changes in the auth_none and cephx integratoin will be needed.  
For example, all the current stubs assume that authentication happens over 
MAuth message and authorization happens in an authorizer blob in 
ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to 
multiplex the cephx message blobs. Also, because the IDENT exchanges 
happens later, we may need to pass additional info in the auth handshake 
messages (like the peer type, or whatever else is needed).

7. Lots of messages can go either way, and I tried ot avoid a strict 
request/response model so that things could be pipelined, and we'd spend a 
minimal amount of time waiting for a response from the other end.  For 
example,

C:
 initiates connection
S:
 accepts connection
 -> banner
 -> TAG_AUTH_METHODS
C:
 -> banner
 -> TAG_AUTH_SET_METHOD
 -> TAG_AUTH_AUTH_REQUEST
S:
 -> TAG_AUTH_REPLY
C:
 -> TAG_ENCRYPT_BEGIN
 -> TAG_IDENT
 -> TAG_SIGNATURE
S:
 -> TAG_ENCRYPT_BEGIN
 -> TAG_IDENT
 -> TAG_SIGNATURE
C:
 -> TAG_START
 -> TAG_SIGNATURE
 -> TAG_MSG
 -> TAG_SIGNATURE
    ...
S:
 -> TAG_MSG
 -> TAG_SIGNATURE
    ...

Comments, please!  The exhange is a bit less structured as far as who 
sends what message, with the idea that we could pipeline a lot of it, but 
it may end up being too ambiguous.  Let me know what you think...

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-26 18:17 msgr2 protocol Sage Weil
@ 2016-05-27  4:41 ` Haomai Wang
  2016-05-27  4:45   ` Haomai Wang
                     ` (2 more replies)
  2016-05-27  9:44 ` Yehuda Sadeh-Weinraub
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 49+ messages in thread
From: Haomai Wang @ 2016-05-27  4:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, May 27, 2016 at 2:17 AM, Sage Weil <sweil@redhat.com> wrote:
> I wrote up a basic proposal for the new msgr2 protocol:
>
>         http://pad.ceph.com/p/msgr2
>
> It is pretty similar to the current protocol, with a few key changes:
>
> 1. The initial banner has a version number for protocl features supported
> and required.  This will allow optional behavior later.  The current
> protocol doesn't allow this (the banner string is fixed and has to match
> verbatim).

Does msgrv2 need to talk with v1peer? Or we just reject this handshake?

If we reject v1, is it possible give our a chance to reset message version?

>
> 2. The auth handshake is a low-level msgr exchange now.  This more or less
> matches the MAuth and MAuthReply exchange with the mon.  Also, the
> authenticator/ticket presentation for established clients can be sent here
> as part of this exchange, instead of as part of the msg_connect and
> msg_connect_reply exchnage.

S: TAG_AUTH_METHODS          # list methods
    __le32 num_methods;
    __le32 methods[num_methods];   // CEPH_AUTH_{NONE, CEPHX}

From my view, it looks we need to force a method instead of letting
peer side select? What's use case that we allow client side to decide
method?

>
> 3. The identification of peers during connect is moved to the TAG_IDENT
> stage.  This way it could happen after authentication and/or encryption,
> if we like.  (Not sure it matters.)

C or S: TAG_ENCRYPT_BEGIN    # signal that all subsequent traffic will
be encrypted

__le32 len

<method specific payload>

do we also need encrypt info handshake? like key/algorithm?

>
> 4. Signatures are a separate message now that follows the previous
> message.  If a message doesn't have a signature that follows, it is
> dropped.  Once authenticated we can sign all the other handshake exchanges
> (TAG_IDENT, etc.) as well as the messages themselves.
>
> 5. The reconnect behavior for stateful connections is a separate
> exchange. This keeps the stateless connections free of clutter.

It will be a big task ......

>
> 6. A few changes in the auth_none and cephx integratoin will be needed.
> For example, all the current stubs assume that authentication happens over
> MAuth message and authorization happens in an authorizer blob in
> ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> multiplex the cephx message blobs. Also, because the IDENT exchanges
> happens later, we may need to pass additional info in the auth handshake
> messages (like the peer type, or whatever else is needed).

Hmm, only need peer type? if address is needed, IDENT stage must
happen before auth

>
> 7. Lots of messages can go either way, and I tried ot avoid a strict
> request/response model so that things could be pipelined, and we'd spend a
> minimal amount of time waiting for a response from the other end.  For
> example,
>
> C:
>  initiates connection
> S:
>  accepts connection
>  -> banner
>  -> TAG_AUTH_METHODS
> C:
>  -> banner
>  -> TAG_AUTH_SET_METHOD
>  -> TAG_AUTH_AUTH_REQUEST
> S:
>  -> TAG_AUTH_REPLY
> C:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE
> S:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE
> C:
>  -> TAG_START
>  -> TAG_SIGNATURE
>  -> TAG_MSG
>  -> TAG_SIGNATURE
>     ...
> S:
>  -> TAG_MSG
>  -> TAG_SIGNATURE
>     ...
>
> Comments, please!  The exhange is a bit less structured as far as who
> sends what message, with the idea that we could pipeline a lot of it, but
> it may end up being too ambiguous.  Let me know what you think...

we may also change ceph_msg_header/ceph_msg_footer :

struct ceph_msg_header {
__le64 seq;       /* message seq# for this session */
__le64 tid;       /* transaction id */
__le16 type;      /* message type */
__le16 priority;  /* priority.  higher value == higher priority */
__le16 version;   /* version of message encoding */

__le32 front_len; /* bytes in main payload */
__le32 middle_len;/* bytes in middle payload */
__le32 data_len;  /* bytes of data payload */
__le16 data_off;  /* sender: include full offset;
    receiver: mask against ~PAGE_MASK */

struct ceph_entity_name src;

/* oldest code we think can decode this.  unknown if zero. */
__le16 compat_version;
__le16 reserved;
__le32 crc;       /* header crc32c */
} __attribute__ ((packed));

we may drop middle_len, src thing.

And could we drop footer and move crc to header? Because for each
message, we always add a system call for footer since it can't be
prefetched in userspace memory. Most of rpc impl only add a header to
actual data.

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-27  4:41 ` Haomai Wang
@ 2016-05-27  4:45   ` Haomai Wang
  2016-05-27  8:28   ` Marcus Watts
  2016-05-27 17:28   ` Sage Weil
  2 siblings, 0 replies; 49+ messages in thread
From: Haomai Wang @ 2016-05-27  4:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, May 27, 2016 at 12:41 PM, Haomai Wang <haomai@xsky.com> wrote:
> On Fri, May 27, 2016 at 2:17 AM, Sage Weil <sweil@redhat.com> wrote:
>> I wrote up a basic proposal for the new msgr2 protocol:
>>
>>         http://pad.ceph.com/p/msgr2
>>
>> It is pretty similar to the current protocol, with a few key changes:
>>
>> 1. The initial banner has a version number for protocl features supported
>> and required.  This will allow optional behavior later.  The current
>> protocol doesn't allow this (the banner string is fixed and has to match
>> verbatim).
>
> Does msgrv2 need to talk with v1peer? Or we just reject this handshake?
>
> If we reject v1, is it possible give our a chance to reset message version?
>
>>
>> 2. The auth handshake is a low-level msgr exchange now.  This more or less
>> matches the MAuth and MAuthReply exchange with the mon.  Also, the
>> authenticator/ticket presentation for established clients can be sent here
>> as part of this exchange, instead of as part of the msg_connect and
>> msg_connect_reply exchnage.
>
> S: TAG_AUTH_METHODS          # list methods
>     __le32 num_methods;
>     __le32 methods[num_methods];   // CEPH_AUTH_{NONE, CEPHX}
>
> From my view, it looks we need to force a method instead of letting
> peer side select? What's use case that we allow client side to decide
> method?
>
>>
>> 3. The identification of peers during connect is moved to the TAG_IDENT
>> stage.  This way it could happen after authentication and/or encryption,
>> if we like.  (Not sure it matters.)
>
> C or S: TAG_ENCRYPT_BEGIN    # signal that all subsequent traffic will
> be encrypted
>
> __le32 len
>
> <method specific payload>
>
> do we also need encrypt info handshake? like key/algorithm?
>
>>
>> 4. Signatures are a separate message now that follows the previous
>> message.  If a message doesn't have a signature that follows, it is
>> dropped.  Once authenticated we can sign all the other handshake exchanges
>> (TAG_IDENT, etc.) as well as the messages themselves.
>>
>> 5. The reconnect behavior for stateful connections is a separate
>> exchange. This keeps the stateless connections free of clutter.
>
> It will be a big task ......
>
>>
>> 6. A few changes in the auth_none and cephx integratoin will be needed.
>> For example, all the current stubs assume that authentication happens over
>> MAuth message and authorization happens in an authorizer blob in
>> ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
>> multiplex the cephx message blobs. Also, because the IDENT exchanges
>> happens later, we may need to pass additional info in the auth handshake
>> messages (like the peer type, or whatever else is needed).
>
> Hmm, only need peer type? if address is needed, IDENT stage must
> happen before auth
>
>>
>> 7. Lots of messages can go either way, and I tried ot avoid a strict
>> request/response model so that things could be pipelined, and we'd spend a
>> minimal amount of time waiting for a response from the other end.  For
>> example,
>>
>> C:
>>  initiates connection
>> S:
>>  accepts connection
>>  -> banner
>>  -> TAG_AUTH_METHODS
>> C:
>>  -> banner
>>  -> TAG_AUTH_SET_METHOD
>>  -> TAG_AUTH_AUTH_REQUEST
>> S:
>>  -> TAG_AUTH_REPLY
>> C:
>>  -> TAG_ENCRYPT_BEGIN
>>  -> TAG_IDENT
>>  -> TAG_SIGNATURE
>> S:
>>  -> TAG_ENCRYPT_BEGIN
>>  -> TAG_IDENT
>>  -> TAG_SIGNATURE
>> C:
>>  -> TAG_START
>>  -> TAG_SIGNATURE
>>  -> TAG_MSG
>>  -> TAG_SIGNATURE
>>     ...
>> S:
>>  -> TAG_MSG
>>  -> TAG_SIGNATURE
>>     ...
>>
>> Comments, please!  The exhange is a bit less structured as far as who
>> sends what message, with the idea that we could pipeline a lot of it, but
>> it may end up being too ambiguous.  Let me know what you think...

we also could add ack_seq to ceph_msg_header to avoid extra ack
tag(1+8). For heavy client io or repop, ack aggregation could help to
reduce a lot kernel cpu util.

>
> we may also change ceph_msg_header/ceph_msg_footer :
>
> struct ceph_msg_header {
> __le64 seq;       /* message seq# for this session */
> __le64 tid;       /* transaction id */
> __le16 type;      /* message type */
> __le16 priority;  /* priority.  higher value == higher priority */
> __le16 version;   /* version of message encoding */
>
> __le32 front_len; /* bytes in main payload */
> __le32 middle_len;/* bytes in middle payload */
> __le32 data_len;  /* bytes of data payload */
> __le16 data_off;  /* sender: include full offset;
>     receiver: mask against ~PAGE_MASK */
>
> struct ceph_entity_name src;
>
> /* oldest code we think can decode this.  unknown if zero. */
> __le16 compat_version;
> __le16 reserved;
> __le32 crc;       /* header crc32c */
> } __attribute__ ((packed));
>
> we may drop middle_len, src thing.
>
> And could we drop footer and move crc to header? Because for each
> message, we always add a system call for footer since it can't be
> prefetched in userspace memory. Most of rpc impl only add a header to
> actual data.
>
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-27  4:41 ` Haomai Wang
  2016-05-27  4:45   ` Haomai Wang
@ 2016-05-27  8:28   ` Marcus Watts
  2016-05-27 17:33     ` Sage Weil
  2016-05-27 17:28   ` Sage Weil
  2 siblings, 1 reply; 49+ messages in thread
From: Marcus Watts @ 2016-05-27  8:28 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel

Various wrote:...
On Fri, May 27, 2016 at 12:41:35PM +0800, Haomai Wang wrote:
> On Fri, May 27, 2016 at 2:17 AM, Sage Weil <sweil@redhat.com> wrote:
...
> S: TAG_AUTH_METHODS          # list methods
>     __le32 num_methods;
>     __le32 methods[num_methods];   // CEPH_AUTH_{NONE, CEPHX}
> 
> From my view, it looks we need to force a method instead of letting
> peer side select? What's use case that we allow client side to decide
> method?
...

Some cases to consider is: v2 server supports "v2.1" and "v2.2" (where .2 includes
some new method or other thing).  v2 client that only supports "v2.1" needs
to decide differently than another v2 client that also supports "v2.2".

Or, 2 clients, one sits on the "super fast secure" server room network,
another sits on the "slow insecure" link.  On the super fast connection
encryption overhead might result in unacceptable performance, yet on
the insecure link it might be absolutely essential.  The client is usually
in a much better position to know this than the server.

In the flow that you quoted, looks like when server advertises its methods
it doesn't yet know what the client supports, so it can't force any method.
The client is the one that knows what it supports, and learns what the server
supports, and therefore is the first one that can decide what it should do.

Of course you still need to somehow protect against a "downgrade" MITM attack.

					-Marcus Watts

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-26 18:17 msgr2 protocol Sage Weil
  2016-05-27  4:41 ` Haomai Wang
@ 2016-05-27  9:44 ` Yehuda Sadeh-Weinraub
  2016-05-27 17:37   ` Sage Weil
  2016-06-02 18:11 ` Gregory Farnum
  2016-06-02 18:16 ` Gregory Farnum
  3 siblings, 1 reply; 49+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2016-05-27  9:44 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> I wrote up a basic proposal for the new msgr2 protocol:
>
>         http://pad.ceph.com/p/msgr2
>
> It is pretty similar to the current protocol, with a few key changes:
>
> 1. The initial banner has a version number for protocl features supported
> and required.  This will allow optional behavior later.  The current
> protocol doesn't allow this (the banner string is fixed and has to match
> verbatim).
>
> 2. The auth handshake is a low-level msgr exchange now.  This more or less
> matches the MAuth and MAuthReply exchange with the mon.  Also, the
> authenticator/ticket presentation for established clients can be sent here
> as part of this exchange, instead of as part of the msg_connect and
> msg_connect_reply exchnage.
>
> 3. The identification of peers during connect is moved to the TAG_IDENT
> stage.  This way it could happen after authentication and/or encryption,
> if we like.  (Not sure it matters.)
>
> 4. Signatures are a separate message now that follows the previous
> message.  If a message doesn't have a signature that follows, it is
> dropped.  Once authenticated we can sign all the other handshake exchanges
> (TAG_IDENT, etc.) as well as the messages themselves.
>

Is there a reason why the signature needs to be a separate message? It
would add extra overhead, and it seems to me that it would complicate
implementation (in terms of message state and such).

> 5. The reconnect behavior for stateful connections is a separate
> exchange. This keeps the stateless connections free of clutter.
>
> 6. A few changes in the auth_none and cephx integratoin will be needed.
> For example, all the current stubs assume that authentication happens over
> MAuth message and authorization happens in an authorizer blob in
> ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> multiplex the cephx message blobs. Also, because the IDENT exchanges
> happens later, we may need to pass additional info in the auth handshake
> messages (like the peer type, or whatever else is needed).
>
> 7. Lots of messages can go either way, and I tried ot avoid a strict
> request/response model so that things could be pipelined, and we'd spend a
> minimal amount of time waiting for a response from the other end.  For
> example,
>
> C:
>  initiates connection
> S:
>  accepts connection
>  -> banner
>  -> TAG_AUTH_METHODS
> C:
>  -> banner
>  -> TAG_AUTH_SET_METHOD
>  -> TAG_AUTH_AUTH_REQUEST
> S:
>  -> TAG_AUTH_REPLY
> C:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE

Can we have the client start authenticating with some predetermined
auth params, and resort to having the server responding with
AUTH_METHODS only if it doesn't support the method selected by the
client. Even if not having it preconfigured, the auth method usually
doesn't change across connection instances, so we can have the client
cache that info per server. That would then be something like this:

a first connection:

C:
 initiates connection
 -> banner
 -> TAG_AUTH_GET_METHODS <-- be explicit
 -> TAG_AUTH_SET_METHOD  <-- opportunistically trying a specific
method type anyway
 -> TAG_AUTH_AUTH_REQUEST

S:
 accepts connection
 -> banner
 -> TAG_AUTH_REPLY


a followup connection:


C:
 initiates connection
 -> banner
 -> TAG_AUTH_SET_METHOD
 -> TAG_AUTH_AUTH_REQUEST

S:
 accepts connection
 -> banner
 -> TAG_AUTH_REPLY



> S:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE
> C:
>  -> TAG_START
>  -> TAG_SIGNATURE
>  -> TAG_MSG
>  -> TAG_SIGNATURE
>     ...
> S:
>  -> TAG_MSG
>  -> TAG_SIGNATURE
>     ...
>
> Comments, please!  The exhange is a bit less structured as far as who
> sends what message, with the idea that we could pipeline a lot of it, but
> it may end up being too ambiguous.  Let me know what you think...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-27  4:41 ` Haomai Wang
  2016-05-27  4:45   ` Haomai Wang
  2016-05-27  8:28   ` Marcus Watts
@ 2016-05-27 17:28   ` Sage Weil
  2 siblings, 0 replies; 49+ messages in thread
From: Sage Weil @ 2016-05-27 17:28 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On Fri, 27 May 2016, Haomai Wang wrote:
> On Fri, May 27, 2016 at 2:17 AM, Sage Weil <sweil@redhat.com> wrote:
> > I wrote up a basic proposal for the new msgr2 protocol:
> >
> >         http://pad.ceph.com/p/msgr2
> >
> > It is pretty similar to the current protocol, with a few key changes:
> >
> > 1. The initial banner has a version number for protocl features supported
> > and required.  This will allow optional behavior later.  The current
> > protocol doesn't allow this (the banner string is fixed and has to match
> > verbatim).
> 
> Does msgrv2 need to talk with v1peer? Or we just reject this handshake?

They won't be compatible.  This is partly why the wip-addr stuff is 
important, and we'll make this switch coincide with the new monitor port 
switch.
 
> If we reject v1, is it possible give our a chance to reset message version?

Yep!  Everything is on the table...

> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
> > authenticator/ticket presentation for established clients can be sent here
> > as part of this exchange, instead of as part of the msg_connect and
> > msg_connect_reply exchnage.
> 
> S: TAG_AUTH_METHODS          # list methods
>     __le32 num_methods;
>     __le32 methods[num_methods];   // CEPH_AUTH_{NONE, CEPHX}
> 
> From my view, it looks we need to force a method instead of letting
> peer side select? What's use case that we allow client side to decide
> method?

The idea is the server would advertise, say, cephx and kerberos auth 
methods (although 99% of users for now will be just cephx).  The client 
would choose.  If the server only wants to support one thing, it can 
advertise just that one thing.

> > 3. The identification of peers during connect is moved to the TAG_IDENT
> > stage.  This way it could happen after authentication and/or encryption,
> > if we like.  (Not sure it matters.)
> 
> C or S: TAG_ENCRYPT_BEGIN    # signal that all subsequent traffic will
> be encrypted
> 
> __le32 len
> 
> <method specific payload>
> 
> do we also need encrypt info handshake? like key/algorithm?

My thought was that anything after this (including other handshaking) 
would be encrypted.

Marcus's comment about MITM downgrade attacks is the main thing I think we 
need to worry about here, though.  I'm not sure how this is normally 
handled to prevent a MITM from just dropping this part of the exchange.  
Maybe the TAG_START should have an auth payload that can allow the auth 
framework to have positive signed statement about what has already been 
negotiated?

> > 4. Signatures are a separate message now that follows the previous
> > message.  If a message doesn't have a signature that follows, it is
> > dropped.  Once authenticated we can sign all the other handshake exchanges
> > (TAG_IDENT, etc.) as well as the messages themselves.
> >
> > 5. The reconnect behavior for stateful connections is a separate
> > exchange. This keeps the stateless connections free of clutter.
> 
> It will be a big task ......

It's the TAG_RECONNECT_* messages in the doc... same basic behavior as 
before, just separating out the seq checks from the feature bits.

> > 6. A few changes in the auth_none and cephx integratoin will be needed.
> > For example, all the current stubs assume that authentication happens over
> > MAuth message and authorization happens in an authorizer blob in
> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> > multiplex the cephx message blobs. Also, because the IDENT exchanges
> > happens later, we may need to pass additional info in the auth handshake
> > messages (like the peer type, or whatever else is needed).
> 
> Hmm, only need peer type? if address is needed, IDENT stage must
> happen before auth

The auth plugin will have access to whatever it needs for the auth 
handshake, and can make sure that that part of the exchange is secure 
(signed or wahtever).  Relying on this informtion is problematic since it 
could be modified by a MITM.

> > 7. Lots of messages can go either way, and I tried ot avoid a strict
> > request/response model so that things could be pipelined, and we'd spend a
> > minimal amount of time waiting for a response from the other end.  For
> > example,
> >
> > C:
> >  initiates connection
> > S:
> >  accepts connection
> >  -> banner
> >  -> TAG_AUTH_METHODS
> > C:
> >  -> banner
> >  -> TAG_AUTH_SET_METHOD
> >  -> TAG_AUTH_AUTH_REQUEST
> > S:
> >  -> TAG_AUTH_REPLY
> > C:
> >  -> TAG_ENCRYPT_BEGIN
> >  -> TAG_IDENT
> >  -> TAG_SIGNATURE
> > S:
> >  -> TAG_ENCRYPT_BEGIN
> >  -> TAG_IDENT
> >  -> TAG_SIGNATURE
> > C:
> >  -> TAG_START
> >  -> TAG_SIGNATURE
> >  -> TAG_MSG
> >  -> TAG_SIGNATURE
> >     ...
> > S:
> >  -> TAG_MSG
> >  -> TAG_SIGNATURE
> >     ...
> >
> > Comments, please!  The exhange is a bit less structured as far as who
> > sends what message, with the idea that we could pipeline a lot of it, but
> > it may end up being too ambiguous.  Let me know what you think...
> 
> we may also change ceph_msg_header/ceph_msg_footer :
> 
> struct ceph_msg_header {
> __le64 seq;       /* message seq# for this session */
> __le64 tid;       /* transaction id */
> __le16 type;      /* message type */
> __le16 priority;  /* priority.  higher value == higher priority */
> __le16 version;   /* version of message encoding */
> 
> __le32 front_len; /* bytes in main payload */
> __le32 middle_len;/* bytes in middle payload */
> __le32 data_len;  /* bytes of data payload */
> __le16 data_off;  /* sender: include full offset;
>     receiver: mask against ~PAGE_MASK */
> 
> struct ceph_entity_name src;
> 
> /* oldest code we think can decode this.  unknown if zero. */
> __le16 compat_version;
> __le16 reserved;
> __le32 crc;       /* header crc32c */
> } __attribute__ ((packed));
> 
> we may drop middle_len, src thing.
> 
> And could we drop footer and move crc to header? Because for each
> message, we always add a system call for footer since it can't be
> prefetched in userspace memory. Most of rpc impl only add a header to
> actual data.

Yeah, good idea.  That and including the ack seq in the message header 
would help out.

Want to put the new ceph_msg_header in the pad?

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-27  8:28   ` Marcus Watts
@ 2016-05-27 17:33     ` Sage Weil
  0 siblings, 0 replies; 49+ messages in thread
From: Sage Weil @ 2016-05-27 17:33 UTC (permalink / raw)
  To: Marcus Watts; +Cc: Haomai Wang, ceph-devel

On Fri, 27 May 2016, Marcus Watts wrote:
> Of course you still need to somehow protect against a "downgrade" MITM 
> attack.

How is this generally done?  I'm thinking about, for example, a MITM who 
silently drops the TAG_ENCRYPT message in the protocol I described.  It 
seems like we need a secure TAG_AUTH_something that makes a statement 
about what has been negotiated before we start the message exchange or 
else some part of the initial exchange could have been dropped.

?

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-27  9:44 ` Yehuda Sadeh-Weinraub
@ 2016-05-27 17:37   ` Sage Weil
  2016-05-28 18:19     ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-05-27 17:37 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: ceph-devel

On Fri, 27 May 2016, Yehuda Sadeh-Weinraub wrote:
> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> > I wrote up a basic proposal for the new msgr2 protocol:
> >
> >         http://pad.ceph.com/p/msgr2
> >
> > It is pretty similar to the current protocol, with a few key changes:
> >
> > 1. The initial banner has a version number for protocl features supported
> > and required.  This will allow optional behavior later.  The current
> > protocol doesn't allow this (the banner string is fixed and has to match
> > verbatim).
> >
> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
> > authenticator/ticket presentation for established clients can be sent here
> > as part of this exchange, instead of as part of the msg_connect and
> > msg_connect_reply exchnage.
> >
> > 3. The identification of peers during connect is moved to the TAG_IDENT
> > stage.  This way it could happen after authentication and/or encryption,
> > if we like.  (Not sure it matters.)
> >
> > 4. Signatures are a separate message now that follows the previous
> > message.  If a message doesn't have a signature that follows, it is
> > dropped.  Once authenticated we can sign all the other handshake exchanges
> > (TAG_IDENT, etc.) as well as the messages themselves.
> >
> 
> Is there a reason why the signature needs to be a separate message? It
> would add extra overhead, and it seems to me that it would complicate
> implementation (in terms of message state and such).

It doesn't have to be--I was just wanting to keep things simple.  We could 
similarly make it part of the underlying format, e.g.,

 tag byte
 8 byte signature
 payload

or whatever.  That's basically the same thing, except we save 1 byte.

> > 5. The reconnect behavior for stateful connections is a separate
> > exchange. This keeps the stateless connections free of clutter.
> >
> > 6. A few changes in the auth_none and cephx integratoin will be needed.
> > For example, all the current stubs assume that authentication happens over
> > MAuth message and authorization happens in an authorizer blob in
> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> > multiplex the cephx message blobs. Also, because the IDENT exchanges
> > happens later, we may need to pass additional info in the auth handshake
> > messages (like the peer type, or whatever else is needed).
> >
> > 7. Lots of messages can go either way, and I tried ot avoid a strict
> > request/response model so that things could be pipelined, and we'd spend a
> > minimal amount of time waiting for a response from the other end.  For
> > example,
> >
> > C:
> >  initiates connection
> > S:
> >  accepts connection
> >  -> banner
> >  -> TAG_AUTH_METHODS
> > C:
> >  -> banner
> >  -> TAG_AUTH_SET_METHOD
> >  -> TAG_AUTH_AUTH_REQUEST
> > S:
> >  -> TAG_AUTH_REPLY
> > C:
> >  -> TAG_ENCRYPT_BEGIN
> >  -> TAG_IDENT
> >  -> TAG_SIGNATURE
> 
> Can we have the client start authenticating with some predetermined
> auth params, and resort to having the server responding with
> AUTH_METHODS only if it doesn't support the method selected by the
> client. Even if not having it preconfigured, the auth method usually
> doesn't change across connection instances, so we can have the client
> cache that info per server. That would then be something like this:
> 
> a first connection:
> 
> C:
>  initiates connection
>  -> banner
>  -> TAG_AUTH_GET_METHODS <-- be explicit
>  -> TAG_AUTH_SET_METHOD  <-- opportunistically trying a specific
> method type anyway
>  -> TAG_AUTH_AUTH_REQUEST
> 
> S:
>  accepts connection
>  -> banner
>  -> TAG_AUTH_REPLY
> 
> 
> a followup connection:
> 
> 
> C:
>  initiates connection
>  -> banner
>  -> TAG_AUTH_SET_METHOD
>  -> TAG_AUTH_AUTH_REQUEST
> 
> S:
>  accepts connection
>  -> banner
>  -> TAG_AUTH_REPLY

Yeah.. of even just make the initial connection try it's preferred method 
and only do the GET_METHODS if it is rejected.

If you do a connect and immediately write a few bytes to teh TCP stream, 
does that actaully translate to fewer packets?  I was guessing that the 
server writing the first bytes of the exchange would be fine but if it 
speeds things up for the client to optimistically start the exchange too 
we may as well...

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-27 17:37   ` Sage Weil
@ 2016-05-28 18:19     ` Yehuda Sadeh-Weinraub
  2016-06-02 15:43       ` Sage Weil
  0 siblings, 1 reply; 49+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2016-05-28 18:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 27 May 2016, Yehuda Sadeh-Weinraub wrote:
>> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
>> > I wrote up a basic proposal for the new msgr2 protocol:
>> >
>> >         http://pad.ceph.com/p/msgr2
>> >
>> > It is pretty similar to the current protocol, with a few key changes:
>> >
>> > 1. The initial banner has a version number for protocl features supported
>> > and required.  This will allow optional behavior later.  The current
>> > protocol doesn't allow this (the banner string is fixed and has to match
>> > verbatim).
>> >
>> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
>> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
>> > authenticator/ticket presentation for established clients can be sent here
>> > as part of this exchange, instead of as part of the msg_connect and
>> > msg_connect_reply exchnage.
>> >
>> > 3. The identification of peers during connect is moved to the TAG_IDENT
>> > stage.  This way it could happen after authentication and/or encryption,
>> > if we like.  (Not sure it matters.)
>> >
>> > 4. Signatures are a separate message now that follows the previous
>> > message.  If a message doesn't have a signature that follows, it is
>> > dropped.  Once authenticated we can sign all the other handshake exchanges
>> > (TAG_IDENT, etc.) as well as the messages themselves.
>> >
>>
>> Is there a reason why the signature needs to be a separate message? It
>> would add extra overhead, and it seems to me that it would complicate
>> implementation (in terms of message state and such).
>
> It doesn't have to be--I was just wanting to keep things simple.  We could
> similarly make it part of the underlying format, e.g.,
>
>  tag byte
>  8 byte signature
>  payload

signature should come after payload, but yeah. Might need to define
extended envelope to allow future extensions.

>
> or whatever.  That's basically the same thing, except we save 1 byte.
>
>> > 5. The reconnect behavior for stateful connections is a separate
>> > exchange. This keeps the stateless connections free of clutter.
>> >
>> > 6. A few changes in the auth_none and cephx integratoin will be needed.
>> > For example, all the current stubs assume that authentication happens over
>> > MAuth message and authorization happens in an authorizer blob in
>> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
>> > multiplex the cephx message blobs. Also, because the IDENT exchanges
>> > happens later, we may need to pass additional info in the auth handshake
>> > messages (like the peer type, or whatever else is needed).
>> >
>> > 7. Lots of messages can go either way, and I tried ot avoid a strict
>> > request/response model so that things could be pipelined, and we'd spend a
>> > minimal amount of time waiting for a response from the other end.  For
>> > example,
>> >
>> > C:
>> >  initiates connection
>> > S:
>> >  accepts connection
>> >  -> banner
>> >  -> TAG_AUTH_METHODS
>> > C:
>> >  -> banner
>> >  -> TAG_AUTH_SET_METHOD
>> >  -> TAG_AUTH_AUTH_REQUEST
>> > S:
>> >  -> TAG_AUTH_REPLY
>> > C:
>> >  -> TAG_ENCRYPT_BEGIN
>> >  -> TAG_IDENT
>> >  -> TAG_SIGNATURE
>>
>> Can we have the client start authenticating with some predetermined
>> auth params, and resort to having the server responding with
>> AUTH_METHODS only if it doesn't support the method selected by the
>> client. Even if not having it preconfigured, the auth method usually
>> doesn't change across connection instances, so we can have the client
>> cache that info per server. That would then be something like this:
>>
>> a first connection:
>>
>> C:
>>  initiates connection
>>  -> banner
>>  -> TAG_AUTH_GET_METHODS <-- be explicit
>>  -> TAG_AUTH_SET_METHOD  <-- opportunistically trying a specific
>> method type anyway
>>  -> TAG_AUTH_AUTH_REQUEST
>>
>> S:
>>  accepts connection
>>  -> banner
>>  -> TAG_AUTH_REPLY
>>
>>
>> a followup connection:
>>
>>
>> C:
>>  initiates connection
>>  -> banner
>>  -> TAG_AUTH_SET_METHOD
>>  -> TAG_AUTH_AUTH_REQUEST
>>
>> S:
>>  accepts connection
>>  -> banner
>>  -> TAG_AUTH_REPLY
>
> Yeah.. of even just make the initial connection try it's preferred method
> and only do the GET_METHODS if it is rejected.
>

Right. In any case, the protocol should enable this flexibility.


> If you do a connect and immediately write a few bytes to teh TCP stream,
> does that actaully translate to fewer packets?  I was guessing that the
> server writing the first bytes of the exchange would be fine but if it
> speeds things up for the client to optimistically start the exchange too
> we may as well...
>

While haven't really looked at it recently, I don't think it'd be
possible to embed data with the SYN packet using the plain vanilla tcp
implementation. However, I believe that doing connect() and sending
data immediately following it should improve things, specifically if
doing async connect (as with the async messenger), but this still
needs to be proven.

Yehuda

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-28 18:19     ` Yehuda Sadeh-Weinraub
@ 2016-06-02 15:43       ` Sage Weil
  2016-06-02 15:59         ` Haomai Wang
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-06-02 15:43 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: ceph-devel

Based on the discussion during CDM yesterday I wrote up a nicer-looking 
spec of the protocol in rst:

	https://github.com/ceph/ceph/pull/9461

Please let me know if this looks right.  I have two questions:

1. Is TAG_START is really necessary?  I guess it doesn't hurt, and makes 
it easy to add flags later.

2. We don't explicitly have anything here that indicates a session is 
stateless or stateful.  Currently this is determined by the Policy stuff 
on either end and the peers just happen to agree.  Setting/asserting 
it explicitly has part of the handshake seems like a good idea.  Maybe a 
flags field in the TAG_IDENT message, with a flags for lossy/lossess, 
whether we initiate connections (true for client or p2p servers)?

sage


On Sat, 28 May 2016, Yehuda Sadeh-Weinraub wrote:

> On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Fri, 27 May 2016, Yehuda Sadeh-Weinraub wrote:
> >> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> >> > I wrote up a basic proposal for the new msgr2 protocol:
> >> >
> >> >         http://pad.ceph.com/p/msgr2
> >> >
> >> > It is pretty similar to the current protocol, with a few key changes:
> >> >
> >> > 1. The initial banner has a version number for protocl features supported
> >> > and required.  This will allow optional behavior later.  The current
> >> > protocol doesn't allow this (the banner string is fixed and has to match
> >> > verbatim).
> >> >
> >> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
> >> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
> >> > authenticator/ticket presentation for established clients can be sent here
> >> > as part of this exchange, instead of as part of the msg_connect and
> >> > msg_connect_reply exchnage.
> >> >
> >> > 3. The identification of peers during connect is moved to the TAG_IDENT
> >> > stage.  This way it could happen after authentication and/or encryption,
> >> > if we like.  (Not sure it matters.)
> >> >
> >> > 4. Signatures are a separate message now that follows the previous
> >> > message.  If a message doesn't have a signature that follows, it is
> >> > dropped.  Once authenticated we can sign all the other handshake exchanges
> >> > (TAG_IDENT, etc.) as well as the messages themselves.
> >> >
> >>
> >> Is there a reason why the signature needs to be a separate message? It
> >> would add extra overhead, and it seems to me that it would complicate
> >> implementation (in terms of message state and such).
> >
> > It doesn't have to be--I was just wanting to keep things simple.  We could
> > similarly make it part of the underlying format, e.g.,
> >
> >  tag byte
> >  8 byte signature
> >  payload
> 
> signature should come after payload, but yeah. Might need to define
> extended envelope to allow future extensions.
> 
> >
> > or whatever.  That's basically the same thing, except we save 1 byte.
> >
> >> > 5. The reconnect behavior for stateful connections is a separate
> >> > exchange. This keeps the stateless connections free of clutter.
> >> >
> >> > 6. A few changes in the auth_none and cephx integratoin will be needed.
> >> > For example, all the current stubs assume that authentication happens over
> >> > MAuth message and authorization happens in an authorizer blob in
> >> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> >> > multiplex the cephx message blobs. Also, because the IDENT exchanges
> >> > happens later, we may need to pass additional info in the auth handshake
> >> > messages (like the peer type, or whatever else is needed).
> >> >
> >> > 7. Lots of messages can go either way, and I tried ot avoid a strict
> >> > request/response model so that things could be pipelined, and we'd spend a
> >> > minimal amount of time waiting for a response from the other end.  For
> >> > example,
> >> >
> >> > C:
> >> >  initiates connection
> >> > S:
> >> >  accepts connection
> >> >  -> banner
> >> >  -> TAG_AUTH_METHODS
> >> > C:
> >> >  -> banner
> >> >  -> TAG_AUTH_SET_METHOD
> >> >  -> TAG_AUTH_AUTH_REQUEST
> >> > S:
> >> >  -> TAG_AUTH_REPLY
> >> > C:
> >> >  -> TAG_ENCRYPT_BEGIN
> >> >  -> TAG_IDENT
> >> >  -> TAG_SIGNATURE
> >>
> >> Can we have the client start authenticating with some predetermined
> >> auth params, and resort to having the server responding with
> >> AUTH_METHODS only if it doesn't support the method selected by the
> >> client. Even if not having it preconfigured, the auth method usually
> >> doesn't change across connection instances, so we can have the client
> >> cache that info per server. That would then be something like this:
> >>
> >> a first connection:
> >>
> >> C:
> >>  initiates connection
> >>  -> banner
> >>  -> TAG_AUTH_GET_METHODS <-- be explicit
> >>  -> TAG_AUTH_SET_METHOD  <-- opportunistically trying a specific
> >> method type anyway
> >>  -> TAG_AUTH_AUTH_REQUEST
> >>
> >> S:
> >>  accepts connection
> >>  -> banner
> >>  -> TAG_AUTH_REPLY
> >>
> >>
> >> a followup connection:
> >>
> >>
> >> C:
> >>  initiates connection
> >>  -> banner
> >>  -> TAG_AUTH_SET_METHOD
> >>  -> TAG_AUTH_AUTH_REQUEST
> >>
> >> S:
> >>  accepts connection
> >>  -> banner
> >>  -> TAG_AUTH_REPLY
> >
> > Yeah.. of even just make the initial connection try it's preferred method
> > and only do the GET_METHODS if it is rejected.
> >
> 
> Right. In any case, the protocol should enable this flexibility.
> 
> 
> > If you do a connect and immediately write a few bytes to teh TCP stream,
> > does that actaully translate to fewer packets?  I was guessing that the
> > server writing the first bytes of the exchange would be fine but if it
> > speeds things up for the client to optimistically start the exchange too
> > we may as well...
> >
> 
> While haven't really looked at it recently, I don't think it'd be
> possible to embed data with the SYN packet using the plain vanilla tcp
> implementation. However, I believe that doing connect() and sending
> data immediately following it should improve things, specifically if
> doing async connect (as with the async messenger), but this still
> needs to be proven.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-02 15:43       ` Sage Weil
@ 2016-06-02 15:59         ` Haomai Wang
  2016-06-02 16:35           ` Sage Weil
  0 siblings, 1 reply; 49+ messages in thread
From: Haomai Wang @ 2016-06-02 15:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yehuda Sadeh-Weinraub, ceph-devel

On Thu, Jun 2, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
> Based on the discussion during CDM yesterday I wrote up a nicer-looking
> spec of the protocol in rst:
>
>         https://github.com/ceph/ceph/pull/9461
>
> Please let me know if this looks right.  I have two questions:
>
> 1. Is TAG_START is really necessary?  I guess it doesn't hurt, and makes
> it easy to add flags later.
>
> 2. We don't explicitly have anything here that indicates a session is
> stateless or stateful.  Currently this is determined by the Policy stuff
> on either end and the peers just happen to agree.  Setting/asserting
> it explicitly has part of the handshake seems like a good idea.  Maybe a
> flags field in the TAG_IDENT message, with a flags for lossy/lossess,
> whether we initiate connections (true for client or p2p servers)?

we already have CEPH_MSG_CONNECT_LOSSY flag when handshake.

>
> sage
>
>
> On Sat, 28 May 2016, Yehuda Sadeh-Weinraub wrote:
>
>> On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@redhat.com> wrote:
>> > On Fri, 27 May 2016, Yehuda Sadeh-Weinraub wrote:
>> >> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
>> >> > I wrote up a basic proposal for the new msgr2 protocol:
>> >> >
>> >> >         http://pad.ceph.com/p/msgr2
>> >> >
>> >> > It is pretty similar to the current protocol, with a few key changes:
>> >> >
>> >> > 1. The initial banner has a version number for protocl features supported
>> >> > and required.  This will allow optional behavior later.  The current
>> >> > protocol doesn't allow this (the banner string is fixed and has to match
>> >> > verbatim).
>> >> >
>> >> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
>> >> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
>> >> > authenticator/ticket presentation for established clients can be sent here
>> >> > as part of this exchange, instead of as part of the msg_connect and
>> >> > msg_connect_reply exchnage.
>> >> >
>> >> > 3. The identification of peers during connect is moved to the TAG_IDENT
>> >> > stage.  This way it could happen after authentication and/or encryption,
>> >> > if we like.  (Not sure it matters.)
>> >> >
>> >> > 4. Signatures are a separate message now that follows the previous
>> >> > message.  If a message doesn't have a signature that follows, it is
>> >> > dropped.  Once authenticated we can sign all the other handshake exchanges
>> >> > (TAG_IDENT, etc.) as well as the messages themselves.
>> >> >
>> >>
>> >> Is there a reason why the signature needs to be a separate message? It
>> >> would add extra overhead, and it seems to me that it would complicate
>> >> implementation (in terms of message state and such).
>> >
>> > It doesn't have to be--I was just wanting to keep things simple.  We could
>> > similarly make it part of the underlying format, e.g.,
>> >
>> >  tag byte
>> >  8 byte signature
>> >  payload
>>
>> signature should come after payload, but yeah. Might need to define
>> extended envelope to allow future extensions.
>>
>> >
>> > or whatever.  That's basically the same thing, except we save 1 byte.
>> >
>> >> > 5. The reconnect behavior for stateful connections is a separate
>> >> > exchange. This keeps the stateless connections free of clutter.
>> >> >
>> >> > 6. A few changes in the auth_none and cephx integratoin will be needed.
>> >> > For example, all the current stubs assume that authentication happens over
>> >> > MAuth message and authorization happens in an authorizer blob in
>> >> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
>> >> > multiplex the cephx message blobs. Also, because the IDENT exchanges
>> >> > happens later, we may need to pass additional info in the auth handshake
>> >> > messages (like the peer type, or whatever else is needed).
>> >> >
>> >> > 7. Lots of messages can go either way, and I tried ot avoid a strict
>> >> > request/response model so that things could be pipelined, and we'd spend a
>> >> > minimal amount of time waiting for a response from the other end.  For
>> >> > example,
>> >> >
>> >> > C:
>> >> >  initiates connection
>> >> > S:
>> >> >  accepts connection
>> >> >  -> banner
>> >> >  -> TAG_AUTH_METHODS
>> >> > C:
>> >> >  -> banner
>> >> >  -> TAG_AUTH_SET_METHOD
>> >> >  -> TAG_AUTH_AUTH_REQUEST
>> >> > S:
>> >> >  -> TAG_AUTH_REPLY
>> >> > C:
>> >> >  -> TAG_ENCRYPT_BEGIN
>> >> >  -> TAG_IDENT
>> >> >  -> TAG_SIGNATURE
>> >>
>> >> Can we have the client start authenticating with some predetermined
>> >> auth params, and resort to having the server responding with
>> >> AUTH_METHODS only if it doesn't support the method selected by the
>> >> client. Even if not having it preconfigured, the auth method usually
>> >> doesn't change across connection instances, so we can have the client
>> >> cache that info per server. That would then be something like this:
>> >>
>> >> a first connection:
>> >>
>> >> C:
>> >>  initiates connection
>> >>  -> banner
>> >>  -> TAG_AUTH_GET_METHODS <-- be explicit
>> >>  -> TAG_AUTH_SET_METHOD  <-- opportunistically trying a specific
>> >> method type anyway
>> >>  -> TAG_AUTH_AUTH_REQUEST
>> >>
>> >> S:
>> >>  accepts connection
>> >>  -> banner
>> >>  -> TAG_AUTH_REPLY
>> >>
>> >>
>> >> a followup connection:
>> >>
>> >>
>> >> C:
>> >>  initiates connection
>> >>  -> banner
>> >>  -> TAG_AUTH_SET_METHOD
>> >>  -> TAG_AUTH_AUTH_REQUEST
>> >>
>> >> S:
>> >>  accepts connection
>> >>  -> banner
>> >>  -> TAG_AUTH_REPLY
>> >
>> > Yeah.. of even just make the initial connection try it's preferred method
>> > and only do the GET_METHODS if it is rejected.
>> >
>>
>> Right. In any case, the protocol should enable this flexibility.
>>
>>
>> > If you do a connect and immediately write a few bytes to teh TCP stream,
>> > does that actaully translate to fewer packets?  I was guessing that the
>> > server writing the first bytes of the exchange would be fine but if it
>> > speeds things up for the client to optimistically start the exchange too
>> > we may as well...
>> >
>>
>> While haven't really looked at it recently, I don't think it'd be
>> possible to embed data with the SYN packet using the plain vanilla tcp
>> implementation. However, I believe that doing connect() and sending
>> data immediately following it should improve things, specifically if
>> doing async connect (as with the async messenger), but this still
>> needs to be proven.
>>
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-02 15:59         ` Haomai Wang
@ 2016-06-02 16:35           ` Sage Weil
  0 siblings, 0 replies; 49+ messages in thread
From: Sage Weil @ 2016-06-02 16:35 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Yehuda Sadeh-Weinraub, ceph-devel

On Thu, 2 Jun 2016, Haomai Wang wrote:
> On Thu, Jun 2, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
> > Based on the discussion during CDM yesterday I wrote up a nicer-looking
> > spec of the protocol in rst:
> >
> >         https://github.com/ceph/ceph/pull/9461
> >
> > Please let me know if this looks right.  I have two questions:
> >
> > 1. Is TAG_START is really necessary?  I guess it doesn't hurt, and makes
> > it easy to add flags later.
> >
> > 2. We don't explicitly have anything here that indicates a session is
> > stateless or stateful.  Currently this is determined by the Policy stuff
> > on either end and the peers just happen to agree.  Setting/asserting
> > it explicitly has part of the handshake seems like a good idea.  Maybe a
> > flags field in the TAG_IDENT message, with a flags for lossy/lossess,
> > whether we initiate connections (true for client or p2p servers)?
> 
> we already have CEPH_MSG_CONNECT_LOSSY flag when handshake.

Oh yeah!  I added a flags field to TAG_IDENT.

sage


> 
> >
> > sage
> >
> >
> > On Sat, 28 May 2016, Yehuda Sadeh-Weinraub wrote:
> >
> >> On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@redhat.com> wrote:
> >> > On Fri, 27 May 2016, Yehuda Sadeh-Weinraub wrote:
> >> >> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> >> >> > I wrote up a basic proposal for the new msgr2 protocol:
> >> >> >
> >> >> >         http://pad.ceph.com/p/msgr2
> >> >> >
> >> >> > It is pretty similar to the current protocol, with a few key changes:
> >> >> >
> >> >> > 1. The initial banner has a version number for protocl features supported
> >> >> > and required.  This will allow optional behavior later.  The current
> >> >> > protocol doesn't allow this (the banner string is fixed and has to match
> >> >> > verbatim).
> >> >> >
> >> >> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
> >> >> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
> >> >> > authenticator/ticket presentation for established clients can be sent here
> >> >> > as part of this exchange, instead of as part of the msg_connect and
> >> >> > msg_connect_reply exchnage.
> >> >> >
> >> >> > 3. The identification of peers during connect is moved to the TAG_IDENT
> >> >> > stage.  This way it could happen after authentication and/or encryption,
> >> >> > if we like.  (Not sure it matters.)
> >> >> >
> >> >> > 4. Signatures are a separate message now that follows the previous
> >> >> > message.  If a message doesn't have a signature that follows, it is
> >> >> > dropped.  Once authenticated we can sign all the other handshake exchanges
> >> >> > (TAG_IDENT, etc.) as well as the messages themselves.
> >> >> >
> >> >>
> >> >> Is there a reason why the signature needs to be a separate message? It
> >> >> would add extra overhead, and it seems to me that it would complicate
> >> >> implementation (in terms of message state and such).
> >> >
> >> > It doesn't have to be--I was just wanting to keep things simple.  We could
> >> > similarly make it part of the underlying format, e.g.,
> >> >
> >> >  tag byte
> >> >  8 byte signature
> >> >  payload
> >>
> >> signature should come after payload, but yeah. Might need to define
> >> extended envelope to allow future extensions.
> >>
> >> >
> >> > or whatever.  That's basically the same thing, except we save 1 byte.
> >> >
> >> >> > 5. The reconnect behavior for stateful connections is a separate
> >> >> > exchange. This keeps the stateless connections free of clutter.
> >> >> >
> >> >> > 6. A few changes in the auth_none and cephx integratoin will be needed.
> >> >> > For example, all the current stubs assume that authentication happens over
> >> >> > MAuth message and authorization happens in an authorizer blob in
> >> >> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> >> >> > multiplex the cephx message blobs. Also, because the IDENT exchanges
> >> >> > happens later, we may need to pass additional info in the auth handshake
> >> >> > messages (like the peer type, or whatever else is needed).
> >> >> >
> >> >> > 7. Lots of messages can go either way, and I tried ot avoid a strict
> >> >> > request/response model so that things could be pipelined, and we'd spend a
> >> >> > minimal amount of time waiting for a response from the other end.  For
> >> >> > example,
> >> >> >
> >> >> > C:
> >> >> >  initiates connection
> >> >> > S:
> >> >> >  accepts connection
> >> >> >  -> banner
> >> >> >  -> TAG_AUTH_METHODS
> >> >> > C:
> >> >> >  -> banner
> >> >> >  -> TAG_AUTH_SET_METHOD
> >> >> >  -> TAG_AUTH_AUTH_REQUEST
> >> >> > S:
> >> >> >  -> TAG_AUTH_REPLY
> >> >> > C:
> >> >> >  -> TAG_ENCRYPT_BEGIN
> >> >> >  -> TAG_IDENT
> >> >> >  -> TAG_SIGNATURE
> >> >>
> >> >> Can we have the client start authenticating with some predetermined
> >> >> auth params, and resort to having the server responding with
> >> >> AUTH_METHODS only if it doesn't support the method selected by the
> >> >> client. Even if not having it preconfigured, the auth method usually
> >> >> doesn't change across connection instances, so we can have the client
> >> >> cache that info per server. That would then be something like this:
> >> >>
> >> >> a first connection:
> >> >>
> >> >> C:
> >> >>  initiates connection
> >> >>  -> banner
> >> >>  -> TAG_AUTH_GET_METHODS <-- be explicit
> >> >>  -> TAG_AUTH_SET_METHOD  <-- opportunistically trying a specific
> >> >> method type anyway
> >> >>  -> TAG_AUTH_AUTH_REQUEST
> >> >>
> >> >> S:
> >> >>  accepts connection
> >> >>  -> banner
> >> >>  -> TAG_AUTH_REPLY
> >> >>
> >> >>
> >> >> a followup connection:
> >> >>
> >> >>
> >> >> C:
> >> >>  initiates connection
> >> >>  -> banner
> >> >>  -> TAG_AUTH_SET_METHOD
> >> >>  -> TAG_AUTH_AUTH_REQUEST
> >> >>
> >> >> S:
> >> >>  accepts connection
> >> >>  -> banner
> >> >>  -> TAG_AUTH_REPLY
> >> >
> >> > Yeah.. of even just make the initial connection try it's preferred method
> >> > and only do the GET_METHODS if it is rejected.
> >> >
> >>
> >> Right. In any case, the protocol should enable this flexibility.
> >>
> >>
> >> > If you do a connect and immediately write a few bytes to teh TCP stream,
> >> > does that actaully translate to fewer packets?  I was guessing that the
> >> > server writing the first bytes of the exchange would be fine but if it
> >> > speeds things up for the client to optimistically start the exchange too
> >> > we may as well...
> >> >
> >>
> >> While haven't really looked at it recently, I don't think it'd be
> >> possible to embed data with the SYN packet using the plain vanilla tcp
> >> implementation. However, I believe that doing connect() and sending
> >> data immediately following it should improve things, specifically if
> >> doing async connect (as with the async messenger), but this still
> >> needs to be proven.
> >>
> >> Yehuda
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-26 18:17 msgr2 protocol Sage Weil
  2016-05-27  4:41 ` Haomai Wang
  2016-05-27  9:44 ` Yehuda Sadeh-Weinraub
@ 2016-06-02 18:11 ` Gregory Farnum
  2016-06-02 18:24   ` Sage Weil
  2016-06-02 18:16 ` Gregory Farnum
  3 siblings, 1 reply; 49+ messages in thread
From: Gregory Farnum @ 2016-06-02 18:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> I wrote up a basic proposal for the new msgr2 protocol:
>
>         http://pad.ceph.com/p/msgr2
>
> It is pretty similar to the current protocol, with a few key changes:
>
> 1. The initial banner has a version number for protocl features supported
> and required.  This will allow optional behavior later.  The current
> protocol doesn't allow this (the banner string is fixed and has to match
> verbatim).
>
> 2. The auth handshake is a low-level msgr exchange now.  This more or less
> matches the MAuth and MAuthReply exchange with the mon.  Also, the
> authenticator/ticket presentation for established clients can be sent here
> as part of this exchange, instead of as part of the msg_connect and
> msg_connect_reply exchnage.
>
> 3. The identification of peers during connect is moved to the TAG_IDENT
> stage.  This way it could happen after authentication and/or encryption,
> if we like.  (Not sure it matters.)

Hmm, reading this through I'm actually confused about how we do
authentication before we identify ourselves.

Going back to the fast reconnects again (in which we allow a client to
submit all the reconnect data at once and submit a message without
waiting for a response from the server), we'd need to be able to
re-use the previous session key during the authentication phase but
for that to make any sense it would need to have supplied the
identifying cookie.

Were you thinking that with cephx, it would use the cluster key to
generate a "blind" session key, saying it's allowed to talk, and then
use that session key to do identification and share the cephx bundle?
-Greg


>
> 4. Signatures are a separate message now that follows the previous
> message.  If a message doesn't have a signature that follows, it is
> dropped.  Once authenticated we can sign all the other handshake exchanges
> (TAG_IDENT, etc.) as well as the messages themselves.
>
> 5. The reconnect behavior for stateful connections is a separate
> exchange. This keeps the stateless connections free of clutter.
>
> 6. A few changes in the auth_none and cephx integratoin will be needed.
> For example, all the current stubs assume that authentication happens over
> MAuth message and authorization happens in an authorizer blob in
> ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> multiplex the cephx message blobs. Also, because the IDENT exchanges
> happens later, we may need to pass additional info in the auth handshake
> messages (like the peer type, or whatever else is needed).
>
> 7. Lots of messages can go either way, and I tried ot avoid a strict
> request/response model so that things could be pipelined, and we'd spend a
> minimal amount of time waiting for a response from the other end.  For
> example,
>
> C:
>  initiates connection
> S:
>  accepts connection
>  -> banner
>  -> TAG_AUTH_METHODS
> C:
>  -> banner
>  -> TAG_AUTH_SET_METHOD
>  -> TAG_AUTH_AUTH_REQUEST
> S:
>  -> TAG_AUTH_REPLY
> C:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE
> S:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE
> C:
>  -> TAG_START
>  -> TAG_SIGNATURE
>  -> TAG_MSG
>  -> TAG_SIGNATURE
>     ...
> S:
>  -> TAG_MSG
>  -> TAG_SIGNATURE
>     ...
>
> Comments, please!  The exhange is a bit less structured as far as who
> sends what message, with the idea that we could pipeline a lot of it, but
> it may end up being too ambiguous.  Let me know what you think...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-05-26 18:17 msgr2 protocol Sage Weil
                   ` (2 preceding siblings ...)
  2016-06-02 18:11 ` Gregory Farnum
@ 2016-06-02 18:16 ` Gregory Farnum
  3 siblings, 0 replies; 49+ messages in thread
From: Gregory Farnum @ 2016-06-02 18:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> I wrote up a basic proposal for the new msgr2 protocol:
>
>         http://pad.ceph.com/p/msgr2
>
> It is pretty similar to the current protocol, with a few key changes:
>
> 1. The initial banner has a version number for protocl features supported
> and required.  This will allow optional behavior later.  The current
> protocol doesn't allow this (the banner string is fixed and has to match
> verbatim).
>
> 2. The auth handshake is a low-level msgr exchange now.  This more or less
> matches the MAuth and MAuthReply exchange with the mon.  Also, the
> authenticator/ticket presentation for established clients can be sent here
> as part of this exchange, instead of as part of the msg_connect and
> msg_connect_reply exchnage.
>
> 3. The identification of peers during connect is moved to the TAG_IDENT
> stage.  This way it could happen after authentication and/or encryption,
> if we like.  (Not sure it matters.)
>
> 4. Signatures are a separate message now that follows the previous
> message.  If a message doesn't have a signature that follows, it is
> dropped.  Once authenticated we can sign all the other handshake exchanges
> (TAG_IDENT, etc.) as well as the messages themselves.
>
> 5. The reconnect behavior for stateful connections is a separate
> exchange. This keeps the stateless connections free of clutter.
>
> 6. A few changes in the auth_none and cephx integratoin will be needed.
> For example, all the current stubs assume that authentication happens over
> MAuth message and authorization happens in an authorizer blob in
> ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> multiplex the cephx message blobs. Also, because the IDENT exchanges
> happens later, we may need to pass additional info in the auth handshake
> messages (like the peer type, or whatever else is needed).
>
> 7. Lots of messages can go either way, and I tried ot avoid a strict
> request/response model so that things could be pipelined, and we'd spend a
> minimal amount of time waiting for a response from the other end.  For
> example,
>
> C:
>  initiates connection
> S:
>  accepts connection
>  -> banner
>  -> TAG_AUTH_METHODS
> C:
>  -> banner
>  -> TAG_AUTH_SET_METHOD
>  -> TAG_AUTH_AUTH_REQUEST
> S:
>  -> TAG_AUTH_REPLY
> C:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE
> S:
>  -> TAG_ENCRYPT_BEGIN
>  -> TAG_IDENT
>  -> TAG_SIGNATURE
> C:
>  -> TAG_START
>  -> TAG_SIGNATURE
>  -> TAG_MSG
>  -> TAG_SIGNATURE

And this bit isn't actually true any more based on that doc; the
signature is embedded in the "frame" concept. (Which I think is good
since it avoids predictable data bits an adversary can attack on!)

>     ...
> S:
>  -> TAG_MSG
>  -> TAG_SIGNATURE
>     ...
>
> Comments, please!  The exhange is a bit less structured as far as who
> sends what message, with the idea that we could pipeline a lot of it, but
> it may end up being too ambiguous.  Let me know what you think...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-02 18:11 ` Gregory Farnum
@ 2016-06-02 18:24   ` Sage Weil
  2016-06-02 18:34     ` Gregory Farnum
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-06-02 18:24 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On Thu, 2 Jun 2016, Gregory Farnum wrote:
> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> > I wrote up a basic proposal for the new msgr2 protocol:
> >
> >         http://pad.ceph.com/p/msgr2
> >
> > It is pretty similar to the current protocol, with a few key changes:
> >
> > 1. The initial banner has a version number for protocl features supported
> > and required.  This will allow optional behavior later.  The current
> > protocol doesn't allow this (the banner string is fixed and has to match
> > verbatim).
> >
> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
> > authenticator/ticket presentation for established clients can be sent here
> > as part of this exchange, instead of as part of the msg_connect and
> > msg_connect_reply exchnage.
> >
> > 3. The identification of peers during connect is moved to the TAG_IDENT
> > stage.  This way it could happen after authentication and/or encryption,
> > if we like.  (Not sure it matters.)
> 
> Hmm, reading this through I'm actually confused about how we do
> authentication before we identify ourselves.

Keep in mind that this TAG_IDENT is the entity type and features--not our 
cephx auth EntityName (client.foo, osd.123, etc.)--that identity is 
established (securely) as part of the auth handshake.

> Going back to the fast reconnects again (in which we allow a client to
> submit all the reconnect data at once and submit a message without
> waiting for a response from the server), we'd need to be able to
> re-use the previous session key during the authentication phase but
> for that to make any sense it would need to have supplied the
> identifying cookie.

I think the fast reconnect would only be possible if the first connection 
got far enough to discover the server cookie from it's TAG_IDENT.  So the 
2 pieces of info we need are the session key established during auth 
handshake *and* the server cookie from the ident.  If, after that point, 
we disconnect, we can fast reconnect using that info + our last seq etc.

I'm not totally certain this will actually be a win, though.  For example, 
say we send

 msg5 + msg6 + msg7 + msg8 + msg9 + msg10

and have seen an ack through msg6.  That means on reconnect we either have 
to wait for a round trip to get the last_ack and find out whether the 
server got 7-10, or blindly resend 7-10 even though they might be dups.  
Whether it's a win will depend on the message sizes vs connection latency.

My inclination is still to leave the door open for fast reconnect, but 
ignore it in the initial implementation for simplicity...

sage


> Were you thinking that with cephx, it would use the cluster key to
> generate a "blind" session key, saying it's allowed to talk, and then
> use that session key to do identification and share the cephx bundle?
> -Greg
> 
> 
> >
> > 4. Signatures are a separate message now that follows the previous
> > message.  If a message doesn't have a signature that follows, it is
> > dropped.  Once authenticated we can sign all the other handshake exchanges
> > (TAG_IDENT, etc.) as well as the messages themselves.
> >
> > 5. The reconnect behavior for stateful connections is a separate
> > exchange. This keeps the stateless connections free of clutter.
> >
> > 6. A few changes in the auth_none and cephx integratoin will be needed.
> > For example, all the current stubs assume that authentication happens over
> > MAuth message and authorization happens in an authorizer blob in
> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> > multiplex the cephx message blobs. Also, because the IDENT exchanges
> > happens later, we may need to pass additional info in the auth handshake
> > messages (like the peer type, or whatever else is needed).
> >
> > 7. Lots of messages can go either way, and I tried ot avoid a strict
> > request/response model so that things could be pipelined, and we'd spend a
> > minimal amount of time waiting for a response from the other end.  For
> > example,
> >
> > C:
> >  initiates connection
> > S:
> >  accepts connection
> >  -> banner
> >  -> TAG_AUTH_METHODS
> > C:
> >  -> banner
> >  -> TAG_AUTH_SET_METHOD
> >  -> TAG_AUTH_AUTH_REQUEST
> > S:
> >  -> TAG_AUTH_REPLY
> > C:
> >  -> TAG_ENCRYPT_BEGIN
> >  -> TAG_IDENT
> >  -> TAG_SIGNATURE
> > S:
> >  -> TAG_ENCRYPT_BEGIN
> >  -> TAG_IDENT
> >  -> TAG_SIGNATURE
> > C:
> >  -> TAG_START
> >  -> TAG_SIGNATURE
> >  -> TAG_MSG
> >  -> TAG_SIGNATURE
> >     ...
> > S:
> >  -> TAG_MSG
> >  -> TAG_SIGNATURE
> >     ...
> >
> > Comments, please!  The exhange is a bit less structured as far as who
> > sends what message, with the idea that we could pipeline a lot of it, but
> > it may end up being too ambiguous.  Let me know what you think...
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-02 18:24   ` Sage Weil
@ 2016-06-02 18:34     ` Gregory Farnum
  2016-06-03 13:11       ` Sage Weil
  2016-06-03 13:24       ` Sage Weil
  0 siblings, 2 replies; 49+ messages in thread
From: Gregory Farnum @ 2016-06-02 18:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, Jun 2, 2016 at 11:24 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 2 Jun 2016, Gregory Farnum wrote:
>> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
>> > I wrote up a basic proposal for the new msgr2 protocol:
>> >
>> >         http://pad.ceph.com/p/msgr2
>> >
>> > It is pretty similar to the current protocol, with a few key changes:
>> >
>> > 1. The initial banner has a version number for protocl features supported
>> > and required.  This will allow optional behavior later.  The current
>> > protocol doesn't allow this (the banner string is fixed and has to match
>> > verbatim).
>> >
>> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
>> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
>> > authenticator/ticket presentation for established clients can be sent here
>> > as part of this exchange, instead of as part of the msg_connect and
>> > msg_connect_reply exchnage.
>> >
>> > 3. The identification of peers during connect is moved to the TAG_IDENT
>> > stage.  This way it could happen after authentication and/or encryption,
>> > if we like.  (Not sure it matters.)
>>
>> Hmm, reading this through I'm actually confused about how we do
>> authentication before we identify ourselves.
>
> Keep in mind that this TAG_IDENT is the entity type and features--not our
> cephx auth EntityName (client.foo, osd.123, etc.)--that identity is
> established (securely) as part of the auth handshake.
>
>> Going back to the fast reconnects again (in which we allow a client to
>> submit all the reconnect data at once and submit a message without
>> waiting for a response from the server), we'd need to be able to
>> re-use the previous session key during the authentication phase but
>> for that to make any sense it would need to have supplied the
>> identifying cookie.
>
> I think the fast reconnect would only be possible if the first connection
> got far enough to discover the server cookie from it's TAG_IDENT.  So the
> 2 pieces of info we need are the session key established during auth
> handshake *and* the server cookie from the ident.  If, after that point,
> we disconnect, we can fast reconnect using that info + our last seq etc.

Yes, I agree. But that means the server needs to be able to identify
the shared secret key being used to sign stuff. Do we not switch over
from the cluster key to the session key until after auth is done and
we move to TAG_IDENT?
That might mean we want to re-sign all the stuff used for
decision-making in the AUTH phase with our session key as well, hrm.
Or maybe that doesn't add anything since if somebody has access to the
session key they can necessarily have seen our session key, so never
mind.

>
> I'm not totally certain this will actually be a win, though.  For example,
> say we send
>
>  msg5 + msg6 + msg7 + msg8 + msg9 + msg10
>
> and have seen an ack through msg6.  That means on reconnect we either have
> to wait for a round trip to get the last_ack and find out whether the
> server got 7-10, or blindly resend 7-10 even though they might be dups.
> Whether it's a win will depend on the message sizes vs connection latency.
>
> My inclination is still to leave the door open for fast reconnect, but
> ignore it in the initial implementation for simplicity...

Yeah, if we actually get interrupted without acks it's not so helpful.
I'm thinking more the case where the OSD is needing to politely tear
down tcp sessions/sockets earlier than it would like to than that the
network is frequently failing on us.
-Greg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-02 18:34     ` Gregory Farnum
@ 2016-06-03 13:11       ` Sage Weil
  2016-06-03 13:24       ` Sage Weil
  1 sibling, 0 replies; 49+ messages in thread
From: Sage Weil @ 2016-06-03 13:11 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On Thu, 2 Jun 2016, Gregory Farnum wrote:
> On Thu, Jun 2, 2016 at 11:24 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Thu, 2 Jun 2016, Gregory Farnum wrote:
> >> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@redhat.com> wrote:
> >> > I wrote up a basic proposal for the new msgr2 protocol:
> >> >
> >> >         http://pad.ceph.com/p/msgr2
> >> >
> >> > It is pretty similar to the current protocol, with a few key changes:
> >> >
> >> > 1. The initial banner has a version number for protocl features supported
> >> > and required.  This will allow optional behavior later.  The current
> >> > protocol doesn't allow this (the banner string is fixed and has to match
> >> > verbatim).
> >> >
> >> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
> >> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
> >> > authenticator/ticket presentation for established clients can be sent here
> >> > as part of this exchange, instead of as part of the msg_connect and
> >> > msg_connect_reply exchnage.
> >> >
> >> > 3. The identification of peers during connect is moved to the TAG_IDENT
> >> > stage.  This way it could happen after authentication and/or encryption,
> >> > if we like.  (Not sure it matters.)
> >>
> >> Hmm, reading this through I'm actually confused about how we do
> >> authentication before we identify ourselves.
> >
> > Keep in mind that this TAG_IDENT is the entity type and features--not our
> > cephx auth EntityName (client.foo, osd.123, etc.)--that identity is
> > established (securely) as part of the auth handshake.
> >
> >> Going back to the fast reconnects again (in which we allow a client to
> >> submit all the reconnect data at once and submit a message without
> >> waiting for a response from the server), we'd need to be able to
> >> re-use the previous session key during the authentication phase but
> >> for that to make any sense it would need to have supplied the
> >> identifying cookie.
> >
> > I think the fast reconnect would only be possible if the first connection
> > got far enough to discover the server cookie from it's TAG_IDENT.  So the
> > 2 pieces of info we need are the session key established during auth
> > handshake *and* the server cookie from the ident.  If, after that point,
> > we disconnect, we can fast reconnect using that info + our last seq etc.
> 
> Yes, I agree. But that means the server needs to be able to identify
> the shared secret key being used to sign stuff. Do we not switch over
> from the cluster key to the session key until after auth is done and
> we move to TAG_IDENT?
> That might mean we want to re-sign all the stuff used for
> decision-making in the AUTH phase with our session key as well, hrm.
> Or maybe that doesn't add anything since if somebody has access to the
> session key they can necessarily have seen our session key, so never
> mind.

I would expect the fast reconnect to do something like

 cookie, {cookie, last seq i got, next seq i will send, nonce}^previous_session_key

and the server reply to do something like

 {nonce+1, last seq i got}^previous_session_key

The server would look up the previous cookie, use that session key to 
decrypt the block, verify it looks okay, and use the rest of the info to 
initialize the session.  Probably with some confounder or something.  As 
long as the cookie is plaintext, and the rest can be validated against the 
previous session key, I think we would have enough?

> > I'm not totally certain this will actually be a win, though.  For example,
> > say we send
> >
> >  msg5 + msg6 + msg7 + msg8 + msg9 + msg10
> >
> > and have seen an ack through msg6.  That means on reconnect we either have
> > to wait for a round trip to get the last_ack and find out whether the
> > server got 7-10, or blindly resend 7-10 even though they might be dups.
> > Whether it's a win will depend on the message sizes vs connection latency.
> >
> > My inclination is still to leave the door open for fast reconnect, but
> > ignore it in the initial implementation for simplicity...
> 
> Yeah, if we actually get interrupted without acks it's not so helpful.
> I'm thinking more the case where the OSD is needing to politely tear
> down tcp sessions/sockets earlier than it would like to than that the
> network is frequently failing on us.

Oh yeah, that's a good point.  It's probably worth it then!

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-02 18:34     ` Gregory Farnum
  2016-06-03 13:11       ` Sage Weil
@ 2016-06-03 13:24       ` Sage Weil
  2016-06-03 16:47         ` Haomai Wang
  1 sibling, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-06-03 13:24 UTC (permalink / raw)
  To: mwatts; +Cc: Gregory Farnum, ceph-devel

I updated the PR with an additional commit that simplifies the 
confounder.  It seems like we only need the confoudner at teh beginning of 
the session, not for every message, since the signature and encryption 
state is chained from the previous frame.  Is that right?

	https://github.com/ceph/ceph/pull/9461/commits/45766fed1864733c5216a7b50f11cce256338170

Full PR:

	https://github.com/ceph/ceph/pull/9461

--

Also, I just realized that now might be a good time to address the ability 
to multiplex different endpoints (sessions) into the same connection.  We 
could add it later with the msgr feature bits, but it'll probably be 
simpler not to have to support two variants of the protocol.  On the other 
hand, it's a lot more complicated.  :(

Initial thoughts:

 - make a clearer distinction between connection (as in tcp) and a 
session (as in an exchange between two entities, one that may or may not 
have a lifecycle independent of the connection.
 - the auth exchange is no longer always at the beginning of the 
connection.  new entities might appear in the same endpoint.

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-03 13:24       ` Sage Weil
@ 2016-06-03 16:47         ` Haomai Wang
  2016-06-03 17:33           ` Sage Weil
  0 siblings, 1 reply; 49+ messages in thread
From: Haomai Wang @ 2016-06-03 16:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Fri, Jun 3, 2016 at 9:24 PM, Sage Weil <sweil@redhat.com> wrote:
> I updated the PR with an additional commit that simplifies the
> confounder.  It seems like we only need the confoudner at teh beginning of
> the session, not for every message, since the signature and encryption
> state is chained from the previous frame.  Is that right?
>
>         https://github.com/ceph/ceph/pull/9461/commits/45766fed1864733c5216a7b50f11cce256338170
>
> Full PR:
>
>         https://github.com/ceph/ceph/pull/9461
>
> --
>
> Also, I just realized that now might be a good time to address the ability
> to multiplex different endpoints (sessions) into the same connection.  We
> could add it later with the msgr feature bits, but it'll probably be
> simpler not to have to support two variants of the protocol.  On the other
> hand, it's a lot more complicated.  :(

What's the use case? different session shares the same connection? Do
you mean different messenger instance will share the same connection
pool?

>
> Initial thoughts:
>
>  - make a clearer distinction between connection (as in tcp) and a
> session (as in an exchange between two entities, one that may or may not
> have a lifecycle independent of the connection.
>  - the auth exchange is no longer always at the beginning of the
> connection.  new entities might appear in the same endpoint。
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-03 16:47         ` Haomai Wang
@ 2016-06-03 17:33           ` Sage Weil
  2016-06-03 17:35             ` Haomai Wang
  2016-06-06 20:16             ` Gregory Farnum
  0 siblings, 2 replies; 49+ messages in thread
From: Sage Weil @ 2016-06-03 17:33 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Sat, 4 Jun 2016, Haomai Wang wrote:
> On Fri, Jun 3, 2016 at 9:24 PM, Sage Weil <sweil@redhat.com> wrote:
> > I updated the PR with an additional commit that simplifies the
> > confounder.  It seems like we only need the confoudner at teh beginning of
> > the session, not for every message, since the signature and encryption
> > state is chained from the previous frame.  Is that right?
> >
> >         https://github.com/ceph/ceph/pull/9461/commits/45766fed1864733c5216a7b50f11cce256338170
> >
> > Full PR:
> >
> >         https://github.com/ceph/ceph/pull/9461
> >
> > --
> >
> > Also, I just realized that now might be a good time to address the ability
> > to multiplex different endpoints (sessions) into the same connection.  We
> > could add it later with the msgr feature bits, but it'll probably be
> > simpler not to have to support two variants of the protocol.  On the other
> > hand, it's a lot more complicated.  :(
> 
> What's the use case? different session shares the same connection? Do
> you mean different messenger instance will share the same connection
> pool?

Many OSD's in the same unix process, sharing the same messenger connection 
pool, eventually using SPDK/DPDK.

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-03 17:33           ` Sage Weil
@ 2016-06-03 17:35             ` Haomai Wang
  2016-06-06  8:23               ` Junwang Zhao
  2016-06-06 20:16             ` Gregory Farnum
  1 sibling, 1 reply; 49+ messages in thread
From: Haomai Wang @ 2016-06-03 17:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Sat, Jun 4, 2016 at 1:33 AM, Sage Weil <sweil@redhat.com> wrote:
> On Sat, 4 Jun 2016, Haomai Wang wrote:
>> On Fri, Jun 3, 2016 at 9:24 PM, Sage Weil <sweil@redhat.com> wrote:
>> > I updated the PR with an additional commit that simplifies the
>> > confounder.  It seems like we only need the confoudner at teh beginning of
>> > the session, not for every message, since the signature and encryption
>> > state is chained from the previous frame.  Is that right?
>> >
>> >         https://github.com/ceph/ceph/pull/9461/commits/45766fed1864733c5216a7b50f11cce256338170
>> >
>> > Full PR:
>> >
>> >         https://github.com/ceph/ceph/pull/9461
>> >
>> > --
>> >
>> > Also, I just realized that now might be a good time to address the ability
>> > to multiplex different endpoints (sessions) into the same connection.  We
>> > could add it later with the msgr feature bits, but it'll probably be
>> > simpler not to have to support two variants of the protocol.  On the other
>> > hand, it's a lot more complicated.  :(
>>
>> What's the use case? different session shares the same connection? Do
>> you mean different messenger instance will share the same connection
>> pool?
>
> Many OSD's in the same unix process, sharing the same messenger connection
> pool, eventually using SPDK/DPDK.

OH, great to push this!

>
> sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-03 17:35             ` Haomai Wang
@ 2016-06-06  8:23               ` Junwang Zhao
  2016-06-10  8:31                 ` Marcus Watts
  0 siblings, 1 reply; 49+ messages in thread
From: Junwang Zhao @ 2016-06-06  8:23 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, Marcus Watts, Gregory Farnum, ceph-devel

Hi,

My understanding is that before the protocol msgr2 begins, each party
already has it's own secrete key, and it's public key has been known by
it's peer, and they have alreay shared the session key that will be used
by encryption/decryption.

The secrete key will be used by signature, and the public key will be
used to identify.

I am not sure about the whole key exchange part, is that done by cephx?


On Sat, Jun 4, 2016 at 1:35 AM, Haomai Wang <haomai@xsky.com> wrote:
> On Sat, Jun 4, 2016 at 1:33 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Sat, 4 Jun 2016, Haomai Wang wrote:
>>> On Fri, Jun 3, 2016 at 9:24 PM, Sage Weil <sweil@redhat.com> wrote:
>>> > I updated the PR with an additional commit that simplifies the
>>> > confounder.  It seems like we only need the confoudner at teh beginning of
>>> > the session, not for every message, since the signature and encryption
>>> > state is chained from the previous frame.  Is that right?
>>> >
>>> >         https://github.com/ceph/ceph/pull/9461/commits/45766fed1864733c5216a7b50f11cce256338170
>>> >
>>> > Full PR:
>>> >
>>> >         https://github.com/ceph/ceph/pull/9461
>>> >
>>> > --
>>> >
>>> > Also, I just realized that now might be a good time to address the ability
>>> > to multiplex different endpoints (sessions) into the same connection.  We
>>> > could add it later with the msgr feature bits, but it'll probably be
>>> > simpler not to have to support two variants of the protocol.  On the other
>>> > hand, it's a lot more complicated.  :(
>>>
>>> What's the use case? different session shares the same connection? Do
>>> you mean different messenger instance will share the same connection
>>> pool?
>>
>> Many OSD's in the same unix process, sharing the same messenger connection
>> pool, eventually using SPDK/DPDK.
>
> OH, great to push this!
>
>>
>> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-03 17:33           ` Sage Weil
  2016-06-03 17:35             ` Haomai Wang
@ 2016-06-06 20:16             ` Gregory Farnum
  2016-06-10 11:04               ` Sage Weil
  1 sibling, 1 reply; 49+ messages in thread
From: Gregory Farnum @ 2016-06-06 20:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: Haomai Wang, Marcus Watts, ceph-devel

On Fri, Jun 3, 2016 at 10:33 AM, Sage Weil <sweil@redhat.com> wrote:
> On Sat, 4 Jun 2016, Haomai Wang wrote:
>> On Fri, Jun 3, 2016 at 9:24 PM, Sage Weil <sweil@redhat.com> wrote:
>> > I updated the PR with an additional commit that simplifies the
>> > confounder.  It seems like we only need the confoudner at teh beginning of
>> > the session, not for every message, since the signature and encryption
>> > state is chained from the previous frame.  Is that right?
>> >
>> >         https://github.com/ceph/ceph/pull/9461/commits/45766fed1864733c5216a7b50f11cce256338170
>> >
>> > Full PR:
>> >
>> >         https://github.com/ceph/ceph/pull/9461
>> >
>> > --
>> >
>> > Also, I just realized that now might be a good time to address the ability
>> > to multiplex different endpoints (sessions) into the same connection.  We
>> > could add it later with the msgr feature bits, but it'll probably be
>> > simpler not to have to support two variants of the protocol.  On the other
>> > hand, it's a lot more complicated.  :(
>>
>> What's the use case? different session shares the same connection? Do
>> you mean different messenger instance will share the same connection
>> pool?
>
> Many OSD's in the same unix process, sharing the same messenger connection
> pool, eventually using SPDK/DPDK.

Oof, yes, if we want to make that happen it should be in this design phase.

So, what features do we want/need to include here? Is it just the
ability to send messages which represent different entities on a tag
basis? Full streams with their own encryption settings?
Do we want to do interleaving on some block size, or do
messages/streams on a full frame basis?

I've thought about these some and am mixed-to-clueless on these, but
we need some targets.
-Greg

>
> sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-06  8:23               ` Junwang Zhao
@ 2016-06-10  8:31                 ` Marcus Watts
  2016-06-10 10:11                   ` Sage Weil
  2016-06-10 10:48                   ` Sage Weil
  0 siblings, 2 replies; 49+ messages in thread
From: Marcus Watts @ 2016-06-10  8:31 UTC (permalink / raw)
  To: Junwang Zhao; +Cc: Haomai Wang, Sage Weil, Gregory Farnum, ceph-devel

Sorry for the delay reviewing this; got a bit snowed with some
rush hammer things.

So my comments here are of msgr2
as of 20160610.

1. 
banner phase.
bitmask- should allow for >64 bits (even though we should avoid that)

2.
session handshake & message exchange
w/ confounder >= block_size, cbc-cts -- there's no need to send pad bytes.
	[ cts, ciphertext stealing, is a simple modification of cbc to
	not send some bytes which turn out to to be unnecessary; does
	not hurt security, see
		https://en.wikipedia.org/wiki/Ciphertext_stealing	]

3.
authentication, "tag_auth_reply".
method specific payload need not *resend* feature bits, needs
merely to sign them.

4.
authentication
S/c/ auth_methods, set_method, etc.
really hope these don't each need to be separate roundtrips.
perhaps something like,
sequence:
	C->S, S->C:
		num_methods,{type,tag,paylen,payload}[num_methods]
			where
				tag: byte: enum{ continue, accept, info, reject}
		at most one method can say "accept", and that message
		concludes the auth negotiation for that side (if the other
		end doesn't agree, it can reject the connection.)
		if a method needs internal state/sequence data, it
		must be part of the payload for that method.
	each side must send one of these with "accept" before
	it can send anything else; all messages sent after "accept"
	must be encrypted &etc. (so in that sense equiv to
	your TAG_ENCRYPT_BEGIN).
	an "info" method can be sent w/ any exchange, can
	communicate add'l info that might be useful in that phase or later.
	"info" needs to be signed somehow by any successful auth sequence (how?).
	reject should include error message, code(?).  is not signed.
	positively terminates that method, but not the auth sequence
	(another method might succeed).  Neither side should use
	that method thereafter, & attempts to do so should be treated
	as a fatal connection error by the other side.
	If no method indicates "continue" or "accept", that is a
	connection fatal error.

	one method type can be the "fast reconnect".

5.
message flow handshake.
I think this can be folded into the authentication
flow I outlined previously.  Definitly should not be
N back & forth's.

6.
so the stuff winds up secured by the authentication sequence needs to include:
	auth sequence
	initial features
	final sequence number
	that side's identity.

7.
> From: Junwang Zhao <zhjwpku@gmail.com>
> My understanding is that before the protocol msgr2 begins, each party
> already has it's own secrete key, and it's public key has been known by
> it's peer, and they have alreay shared the session key that will be used
> by encryption/decryption.
> 
> The secrete key will be used by signature, and the public key will be
> used to identify.

2 main forms of encryption:
	symmetric:	single secret key known to both ends: used for
			both encryption & decrytion.
	public key:	2 keys: public & private.  public key can be shared,
			private key is knonw to only one end.
			a variety of algorithms including public key encryption,
			signing with the private key, or generating shared
			secrets.  In many cases the public key/private keys
			only used to derive or protect a secret key that is
			then used with ordinary symmetric encryption.

Ceph currently onhly does symmetric encryption.  So there are no public keys.
(cephx.  See,
	http://docs.ceph.com/docs/jewel/dev/cephx_protocol/	)

We'd like to do interesting things with public keys in the future.
So we want a sufficiently flexible algorithm that we can plug
such use in without redoing everything.

					-Marcus Watts

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-10  8:31                 ` Marcus Watts
@ 2016-06-10 10:11                   ` Sage Weil
  2016-06-10 10:48                   ` Sage Weil
  1 sibling, 0 replies; 49+ messages in thread
From: Sage Weil @ 2016-06-10 10:11 UTC (permalink / raw)
  To: Marcus Watts; +Cc: Junwang Zhao, Haomai Wang, Gregory Farnum, ceph-devel

On Fri, 10 Jun 2016, Marcus Watts wrote:
> Sorry for the delay reviewing this; got a bit snowed with some
> rush hammer things.
> 
> So my comments here are of msgr2
> as of 20160610.
> 
> 1. 
> banner phase.
> bitmask- should allow for >64 bits (even though we should avoid that)

The protocol specifies hex digits, so there is no explicit size 
restriction.

> 2.
> session handshake & message exchange
> w/ confounder >= block_size, cbc-cts -- there's no need to send pad bytes.
> 	[ cts, ciphertext stealing, is a simple modification of cbc to
> 	not send some bytes which turn out to to be unnecessary; does
> 	not hurt security, see
> 		https://en.wikipedia.org/wiki/Ciphertext_stealing	]
>
> 3.
> authentication, "tag_auth_reply".
> method specific payload need not *resend* feature bits, needs
> merely to sign them.

True.  The modular auth implementation needs to have a well-defined 
representation of the thing it is signing.  Since it should be independent 
of the wire protocol, we just need to make sure the thing it is 
either signing or resending and signing is well defined.

> 4.
> authentication
> S/c/ auth_methods, set_method, etc.
> really hope these don't each need to be separate roundtrips.
> perhaps something like,
> sequence:
> 	C->S, S->C:
> 		num_methods,{type,tag,paylen,payload}[num_methods]
> 			where
> 				tag: byte: enum{ continue, accept, info, reject}
> 		at most one method can say "accept", and that message
> 		concludes the auth negotiation for that side (if the other
> 		end doesn't agree, it can reject the connection.)
> 		if a method needs internal state/sequence data, it
> 		must be part of the payload for that method.
> 	each side must send one of these with "accept" before
> 	it can send anything else; all messages sent after "accept"
> 	must be encrypted &etc. (so in that sense equiv to
> 	your TAG_ENCRYPT_BEGIN).
> 	an "info" method can be sent w/ any exchange, can
> 	communicate add'l info that might be useful in that phase or later.
> 	"info" needs to be signed somehow by any successful auth sequence (how?).
> 	reject should include error message, code(?).  is not signed.
> 	positively terminates that method, but not the auth sequence
> 	(another method might succeed).  Neither side should use
> 	that method thereafter, & attempts to do so should be treated
> 	as a fatal connection error by the other side.
> 	If no method indicates "continue" or "accept", that is a
> 	connection fatal error.
> 
> 	one method type can be the "fast reconnect".

If I'm understanding correctly, you're worried about the client 
speculatively trying a few different auth modes in parallel?  I'm not 
worried about optimizing that situation--the site config may accept 
multiple methods server-side but the clients themselves should just be 
configured to prefer the method the site admin prefers.

With the current spec, you can do something like:

C: connect
C: banner
C: set method
C: auth + present existing ticket
C: <wait>

S: accept
S: banner
S: <accept client's method, see ticket>
S: auth + reply
S: <wait>

C: <see server replies>
C: <start sending messages>

Earlier I thought the client had to wait for the server auth reply before 
it could send messages, but it turns out it doesn't with cephx: the 
session key is fixed for a given ticket.  So the confounder(s) are 
critical.  I suppose the client could optimistically continue with the 
exchange without hearing anything from the server, but the fallback path 
if the client's choice of auth method isn't accepted would be more 
complicated. Perhaps we only do that if the client only has 1 configured 
option that it is allowed to use.

> 5.
> message flow handshake.
> I think this can be folded into the authentication
> flow I outlined previously.  Definitly should not be
> N back & forth's.

With the single configured auth method on the client, we'd have

C: connect
C: banner
C: set method
C: auth + present existing ticket
C: tag_ident
C: <wait>

S: accept
S: banner
S: <accept client's method, see ticket>
S: auth + reply
S: <see client ident>
S: ident reply

C: start
C: <messages ...>

That's only one round trip/wait.  I don't think we can get around the 
ident exchange and feature exchange because the message encoding depends 
on it.

> 6.
> so the stuff winds up secured by the authentication sequence needs to include:
> 	auth sequence

The auth and auth reply payloads are just passed to the auth plugin; they 
need to be self-securing.

> 	initial features

This needs to be added :(.  I'm thinking we should just call it cephy 
instead of cephx since we'll be changing the internal auth payloads to 
include this.

> 	final sequence number

What's this?

> 	that side's identity.

Already there.


My big question in the above PR is whether it is sufficient to have a 
confounder in the initial auth exchange and not on each frame/message.
We had per message after our bluejeans discussion but I don't see why 
having just one confounder during the auth exchange to randomize the 
initial state of the exchange isn't sufficient.

Thanks!
sage



> 
> 7.
> > From: Junwang Zhao <zhjwpku@gmail.com>
> > My understanding is that before the protocol msgr2 begins, each party
> > already has it's own secrete key, and it's public key has been known by
> > it's peer, and they have alreay shared the session key that will be used
> > by encryption/decryption.
> > 
> > The secrete key will be used by signature, and the public key will be
> > used to identify.
> 
> 2 main forms of encryption:
> 	symmetric:	single secret key known to both ends: used for
> 			both encryption & decrytion.
> 	public key:	2 keys: public & private.  public key can be shared,
> 			private key is knonw to only one end.
> 			a variety of algorithms including public key encryption,
> 			signing with the private key, or generating shared
> 			secrets.  In many cases the public key/private keys
> 			only used to derive or protect a secret key that is
> 			then used with ordinary symmetric encryption.
> 
> Ceph currently onhly does symmetric encryption.  So there are no public keys.
> (cephx.  See,
> 	http://docs.ceph.com/docs/jewel/dev/cephx_protocol/	)
> 
> We'd like to do interesting things with public keys in the future.
> So we want a sufficiently flexible algorithm that we can plug
> such use in without redoing everything.
> 
> 					-Marcus Watts
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-10  8:31                 ` Marcus Watts
  2016-06-10 10:11                   ` Sage Weil
@ 2016-06-10 10:48                   ` Sage Weil
  1 sibling, 0 replies; 49+ messages in thread
From: Sage Weil @ 2016-06-10 10:48 UTC (permalink / raw)
  To: Marcus Watts; +Cc: Junwang Zhao, Haomai Wang, Gregory Farnum, ceph-devel

I forgot to address this in the last message:

On Fri, 10 Jun 2016, Marcus Watts wrote:
> 2.
> session handshake & message exchange
> w/ confounder >= block_size, cbc-cts -- there's no need to send pad bytes.
> 	[ cts, ciphertext stealing, is a simple modification of cbc to
> 	not send some bytes which turn out to to be unnecessary; does
> 	not hurt security, see
> 		https://en.wikipedia.org/wiki/Ciphertext_stealing	]

If we knew that we would always know when the end of a payload was 
coming before we got to that last encryption block, we could skip the 
padding, but it would be complex and fragile.  The key requirement is 
that we want to be able to flush data over the wire and have the 
other end process it at any message boundary.  And the receiver would 
need to know that it should do it's ciphertext stealing thing when it has 
only read a fraction of that last block off the wire.

IMO just specifying the block_size as an auth method attributes simplifies 
everything: we don't have to write code to do the ciphertext stealing, and 
we can more or less blindly read data off the wire in chunks that we know 
can be decrypted.  It costs a few extra bytes, per frame, but I think the 
performance and simplicity win is more than worth it.

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-06 20:16             ` Gregory Farnum
@ 2016-06-10 11:04               ` Sage Weil
  2016-06-10 19:05                 ` Marcus Watts
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-06-10 11:04 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, Marcus Watts, ceph-devel

On Mon, 6 Jun 2016, Gregory Farnum wrote:
> On Fri, Jun 3, 2016 at 10:33 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Sat, 4 Jun 2016, Haomai Wang wrote:
> >> On Fri, Jun 3, 2016 at 9:24 PM, Sage Weil <sweil@redhat.com> wrote:
> >> > I updated the PR with an additional commit that simplifies the
> >> > confounder.  It seems like we only need the confoudner at teh beginning of
> >> > the session, not for every message, since the signature and encryption
> >> > state is chained from the previous frame.  Is that right?
> >> >
> >> >         https://github.com/ceph/ceph/pull/9461/commits/45766fed1864733c5216a7b50f11cce256338170
> >> >
> >> > Full PR:
> >> >
> >> >         https://github.com/ceph/ceph/pull/9461
> >> >
> >> > --
> >> >
> >> > Also, I just realized that now might be a good time to address the ability
> >> > to multiplex different endpoints (sessions) into the same connection.  We
> >> > could add it later with the msgr feature bits, but it'll probably be
> >> > simpler not to have to support two variants of the protocol.  On the other
> >> > hand, it's a lot more complicated.  :(
> >>
> >> What's the use case? different session shares the same connection? Do
> >> you mean different messenger instance will share the same connection
> >> pool?
> >
> > Many OSD's in the same unix process, sharing the same messenger connection
> > pool, eventually using SPDK/DPDK.
> 
> Oof, yes, if we want to make that happen it should be in this design phase.
> 
> So, what features do we want/need to include here? Is it just the
> ability to send messages which represent different entities on a tag
> basis? Full streams with their own encryption settings?
> Do we want to do interleaving on some block size, or do
> messages/streams on a full frame basis?
> 
> I've thought about these some and am mixed-to-clueless on these, but
> we need some targets.

I tried to spec this out (see latest wip-msgr2 commit,

	https://github.com/liewegas/ceph/commit/wip-msgr2

Basically a single connection just multiplexes multiple streams, and 
everything we talked about before (except for the banner) is per-stream.  
Each frame has a stream id, frame length, and then the rest is the 
same.

It mostly works, but there is one thing to point out: if we do 
this, the encryption happens within a stream, not the outer framing 
protocol.  That means that the size of frames is no longer secret.  This 
could probably already be inferred by packet boundaries and timing in most 
circumstances, so I don't think it's a problem, but it's worth 
pointing out.  You can also tell which frames belong to which stream, but 
that would also be true if they were over separate connections, so I don't 
think that matters either.

The end goal of this is that:

 - a single process could most many entities (OSDs is the target)
 - entities could come and go, independent of the connection lifecycle
 - things like mark_down would terminate streams.  I think this would be 
just be a TAG_CLOSE message inside a given stream.

How the implementation's Messenger objects would be instantiated and 
handle the sharing of connections is not important at this layer, as far 
as I can tell.  I can imagine a couple different options that seem fine.

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-10 11:04               ` Sage Weil
@ 2016-06-10 19:05                 ` Marcus Watts
  2016-06-10 21:15                   ` Sage Weil
  0 siblings, 1 reply; 49+ messages in thread
From: Marcus Watts @ 2016-06-10 19:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, Haomai Wang, ceph-devel

I'm going to try to respond to several messages with this one message...

1. bitmask
2. authentication.
3. final sequence
4. confounder
5. encryption per message or per connection?

____ 1. bitmask
banner phase - bitmask.

hex digits will work fine for the protocol - but
can't just use %llx %llx for parsing - when parsing for required, should
reject for any non-zero hex digits above word size; for supported should
ignore hex digits above word size.

____ 2. authentication.
We're already talking about 2 methods:
"initial connect"
"reconnect".
the client doesn't know when it tries to reconnect if the
server will accept a reconnect.  being able to say :
	c->s
		continue: reconnect
		continue: new_connect: phase 1
then either
	s->c
		accept: reconnect
or
	s->c
		reject: reconnect: no data
		continue: new connect phase 2

means 1 fewer round trips.

if the ident & feature exchange can be done at the same time (being
careful to make sure they're part of the signing process) I think that's
a further win.

Optimizing round trips is particularly important for high latency
connections.

I don't think the client has any business sending
a "set method" message before it sees the server's
"banner" - so there's already another round trip there.

____ 3. final sequence
"final sequence"
sorry, bad wording.
for a reconnect: last sequence number from previous connection.
(for a new connect after a reconnect: probably the client should be
setting this to its remembered last value even if the server doesn't
remember that value.)

____ 4. confounder
(but see next point too)
The confounder is really a lower-level feature of the encryption.
So it has to be per-message.  It needs to be part of the message hash,
but should not be visible past that.  With cbc mode, encryption looks like;
	given plaintext p[0..n]
	to encrypt,
		c[0] = e(p[0])
		c[n] = e(c[n-1] ^ p[n])
	to decrypt,
		p[0] = d(c[0])
		p[n] = d(c[n]) ^ c[n^1]
by making p[0] be a "random" value - that ensures for each
message all c[n] are likely unique, including the hash.
If p[0] were a set value, then c[0]
becomes a fixed value and further values of c[x] become
more deterministic as well.  That allows a bad guy to build up
a dictionary of p[x] -> c[x] even without knowing the key,
and we really don't want to give bad guys that capability.

____ 5. encryption per message or per connection?

You're thinking of doing encryption over the entire byte stream
not having framing outside.  Ah!.  Interesting...
Well, you certainly don't need extra confounders past the start
of your encryption.  That can be (more or less) a one time thing.
You will definitely need some sort of block flushing/padding
thing between messages.  I imagine you'll want something like:
	size message hash(size+message) padding-to-block
And you will want to have some sort of notion
that such connections are size limited - "do not encrypt more than N
bytes on a connection".  You probably should have some sort of
"new key every N bytes" notion over smaller byte ranges in addition.
(wouldn't hurt to do a new confounder for each new key too.)

					-Marcus Watts

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-10 19:05                 ` Marcus Watts
@ 2016-06-10 21:15                   ` Sage Weil
  2016-06-10 21:22                     ` Gregory Farnum
  2016-06-11 23:05                     ` Marcus Watts
  0 siblings, 2 replies; 49+ messages in thread
From: Sage Weil @ 2016-06-10 21:15 UTC (permalink / raw)
  To: Marcus Watts; +Cc: Gregory Farnum, Haomai Wang, ceph-devel

On Fri, 10 Jun 2016, Marcus Watts wrote:
> I'm going to try to respond to several messages with this one message...
> 
> 1. bitmask
> 2. authentication.
> 3. final sequence
> 4. confounder
> 5. encryption per message or per connection?
> 
> ____ 1. bitmask
> banner phase - bitmask.
> 
> hex digits will work fine for the protocol - but
> can't just use %llx %llx for parsing - when parsing for required, should
> reject for any non-zero hex digits above word size; for supported should
> ignore hex digits above word size.

+1

> ____ 2. authentication.
> We're already talking about 2 methods:
> "initial connect"
> "reconnect".
> the client doesn't know when it tries to reconnect if the
> server will accept a reconnect.  being able to say :
> 	c->s
> 		continue: reconnect
> 		continue: new_connect: phase 1
> then either
> 	s->c
> 		accept: reconnect
> or
> 	s->c
> 		reject: reconnect: no data
> 		continue: new connect phase 2
> 
> means 1 fewer round trips.

If the reconnect isn't accepted we fail hard (call a reset method and 
close the connection) so we can just plow forward with one method or the 
other, I think.

> if the ident & feature exchange can be done at the same time (being
> careful to make sure they're part of the signing process) I think that's
> a further win.
> 
> Optimizing round trips is particularly important for high latency
> connections.
> 
> I don't think the client has any business sending
> a "set method" message before it sees the server's
> "banner" - so there's already another round trip there.

The set method should be part of the base feature set, so the only risk is 
that we connect to something that isn't ceph.  I'm not sure we care if 
we're rude to another service on our port or not.  If we have to be 
polite, though, I guess there isn't much way around it.

Anyway, I think we're okay here.

> ____ 3. final sequence
> "final sequence"
> sorry, bad wording.
> for a reconnect: last sequence number from previous connection.
> (for a new connect after a reconnect: probably the client should be
> setting this to its remembered last value even if the server doesn't
> remember that value.)

Gotcha.

> ____ 4. confounder
> (but see next point too)
> The confounder is really a lower-level feature of the encryption.
> So it has to be per-message.  It needs to be part of the message hash,
> but should not be visible past that.  With cbc mode, encryption looks like;
> 	given plaintext p[0..n]
> 	to encrypt,
> 		c[0] = e(p[0])
> 		c[n] = e(c[n-1] ^ p[n])
> 	to decrypt,
> 		p[0] = d(c[0])
> 		p[n] = d(c[n]) ^ c[n^1]
> by making p[0] be a "random" value - that ensures for each
> message all c[n] are likely unique, including the hash.
> If p[0] were a set value, then c[0]
> becomes a fixed value and further values of c[x] become
> more deterministic as well.  That allows a bad guy to build up
> a dictionary of p[x] -> c[x] even without knowing the key,
> and we really don't want to give bad guys that capability.
> 
> ____ 5. encryption per message or per connection?
> 
> You're thinking of doing encryption over the entire byte stream
> not having framing outside.  Ah!.  Interesting...

Actually, over the entire stream, but with framing surrounding pieces.  So 
the c[n-1] will either be from c[0] (confounder during auth_done) or the 
perhaps the previous frame/message.  The framing has to be clear because 
different streams (converstaions between osds/clients/whatever) may be 
governed by different authentication and crypto keys/options/etc.

> Well, you certainly don't need extra confounders past the start
> of your encryption.  That can be (more or less) a one time thing.
> You will definitely need some sort of block flushing/padding
> thing between messages.  I imagine you'll want something like:
> 	size message hash(size+message) padding-to-block

Can you look at the padding described in the doc to see if it's 
sufficient?  Basically there is a block_size and sig_size.  Payloads are 
padded out to block_size for simplicity in the cipher, and sig_size is the 
size of the signature (if present--i.e., not encrypted) after that.

> And you will want to have some sort of notion
> that such connections are size limited - "do not encrypt more than N
> bytes on a connection".  You probably should have some sort of
> "new key every N bytes" notion over smaller byte ranges in addition.
> (wouldn't hurt to do a new confounder for each new key too.)

This isn't something cephx does.  The session keys do rotate every so 
often already (30m-1h?), but established connections aren't currently 
forced to reauthenticate.  If that's a short enough timespan it's probably 
an easyish change. If that's still too long, I think we can layer on 
session rekeying later when we get around to improving the auth methods 
themselves (e.g., maybe probably replacing cephx with something based on 
krb5).


Greg, Haomai, I think wip-msgr2 is in decent shape.  Want to take a final 
look before I merged?

	https://github.com/ceph/ceph/pull/9461/files

Thanks!
sage


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-10 21:15                   ` Sage Weil
@ 2016-06-10 21:22                     ` Gregory Farnum
  2016-06-11 23:05                     ` Marcus Watts
  1 sibling, 0 replies; 49+ messages in thread
From: Gregory Farnum @ 2016-06-10 21:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: Marcus Watts, Haomai Wang, ceph-devel

On Fri, Jun 10, 2016 at 2:15 PM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 10 Jun 2016, Marcus Watts wrote:
>> I'm going to try to respond to several messages with this one message...
>>
>> 1. bitmask
>> 2. authentication.
>> 3. final sequence
>> 4. confounder
>> 5. encryption per message or per connection?
>>
>> ____ 1. bitmask
>> banner phase - bitmask.
>>
>> hex digits will work fine for the protocol - but
>> can't just use %llx %llx for parsing - when parsing for required, should
>> reject for any non-zero hex digits above word size; for supported should
>> ignore hex digits above word size.
>
> +1
>
>> ____ 2. authentication.
>> We're already talking about 2 methods:
>> "initial connect"
>> "reconnect".
>> the client doesn't know when it tries to reconnect if the
>> server will accept a reconnect.  being able to say :
>>       c->s
>>               continue: reconnect
>>               continue: new_connect: phase 1
>> then either
>>       s->c
>>               accept: reconnect
>> or
>>       s->c
>>               reject: reconnect: no data
>>               continue: new connect phase 2
>>
>> means 1 fewer round trips.
>
> If the reconnect isn't accepted we fail hard (call a reset method and
> close the connection) so we can just plow forward with one method or the
> other, I think.
>
>> if the ident & feature exchange can be done at the same time (being
>> careful to make sure they're part of the signing process) I think that's
>> a further win.
>>
>> Optimizing round trips is particularly important for high latency
>> connections.
>>
>> I don't think the client has any business sending
>> a "set method" message before it sees the server's
>> "banner" - so there's already another round trip there.
>
> The set method should be part of the base feature set, so the only risk is
> that we connect to something that isn't ceph.  I'm not sure we care if
> we're rude to another service on our port or not.  If we have to be
> polite, though, I guess there isn't much way around it.
>
> Anyway, I think we're okay here.
>
>> ____ 3. final sequence
>> "final sequence"
>> sorry, bad wording.
>> for a reconnect: last sequence number from previous connection.
>> (for a new connect after a reconnect: probably the client should be
>> setting this to its remembered last value even if the server doesn't
>> remember that value.)
>
> Gotcha.
>
>> ____ 4. confounder
>> (but see next point too)
>> The confounder is really a lower-level feature of the encryption.
>> So it has to be per-message.  It needs to be part of the message hash,
>> but should not be visible past that.  With cbc mode, encryption looks like;
>>       given plaintext p[0..n]
>>       to encrypt,
>>               c[0] = e(p[0])
>>               c[n] = e(c[n-1] ^ p[n])
>>       to decrypt,
>>               p[0] = d(c[0])
>>               p[n] = d(c[n]) ^ c[n^1]
>> by making p[0] be a "random" value - that ensures for each
>> message all c[n] are likely unique, including the hash.
>> If p[0] were a set value, then c[0]
>> becomes a fixed value and further values of c[x] become
>> more deterministic as well.  That allows a bad guy to build up
>> a dictionary of p[x] -> c[x] even without knowing the key,
>> and we really don't want to give bad guys that capability.
>>
>> ____ 5. encryption per message or per connection?
>>
>> You're thinking of doing encryption over the entire byte stream
>> not having framing outside.  Ah!.  Interesting...
>
> Actually, over the entire stream, but with framing surrounding pieces.  So
> the c[n-1] will either be from c[0] (confounder during auth_done) or the
> perhaps the previous frame/message.  The framing has to be clear because
> different streams (converstaions between osds/clients/whatever) may be
> governed by different authentication and crypto keys/options/etc.
>
>> Well, you certainly don't need extra confounders past the start
>> of your encryption.  That can be (more or less) a one time thing.
>> You will definitely need some sort of block flushing/padding
>> thing between messages.  I imagine you'll want something like:
>>       size message hash(size+message) padding-to-block
>
> Can you look at the padding described in the doc to see if it's
> sufficient?  Basically there is a block_size and sig_size.  Payloads are
> padded out to block_size for simplicity in the cipher, and sig_size is the
> size of the signature (if present--i.e., not encrypted) after that.
>
>> And you will want to have some sort of notion
>> that such connections are size limited - "do not encrypt more than N
>> bytes on a connection".  You probably should have some sort of
>> "new key every N bytes" notion over smaller byte ranges in addition.
>> (wouldn't hurt to do a new confounder for each new key too.)
>
> This isn't something cephx does.  The session keys do rotate every so
> often already (30m-1h?), but established connections aren't currently
> forced to reauthenticate.  If that's a short enough timespan it's probably
> an easyish change. If that's still too long, I think we can layer on
> session rekeying later when we get around to improving the auth methods
> themselves (e.g., maybe probably replacing cephx with something based on
> krb5).
>
>
> Greg, Haomai, I think wip-msgr2 is in decent shape.  Want to take a final
> look before I merged?
>
>         https://github.com/ceph/ceph/pull/9461/files

Can you wait until after the 10.2.2 release? I'm busy with other
things until then.
-Greg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-10 21:15                   ` Sage Weil
  2016-06-10 21:22                     ` Gregory Farnum
@ 2016-06-11 23:05                     ` Marcus Watts
  2016-06-12 23:59                       ` Sage Weil
  1 sibling, 1 reply; 49+ messages in thread
From: Marcus Watts @ 2016-06-11 23:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, Haomai Wang, ceph-devel

If the client doesn't look at "features" before it sends stuff, it
will not be able to be very smart about taking advantage of some
future better method.  In fact, there isn't much advantage
to the server sending anything early - it could just as easily
wait until after it's seen the clients request.

Failing hard & retrying on a failed reconnect is going to be slower.
On the bright side, at least it shouldn't happen often.

If you're sending encryption (w/ different auth or keys) from several
different streams, how are you planning to indicate which bits
go with which scheme?, and which bits are you planning to encrypt
and which not?  

Byte count limits.  Basically, you don't want collisions because
of duplicated keys or data.  This depends on your crypto system,
so, for instance, you should not encrypt with one key more than
	aes, cbc	about 2^68 bytes
	aes, ctr	exactly 2^128 bytes
more generally, this depends on mode, blocksize, ...
This applies across *all* uses of the key - and so you would
generally want to use the session key directly as little as possible.
(in particular, using the session key for ctr directly would be very very bad.)

If you've got multiple streams going already, you should be able
to include a fairly simple rekey method with little effort.
For instance, as part of the method, you could,
	up front as part of the method
		send a per-stream key encrypted under the shared secret.
	prepend to the first data sent in a payload
		byte limit, stream key #0 (encrypted under the per-stream key)
		then encrypt the next N bytes with stream key #0
	when the byte limit is reached, prepend to the
		next data sent in a payload
		byte limit, stream key #1 (encrypted under the per-stream key)
		then encrypt the next N bytes with stream key #1
	&etc.

					-Marcus Watts

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-11 23:05                     ` Marcus Watts
@ 2016-06-12 23:59                       ` Sage Weil
       [not found]                         ` <CACJqLyax_SXEZp3vA2_wR+CdwKOo2Re=SsK2xfXqmXjz9d8iNw@mail.gmail.com>
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-06-12 23:59 UTC (permalink / raw)
  To: Marcus Watts; +Cc: Gregory Farnum, Haomai Wang, ceph-devel

On Sat, 11 Jun 2016, Marcus Watts wrote:
> If the client doesn't look at "features" before it sends stuff, it
> will not be able to be very smart about taking advantage of some
> future better method.  In fact, there isn't much advantage
> to the server sending anything early - it could just as easily
> wait until after it's seen the clients request.
> 
> Failing hard & retrying on a failed reconnect is going to be slower.
> On the bright side, at least it shouldn't happen often.

Yep.  Well, I think it is the client's (limited choice).  If it needs to 
know the server features, it needs to either wait for them, or make some 
optimistic choice and be prepared to pay the cost of a mistake.  We should 
give the client choice, though, if we can.

> If you're sending encryption (w/ different auth or keys) from several
> different streams, how are you planning to indicate which bits
> go with which scheme?, and which bits are you planning to encrypt
> and which not?  

This is what he stream ids are for, and why the outer portion of the frame 
is unencrypted.  See

	https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330c44b436R68

 +  stream_id (le32)
 +  frame_len (le32)
 +  tag (TAG_* byte)
 +  payload
 +  [payload padding -- only present after stream auth phase]
 +  [signature -- only present after stream auth phase]

The tag and payload (and padding) would be encrypted or signed, but not 
the stream id and frame_len.

> Byte count limits.  Basically, you don't want collisions because
> of duplicated keys or data.  This depends on your crypto system,
> so, for instance, you should not encrypt with one key more than
> 	aes, cbc	about 2^68 bytes
> 	aes, ctr	exactly 2^128 bytes
> more generally, this depends on mode, blocksize, ...
> This applies across *all* uses of the key - and so you would
> generally want to use the session key directly as little as possible.
> (in particular, using the session key for ctr directly would be very very bad.)
> 
> If you've got multiple streams going already, you should be able
> to include a fairly simple rekey method with little effort.
> For instance, as part of the method, you could,
> 	up front as part of the method
> 		send a per-stream key encrypted under the shared secret.
> 	prepend to the first data sent in a payload
> 		byte limit, stream key #0 (encrypted under the per-stream key)
> 		then encrypt the next N bytes with stream key #0
> 	when the byte limit is reached, prepend to the
> 		next data sent in a payload
> 		byte limit, stream key #1 (encrypted under the per-stream key)
> 		then encrypt the next N bytes with stream key #1
> 	&etc.

Good idea.  If I understand correctly, it means that the session_key is 
only used to send the new/next random encryption key, and if we make the 
byte limit part of the initial protocol we get the rotation we need.  It 
might be simpler to do it as a frame limit instead of byte limit, and 
assume max-length frames (2^32 bytes).  We could still be super 
conservative and rotate the encryption key every 2^16 messages or 
something...?  And rotating the key on frame boundaries should be much 
simpler to implement.

Anyway, that part can be defined a bit later, I think.

Thanks!
sage


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
       [not found]                         ` <CACJqLyax_SXEZp3vA2_wR+CdwKOo2Re=SsK2xfXqmXjz9d8iNw@mail.gmail.com>
@ 2016-09-09 21:14                           ` Sage Weil
       [not found]                             ` <CACJqLyYwKZ5_1OHR_5=+mr=1ED2Nt34x4TB29j5dE1D+NjzFpg@mail.gmail.com>
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-09-09 21:14 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5963 bytes --]

On Sat, 10 Sep 2016, Haomai Wang wrote:
> @sage in current impl, when logic fault like state mismatch, data format
> mismatch or anything else, connection will abort session via closing socket.
> And the peer side would do something according to policy too.
> In msgr v2 when introducing multi streams in the same connection, we can't
> simply abort socket to indicate something wrong now. I think we need to
> introduce TAG_ABORT with error message.
> 
> But the peer side may stuck into a state like reading enough data as
> "length" indicate. It may miss the TAG_ABORT notify or other reconnect tag.
> A tricky thing is we use tcp OOB bit to send which exactly trigger "urgent"
> signal when receiving, but it only occur 1 byte in tcp proto which can be
> used here to indicate the stream id(32bit designed now). 
> 
> What's more, multi stream mixed within one socket may make trouble to
> message receiving when potential tcp packet silent error. So it looks we
> can't use the same socket the multi stream to meet our demands.
> 
> Any idea?

I think we intorduce a TAG_ABORT to interrupt the stream.  And then we 
have to assume that the low-level msgr2 implementation that reads and 
writes frames (which have their own frame_len) is not buggy.  In practice, 
the aborts tend to happen because we get a message we don't understand 
(version mismatch, encoding compatibility bug, etc.), and that'll happen 
at a higher level after frames have been read... so a TAG_ABORT will be 
sufficient.

Also, we can have an option to make aborts close the socket.  That'll be 
fine for now anyway, although later it's probably to disruptive when 
multiple streams are sharing a socket...

sage


 > 
> 
> 
> On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
>       On Sat, 11 Jun 2016, Marcus Watts wrote:
>       > If the client doesn't look at "features" before it sends
>       stuff, it
>       > will not be able to be very smart about taking advantage of
>       some
>       > future better method.  In fact, there isn't much advantage
>       > to the server sending anything early - it could just as easily
>       > wait until after it's seen the clients request.
>       >
>       > Failing hard & retrying on a failed reconnect is going to be
>       slower.
>       > On the bright side, at least it shouldn't happen often.
> 
>       Yep.  Well, I think it is the client's (limited choice).  If it
>       needs to
>       know the server features, it needs to either wait for them, or
>       make some
>       optimistic choice and be prepared to pay the cost of a mistake. 
>       We should
>       give the client choice, though, if we can.
> 
>       > If you're sending encryption (w/ different auth or keys) from
>       several
>       > different streams, how are you planning to indicate which bits
>       > go with which scheme?, and which bits are you planning to
>       encrypt
>       > and which not?
> 
>       This is what he stream ids are for, and why the outer portion of
>       the frame
>       is unencrypted.  See
> 
>              https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
>       c44b436R68
> 
>        +  stream_id (le32)
>        +  frame_len (le32)
>        +  tag (TAG_* byte)
>        +  payload
>        +  [payload padding -- only present after stream auth phase]
>        +  [signature -- only present after stream auth phase]
> 
>       The tag and payload (and padding) would be encrypted or signed,
>       but not
>       the stream id and frame_len.
> 
>       > Byte count limits.  Basically, you don't want collisions
>       because
>       > of duplicated keys or data.  This depends on your crypto
>       system,
>       > so, for instance, you should not encrypt with one key more
>       than
>       >       aes, cbc        about 2^68 bytes
>       >       aes, ctr        exactly 2^128 bytes
>       > more generally, this depends on mode, blocksize, ...
>       > This applies across *all* uses of the key - and so you would
>       > generally want to use the session key directly as little as
>       possible.
>       > (in particular, using the session key for ctr directly would
>       be very very bad.)
>       >
>       > If you've got multiple streams going already, you should be
>       able
>       > to include a fairly simple rekey method with little effort.
>       > For instance, as part of the method, you could,
>       >       up front as part of the method
>       >               send a per-stream key encrypted under the shared
>       secret.
>       >       prepend to the first data sent in a payload
>       >               byte limit, stream key #0 (encrypted under the
>       per-stream key)
>       >               then encrypt the next N bytes with stream key #0
>       >       when the byte limit is reached, prepend to the
>       >               next data sent in a payload
>       >               byte limit, stream key #1 (encrypted under the
>       per-stream key)
>       >               then encrypt the next N bytes with stream key #1
>       >       &etc.
> 
>       Good idea.  If I understand correctly, it means that the
>       session_key is
>       only used to send the new/next random encryption key, and if we
>       make the
>       byte limit part of the initial protocol we get the rotation we
>       need.  It
>       might be simpler to do it as a frame limit instead of byte
>       limit, and
>       assume max-length frames (2^32 bytes).  We could still be super
>       conservative and rotate the encryption key every 2^16 messages
>       or
>       something...?  And rotating the key on frame boundaries should
>       be much
>       simpler to implement.
> 
>       Anyway, that part can be defined a bit later, I think.
> 
>       Thanks!
>       sage
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
       [not found]                             ` <CACJqLyYwKZ5_1OHR_5=+mr=1ED2Nt34x4TB29j5dE1D+NjzFpg@mail.gmail.com>
@ 2016-09-10 14:43                               ` Haomai Wang
  2016-09-11 17:05                                 ` Sage Weil
  0 siblings, 1 reply; 49+ messages in thread
From: Haomai Wang @ 2016-09-10 14:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

About thing is v1/v2 compatible. I rethink the details:

0. we need to define the new banner which must longer than before("ceph v027")
1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
2. both in simle/async codes, server side must issue banner firstly
3. if server side supports v2 and client only supports v1, client will
receive 9 bytes and do memcmp, then reject this connection via closing
socket. So server side could retry the older version
4. if server side only supports v1 and client supports v2, client
according banner to reply corresponding banner

This tricky design is based on the implementation fact "accept side
issue the banner firstly" and "new banner is longer than old banner",
and this way doesn't need to involve other dependences like mon port
changes.

Does this way has problem?


On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
>
>
> On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
>>
>> On Sat, 10 Sep 2016, Haomai Wang wrote:
>> > @sage in current impl, when logic fault like state mismatch, data format
>> > mismatch or anything else, connection will abort session via closing
>> > socket.
>> > And the peer side would do something according to policy too.
>> > In msgr v2 when introducing multi streams in the same connection, we
>> > can't
>> > simply abort socket to indicate something wrong now. I think we need to
>> > introduce TAG_ABORT with error message.
>> >
>> > But the peer side may stuck into a state like reading enough data as
>> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
>> > tag.
>> > A tricky thing is we use tcp OOB bit to send which exactly trigger
>> > "urgent"
>> > signal when receiving, but it only occur 1 byte in tcp proto which can
>> > be
>> > used here to indicate the stream id(32bit designed now).
>> >
>> > What's more, multi stream mixed within one socket may make trouble to
>> > message receiving when potential tcp packet silent error. So it looks we
>> > can't use the same socket the multi stream to meet our demands.
>> >
>> > Any idea?
>>
>> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
>> have to assume that the low-level msgr2 implementation that reads and
>> writes frames (which have their own frame_len) is not buggy.  In practice,
>> the aborts tend to happen because we get a message we don't understand
>> (version mismatch, encoding compatibility bug, etc.), and that'll happen
>> at a higher level after frames have been read... so a TAG_ABORT will be
>> sufficient.
>
>
> yes, if frame is ok. It should be ok.... Let's go through this firstly...
> The worse case is the frame length is not expected as data transferred.
>
>>
>>
>> Also, we can have an option to make aborts close the socket.  That'll be
>> fine for now anyway, although later it's probably to disruptive when
>> multiple streams are sharing a socket...
>>
>> sage
>>
>>
>>  >
>> >
>> >
>> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
>> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
>> >       > If the client doesn't look at "features" before it sends
>> >       stuff, it
>> >       > will not be able to be very smart about taking advantage of
>> >       some
>> >       > future better method.  In fact, there isn't much advantage
>> >       > to the server sending anything early - it could just as easily
>> >       > wait until after it's seen the clients request.
>> >       >
>> >       > Failing hard & retrying on a failed reconnect is going to be
>> >       slower.
>> >       > On the bright side, at least it shouldn't happen often.
>> >
>> >       Yep.  Well, I think it is the client's (limited choice).  If it
>> >       needs to
>> >       know the server features, it needs to either wait for them, or
>> >       make some
>> >       optimistic choice and be prepared to pay the cost of a mistake.
>> >       We should
>> >       give the client choice, though, if we can.
>> >
>> >       > If you're sending encryption (w/ different auth or keys) from
>> >       several
>> >       > different streams, how are you planning to indicate which bits
>> >       > go with which scheme?, and which bits are you planning to
>> >       encrypt
>> >       > and which not?
>> >
>> >       This is what he stream ids are for, and why the outer portion of
>> >       the frame
>> >       is unencrypted.  See
>> >
>> >
>> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
>> >       c44b436R68
>> >
>> >        +  stream_id (le32)
>> >        +  frame_len (le32)
>> >        +  tag (TAG_* byte)
>> >        +  payload
>> >        +  [payload padding -- only present after stream auth phase]
>> >        +  [signature -- only present after stream auth phase]
>> >
>> >       The tag and payload (and padding) would be encrypted or signed,
>> >       but not
>> >       the stream id and frame_len.
>> >
>> >       > Byte count limits.  Basically, you don't want collisions
>> >       because
>> >       > of duplicated keys or data.  This depends on your crypto
>> >       system,
>> >       > so, for instance, you should not encrypt with one key more
>> >       than
>> >       >       aes, cbc        about 2^68 bytes
>> >       >       aes, ctr        exactly 2^128 bytes
>> >       > more generally, this depends on mode, blocksize, ...
>> >       > This applies across *all* uses of the key - and so you would
>> >       > generally want to use the session key directly as little as
>> >       possible.
>> >       > (in particular, using the session key for ctr directly would
>> >       be very very bad.)
>> >       >
>> >       > If you've got multiple streams going already, you should be
>> >       able
>> >       > to include a fairly simple rekey method with little effort.
>> >       > For instance, as part of the method, you could,
>> >       >       up front as part of the method
>> >       >               send a per-stream key encrypted under the shared
>> >       secret.
>> >       >       prepend to the first data sent in a payload
>> >       >               byte limit, stream key #0 (encrypted under the
>> >       per-stream key)
>> >       >               then encrypt the next N bytes with stream key #0
>> >       >       when the byte limit is reached, prepend to the
>> >       >               next data sent in a payload
>> >       >               byte limit, stream key #1 (encrypted under the
>> >       per-stream key)
>> >       >               then encrypt the next N bytes with stream key #1
>> >       >       &etc.
>> >
>> >       Good idea.  If I understand correctly, it means that the
>> >       session_key is
>> >       only used to send the new/next random encryption key, and if we
>> >       make the
>> >       byte limit part of the initial protocol we get the rotation we
>> >       need.  It
>> >       might be simpler to do it as a frame limit instead of byte
>> >       limit, and
>> >       assume max-length frames (2^32 bytes).  We could still be super
>> >       conservative and rotate the encryption key every 2^16 messages
>> >       or
>> >       something...?  And rotating the key on frame boundaries should
>> >       be much
>> >       simpler to implement.
>> >
>> >       Anyway, that part can be defined a bit later, I think.
>> >
>> >       Thanks!
>> >       sage
>> >
>> >
>> >
>> >
>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-10 14:43                               ` Haomai Wang
@ 2016-09-11 17:05                                 ` Sage Weil
  2016-09-12  2:29                                   ` Haomai Wang
  2016-09-13 11:18                                   ` Jeff Layton
  0 siblings, 2 replies; 49+ messages in thread
From: Sage Weil @ 2016-09-11 17:05 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Sat, 10 Sep 2016, Haomai Wang wrote:
> About thing is v1/v2 compatible. I rethink the details:
> 
> 0. we need to define the new banner which must longer than before("ceph v027")
> 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> 2. both in simle/async codes, server side must issue banner firstly
> 3. if server side supports v2 and client only supports v1, client will
> receive 9 bytes and do memcmp, then reject this connection via closing
> socket. So server side could retry the older version
> 4. if server side only supports v1 and client supports v2, client
> according banner to reply corresponding banner
> 
> This tricky design is based on the implementation fact "accept side
> issue the banner firstly" and "new banner is longer than old banner",
> and this way doesn't need to involve other dependences like mon port
> changes.
> 
> Does this way has problem?

I was thinking we avoid this problem and any hacky initial handshakes by 
speaking v2 on the new port and v1 on the old port.  Then the monmap has 
an entity_addrvec_t with both a v1 and v2 address (encoding with just the 
v1 address for old clients). Same for the OSDs.

The v1 handshake just isn't extensible (how do you tell a v2 client 
connecting that you speak both v1 and v2?).

sage


> 
> 
> On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
> >
> >
> > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
> >>
> >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> >> > @sage in current impl, when logic fault like state mismatch, data format
> >> > mismatch or anything else, connection will abort session via closing
> >> > socket.
> >> > And the peer side would do something according to policy too.
> >> > In msgr v2 when introducing multi streams in the same connection, we
> >> > can't
> >> > simply abort socket to indicate something wrong now. I think we need to
> >> > introduce TAG_ABORT with error message.
> >> >
> >> > But the peer side may stuck into a state like reading enough data as
> >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> >> > tag.
> >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> >> > "urgent"
> >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> >> > be
> >> > used here to indicate the stream id(32bit designed now).
> >> >
> >> > What's more, multi stream mixed within one socket may make trouble to
> >> > message receiving when potential tcp packet silent error. So it looks we
> >> > can't use the same socket the multi stream to meet our demands.
> >> >
> >> > Any idea?
> >>
> >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> >> have to assume that the low-level msgr2 implementation that reads and
> >> writes frames (which have their own frame_len) is not buggy.  In practice,
> >> the aborts tend to happen because we get a message we don't understand
> >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> >> at a higher level after frames have been read... so a TAG_ABORT will be
> >> sufficient.
> >
> >
> > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> > The worse case is the frame length is not expected as data transferred.
> >
> >>
> >>
> >> Also, we can have an option to make aborts close the socket.  That'll be
> >> fine for now anyway, although later it's probably to disruptive when
> >> multiple streams are sharing a socket...
> >>
> >> sage
> >>
> >>
> >>  >
> >> >
> >> >
> >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
> >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> >> >       > If the client doesn't look at "features" before it sends
> >> >       stuff, it
> >> >       > will not be able to be very smart about taking advantage of
> >> >       some
> >> >       > future better method.  In fact, there isn't much advantage
> >> >       > to the server sending anything early - it could just as easily
> >> >       > wait until after it's seen the clients request.
> >> >       >
> >> >       > Failing hard & retrying on a failed reconnect is going to be
> >> >       slower.
> >> >       > On the bright side, at least it shouldn't happen often.
> >> >
> >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> >> >       needs to
> >> >       know the server features, it needs to either wait for them, or
> >> >       make some
> >> >       optimistic choice and be prepared to pay the cost of a mistake.
> >> >       We should
> >> >       give the client choice, though, if we can.
> >> >
> >> >       > If you're sending encryption (w/ different auth or keys) from
> >> >       several
> >> >       > different streams, how are you planning to indicate which bits
> >> >       > go with which scheme?, and which bits are you planning to
> >> >       encrypt
> >> >       > and which not?
> >> >
> >> >       This is what he stream ids are for, and why the outer portion of
> >> >       the frame
> >> >       is unencrypted.  See
> >> >
> >> >
> >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> >> >       c44b436R68
> >> >
> >> >        +  stream_id (le32)
> >> >        +  frame_len (le32)
> >> >        +  tag (TAG_* byte)
> >> >        +  payload
> >> >        +  [payload padding -- only present after stream auth phase]
> >> >        +  [signature -- only present after stream auth phase]
> >> >
> >> >       The tag and payload (and padding) would be encrypted or signed,
> >> >       but not
> >> >       the stream id and frame_len.
> >> >
> >> >       > Byte count limits.  Basically, you don't want collisions
> >> >       because
> >> >       > of duplicated keys or data.  This depends on your crypto
> >> >       system,
> >> >       > so, for instance, you should not encrypt with one key more
> >> >       than
> >> >       >       aes, cbc        about 2^68 bytes
> >> >       >       aes, ctr        exactly 2^128 bytes
> >> >       > more generally, this depends on mode, blocksize, ...
> >> >       > This applies across *all* uses of the key - and so you would
> >> >       > generally want to use the session key directly as little as
> >> >       possible.
> >> >       > (in particular, using the session key for ctr directly would
> >> >       be very very bad.)
> >> >       >
> >> >       > If you've got multiple streams going already, you should be
> >> >       able
> >> >       > to include a fairly simple rekey method with little effort.
> >> >       > For instance, as part of the method, you could,
> >> >       >       up front as part of the method
> >> >       >               send a per-stream key encrypted under the shared
> >> >       secret.
> >> >       >       prepend to the first data sent in a payload
> >> >       >               byte limit, stream key #0 (encrypted under the
> >> >       per-stream key)
> >> >       >               then encrypt the next N bytes with stream key #0
> >> >       >       when the byte limit is reached, prepend to the
> >> >       >               next data sent in a payload
> >> >       >               byte limit, stream key #1 (encrypted under the
> >> >       per-stream key)
> >> >       >               then encrypt the next N bytes with stream key #1
> >> >       >       &etc.
> >> >
> >> >       Good idea.  If I understand correctly, it means that the
> >> >       session_key is
> >> >       only used to send the new/next random encryption key, and if we
> >> >       make the
> >> >       byte limit part of the initial protocol we get the rotation we
> >> >       need.  It
> >> >       might be simpler to do it as a frame limit instead of byte
> >> >       limit, and
> >> >       assume max-length frames (2^32 bytes).  We could still be super
> >> >       conservative and rotate the encryption key every 2^16 messages
> >> >       or
> >> >       something...?  And rotating the key on frame boundaries should
> >> >       be much
> >> >       simpler to implement.
> >> >
> >> >       Anyway, that part can be defined a bit later, I think.
> >> >
> >> >       Thanks!
> >> >       sage
> >> >
> >> >
> >> >
> >> >
> >
> >
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-11 17:05                                 ` Sage Weil
@ 2016-09-12  2:29                                   ` Haomai Wang
  2016-09-12 13:21                                     ` Sage Weil
  2016-09-13 11:18                                   ` Jeff Layton
  1 sibling, 1 reply; 49+ messages in thread
From: Haomai Wang @ 2016-09-12  2:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Mon, Sep 12, 2016 at 1:05 AM, Sage Weil <sweil@redhat.com> wrote:
> On Sat, 10 Sep 2016, Haomai Wang wrote:
>> About thing is v1/v2 compatible. I rethink the details:
>>
>> 0. we need to define the new banner which must longer than before("ceph v027")
>> 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
>> 2. both in simle/async codes, server side must issue banner firstly
>> 3. if server side supports v2 and client only supports v1, client will
>> receive 9 bytes and do memcmp, then reject this connection via closing
>> socket. So server side could retry the older version
>> 4. if server side only supports v1 and client supports v2, client
>> according banner to reply corresponding banner
>>
>> This tricky design is based on the implementation fact "accept side
>> issue the banner firstly" and "new banner is longer than old banner",
>> and this way doesn't need to involve other dependences like mon port
>> changes.
>>
>> Does this way has problem?
>
> I was thinking we avoid this problem and any hacky initial handshakes by
> speaking v2 on the new port and v1 on the old port.  Then the monmap has
> an entity_addrvec_t with both a v1 and v2 address (encoding with just the
> v1 address for old clients). Same for the OSDs.

This way is ok to me. So another change is double messenger
instances(to v1 and v2) or let each messenger support multi binding
addresses(this may need to refactor messenger interface).

>
> The v1 handshake just isn't extensible (how do you tell a v2 client
> connecting that you speak both v1 and v2?).
>
> sage
>
>
>>
>>
>> On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
>> >
>> >
>> > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
>> >>
>> >> On Sat, 10 Sep 2016, Haomai Wang wrote:
>> >> > @sage in current impl, when logic fault like state mismatch, data format
>> >> > mismatch or anything else, connection will abort session via closing
>> >> > socket.
>> >> > And the peer side would do something according to policy too.
>> >> > In msgr v2 when introducing multi streams in the same connection, we
>> >> > can't
>> >> > simply abort socket to indicate something wrong now. I think we need to
>> >> > introduce TAG_ABORT with error message.
>> >> >
>> >> > But the peer side may stuck into a state like reading enough data as
>> >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
>> >> > tag.
>> >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
>> >> > "urgent"
>> >> > signal when receiving, but it only occur 1 byte in tcp proto which can
>> >> > be
>> >> > used here to indicate the stream id(32bit designed now).
>> >> >
>> >> > What's more, multi stream mixed within one socket may make trouble to
>> >> > message receiving when potential tcp packet silent error. So it looks we
>> >> > can't use the same socket the multi stream to meet our demands.
>> >> >
>> >> > Any idea?
>> >>
>> >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
>> >> have to assume that the low-level msgr2 implementation that reads and
>> >> writes frames (which have their own frame_len) is not buggy.  In practice,
>> >> the aborts tend to happen because we get a message we don't understand
>> >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
>> >> at a higher level after frames have been read... so a TAG_ABORT will be
>> >> sufficient.
>> >
>> >
>> > yes, if frame is ok. It should be ok.... Let's go through this firstly...
>> > The worse case is the frame length is not expected as data transferred.
>> >
>> >>
>> >>
>> >> Also, we can have an option to make aborts close the socket.  That'll be
>> >> fine for now anyway, although later it's probably to disruptive when
>> >> multiple streams are sharing a socket...
>> >>
>> >> sage
>> >>
>> >>
>> >>  >
>> >> >
>> >> >
>> >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
>> >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
>> >> >       > If the client doesn't look at "features" before it sends
>> >> >       stuff, it
>> >> >       > will not be able to be very smart about taking advantage of
>> >> >       some
>> >> >       > future better method.  In fact, there isn't much advantage
>> >> >       > to the server sending anything early - it could just as easily
>> >> >       > wait until after it's seen the clients request.
>> >> >       >
>> >> >       > Failing hard & retrying on a failed reconnect is going to be
>> >> >       slower.
>> >> >       > On the bright side, at least it shouldn't happen often.
>> >> >
>> >> >       Yep.  Well, I think it is the client's (limited choice).  If it
>> >> >       needs to
>> >> >       know the server features, it needs to either wait for them, or
>> >> >       make some
>> >> >       optimistic choice and be prepared to pay the cost of a mistake.
>> >> >       We should
>> >> >       give the client choice, though, if we can.
>> >> >
>> >> >       > If you're sending encryption (w/ different auth or keys) from
>> >> >       several
>> >> >       > different streams, how are you planning to indicate which bits
>> >> >       > go with which scheme?, and which bits are you planning to
>> >> >       encrypt
>> >> >       > and which not?
>> >> >
>> >> >       This is what he stream ids are for, and why the outer portion of
>> >> >       the frame
>> >> >       is unencrypted.  See
>> >> >
>> >> >
>> >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
>> >> >       c44b436R68
>> >> >
>> >> >        +  stream_id (le32)
>> >> >        +  frame_len (le32)
>> >> >        +  tag (TAG_* byte)
>> >> >        +  payload
>> >> >        +  [payload padding -- only present after stream auth phase]
>> >> >        +  [signature -- only present after stream auth phase]
>> >> >
>> >> >       The tag and payload (and padding) would be encrypted or signed,
>> >> >       but not
>> >> >       the stream id and frame_len.
>> >> >
>> >> >       > Byte count limits.  Basically, you don't want collisions
>> >> >       because
>> >> >       > of duplicated keys or data.  This depends on your crypto
>> >> >       system,
>> >> >       > so, for instance, you should not encrypt with one key more
>> >> >       than
>> >> >       >       aes, cbc        about 2^68 bytes
>> >> >       >       aes, ctr        exactly 2^128 bytes
>> >> >       > more generally, this depends on mode, blocksize, ...
>> >> >       > This applies across *all* uses of the key - and so you would
>> >> >       > generally want to use the session key directly as little as
>> >> >       possible.
>> >> >       > (in particular, using the session key for ctr directly would
>> >> >       be very very bad.)
>> >> >       >
>> >> >       > If you've got multiple streams going already, you should be
>> >> >       able
>> >> >       > to include a fairly simple rekey method with little effort.
>> >> >       > For instance, as part of the method, you could,
>> >> >       >       up front as part of the method
>> >> >       >               send a per-stream key encrypted under the shared
>> >> >       secret.
>> >> >       >       prepend to the first data sent in a payload
>> >> >       >               byte limit, stream key #0 (encrypted under the
>> >> >       per-stream key)
>> >> >       >               then encrypt the next N bytes with stream key #0
>> >> >       >       when the byte limit is reached, prepend to the
>> >> >       >               next data sent in a payload
>> >> >       >               byte limit, stream key #1 (encrypted under the
>> >> >       per-stream key)
>> >> >       >               then encrypt the next N bytes with stream key #1
>> >> >       >       &etc.
>> >> >
>> >> >       Good idea.  If I understand correctly, it means that the
>> >> >       session_key is
>> >> >       only used to send the new/next random encryption key, and if we
>> >> >       make the
>> >> >       byte limit part of the initial protocol we get the rotation we
>> >> >       need.  It
>> >> >       might be simpler to do it as a frame limit instead of byte
>> >> >       limit, and
>> >> >       assume max-length frames (2^32 bytes).  We could still be super
>> >> >       conservative and rotate the encryption key every 2^16 messages
>> >> >       or
>> >> >       something...?  And rotating the key on frame boundaries should
>> >> >       be much
>> >> >       simpler to implement.
>> >> >
>> >> >       Anyway, that part can be defined a bit later, I think.
>> >> >
>> >> >       Thanks!
>> >> >       sage
>> >> >
>> >> >
>> >> >
>> >> >
>> >
>> >
>>
>>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-12  2:29                                   ` Haomai Wang
@ 2016-09-12 13:21                                     ` Sage Weil
  2016-09-13  0:03                                       ` Gregory Farnum
  2016-09-13 11:50                                       ` Jeff Layton
  0 siblings, 2 replies; 49+ messages in thread
From: Sage Weil @ 2016-09-12 13:21 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Mon, 12 Sep 2016, Haomai Wang wrote:
> On Mon, Sep 12, 2016 at 1:05 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Sat, 10 Sep 2016, Haomai Wang wrote:
> >> About thing is v1/v2 compatible. I rethink the details:
> >>
> >> 0. we need to define the new banner which must longer than before("ceph v027")
> >> 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> >> 2. both in simle/async codes, server side must issue banner firstly
> >> 3. if server side supports v2 and client only supports v1, client will
> >> receive 9 bytes and do memcmp, then reject this connection via closing
> >> socket. So server side could retry the older version
> >> 4. if server side only supports v1 and client supports v2, client
> >> according banner to reply corresponding banner
> >>
> >> This tricky design is based on the implementation fact "accept side
> >> issue the banner firstly" and "new banner is longer than old banner",
> >> and this way doesn't need to involve other dependences like mon port
> >> changes.
> >>
> >> Does this way has problem?
> >
> > I was thinking we avoid this problem and any hacky initial handshakes by
> > speaking v2 on the new port and v1 on the old port.  Then the monmap has
> > an entity_addrvec_t with both a v1 and v2 address (encoding with just the
> > v1 address for old clients). Same for the OSDs.
> 
> This way is ok to me. So another change is double messenger
> instances(to v1 and v2) or let each messenger support multi binding
> addresses(this may need to refactor messenger interface).

Yeah.  I'm guessing we'll want to have an entity_addrvec_t with address 
types mapped to different Messenger implementations (e.g., xio), so we'll 
wan to allow multiple instance eventually.  But we'll also just want to
allow multiple binding (v1 + v2, or ipv4 + ipv6).  :/

sage

> 
> >
> > The v1 handshake just isn't extensible (how do you tell a v2 client
> > connecting that you speak both v1 and v2?).
> >
> > sage
> >
> >
> >>
> >>
> >> On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
> >> >
> >> >
> >> > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
> >> >>
> >> >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> >> >> > @sage in current impl, when logic fault like state mismatch, data format
> >> >> > mismatch or anything else, connection will abort session via closing
> >> >> > socket.
> >> >> > And the peer side would do something according to policy too.
> >> >> > In msgr v2 when introducing multi streams in the same connection, we
> >> >> > can't
> >> >> > simply abort socket to indicate something wrong now. I think we need to
> >> >> > introduce TAG_ABORT with error message.
> >> >> >
> >> >> > But the peer side may stuck into a state like reading enough data as
> >> >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> >> >> > tag.
> >> >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> >> >> > "urgent"
> >> >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> >> >> > be
> >> >> > used here to indicate the stream id(32bit designed now).
> >> >> >
> >> >> > What's more, multi stream mixed within one socket may make trouble to
> >> >> > message receiving when potential tcp packet silent error. So it looks we
> >> >> > can't use the same socket the multi stream to meet our demands.
> >> >> >
> >> >> > Any idea?
> >> >>
> >> >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> >> >> have to assume that the low-level msgr2 implementation that reads and
> >> >> writes frames (which have their own frame_len) is not buggy.  In practice,
> >> >> the aborts tend to happen because we get a message we don't understand
> >> >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> >> >> at a higher level after frames have been read... so a TAG_ABORT will be
> >> >> sufficient.
> >> >
> >> >
> >> > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> >> > The worse case is the frame length is not expected as data transferred.
> >> >
> >> >>
> >> >>
> >> >> Also, we can have an option to make aborts close the socket.  That'll be
> >> >> fine for now anyway, although later it's probably to disruptive when
> >> >> multiple streams are sharing a socket...
> >> >>
> >> >> sage
> >> >>
> >> >>
> >> >>  >
> >> >> >
> >> >> >
> >> >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
> >> >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> >> >> >       > If the client doesn't look at "features" before it sends
> >> >> >       stuff, it
> >> >> >       > will not be able to be very smart about taking advantage of
> >> >> >       some
> >> >> >       > future better method.  In fact, there isn't much advantage
> >> >> >       > to the server sending anything early - it could just as easily
> >> >> >       > wait until after it's seen the clients request.
> >> >> >       >
> >> >> >       > Failing hard & retrying on a failed reconnect is going to be
> >> >> >       slower.
> >> >> >       > On the bright side, at least it shouldn't happen often.
> >> >> >
> >> >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> >> >> >       needs to
> >> >> >       know the server features, it needs to either wait for them, or
> >> >> >       make some
> >> >> >       optimistic choice and be prepared to pay the cost of a mistake.
> >> >> >       We should
> >> >> >       give the client choice, though, if we can.
> >> >> >
> >> >> >       > If you're sending encryption (w/ different auth or keys) from
> >> >> >       several
> >> >> >       > different streams, how are you planning to indicate which bits
> >> >> >       > go with which scheme?, and which bits are you planning to
> >> >> >       encrypt
> >> >> >       > and which not?
> >> >> >
> >> >> >       This is what he stream ids are for, and why the outer portion of
> >> >> >       the frame
> >> >> >       is unencrypted.  See
> >> >> >
> >> >> >
> >> >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> >> >> >       c44b436R68
> >> >> >
> >> >> >        +  stream_id (le32)
> >> >> >        +  frame_len (le32)
> >> >> >        +  tag (TAG_* byte)
> >> >> >        +  payload
> >> >> >        +  [payload padding -- only present after stream auth phase]
> >> >> >        +  [signature -- only present after stream auth phase]
> >> >> >
> >> >> >       The tag and payload (and padding) would be encrypted or signed,
> >> >> >       but not
> >> >> >       the stream id and frame_len.
> >> >> >
> >> >> >       > Byte count limits.  Basically, you don't want collisions
> >> >> >       because
> >> >> >       > of duplicated keys or data.  This depends on your crypto
> >> >> >       system,
> >> >> >       > so, for instance, you should not encrypt with one key more
> >> >> >       than
> >> >> >       >       aes, cbc        about 2^68 bytes
> >> >> >       >       aes, ctr        exactly 2^128 bytes
> >> >> >       > more generally, this depends on mode, blocksize, ...
> >> >> >       > This applies across *all* uses of the key - and so you would
> >> >> >       > generally want to use the session key directly as little as
> >> >> >       possible.
> >> >> >       > (in particular, using the session key for ctr directly would
> >> >> >       be very very bad.)
> >> >> >       >
> >> >> >       > If you've got multiple streams going already, you should be
> >> >> >       able
> >> >> >       > to include a fairly simple rekey method with little effort.
> >> >> >       > For instance, as part of the method, you could,
> >> >> >       >       up front as part of the method
> >> >> >       >               send a per-stream key encrypted under the shared
> >> >> >       secret.
> >> >> >       >       prepend to the first data sent in a payload
> >> >> >       >               byte limit, stream key #0 (encrypted under the
> >> >> >       per-stream key)
> >> >> >       >               then encrypt the next N bytes with stream key #0
> >> >> >       >       when the byte limit is reached, prepend to the
> >> >> >       >               next data sent in a payload
> >> >> >       >               byte limit, stream key #1 (encrypted under the
> >> >> >       per-stream key)
> >> >> >       >               then encrypt the next N bytes with stream key #1
> >> >> >       >       &etc.
> >> >> >
> >> >> >       Good idea.  If I understand correctly, it means that the
> >> >> >       session_key is
> >> >> >       only used to send the new/next random encryption key, and if we
> >> >> >       make the
> >> >> >       byte limit part of the initial protocol we get the rotation we
> >> >> >       need.  It
> >> >> >       might be simpler to do it as a frame limit instead of byte
> >> >> >       limit, and
> >> >> >       assume max-length frames (2^32 bytes).  We could still be super
> >> >> >       conservative and rotate the encryption key every 2^16 messages
> >> >> >       or
> >> >> >       something...?  And rotating the key on frame boundaries should
> >> >> >       be much
> >> >> >       simpler to implement.
> >> >> >
> >> >> >       Anyway, that part can be defined a bit later, I think.
> >> >> >
> >> >> >       Thanks!
> >> >> >       sage
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-12 13:21                                     ` Sage Weil
@ 2016-09-13  0:03                                       ` Gregory Farnum
  2016-09-13  1:35                                         ` Haomai Wang
  2016-09-13 11:50                                       ` Jeff Layton
  1 sibling, 1 reply; 49+ messages in thread
From: Gregory Farnum @ 2016-09-13  0:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Haomai Wang, Marcus Watts, ceph-devel

On Mon, Sep 12, 2016 at 6:21 AM, Sage Weil <sweil@redhat.com> wrote:
> On Mon, 12 Sep 2016, Haomai Wang wrote:
>> This way is ok to me. So another change is double messenger
>> instances(to v1 and v2) or let each messenger support multi binding
>> addresses(this may need to refactor messenger interface).
>
> Yeah.  I'm guessing we'll want to have an entity_addrvec_t with address
> types mapped to different Messenger implementations (e.g., xio), so we'll
> wan to allow multiple instance eventually.  But we'll also just want to
> allow multiple binding (v1 + v2, or ipv4 + ipv6).  :/

Hmm, is that really necessary? It seems a fair bit more complicated
and I'm not sure there's much payoff given the connection types.
Long-term the only doubled connection I can see being needed is the
client one; OSD cluster messengers will only be required to bind twice
during the initial upgrade period.

Put another way, what's the advantage of supporting two different
protocols within one messenger? That just sounds like a disaster
waiting to happen, and not one worth risking for slightly reducing the
thread count on AsyncMessenger (especially with users coming from the
SimpleMessenger).
-Greg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-13  0:03                                       ` Gregory Farnum
@ 2016-09-13  1:35                                         ` Haomai Wang
  2016-09-13 13:21                                           ` Sage Weil
  0 siblings, 1 reply; 49+ messages in thread
From: Haomai Wang @ 2016-09-13  1:35 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Marcus Watts, ceph-devel

On Tue, Sep 13, 2016 at 8:03 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Mon, Sep 12, 2016 at 6:21 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Mon, 12 Sep 2016, Haomai Wang wrote:
>>> This way is ok to me. So another change is double messenger
>>> instances(to v1 and v2) or let each messenger support multi binding
>>> addresses(this may need to refactor messenger interface).
>>
>> Yeah.  I'm guessing we'll want to have an entity_addrvec_t with address
>> types mapped to different Messenger implementations (e.g., xio), so we'll
>> wan to allow multiple instance eventually.  But we'll also just want to
>> allow multiple binding (v1 + v2, or ipv4 + ipv6).  :/
>
> Hmm, is that really necessary? It seems a fair bit more complicated
> and I'm not sure there's much payoff given the connection types.
> Long-term the only doubled connection I can see being needed is the
> client one; OSD cluster messengers will only be required to bind twice
> during the initial upgrade period.
>
> Put another way, what's the advantage of supporting two different
> protocols within one messenger? That just sounds like a disaster

I don't think this version upgrade will happen right now. I guess
there will be a version to switch to async msgr, then we consider to
upgrade msgr version really.

> waiting to happen, and not one worth risking for slightly reducing the
> thread count on AsyncMessenger (especially with users coming from the
> SimpleMessenger).

Maybe you like aonther msgr like asyncmessengerv2? instead of the same
msgr type. From my view, msgrv2 will reduce packet per message and
prepare for the other new features.

> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-11 17:05                                 ` Sage Weil
  2016-09-12  2:29                                   ` Haomai Wang
@ 2016-09-13 11:18                                   ` Jeff Layton
  2016-09-13 13:31                                     ` Sage Weil
  1 sibling, 1 reply; 49+ messages in thread
From: Jeff Layton @ 2016-09-13 11:18 UTC (permalink / raw)
  To: Sage Weil, Haomai Wang; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote:
> On Sat, 10 Sep 2016, Haomai Wang wrote:
> > About thing is v1/v2 compatible. I rethink the details:
> > 
> > 0. we need to define the new banner which must longer than before("ceph v027")
> > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> > 2. both in simle/async codes, server side must issue banner firstly
> > 3. if server side supports v2 and client only supports v1, client will
> > receive 9 bytes and do memcmp, then reject this connection via closing
> > socket. So server side could retry the older version
> > 4. if server side only supports v1 and client supports v2, client
> > according banner to reply corresponding banner
> > 
> > This tricky design is based on the implementation fact "accept side
> > issue the banner firstly" and "new banner is longer than old banner",
> > and this way doesn't need to involve other dependences like mon port
> > changes.
> > 
> > Does this way has problem?
> 
> I was thinking we avoid this problem and any hacky initial handshakes by 
> speaking v2 on the new port and v1 on the old port.  Then the monmap has 
> an entity_addrvec_t with both a v1 and v2 address (encoding with just the 
> v1 address for old clients). Same for the OSDs.
> 
> The v1 handshake just isn't extensible (how do you tell a v2 client 
> connecting that you speak both v1 and v2?).
> 

Depending on port assignments for the protocol is pretty icky though.
There may be valid reasons to use different ports in some environments
and then that heuristic goes right out the window.

One thing that is really strange about both the old and new protocols
is that they have the client and server sending the initial exchange
concurrently, or have the server send it first.  While it may speed up
the initial negotiation slightly, it makes it really hard to handle
fallback to earlier protocol versions (as Haomai pointed out), as the
client is responsible for handing reconnects.

Consider the case where we have a client that supports only v1 but a
server that supports v1 and v2. Client connects and then server sends a
v2 message. Client doesn't understand it and closes the connection and
reconnects, only to end up in the same situation on the second attempt.

There's no way for the server to preserve the state from the initial
connection attempt and handle the new connection with v1. Would it not
make more sense to have the client connect and send its initial banner,
and then let the server decide what sort of banner to send based on
what the client sent?

> > 
> > 
> > > On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
> > >
> > >
> > > > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
> > >>
> > >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> > >> > @sage in current impl, when logic fault like state mismatch, data format
> > >> > mismatch or anything else, connection will abort session via closing
> > >> > socket.
> > >> > And the peer side would do something according to policy too.
> > >> > In msgr v2 when introducing multi streams in the same connection, we
> > >> > can't
> > >> > simply abort socket to indicate something wrong now. I think we need to
> > >> > introduce TAG_ABORT with error message.
> > >> >
> > >> > But the peer side may stuck into a state like reading enough data as
> > >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> > >> > tag.
> > >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> > >> > "urgent"
> > >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> > >> > be
> > >> > used here to indicate the stream id(32bit designed now).
> > >> >
> > >> > What's more, multi stream mixed within one socket may make trouble to
> > >> > message receiving when potential tcp packet silent error. So it looks we
> > >> > can't use the same socket the multi stream to meet our demands.
> > >> >
> > >> > Any idea?
> > >>
> > >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> > >> have to assume that the low-level msgr2 implementation that reads and
> > >> writes frames (which have their own frame_len) is not buggy.  In practice,
> > >> the aborts tend to happen because we get a message we don't understand
> > >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> > >> at a higher level after frames have been read... so a TAG_ABORT will be
> > >> sufficient.
> > >
> > >
> > > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> > > The worse case is the frame length is not expected as data transferred.
> > >
> > >>
> > >>
> > >> Also, we can have an option to make aborts close the socket.  That'll be
> > >> fine for now anyway, although later it's probably to disruptive when
> > >> multiple streams are sharing a socket...
> > >>
> > >> sage
> > >>
> > >>
> > >>  >
> > >> >
> > >> >
> > > >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
> > >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> > >> >       > If the client doesn't look at "features" before it sends
> > >> >       stuff, it
> > >> >       > will not be able to be very smart about taking advantage of
> > >> >       some
> > >> >       > future better method.  In fact, there isn't much advantage
> > >> >       > to the server sending anything early - it could just as easily
> > >> >       > wait until after it's seen the clients request.
> > >> >       >
> > >> >       > Failing hard & retrying on a failed reconnect is going to be
> > >> >       slower.
> > >> >       > On the bright side, at least it shouldn't happen often.
> > >> >
> > >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> > >> >       needs to
> > >> >       know the server features, it needs to either wait for them, or
> > >> >       make some
> > >> >       optimistic choice and be prepared to pay the cost of a mistake.
> > >> >       We should
> > >> >       give the client choice, though, if we can.
> > >> >
> > >> >       > If you're sending encryption (w/ different auth or keys) from
> > >> >       several
> > >> >       > different streams, how are you planning to indicate which bits
> > >> >       > go with which scheme?, and which bits are you planning to
> > >> >       encrypt
> > >> >       > and which not?
> > >> >
> > >> >       This is what he stream ids are for, and why the outer portion of
> > >> >       the frame
> > >> >       is unencrypted.  See
> > >> >
> > >> >
> > >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> > >> >       c44b436R68
> > >> >
> > >> >        +  stream_id (le32)
> > >> >        +  frame_len (le32)
> > >> >        +  tag (TAG_* byte)
> > >> >        +  payload
> > >> >        +  [payload padding -- only present after stream auth phase]
> > >> >        +  [signature -- only present after stream auth phase]
> > >> >
> > >> >       The tag and payload (and padding) would be encrypted or signed,
> > >> >       but not
> > >> >       the stream id and frame_len.
> > >> >
> > >> >       > Byte count limits.  Basically, you don't want collisions
> > >> >       because
> > >> >       > of duplicated keys or data.  This depends on your crypto
> > >> >       system,
> > >> >       > so, for instance, you should not encrypt with one key more
> > >> >       than
> > >> >       >       aes, cbc        about 2^68 bytes
> > >> >       >       aes, ctr        exactly 2^128 bytes
> > >> >       > more generally, this depends on mode, blocksize, ...
> > >> >       > This applies across *all* uses of the key - and so you would
> > >> >       > generally want to use the session key directly as little as
> > >> >       possible.
> > >> >       > (in particular, using the session key for ctr directly would
> > >> >       be very very bad.)
> > >> >       >
> > >> >       > If you've got multiple streams going already, you should be
> > >> >       able
> > >> >       > to include a fairly simple rekey method with little effort.
> > >> >       > For instance, as part of the method, you could,
> > >> >       >       up front as part of the method
> > >> >       >               send a per-stream key encrypted under the shared
> > >> >       secret.
> > >> >       >       prepend to the first data sent in a payload
> > >> >       >               byte limit, stream key #0 (encrypted under the
> > >> >       per-stream key)
> > >> >       >               then encrypt the next N bytes with stream key #0
> > >> >       >       when the byte limit is reached, prepend to the
> > >> >       >               next data sent in a payload
> > >> >       >               byte limit, stream key #1 (encrypted under the
> > >> >       per-stream key)
> > >> >       >               then encrypt the next N bytes with stream key #1
> > >> >       >       &etc.
> > >> >
> > >> >       Good idea.  If I understand correctly, it means that the
> > >> >       session_key is
> > >> >       only used to send the new/next random encryption key, and if we
> > >> >       make the
> > >> >       byte limit part of the initial protocol we get the rotation we
> > >> >       need.  It
> > >> >       might be simpler to do it as a frame limit instead of byte
> > >> >       limit, and
> > >> >       assume max-length frames (2^32 bytes).  We could still be super
> > >> >       conservative and rotate the encryption key every 2^16 messages
> > >> >       or
> > >> >       something...?  And rotating the key on frame boundaries should
> > >> >       be much
> > >> >       simpler to implement.
> > >> >
> > >> >       Anyway, that part can be defined a bit later, I think.
> > >> >
> > >> >       Thanks!
> > >> >       sage
> > >> >
> > >> >
> > >> >
> > >> >
> > >
> > >
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jeff Layton <jlayton@poochiereds.net>
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-12 13:21                                     ` Sage Weil
  2016-09-13  0:03                                       ` Gregory Farnum
@ 2016-09-13 11:50                                       ` Jeff Layton
  1 sibling, 0 replies; 49+ messages in thread
From: Jeff Layton @ 2016-09-13 11:50 UTC (permalink / raw)
  To: Sage Weil, Haomai Wang; +Cc: Marcus Watts, Gregory Farnum, ceph-devel

On Mon, 2016-09-12 at 13:21 +0000, Sage Weil wrote:
> On Mon, 12 Sep 2016, Haomai Wang wrote:
> > > On Mon, Sep 12, 2016 at 1:05 AM, Sage Weil <sweil@redhat.com> wrote:
> > > On Sat, 10 Sep 2016, Haomai Wang wrote:
> > >> About thing is v1/v2 compatible. I rethink the details:
> > >>
> > >> 0. we need to define the new banner which must longer than before("ceph v027")
> > >> 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> > >> 2. both in simle/async codes, server side must issue banner firstly
> > >> 3. if server side supports v2 and client only supports v1, client will
> > >> receive 9 bytes and do memcmp, then reject this connection via closing
> > >> socket. So server side could retry the older version
> > >> 4. if server side only supports v1 and client supports v2, client
> > >> according banner to reply corresponding banner
> > >>
> > >> This tricky design is based on the implementation fact "accept side
> > >> issue the banner firstly" and "new banner is longer than old banner",
> > >> and this way doesn't need to involve other dependences like mon port
> > >> changes.
> > >>
> > >> Does this way has problem?
> > >
> > > I was thinking we avoid this problem and any hacky initial handshakes by
> > > speaking v2 on the new port and v1 on the old port.  Then the monmap has
> > > an entity_addrvec_t with both a v1 and v2 address (encoding with just the
> > > v1 address for old clients). Same for the OSDs.
> > 
> > This way is ok to me. So another change is double messenger
> > instances(to v1 and v2) or let each messenger support multi binding
> > addresses(this may need to refactor messenger interface).
> 
> Yeah.  I'm guessing we'll want to have an entity_addrvec_t with address 
> types mapped to different Messenger implementations (e.g., xio), so we'll 
> wan to allow multiple instance eventually.  But we'll also just want to
> allow multiple binding (v1 + v2, or ipv4 + ipv6).  :/
> 
> 

I'm a little slow on the uptake here and not that familiar with the MDS
code, but given that the protocol is largely based around a socket
connections (or something like them), what exactly is the point of
passing around the entity_addr_t objects? Is either peer expected ever
to do something with these addresses?

It seems like this is something that may end up hamstringing you later
for things like roaming clients. Suppose you have a laptop with a dhcp
address that is communicating with several servers. Someone closes the
lid and it goes to sleep, but wakes up in a couple of minutes on a
different network.

It reconnects but now its banner will be different than before. Would
it be able to reestablish its session?
 

> > 
> > >
> > > The v1 handshake just isn't extensible (how do you tell a v2 client
> > > connecting that you speak both v1 and v2?).
> > >
> > > sage
> > >
> > >
> > >>
> > >>
> > > >> On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
> > >> >
> > >> >
> > > >> > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
> > >> >>
> > >> >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> > >> >> > @sage in current impl, when logic fault like state mismatch, data format
> > >> >> > mismatch or anything else, connection will abort session via closing
> > >> >> > socket.
> > >> >> > And the peer side would do something according to policy too.
> > >> >> > In msgr v2 when introducing multi streams in the same connection, we
> > >> >> > can't
> > >> >> > simply abort socket to indicate something wrong now. I think we need to
> > >> >> > introduce TAG_ABORT with error message.
> > >> >> >
> > >> >> > But the peer side may stuck into a state like reading enough data as
> > >> >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> > >> >> > tag.
> > >> >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> > >> >> > "urgent"
> > >> >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> > >> >> > be
> > >> >> > used here to indicate the stream id(32bit designed now).
> > >> >> >
> > >> >> > What's more, multi stream mixed within one socket may make trouble to
> > >> >> > message receiving when potential tcp packet silent error. So it looks we
> > >> >> > can't use the same socket the multi stream to meet our demands.
> > >> >> >
> > >> >> > Any idea?
> > >> >>
> > >> >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> > >> >> have to assume that the low-level msgr2 implementation that reads and
> > >> >> writes frames (which have their own frame_len) is not buggy.  In practice,
> > >> >> the aborts tend to happen because we get a message we don't understand
> > >> >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> > >> >> at a higher level after frames have been read... so a TAG_ABORT will be
> > >> >> sufficient.
> > >> >
> > >> >
> > >> > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> > >> > The worse case is the frame length is not expected as data transferred.
> > >> >
> > >> >>
> > >> >>
> > >> >> Also, we can have an option to make aborts close the socket.  That'll be
> > >> >> fine for now anyway, although later it's probably to disruptive when
> > >> >> multiple streams are sharing a socket...
> > >> >>
> > >> >> sage
> > >> >>
> > >> >>
> > >> >>  >
> > >> >> >
> > >> >> >
> > >> >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
> > >> >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> > >> >> >       > If the client doesn't look at "features" before it sends
> > >> >> >       stuff, it
> > >> >> >       > will not be able to be very smart about taking advantage of
> > >> >> >       some
> > >> >> >       > future better method.  In fact, there isn't much advantage
> > >> >> >       > to the server sending anything early - it could just as easily
> > >> >> >       > wait until after it's seen the clients request.
> > >> >> >       >
> > >> >> >       > Failing hard & retrying on a failed reconnect is going to be
> > >> >> >       slower.
> > >> >> >       > On the bright side, at least it shouldn't happen often.
> > >> >> >
> > >> >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> > >> >> >       needs to
> > >> >> >       know the server features, it needs to either wait for them, or
> > >> >> >       make some
> > >> >> >       optimistic choice and be prepared to pay the cost of a mistake.
> > >> >> >       We should
> > >> >> >       give the client choice, though, if we can.
> > >> >> >
> > >> >> >       > If you're sending encryption (w/ different auth or keys) from
> > >> >> >       several
> > >> >> >       > different streams, how are you planning to indicate which bits
> > >> >> >       > go with which scheme?, and which bits are you planning to
> > >> >> >       encrypt
> > >> >> >       > and which not?
> > >> >> >
> > >> >> >       This is what he stream ids are for, and why the outer portion of
> > >> >> >       the frame
> > >> >> >       is unencrypted.  See
> > >> >> >
> > >> >> >
> > >> >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> > >> >> >       c44b436R68
> > >> >> >
> > >> >> >        +  stream_id (le32)
> > >> >> >        +  frame_len (le32)
> > >> >> >        +  tag (TAG_* byte)
> > >> >> >        +  payload
> > >> >> >        +  [payload padding -- only present after stream auth phase]
> > >> >> >        +  [signature -- only present after stream auth phase]
> > >> >> >
> > >> >> >       The tag and payload (and padding) would be encrypted or signed,
> > >> >> >       but not
> > >> >> >       the stream id and frame_len.
> > >> >> >
> > >> >> >       > Byte count limits.  Basically, you don't want collisions
> > >> >> >       because
> > >> >> >       > of duplicated keys or data.  This depends on your crypto
> > >> >> >       system,
> > >> >> >       > so, for instance, you should not encrypt with one key more
> > >> >> >       than
> > >> >> >       >       aes, cbc        about 2^68 bytes
> > >> >> >       >       aes, ctr        exactly 2^128 bytes
> > >> >> >       > more generally, this depends on mode, blocksize, ...
> > >> >> >       > This applies across *all* uses of the key - and so you would
> > >> >> >       > generally want to use the session key directly as little as
> > >> >> >       possible.
> > >> >> >       > (in particular, using the session key for ctr directly would
> > >> >> >       be very very bad.)
> > >> >> >       >
> > >> >> >       > If you've got multiple streams going already, you should be
> > >> >> >       able
> > >> >> >       > to include a fairly simple rekey method with little effort.
> > >> >> >       > For instance, as part of the method, you could,
> > >> >> >       >       up front as part of the method
> > >> >> >       >               send a per-stream key encrypted under the shared
> > >> >> >       secret.
> > >> >> >       >       prepend to the first data sent in a payload
> > >> >> >       >               byte limit, stream key #0 (encrypted under the
> > >> >> >       per-stream key)
> > >> >> >       >               then encrypt the next N bytes with stream key #0
> > >> >> >       >       when the byte limit is reached, prepend to the
> > >> >> >       >               next data sent in a payload
> > >> >> >       >               byte limit, stream key #1 (encrypted under the
> > >> >> >       per-stream key)
> > >> >> >       >               then encrypt the next N bytes with stream key #1
> > >> >> >       >       &etc.
> > >> >> >
> > >> >> >       Good idea.  If I understand correctly, it means that the
> > >> >> >       session_key is
> > >> >> >       only used to send the new/next random encryption key, and if we
> > >> >> >       make the
> > >> >> >       byte limit part of the initial protocol we get the rotation we
> > >> >> >       need.  It
> > >> >> >       might be simpler to do it as a frame limit instead of byte
> > >> >> >       limit, and
> > >> >> >       assume max-length frames (2^32 bytes).  We could still be super
> > >> >> >       conservative and rotate the encryption key every 2^16 messages
> > >> >> >       or
> > >> >> >       something...?  And rotating the key on frame boundaries should
> > >> >> >       be much
> > >> >> >       simpler to implement.
> > >> >> >
> > >> >> >       Anyway, that part can be defined a bit later, I think.
> > >> >> >
> > >> >> >       Thanks!
> > >> >> >       sage
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >
> > >> >
> > >>
> > >>
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-13  1:35                                         ` Haomai Wang
@ 2016-09-13 13:21                                           ` Sage Weil
  0 siblings, 0 replies; 49+ messages in thread
From: Sage Weil @ 2016-09-13 13:21 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Gregory Farnum, Marcus Watts, ceph-devel

On Tue, 13 Sep 2016, Haomai Wang wrote:
> On Tue, Sep 13, 2016 at 8:03 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> > On Mon, Sep 12, 2016 at 6:21 AM, Sage Weil <sweil@redhat.com> wrote:
> >> On Mon, 12 Sep 2016, Haomai Wang wrote:
> >>> This way is ok to me. So another change is double messenger
> >>> instances(to v1 and v2) or let each messenger support multi binding
> >>> addresses(this may need to refactor messenger interface).
> >>
> >> Yeah.  I'm guessing we'll want to have an entity_addrvec_t with address
> >> types mapped to different Messenger implementations (e.g., xio), so we'll
> >> wan to allow multiple instance eventually.  But we'll also just want to
> >> allow multiple binding (v1 + v2, or ipv4 + ipv6).  :/
> >
> > Hmm, is that really necessary? It seems a fair bit more complicated
> > and I'm not sure there's much payoff given the connection types.
> > Long-term the only doubled connection I can see being needed is the
> > client one; OSD cluster messengers will only be required to bind twice
> > during the initial upgrade period.
> >
> > Put another way, what's the advantage of supporting two different
> > protocols within one messenger? That just sounds like a disaster
> 
> I don't think this version upgrade will happen right now. I guess
> there will be a version to switch to async msgr, then we consider to
> upgrade msgr version really.
> 
> > waiting to happen, and not one worth risking for slightly reducing the
> > thread count on AsyncMessenger (especially with users coming from the
> > SimpleMessenger).
> 
> Maybe you like aonther msgr like asyncmessengerv2? instead of the same
> msgr type. From my view, msgrv2 will reduce packet per message and
> prepare for the other new features.

There are two sides to the problem: the server needs to multi-home at 
mulitiple addresses (v1 + v2, ipv4 + ipv6 + xio, etc).  Using separate 
Messenger instances doesn't seem like a big deal there.

On the connecting client side of things, we will regularly be presented 
with an entity_addrvec_t for a peer and want to connect.  Something needs 
to pick a preferred address and connect.  If it's multiple Messengers, 
then there needs to be some multiplexing layer that opens a Connection.  
(I think from there on out almost everything is con-handle based, so that 
shouldn't be a problem.)

For v1/v2 a single instance would do, but once we start talking about xio 
then clearly we need the multiplexing (on both ends), so that's probably 
the way to go.

Whether it's a fork of msg/async or a constructor argument that determines 
which version of the protocol to speak doesn't matter much to me.  We 
could even just make async pure v2 and leave simplemessenger for legacy 
clients... that's the least long term code maintenance...

sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-13 11:18                                   ` Jeff Layton
@ 2016-09-13 13:31                                     ` Sage Weil
  2016-09-13 14:48                                       ` Jeff Layton
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-09-13 13:31 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Haomai Wang, Marcus Watts, Gregory Farnum, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 12101 bytes --]

On Tue, 13 Sep 2016, Jeff Layton wrote:
> On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote:
> > On Sat, 10 Sep 2016, Haomai Wang wrote:
> > > About thing is v1/v2 compatible. I rethink the details:
> > > 
> > > 0. we need to define the new banner which must longer than before("ceph v027")
> > > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> > > 2. both in simle/async codes, server side must issue banner firstly
> > > 3. if server side supports v2 and client only supports v1, client will
> > > receive 9 bytes and do memcmp, then reject this connection via closing
> > > socket. So server side could retry the older version
> > > 4. if server side only supports v1 and client supports v2, client
> > > according banner to reply corresponding banner
> > > 
> > > This tricky design is based on the implementation fact "accept side
> > > issue the banner firstly" and "new banner is longer than old banner",
> > > and this way doesn't need to involve other dependences like mon port
> > > changes.
> > > 
> > > Does this way has problem?
> > 
> > I was thinking we avoid this problem and any hacky initial handshakes by 
> > speaking v2 on the new port and v1 on the old port.  Then the monmap has 
> > an entity_addrvec_t with both a v1 and v2 address (encoding with just the 
> > v1 address for old clients). Same for the OSDs.
> > 
> > The v1 handshake just isn't extensible (how do you tell a v2 client 
> > connecting that you speak both v1 and v2?).
> > 
> 
> Depending on port assignments for the protocol is pretty icky though.
> There may be valid reasons to use different ports in some environments
> and then that heuristic goes right out the window.
> 
> One thing that is really strange about both the old and new protocols
> is that they have the client and server sending the initial exchange
> concurrently, or have the server send it first.  While it may speed up
> the initial negotiation slightly, it makes it really hard to handle
> fallback to earlier protocol versions (as Haomai pointed out), as the
> client is responsible for handing reconnects.
> 
> Consider the case where we have a client that supports only v1 but a
> server that supports v1 and v2. Client connects and then server sends a
> v2 message. Client doesn't understand it and closes the connection and
> reconnects, only to end up in the same situation on the second attempt.
> 
> There's no way for the server to preserve the state from the initial
> connection attempt and handle the new connection with v1. Would it not
> make more sense to have the client connect and send its initial banner,
> and then let the server decide what sort of banner to send based on
> what the client sent?

This is why the v2 banner has the features values (%lx with supported and 
required bits).  Clients and servers (connecter and accepters, really, 
since servers talk to each other too) can concurrently announce what they 
support and require and then go from there.  It doesn't help with the v1 
transition, but the addrvec changes (entity_addr_t now has a type 
indicating which protocol is spoken, and multiple addrs can be listed for 
any server) along with a mon port change (which we have to do anyway to 
switch to our IANA assigned port) handle the v1 transition.

Are there other reasons to do the client banner first?  

sage

> 
> > > 
> > > 
> > > > On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
> > > >
> > > >
> > > > > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
> > > >>
> > > >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> > > >> > @sage in current impl, when logic fault like state mismatch, data format
> > > >> > mismatch or anything else, connection will abort session via closing
> > > >> > socket.
> > > >> > And the peer side would do something according to policy too.
> > > >> > In msgr v2 when introducing multi streams in the same connection, we
> > > >> > can't
> > > >> > simply abort socket to indicate something wrong now. I think we need to
> > > >> > introduce TAG_ABORT with error message.
> > > >> >
> > > >> > But the peer side may stuck into a state like reading enough data as
> > > >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> > > >> > tag.
> > > >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> > > >> > "urgent"
> > > >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> > > >> > be
> > > >> > used here to indicate the stream id(32bit designed now).
> > > >> >
> > > >> > What's more, multi stream mixed within one socket may make trouble to
> > > >> > message receiving when potential tcp packet silent error. So it looks we
> > > >> > can't use the same socket the multi stream to meet our demands.
> > > >> >
> > > >> > Any idea?
> > > >>
> > > >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> > > >> have to assume that the low-level msgr2 implementation that reads and
> > > >> writes frames (which have their own frame_len) is not buggy.  In practice,
> > > >> the aborts tend to happen because we get a message we don't understand
> > > >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> > > >> at a higher level after frames have been read... so a TAG_ABORT will be
> > > >> sufficient.
> > > >
> > > >
> > > > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> > > > The worse case is the frame length is not expected as data transferred.
> > > >
> > > >>
> > > >>
> > > >> Also, we can have an option to make aborts close the socket.  That'll be
> > > >> fine for now anyway, although later it's probably to disruptive when
> > > >> multiple streams are sharing a socket...
> > > >>
> > > >> sage
> > > >>
> > > >>
> > > >>  >
> > > >> >
> > > >> >
> > > > >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
> > > >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> > > >> >       > If the client doesn't look at "features" before it sends
> > > >> >       stuff, it
> > > >> >       > will not be able to be very smart about taking advantage of
> > > >> >       some
> > > >> >       > future better method.  In fact, there isn't much advantage
> > > >> >       > to the server sending anything early - it could just as easily
> > > >> >       > wait until after it's seen the clients request.
> > > >> >       >
> > > >> >       > Failing hard & retrying on a failed reconnect is going to be
> > > >> >       slower.
> > > >> >       > On the bright side, at least it shouldn't happen often.
> > > >> >
> > > >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> > > >> >       needs to
> > > >> >       know the server features, it needs to either wait for them, or
> > > >> >       make some
> > > >> >       optimistic choice and be prepared to pay the cost of a mistake.
> > > >> >       We should
> > > >> >       give the client choice, though, if we can.
> > > >> >
> > > >> >       > If you're sending encryption (w/ different auth or keys) from
> > > >> >       several
> > > >> >       > different streams, how are you planning to indicate which bits
> > > >> >       > go with which scheme?, and which bits are you planning to
> > > >> >       encrypt
> > > >> >       > and which not?
> > > >> >
> > > >> >       This is what he stream ids are for, and why the outer portion of
> > > >> >       the frame
> > > >> >       is unencrypted.  See
> > > >> >
> > > >> >
> > > >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> > > >> >       c44b436R68
> > > >> >
> > > >> >        +  stream_id (le32)
> > > >> >        +  frame_len (le32)
> > > >> >        +  tag (TAG_* byte)
> > > >> >        +  payload
> > > >> >        +  [payload padding -- only present after stream auth phase]
> > > >> >        +  [signature -- only present after stream auth phase]
> > > >> >
> > > >> >       The tag and payload (and padding) would be encrypted or signed,
> > > >> >       but not
> > > >> >       the stream id and frame_len.
> > > >> >
> > > >> >       > Byte count limits.  Basically, you don't want collisions
> > > >> >       because
> > > >> >       > of duplicated keys or data.  This depends on your crypto
> > > >> >       system,
> > > >> >       > so, for instance, you should not encrypt with one key more
> > > >> >       than
> > > >> >       >       aes, cbc        about 2^68 bytes
> > > >> >       >       aes, ctr        exactly 2^128 bytes
> > > >> >       > more generally, this depends on mode, blocksize, ...
> > > >> >       > This applies across *all* uses of the key - and so you would
> > > >> >       > generally want to use the session key directly as little as
> > > >> >       possible.
> > > >> >       > (in particular, using the session key for ctr directly would
> > > >> >       be very very bad.)
> > > >> >       >
> > > >> >       > If you've got multiple streams going already, you should be
> > > >> >       able
> > > >> >       > to include a fairly simple rekey method with little effort.
> > > >> >       > For instance, as part of the method, you could,
> > > >> >       >       up front as part of the method
> > > >> >       >               send a per-stream key encrypted under the shared
> > > >> >       secret.
> > > >> >       >       prepend to the first data sent in a payload
> > > >> >       >               byte limit, stream key #0 (encrypted under the
> > > >> >       per-stream key)
> > > >> >       >               then encrypt the next N bytes with stream key #0
> > > >> >       >       when the byte limit is reached, prepend to the
> > > >> >       >               next data sent in a payload
> > > >> >       >               byte limit, stream key #1 (encrypted under the
> > > >> >       per-stream key)
> > > >> >       >               then encrypt the next N bytes with stream key #1
> > > >> >       >       &etc.
> > > >> >
> > > >> >       Good idea.  If I understand correctly, it means that the
> > > >> >       session_key is
> > > >> >       only used to send the new/next random encryption key, and if we
> > > >> >       make the
> > > >> >       byte limit part of the initial protocol we get the rotation we
> > > >> >       need.  It
> > > >> >       might be simpler to do it as a frame limit instead of byte
> > > >> >       limit, and
> > > >> >       assume max-length frames (2^32 bytes).  We could still be super
> > > >> >       conservative and rotate the encryption key every 2^16 messages
> > > >> >       or
> > > >> >       something...?  And rotating the key on frame boundaries should
> > > >> >       be much
> > > >> >       simpler to implement.
> > > >> >
> > > >> >       Anyway, that part can be defined a bit later, I think.
> > > >> >
> > > >> >       Thanks!
> > > >> >       sage
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >
> > > >
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jeff Layton <jlayton@poochiereds.net>
> -- 
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-13 13:31                                     ` Sage Weil
@ 2016-09-13 14:48                                       ` Jeff Layton
  2016-09-13 15:10                                         ` Sage Weil
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff Layton @ 2016-09-13 14:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Haomai Wang, Marcus Watts, Gregory Farnum, ceph-devel

On Tue, 2016-09-13 at 13:31 +0000, Sage Weil wrote:
> On Tue, 13 Sep 2016, Jeff Layton wrote:
> > On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote:
> > > On Sat, 10 Sep 2016, Haomai Wang wrote:
> > > > About thing is v1/v2 compatible. I rethink the details:
> > > > 
> > > > 0. we need to define the new banner which must longer than before("ceph v027")
> > > > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> > > > 2. both in simle/async codes, server side must issue banner firstly
> > > > 3. if server side supports v2 and client only supports v1, client will
> > > > receive 9 bytes and do memcmp, then reject this connection via closing
> > > > socket. So server side could retry the older version
> > > > 4. if server side only supports v1 and client supports v2, client
> > > > according banner to reply corresponding banner
> > > > 
> > > > This tricky design is based on the implementation fact "accept side
> > > > issue the banner firstly" and "new banner is longer than old banner",
> > > > and this way doesn't need to involve other dependences like mon port
> > > > changes.
> > > > 
> > > > Does this way has problem?
> > > 
> > > I was thinking we avoid this problem and any hacky initial handshakes by 
> > > speaking v2 on the new port and v1 on the old port.  Then the monmap has 
> > > an entity_addrvec_t with both a v1 and v2 address (encoding with just the 
> > > v1 address for old clients). Same for the OSDs.
> > > 
> > > The v1 handshake just isn't extensible (how do you tell a v2 client 
> > > connecting that you speak both v1 and v2?).
> > > 
> > 
> > Depending on port assignments for the protocol is pretty icky though.
> > There may be valid reasons to use different ports in some environments
> > and then that heuristic goes right out the window.
> > 
> > One thing that is really strange about both the old and new protocols
> > is that they have the client and server sending the initial exchange
> > concurrently, or have the server send it first.  While it may speed up
> > the initial negotiation slightly, it makes it really hard to handle
> > fallback to earlier protocol versions (as Haomai pointed out), as the
> > client is responsible for handing reconnects.
> > 
> > Consider the case where we have a client that supports only v1 but a
> > server that supports v1 and v2. Client connects and then server sends a
> > v2 message. Client doesn't understand it and closes the connection and
> > reconnects, only to end up in the same situation on the second attempt.
> > 
> > There's no way for the server to preserve the state from the initial
> > connection attempt and handle the new connection with v1. Would it not
> > make more sense to have the client connect and send its initial banner,
> > and then let the server decide what sort of banner to send based on
> > what the client sent?
> 
> This is why the v2 banner has the features values (%lx with supported and 
> required bits).  Clients and servers (connecter and accepters, really, 
> since servers talk to each other too) can concurrently announce what they 
> support and require and then go from there.  It doesn't help with the v1 
> transition, but the addrvec changes (entity_addr_t now has a type 
> indicating which protocol is spoken, and multiple addrs can be listed for 
> any server) along with a mon port change (which we have to do anyway to 
> switch to our IANA assigned port) handle the v1 transition.
> 

Ahh ok, I didn't realize ceph was squatting on a port! Ok, then if
you're planning to switch to a new well-known port anyway, then a clean
break like this makes more sense.

I'll confess though that I don't quite understand the whole point of
the entity_addr_t's. What purpose does it serve to exchange network
addresses here? Is it simply to inform the peer of other ways that it
can be reached? What happens if I pick up my laptop that's acting as a
ceph client and wander onto a new network. Does anything break?

> Are there other reasons to do the client banner first?  
> 
> 

I think that's the main one.

The only other reason I can think of might be to guard against
information disclosure to port scanners. If you require the client to
send a banner first, then the server could drop the connection if it
doesn't look right without ever sending anything.

That said, given that we're going to be using a IANA designated port in
most cases, that's not going to be terribly useful. The port scanners
would just send a bogus but legit-looking banner to that port.


> > 
> > > > 
> > > > 
> > > > > On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xsky.com> wrote:
> > > > >
> > > > >
> > > > > > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@redhat.com> wrote:
> > > > >>
> > > > >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> > > > >> > @sage in current impl, when logic fault like state mismatch, data format
> > > > >> > mismatch or anything else, connection will abort session via closing
> > > > >> > socket.
> > > > >> > And the peer side would do something according to policy too.
> > > > >> > In msgr v2 when introducing multi streams in the same connection, we
> > > > >> > can't
> > > > >> > simply abort socket to indicate something wrong now. I think we need to
> > > > >> > introduce TAG_ABORT with error message.
> > > > >> >
> > > > >> > But the peer side may stuck into a state like reading enough data as
> > > > >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> > > > >> > tag.
> > > > >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> > > > >> > "urgent"
> > > > >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> > > > >> > be
> > > > >> > used here to indicate the stream id(32bit designed now).
> > > > >> >
> > > > >> > What's more, multi stream mixed within one socket may make trouble to
> > > > >> > message receiving when potential tcp packet silent error. So it looks we
> > > > >> > can't use the same socket the multi stream to meet our demands.
> > > > >> >
> > > > >> > Any idea?
> > > > >>
> > > > >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> > > > >> have to assume that the low-level msgr2 implementation that reads and
> > > > >> writes frames (which have their own frame_len) is not buggy.  In practice,
> > > > >> the aborts tend to happen because we get a message we don't understand
> > > > >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> > > > >> at a higher level after frames have been read... so a TAG_ABORT will be
> > > > >> sufficient.
> > > > >
> > > > >
> > > > > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> > > > > The worse case is the frame length is not expected as data transferred.
> > > > >
> > > > >>
> > > > >>
> > > > >> Also, we can have an option to make aborts close the socket.  That'll be
> > > > >> fine for now anyway, although later it's probably to disruptive when
> > > > >> multiple streams are sharing a socket...
> > > > >>
> > > > >> sage
> > > > >>
> > > > >>
> > > > >>  >
> > > > >> >
> > > > >> >
> > > > > >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@redhat.com> wrote:
> > > > >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> > > > >> >       > If the client doesn't look at "features" before it sends
> > > > >> >       stuff, it
> > > > >> >       > will not be able to be very smart about taking advantage of
> > > > >> >       some
> > > > >> >       > future better method.  In fact, there isn't much advantage
> > > > >> >       > to the server sending anything early - it could just as easily
> > > > >> >       > wait until after it's seen the clients request.
> > > > >> >       >
> > > > >> >       > Failing hard & retrying on a failed reconnect is going to be
> > > > >> >       slower.
> > > > >> >       > On the bright side, at least it shouldn't happen often.
> > > > >> >
> > > > >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> > > > >> >       needs to
> > > > >> >       know the server features, it needs to either wait for them, or
> > > > >> >       make some
> > > > >> >       optimistic choice and be prepared to pay the cost of a mistake.
> > > > >> >       We should
> > > > >> >       give the client choice, though, if we can.
> > > > >> >
> > > > >> >       > If you're sending encryption (w/ different auth or keys) from
> > > > >> >       several
> > > > >> >       > different streams, how are you planning to indicate which bits
> > > > >> >       > go with which scheme?, and which bits are you planning to
> > > > >> >       encrypt
> > > > >> >       > and which not?
> > > > >> >
> > > > >> >       This is what he stream ids are for, and why the outer portion of
> > > > >> >       the frame
> > > > >> >       is unencrypted.  See
> > > > >> >
> > > > >> >
> > > > >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> > > > >> >       c44b436R68
> > > > >> >
> > > > >> >        +  stream_id (le32)
> > > > >> >        +  frame_len (le32)
> > > > >> >        +  tag (TAG_* byte)
> > > > >> >        +  payload
> > > > >> >        +  [payload padding -- only present after stream auth phase]
> > > > >> >        +  [signature -- only present after stream auth phase]
> > > > >> >
> > > > >> >       The tag and payload (and padding) would be encrypted or signed,
> > > > >> >       but not
> > > > >> >       the stream id and frame_len.
> > > > >> >
> > > > >> >       > Byte count limits.  Basically, you don't want collisions
> > > > >> >       because
> > > > >> >       > of duplicated keys or data.  This depends on your crypto
> > > > >> >       system,
> > > > >> >       > so, for instance, you should not encrypt with one key more
> > > > >> >       than
> > > > >> >       >       aes, cbc        about 2^68 bytes
> > > > >> >       >       aes, ctr        exactly 2^128 bytes
> > > > >> >       > more generally, this depends on mode, blocksize, ...
> > > > >> >       > This applies across *all* uses of the key - and so you would
> > > > >> >       > generally want to use the session key directly as little as
> > > > >> >       possible.
> > > > >> >       > (in particular, using the session key for ctr directly would
> > > > >> >       be very very bad.)
> > > > >> >       >
> > > > >> >       > If you've got multiple streams going already, you should be
> > > > >> >       able
> > > > >> >       > to include a fairly simple rekey method with little effort.
> > > > >> >       > For instance, as part of the method, you could,
> > > > >> >       >       up front as part of the method
> > > > >> >       >               send a per-stream key encrypted under the shared
> > > > >> >       secret.
> > > > >> >       >       prepend to the first data sent in a payload
> > > > >> >       >               byte limit, stream key #0 (encrypted under the
> > > > >> >       per-stream key)
> > > > >> >       >               then encrypt the next N bytes with stream key #0
> > > > >> >       >       when the byte limit is reached, prepend to the
> > > > >> >       >               next data sent in a payload
> > > > >> >       >               byte limit, stream key #1 (encrypted under the
> > > > >> >       per-stream key)
> > > > >> >       >               then encrypt the next N bytes with stream key #1
> > > > >> >       >       &etc.
> > > > >> >
> > > > >> >       Good idea.  If I understand correctly, it means that the
> > > > >> >       session_key is
> > > > >> >       only used to send the new/next random encryption key, and if we
> > > > >> >       make the
> > > > >> >       byte limit part of the initial protocol we get the rotation we
> > > > >> >       need.  It
> > > > >> >       might be simpler to do it as a frame limit instead of byte
> > > > >> >       limit, and
> > > > >> >       assume max-length frames (2^32 bytes).  We could still be super
> > > > >> >       conservative and rotate the encryption key every 2^16 messages
> > > > >> >       or
> > > > >> >       something...?  And rotating the key on frame boundaries should
> > > > >> >       be much
> > > > >> >       simpler to implement.
> > > > >> >
> > > > >> >       Anyway, that part can be defined a bit later, I think.
> > > > >> >
> > > > >> >       Thanks!
> > > > >> >       sage
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >
> > > > >
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > -- 
> > > Jeff Layton <jlayton@poochiereds.net>
> > -- 
> > > Jeff Layton <jlayton@redhat.com>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-13 14:48                                       ` Jeff Layton
@ 2016-09-13 15:10                                         ` Sage Weil
  2016-09-13 20:07                                           ` Gregory Farnum
  0 siblings, 1 reply; 49+ messages in thread
From: Sage Weil @ 2016-09-13 15:10 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Haomai Wang, Marcus Watts, Gregory Farnum, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5735 bytes --]

On Tue, 13 Sep 2016, Jeff Layton wrote:
> On Tue, 2016-09-13 at 13:31 +0000, Sage Weil wrote:
> > On Tue, 13 Sep 2016, Jeff Layton wrote:
> > > On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote:
> > > > On Sat, 10 Sep 2016, Haomai Wang wrote:
> > > > > About thing is v1/v2 compatible. I rethink the details:
> > > > > 
> > > > > 0. we need to define the new banner which must longer than before("ceph v027")
> > > > > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> > > > > 2. both in simle/async codes, server side must issue banner firstly
> > > > > 3. if server side supports v2 and client only supports v1, client will
> > > > > receive 9 bytes and do memcmp, then reject this connection via closing
> > > > > socket. So server side could retry the older version
> > > > > 4. if server side only supports v1 and client supports v2, client
> > > > > according banner to reply corresponding banner
> > > > > 
> > > > > This tricky design is based on the implementation fact "accept side
> > > > > issue the banner firstly" and "new banner is longer than old banner",
> > > > > and this way doesn't need to involve other dependences like mon port
> > > > > changes.
> > > > > 
> > > > > Does this way has problem?
> > > > 
> > > > I was thinking we avoid this problem and any hacky initial handshakes by 
> > > > speaking v2 on the new port and v1 on the old port.  Then the monmap has 
> > > > an entity_addrvec_t with both a v1 and v2 address (encoding with just the 
> > > > v1 address for old clients). Same for the OSDs.
> > > > 
> > > > The v1 handshake just isn't extensible (how do you tell a v2 client 
> > > > connecting that you speak both v1 and v2?).
> > > > 
> > > 
> > > Depending on port assignments for the protocol is pretty icky though.
> > > There may be valid reasons to use different ports in some environments
> > > and then that heuristic goes right out the window.
> > > 
> > > One thing that is really strange about both the old and new protocols
> > > is that they have the client and server sending the initial exchange
> > > concurrently, or have the server send it first.  While it may speed up
> > > the initial negotiation slightly, it makes it really hard to handle
> > > fallback to earlier protocol versions (as Haomai pointed out), as the
> > > client is responsible for handing reconnects.
> > > 
> > > Consider the case where we have a client that supports only v1 but a
> > > server that supports v1 and v2. Client connects and then server sends a
> > > v2 message. Client doesn't understand it and closes the connection and
> > > reconnects, only to end up in the same situation on the second attempt.
> > > 
> > > There's no way for the server to preserve the state from the initial
> > > connection attempt and handle the new connection with v1. Would it not
> > > make more sense to have the client connect and send its initial banner,
> > > and then let the server decide what sort of banner to send based on
> > > what the client sent?
> > 
> > This is why the v2 banner has the features values (%lx with supported and 
> > required bits).  Clients and servers (connecter and accepters, really, 
> > since servers talk to each other too) can concurrently announce what they 
> > support and require and then go from there.  It doesn't help with the v1 
> > transition, but the addrvec changes (entity_addr_t now has a type 
> > indicating which protocol is spoken, and multiple addrs can be listed for 
> > any server) along with a mon port change (which we have to do anyway to 
> > switch to our IANA assigned port) handle the v1 transition.
> > 
> 
> Ahh ok, I didn't realize ceph was squatting on a port! Ok, then if
> you're planning to switch to a new well-known port anyway, then a clean
> break like this makes more sense.
> 
> I'll confess though that I don't quite understand the whole point of
> the entity_addr_t's. What purpose does it serve to exchange network
> addresses here?

The main thing is that entity_addr_t contains a nonce to distinguish 
between difference incarnations of the same server on the same port.  When 
an OSD is marked down and comes back up, the nonce will be different, and 
its peers can tell they're talking to the new/current instance without any 
stale state (or whatever).  Currently we guard this at the messenger 
layer, so that if we're trying to connect to a particularly instance 
of osd.12 we will simply fail to connect if that port is occupied by 
someone else (e.g., a newer instance of osd.12 that we don't know about 
yet) so that we don't confuse them or ourselves.

> Is it simply to inform the peer of other ways that it
> can be reached?

With the addrvec changes anybody connecting to (this version of) you 
should already have a list of all your addresses...

> What happens if I pick up my laptop that's acting as a
> ceph client and wander onto a new network. Does anything break?

I'm sure something will break currently, but eventually I think we can 
shake these issues out... for clients, at least.  The servers all talk to 
each other so we assume there is no NAT gumming up the works.

> > Are there other reasons to do the client banner first?  
> 
> I think that's the main one.
> 
> The only other reason I can think of might be to guard against
> information disclosure to port scanners. If you require the client to
> send a banner first, then the server could drop the connection if it
> doesn't look right without ever sending anything.
> 
> That said, given that we're going to be using a IANA designated port in
> most cases, that's not going to be terribly useful. The port scanners
> would just send a bogus but legit-looking banner to that port.

Sounds good to me!
sage

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-09-13 15:10                                         ` Sage Weil
@ 2016-09-13 20:07                                           ` Gregory Farnum
  0 siblings, 0 replies; 49+ messages in thread
From: Gregory Farnum @ 2016-09-13 20:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Jeff Layton, Haomai Wang, Marcus Watts, ceph-devel

On Tue, Sep 13, 2016 at 8:10 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 13 Sep 2016, Jeff Layton wrote:
>> On Tue, 2016-09-13 at 13:31 +0000, Sage Weil wrote:
>> > On Tue, 13 Sep 2016, Jeff Layton wrote:
>> > > On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote:
>> > > > On Sat, 10 Sep 2016, Haomai Wang wrote:
>> > > > > About thing is v1/v2 compatible. I rethink the details:
>> > > > >
>> > > > > 0. we need to define the new banner which must longer than before("ceph v027")
>> > > > > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
>> > > > > 2. both in simle/async codes, server side must issue banner firstly
>> > > > > 3. if server side supports v2 and client only supports v1, client will
>> > > > > receive 9 bytes and do memcmp, then reject this connection via closing
>> > > > > socket. So server side could retry the older version
>> > > > > 4. if server side only supports v1 and client supports v2, client
>> > > > > according banner to reply corresponding banner
>> > > > >
>> > > > > This tricky design is based on the implementation fact "accept side
>> > > > > issue the banner firstly" and "new banner is longer than old banner",
>> > > > > and this way doesn't need to involve other dependences like mon port
>> > > > > changes.
>> > > > >
>> > > > > Does this way has problem?
>> > > >
>> > > > I was thinking we avoid this problem and any hacky initial handshakes by
>> > > > speaking v2 on the new port and v1 on the old port.  Then the monmap has
>> > > > an entity_addrvec_t with both a v1 and v2 address (encoding with just the
>> > > > v1 address for old clients). Same for the OSDs.
>> > > >
>> > > > The v1 handshake just isn't extensible (how do you tell a v2 client
>> > > > connecting that you speak both v1 and v2?).
>> > > >
>> > >
>> > > Depending on port assignments for the protocol is pretty icky though.
>> > > There may be valid reasons to use different ports in some environments
>> > > and then that heuristic goes right out the window.
>> > >
>> > > One thing that is really strange about both the old and new protocols
>> > > is that they have the client and server sending the initial exchange
>> > > concurrently, or have the server send it first.  While it may speed up
>> > > the initial negotiation slightly, it makes it really hard to handle
>> > > fallback to earlier protocol versions (as Haomai pointed out), as the
>> > > client is responsible for handing reconnects.
>> > >
>> > > Consider the case where we have a client that supports only v1 but a
>> > > server that supports v1 and v2. Client connects and then server sends a
>> > > v2 message. Client doesn't understand it and closes the connection and
>> > > reconnects, only to end up in the same situation on the second attempt.
>> > >
>> > > There's no way for the server to preserve the state from the initial
>> > > connection attempt and handle the new connection with v1. Would it not
>> > > make more sense to have the client connect and send its initial banner,
>> > > and then let the server decide what sort of banner to send based on
>> > > what the client sent?
>> >
>> > This is why the v2 banner has the features values (%lx with supported and
>> > required bits).  Clients and servers (connecter and accepters, really,
>> > since servers talk to each other too) can concurrently announce what they
>> > support and require and then go from there.  It doesn't help with the v1
>> > transition, but the addrvec changes (entity_addr_t now has a type
>> > indicating which protocol is spoken, and multiple addrs can be listed for
>> > any server) along with a mon port change (which we have to do anyway to
>> > switch to our IANA assigned port) handle the v1 transition.
>> >
>>
>> Ahh ok, I didn't realize ceph was squatting on a port! Ok, then if
>> you're planning to switch to a new well-known port anyway, then a clean
>> break like this makes more sense.
>>
>> I'll confess though that I don't quite understand the whole point of
>> the entity_addr_t's. What purpose does it serve to exchange network
>> addresses here?
>
> The main thing is that entity_addr_t contains a nonce to distinguish
> between difference incarnations of the same server on the same port.  When
> an OSD is marked down and comes back up, the nonce will be different, and
> its peers can tell they're talking to the new/current instance without any
> stale state (or whatever).  Currently we guard this at the messenger
> layer, so that if we're trying to connect to a particularly instance
> of osd.12 we will simply fail to connect if that port is occupied by
> someone else (e.g., a newer instance of osd.12 that we don't know about
> yet) so that we don't confuse them or ourselves.
>
>> Is it simply to inform the peer of other ways that it
>> can be reached?
>
> With the addrvec changes anybody connecting to (this version of) you
> should already have a list of all your addresses...
>
>> What happens if I pick up my laptop that's acting as a
>> ceph client and wander onto a new network. Does anything break?
>
> I'm sure something will break currently, but eventually I think we can
> shake these issues out... for clients, at least.  The servers all talk to
> each other so we assume there is no NAT gumming up the works.

Well, base RADOS will work okay since it doesn't worry about caching,
but neither RBD nor CephFS handle this at all right now. If you're
putting the machine to sleep, presumably you're breaking the timeouts
on CephFS caps and the RBD locks. You've witnessed the horror of
trying to work through the data safety requirements there, Jeff. ;)

Beyond that, though, I think all our data structures are ultimately IP
based in terms of remembering sessions and such. We could maybe
recover using the messenger session cookies, but it wouldn't be a
trivial thing.
-Greg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: msgr2 protocol
  2016-06-29 16:52 ` Yehuda Sadeh-Weinraub
@ 2016-06-30 11:59   ` Avner Ben Hanoch
  0 siblings, 0 replies; 49+ messages in thread
From: Avner Ben Hanoch @ 2016-06-30 11:59 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: Sage Weil, Ceph Development



> -----Original Message-----
> From: Yehuda Sadeh-Weinraub 
> Sent: Wednesday, June 29, 2016 19:53
> 
> On Wed, Jun 29, 2016 at 4:59 AM, Avner Ben Hanoch
> <avnerb@mellanox.com> wrote:
> >
> > On Sat, 28 May 2016 11:19 AM, Yehuda Sadeh-Weinraub
> <yehuda@xxxxxxxxxx> wrote:
> >>On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> >>>
> >>> If you do a connect and immediately write a few bytes to teh TCP
> >>> stream, does that actaully translate to fewer packets?  I was
> >>> guessing that the server writing the first bytes of the exchange
> >>> would be fine but if it speeds things up for the client to
> >>> optimistically start the exchange too we may as well...
> >>>
> >>
> >>While haven't really looked at it recently, I don't think it'd be possible to
> embed data with the SYN packet using the plain >vanilla tcp implementation.
> However, I believe that doing connect() and sending data immediately
> following it should improve >things, specifically if doing async connect (as
> with the async messenger), but this still needs to be proven.
> >>
> >>Yehuda
> >
> > I am using TCP with network sniffers like Wireshark and this is always the
> case that I see in Linux  - *sending data soon after connect will always save
> packet by combining the ACK from the last step of TCP 3-way handshake with
> the 1st data packet* .
> > This is the case even when I did "short" activity between connect and send.
> >
> > Sniffer will show you 3 packets on the stream:
> > 1.      Client sends SYN packet
> > 2.      Server replies with SYN-ACK packet
> > 3.      Client send *data packet* that have the ACK flag set in it (this ACK
> completes the TCP 3-way handshake and makes 'accept' return on the server
> side)
> >
> > synchronous or asynchronous socket isn't relevant here because 'connect'
> returns with success upon receiving SYN-ACK from the server regardless of the
> actual client send of the TCP 3-way completing ACK (i.e., the client application
> doesn't need this ACK for relying on the socket as connected - only the server
> side need it).
> >
> > From my experience, even disabling nagle (TCP_NODELAY) doesn't affect
> > this behavior (probably because TCP_NODELAY only affect sending *data*
> > faster but does not change TCP handshake behavior)
> 
> Right. However, I was aiming at sending the data with the SYN, not with the
> ACK that follows it (trying to avoid the first roundtrip). I assume it's not really
> possible with vanilla tcp.
> 
> Yehuda
> 
I think that Sage's idea/question was to have fewer packets and to speed things up by having the client starting the exchange instead of having the server starting the exchange.
Hence, I am saying - YES.  This idea is correct.  It will save one packet and it will save time of half round trip (i.e., one leg) because the client is allowed to send data after 2 legs of TCP 3-way handshake; while the server must wait for completion of all 3 legs of TCP 3-way handshake before sending any data.
And I am saying that this is the behavior of vanilla TCP without any special settings.

If you want to send traffic with the 1st SYN packet than it sounds a bit strange even for non-vanilla TCP.  Because the client doesn't even know that the server is alive at this phase, and the server doesn't know that the client is legitimate at this phase.  This will also probably expose your server to security issue that is called SYN attack, because servers are afraid of allocating connection resources just for SYN packets (because the src ip/port can be fictitious at this phase before the client used the server's reply)

Avner

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: msgr2 protocol
  2016-06-29 11:59 Avner Ben Hanoch
@ 2016-06-29 16:52 ` Yehuda Sadeh-Weinraub
  2016-06-30 11:59   ` Avner Ben Hanoch
  0 siblings, 1 reply; 49+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2016-06-29 16:52 UTC (permalink / raw)
  To: Avner Ben Hanoch; +Cc: Sage Weil, Ceph Development

On Wed, Jun 29, 2016 at 4:59 AM, Avner Ben Hanoch <avnerb@mellanox.com> wrote:
> bbbb
>
> On Sat, 28 May 2016 11:19 AM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
>>On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>
>>> If you do a connect and immediately write a few bytes to teh TCP
>>> stream, does that actaully translate to fewer packets?  I was guessing
>>> that the server writing the first bytes of the exchange would be fine
>>> but if it speeds things up for the client to optimistically start the
>>> exchange too we may as well...
>>>
>>
>>While haven't really looked at it recently, I don't think it'd be possible to embed data with the SYN packet using the plain >vanilla tcp implementation. However, I believe that doing connect() and sending data immediately following it should improve >things, specifically if doing async connect (as with the async messenger), but this still needs to be proven.
>>
>>Yehuda
>
> I am using TCP with network sniffers like Wireshark and this is always the case that I see in Linux  - *sending data soon after connect will always save packet by combining the ACK from the last step of TCP 3-way handshake with the 1st data packet* .
> This is the case even when I did "short" activity between connect and send.
>
> Sniffer will show you 3 packets on the stream:
> 1.      Client sends SYN packet
> 2.      Server replies with SYN-ACK packet
> 3.      Client send *data packet* that have the ACK flag set in it (this ACK completes the TCP 3-way handshake and makes 'accept' return on the server side)
>
> synchronous or asynchronous socket isn't relevant here because 'connect' returns with success upon receiving SYN-ACK from the server regardless of the actual client send of the TCP 3-way completing ACK (i.e., the client application doesn't need this ACK for relying on the socket as connected - only the server side need it).
>
> From my experience, even disabling nagle (TCP_NODELAY) doesn't affect this behavior (probably because TCP_NODELAY only affect sending *data* faster but does not change TCP handshake behavior)

Right. However, I was aiming at sending the data with the SYN, not
with the ACK that follows it (trying to avoid the first roundtrip). I
assume it's not really possible with vanilla tcp.

Yehuda

>
> If you need a test application, I can provide you
> Avner

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: msgr2 protocol
@ 2016-06-29 11:59 Avner Ben Hanoch
  2016-06-29 16:52 ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 49+ messages in thread
From: Avner Ben Hanoch @ 2016-06-29 11:59 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub, Sage Weil; +Cc: Ceph Development

bbbb

On Sat, 28 May 2016 11:19 AM, Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: 
>On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>
>> If you do a connect and immediately write a few bytes to teh TCP 
>> stream, does that actaully translate to fewer packets?  I was guessing 
>> that the server writing the first bytes of the exchange would be fine 
>> but if it speeds things up for the client to optimistically start the 
>> exchange too we may as well...
>>
>
>While haven't really looked at it recently, I don't think it'd be possible to embed data with the SYN packet using the plain >vanilla tcp implementation. However, I believe that doing connect() and sending data immediately following it should improve >things, specifically if doing async connect (as with the async messenger), but this still needs to be proven.
>
>Yehuda

I am using TCP with network sniffers like Wireshark and this is always the case that I see in Linux  - *sending data soon after connect will always save packet by combining the ACK from the last step of TCP 3-way handshake with the 1st data packet* .  
This is the case even when I did "short" activity between connect and send.

Sniffer will show you 3 packets on the stream:
1.	Client sends SYN packet
2.	Server replies with SYN-ACK packet
3.	Client send *data packet* that have the ACK flag set in it (this ACK completes the TCP 3-way handshake and makes 'accept' return on the server side)

synchronous or asynchronous socket isn't relevant here because 'connect' returns with success upon receiving SYN-ACK from the server regardless of the actual client send of the TCP 3-way completing ACK (i.e., the client application doesn't need this ACK for relying on the socket as connected - only the server side need it).

From my experience, even disabling nagle (TCP_NODELAY) doesn't affect this behavior (probably because TCP_NODELAY only affect sending *data* faster but does not change TCP handshake behavior)

If you need a test application, I can provide you
Avner

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-09-13 20:07 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 18:17 msgr2 protocol Sage Weil
2016-05-27  4:41 ` Haomai Wang
2016-05-27  4:45   ` Haomai Wang
2016-05-27  8:28   ` Marcus Watts
2016-05-27 17:33     ` Sage Weil
2016-05-27 17:28   ` Sage Weil
2016-05-27  9:44 ` Yehuda Sadeh-Weinraub
2016-05-27 17:37   ` Sage Weil
2016-05-28 18:19     ` Yehuda Sadeh-Weinraub
2016-06-02 15:43       ` Sage Weil
2016-06-02 15:59         ` Haomai Wang
2016-06-02 16:35           ` Sage Weil
2016-06-02 18:11 ` Gregory Farnum
2016-06-02 18:24   ` Sage Weil
2016-06-02 18:34     ` Gregory Farnum
2016-06-03 13:11       ` Sage Weil
2016-06-03 13:24       ` Sage Weil
2016-06-03 16:47         ` Haomai Wang
2016-06-03 17:33           ` Sage Weil
2016-06-03 17:35             ` Haomai Wang
2016-06-06  8:23               ` Junwang Zhao
2016-06-10  8:31                 ` Marcus Watts
2016-06-10 10:11                   ` Sage Weil
2016-06-10 10:48                   ` Sage Weil
2016-06-06 20:16             ` Gregory Farnum
2016-06-10 11:04               ` Sage Weil
2016-06-10 19:05                 ` Marcus Watts
2016-06-10 21:15                   ` Sage Weil
2016-06-10 21:22                     ` Gregory Farnum
2016-06-11 23:05                     ` Marcus Watts
2016-06-12 23:59                       ` Sage Weil
     [not found]                         ` <CACJqLyax_SXEZp3vA2_wR+CdwKOo2Re=SsK2xfXqmXjz9d8iNw@mail.gmail.com>
2016-09-09 21:14                           ` Sage Weil
     [not found]                             ` <CACJqLyYwKZ5_1OHR_5=+mr=1ED2Nt34x4TB29j5dE1D+NjzFpg@mail.gmail.com>
2016-09-10 14:43                               ` Haomai Wang
2016-09-11 17:05                                 ` Sage Weil
2016-09-12  2:29                                   ` Haomai Wang
2016-09-12 13:21                                     ` Sage Weil
2016-09-13  0:03                                       ` Gregory Farnum
2016-09-13  1:35                                         ` Haomai Wang
2016-09-13 13:21                                           ` Sage Weil
2016-09-13 11:50                                       ` Jeff Layton
2016-09-13 11:18                                   ` Jeff Layton
2016-09-13 13:31                                     ` Sage Weil
2016-09-13 14:48                                       ` Jeff Layton
2016-09-13 15:10                                         ` Sage Weil
2016-09-13 20:07                                           ` Gregory Farnum
2016-06-02 18:16 ` Gregory Farnum
2016-06-29 11:59 Avner Ben Hanoch
2016-06-29 16:52 ` Yehuda Sadeh-Weinraub
2016-06-30 11:59   ` Avner Ben Hanoch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.