All of lore.kernel.org
 help / color / mirror / Atom feed
* OOB message roll into Messenger interface
@ 2016-09-03 16:01 Haomai Wang
  2016-09-06 13:17 ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2016-09-03 16:01 UTC (permalink / raw)
  To: ceph-devel

Background:
Each osd has two heartbeat messenger instances to maintain front/back
network available. It brings lots of connections and messages overhead
in scale out cluster. Actually we can combine these heartbeat
exchanges to public/cluster messengers to reduce tons of
connections(resources).

Then heartbeat message should be OOB and shared the same thread/socket
with normal message channel. So it can exactly represent the heartbeat
role for real IO message. Otherwise, heartbeat channel's status can't
indicate the real IO message channel status. Because different socket
uses different send buffer/recv buffer, if real io message blocked,
oob message may be healthy.

Besides OSD's heartbeat things, we have logic PING/PONG lived in
Objecter Ping/WatchNotify Ping etc. For the same goal, they could
share the heartbeat message.

In a real rbd use case env, if we combines these ping/pong messages,
thousands of messages could be avoided which means lots of resources.

As we reduce the heartbeat overhead, we can reduce heartbeat interval
and increase frequency which help a lot to the accurate of cluster
failure detection!

Design:

As discussed in Raleigh, we could defines these interfaces:

int Connection::register_oob_message(identitfy_op, callback, interval);

Users like Objecter linger ping could register a "callback" which
generate bufferlist used to be carried by heartbeat message.
"interval" indicate the user's oob message's send interval.

"identitfy_op" indicates who can handle the oob info in peer side.
Like "Ping", "OSDPing" or "LingerPing" as the current message define.

void Dispatcher::ms_dispatch_oob(Message*)

handle the oob message with parsing each oob part.

So lots of timer control in user's side could be avoided via callback
generator. When sending, OOB message could insert the front of send
message queue but we can't get any help from kernel oob flag since
it's really useless..

Any suggestion is welcomed!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-03 16:01 OOB message roll into Messenger interface Haomai Wang
@ 2016-09-06 13:17 ` Sage Weil
  2016-09-06 13:33   ` Haomai Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2016-09-06 13:17 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Hi Haomai!

On Sun, 4 Sep 2016, Haomai Wang wrote:
> Background:
> Each osd has two heartbeat messenger instances to maintain front/back
> network available. It brings lots of connections and messages overhead
> in scale out cluster. Actually we can combine these heartbeat
> exchanges to public/cluster messengers to reduce tons of
> connections(resources).
> 
> Then heartbeat message should be OOB and shared the same thread/socket
> with normal message channel. So it can exactly represent the heartbeat
> role for real IO message. Otherwise, heartbeat channel's status can't
> indicate the real IO message channel status. Because different socket
> uses different send buffer/recv buffer, if real io message blocked,
> oob message may be healthy.
> 
> Besides OSD's heartbeat things, we have logic PING/PONG lived in
> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
> share the heartbeat message.
> 
> In a real rbd use case env, if we combines these ping/pong messages,
> thousands of messages could be avoided which means lots of resources.
> 
> As we reduce the heartbeat overhead, we can reduce heartbeat interval
> and increase frequency which help a lot to the accurate of cluster
> failure detection!

I'm very excited to see this move forward!
 
> Design:
> 
> As discussed in Raleigh, we could defines these interfaces:
> 
> int Connection::register_oob_message(identitfy_op, callback, interval);
> 
> Users like Objecter linger ping could register a "callback" which
> generate bufferlist used to be carried by heartbeat message.
> "interval" indicate the user's oob message's send interval.
> 
> "identitfy_op" indicates who can handle the oob info in peer side.
> Like "Ping", "OSDPing" or "LingerPing" as the current message define.

This looks convenient for the simpler callers, but I worry it won't work 
as well for OSDPing. There's a bunch of odd locking around the heartbeat 
info and the code already exists to do the the heartbeat sends.  I'm not 
sure it will simplify to a simple interval.

An easier first step would be to just define a 
Connection::send_message_oob(Message*).  That would require almost no 
changes to the calling code, and avoid having to create the timing 
infrastructure inside AsyncMessenger...

sage

> void Dispatcher::ms_dispatch_oob(Message*)
> 
> handle the oob message with parsing each oob part.
> 
> So lots of timer control in user's side could be avoided via callback
> generator. When sending, OOB message could insert the front of send
> message queue but we can't get any help from kernel oob flag since
> it's really useless..
> 
> Any suggestion is welcomed!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-06 13:17 ` Sage Weil
@ 2016-09-06 13:33   ` Haomai Wang
  2016-09-06 14:06     ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2016-09-06 13:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
> Hi Haomai!
>
> On Sun, 4 Sep 2016, Haomai Wang wrote:
>> Background:
>> Each osd has two heartbeat messenger instances to maintain front/back
>> network available. It brings lots of connections and messages overhead
>> in scale out cluster. Actually we can combine these heartbeat
>> exchanges to public/cluster messengers to reduce tons of
>> connections(resources).
>>
>> Then heartbeat message should be OOB and shared the same thread/socket
>> with normal message channel. So it can exactly represent the heartbeat
>> role for real IO message. Otherwise, heartbeat channel's status can't
>> indicate the real IO message channel status. Because different socket
>> uses different send buffer/recv buffer, if real io message blocked,
>> oob message may be healthy.
>>
>> Besides OSD's heartbeat things, we have logic PING/PONG lived in
>> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
>> share the heartbeat message.
>>
>> In a real rbd use case env, if we combines these ping/pong messages,
>> thousands of messages could be avoided which means lots of resources.
>>
>> As we reduce the heartbeat overhead, we can reduce heartbeat interval
>> and increase frequency which help a lot to the accurate of cluster
>> failure detection!
>
> I'm very excited to see this move forward!
>
>> Design:
>>
>> As discussed in Raleigh, we could defines these interfaces:
>>
>> int Connection::register_oob_message(identitfy_op, callback, interval);
>>
>> Users like Objecter linger ping could register a "callback" which
>> generate bufferlist used to be carried by heartbeat message.
>> "interval" indicate the user's oob message's send interval.
>>
>> "identitfy_op" indicates who can handle the oob info in peer side.
>> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
>
> This looks convenient for the simpler callers, but I worry it won't work
> as well for OSDPing. There's a bunch of odd locking around the heartbeat
> info and the code already exists to do the the heartbeat sends.  I'm not
> sure it will simplify to a simple interval.

Hmm, I'm not sure what's the odd locking thing refer to. As we can
register callback when adding new peer and unregister callback when
removing peer from "heartbeat_peers".

The main send message construct callback extract from this loop:
  for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
       i != heartbeat_peers.end();
       ++i) {
    int peer = i->first;
    i->second.last_tx = now;
    if (i->second.first_tx == utime_t())
      i->second.first_tx = now;
    dout(30) << "heartbeat sending ping to osd." << peer << dendl;
    i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
 service.get_osdmap()->get_epoch(),
 MOSDPing::PING,
 now));

    if (i->second.con_front)
      i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
    service.get_osdmap()->get_epoch(),
    MOSDPing::PING,
    now));
  }

Only "fsid", "osdmap epoch" are required, I don't think it will block.
Then I think lots of locking/odding things exists on heartbeat
dispatch/handle process. sending process is clear I guess.

The advantage to register callback is we can combine multi layers oob
messages to one.

>
> An easier first step would be to just define a
> Connection::send_message_oob(Message*).  That would require almost no
> changes to the calling code, and avoid having to create the timing
> infrastructure inside AsyncMessenger...
>
> sage
>
>> void Dispatcher::ms_dispatch_oob(Message*)
>>
>> handle the oob message with parsing each oob part.
>>
>> So lots of timer control in user's side could be avoided via callback
>> generator. When sending, OOB message could insert the front of send
>> message queue but we can't get any help from kernel oob flag since
>> it's really useless..
>>
>> Any suggestion is welcomed!
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-06 13:33   ` Haomai Wang
@ 2016-09-06 14:06     ` Sage Weil
  2016-09-06 14:15       ` Haomai Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2016-09-06 14:06 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On Tue, 6 Sep 2016, Haomai Wang wrote:
> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
> > Hi Haomai!
> >
> > On Sun, 4 Sep 2016, Haomai Wang wrote:
> >> Background:
> >> Each osd has two heartbeat messenger instances to maintain front/back
> >> network available. It brings lots of connections and messages overhead
> >> in scale out cluster. Actually we can combine these heartbeat
> >> exchanges to public/cluster messengers to reduce tons of
> >> connections(resources).
> >>
> >> Then heartbeat message should be OOB and shared the same thread/socket
> >> with normal message channel. So it can exactly represent the heartbeat
> >> role for real IO message. Otherwise, heartbeat channel's status can't
> >> indicate the real IO message channel status. Because different socket
> >> uses different send buffer/recv buffer, if real io message blocked,
> >> oob message may be healthy.
> >>
> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
> >> share the heartbeat message.
> >>
> >> In a real rbd use case env, if we combines these ping/pong messages,
> >> thousands of messages could be avoided which means lots of resources.
> >>
> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
> >> and increase frequency which help a lot to the accurate of cluster
> >> failure detection!
> >
> > I'm very excited to see this move forward!
> >
> >> Design:
> >>
> >> As discussed in Raleigh, we could defines these interfaces:
> >>
> >> int Connection::register_oob_message(identitfy_op, callback, interval);
> >>
> >> Users like Objecter linger ping could register a "callback" which
> >> generate bufferlist used to be carried by heartbeat message.
> >> "interval" indicate the user's oob message's send interval.
> >>
> >> "identitfy_op" indicates who can handle the oob info in peer side.
> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
> >
> > This looks convenient for the simpler callers, but I worry it won't work
> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
> > info and the code already exists to do the the heartbeat sends.  I'm not
> > sure it will simplify to a simple interval.
> 
> Hmm, I'm not sure what's the odd locking thing refer to. As we can
> register callback when adding new peer and unregister callback when
> removing peer from "heartbeat_peers".
> 
> The main send message construct callback extract from this loop:
>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
>        i != heartbeat_peers.end();
>        ++i) {
>     int peer = i->first;
>     i->second.last_tx = now;
>     if (i->second.first_tx == utime_t())
>       i->second.first_tx = now;
>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
>  service.get_osdmap()->get_epoch(),
>  MOSDPing::PING,
>  now));
> 
>     if (i->second.con_front)
>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
>     service.get_osdmap()->get_epoch(),
>     MOSDPing::PING,
>     now));
>   }
> 
> Only "fsid", "osdmap epoch" are required, I don't think it will block.
> Then I think lots of locking/odding things exists on heartbeat
> dispatch/handle process. sending process is clear I guess.

Yeah, I guess that's fine.  I was worried about some dependency between 
who we ping and the osdmap epoch in the message (and races adding/removing 
heartbeat peers), but I think it doesn't matter.

Even so, I think it would be good to expose the send_message_oob() 
interface, and do this in 2 stages so the two changes are decoupled.  
Unless there is some implementation reason why the oob message scheduling 
needs to be done inside the messenger?

sage

> The advantage to register callback is we can combine multi layers oob
> messages to one.
> 
> >
> > An easier first step would be to just define a
> > Connection::send_message_oob(Message*).  That would require almost no
> > changes to the calling code, and avoid having to create the timing
> > infrastructure inside AsyncMessenger...
> >
> > sage
> >
> >> void Dispatcher::ms_dispatch_oob(Message*)
> >>
> >> handle the oob message with parsing each oob part.
> >>
> >> So lots of timer control in user's side could be avoided via callback
> >> generator. When sending, OOB message could insert the front of send
> >> message queue but we can't get any help from kernel oob flag since
> >> it's really useless..
> >>
> >> Any suggestion is welcomed!
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-06 14:06     ` Sage Weil
@ 2016-09-06 14:15       ` Haomai Wang
  2016-09-06 17:42         ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2016-09-06 14:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 6 Sep 2016, Haomai Wang wrote:
>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
>> > Hi Haomai!
>> >
>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
>> >> Background:
>> >> Each osd has two heartbeat messenger instances to maintain front/back
>> >> network available. It brings lots of connections and messages overhead
>> >> in scale out cluster. Actually we can combine these heartbeat
>> >> exchanges to public/cluster messengers to reduce tons of
>> >> connections(resources).
>> >>
>> >> Then heartbeat message should be OOB and shared the same thread/socket
>> >> with normal message channel. So it can exactly represent the heartbeat
>> >> role for real IO message. Otherwise, heartbeat channel's status can't
>> >> indicate the real IO message channel status. Because different socket
>> >> uses different send buffer/recv buffer, if real io message blocked,
>> >> oob message may be healthy.
>> >>
>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
>> >> share the heartbeat message.
>> >>
>> >> In a real rbd use case env, if we combines these ping/pong messages,
>> >> thousands of messages could be avoided which means lots of resources.
>> >>
>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
>> >> and increase frequency which help a lot to the accurate of cluster
>> >> failure detection!
>> >
>> > I'm very excited to see this move forward!
>> >
>> >> Design:
>> >>
>> >> As discussed in Raleigh, we could defines these interfaces:
>> >>
>> >> int Connection::register_oob_message(identitfy_op, callback, interval);
>> >>
>> >> Users like Objecter linger ping could register a "callback" which
>> >> generate bufferlist used to be carried by heartbeat message.
>> >> "interval" indicate the user's oob message's send interval.
>> >>
>> >> "identitfy_op" indicates who can handle the oob info in peer side.
>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
>> >
>> > This looks convenient for the simpler callers, but I worry it won't work
>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
>> > info and the code already exists to do the the heartbeat sends.  I'm not
>> > sure it will simplify to a simple interval.
>>
>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
>> register callback when adding new peer and unregister callback when
>> removing peer from "heartbeat_peers".
>>
>> The main send message construct callback extract from this loop:
>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
>>        i != heartbeat_peers.end();
>>        ++i) {
>>     int peer = i->first;
>>     i->second.last_tx = now;
>>     if (i->second.first_tx == utime_t())
>>       i->second.first_tx = now;
>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
>>  service.get_osdmap()->get_epoch(),
>>  MOSDPing::PING,
>>  now));
>>
>>     if (i->second.con_front)
>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
>>     service.get_osdmap()->get_epoch(),
>>     MOSDPing::PING,
>>     now));
>>   }
>>
>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
>> Then I think lots of locking/odding things exists on heartbeat
>> dispatch/handle process. sending process is clear I guess.
>
> Yeah, I guess that's fine.  I was worried about some dependency between
> who we ping and the osdmap epoch in the message (and races adding/removing
> heartbeat peers), but I think it doesn't matter.
>
> Even so, I think it would be good to expose the send_message_oob()
> interface, and do this in 2 stages so the two changes are decoupled.
> Unless there is some implementation reason why the oob message scheduling
> needs to be done inside the messenger?

Agreed! we could remove heartbeat messenger firstly!

>
> sage
>
>> The advantage to register callback is we can combine multi layers oob
>> messages to one.
>>
>> >
>> > An easier first step would be to just define a
>> > Connection::send_message_oob(Message*).  That would require almost no
>> > changes to the calling code, and avoid having to create the timing
>> > infrastructure inside AsyncMessenger...
>> >
>> > sage
>> >
>> >> void Dispatcher::ms_dispatch_oob(Message*)
>> >>
>> >> handle the oob message with parsing each oob part.
>> >>
>> >> So lots of timer control in user's side could be avoided via callback
>> >> generator. When sending, OOB message could insert the front of send
>> >> message queue but we can't get any help from kernel oob flag since
>> >> it's really useless..
>> >>
>> >> Any suggestion is welcomed!
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-06 14:15       ` Haomai Wang
@ 2016-09-06 17:42         ` Gregory Farnum
  2016-09-06 18:06           ` Sage Weil
  2016-09-07  2:43           ` Haomai Wang
  0 siblings, 2 replies; 10+ messages in thread
From: Gregory Farnum @ 2016-09-06 17:42 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel

On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xsky.com> wrote:
> On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@newdream.net> wrote:
>> On Tue, 6 Sep 2016, Haomai Wang wrote:
>>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
>>> > Hi Haomai!
>>> >
>>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
>>> >> Background:
>>> >> Each osd has two heartbeat messenger instances to maintain front/back
>>> >> network available. It brings lots of connections and messages overhead
>>> >> in scale out cluster. Actually we can combine these heartbeat
>>> >> exchanges to public/cluster messengers to reduce tons of
>>> >> connections(resources).
>>> >>
>>> >> Then heartbeat message should be OOB and shared the same thread/socket
>>> >> with normal message channel. So it can exactly represent the heartbeat
>>> >> role for real IO message. Otherwise, heartbeat channel's status can't
>>> >> indicate the real IO message channel status. Because different socket
>>> >> uses different send buffer/recv buffer, if real io message blocked,
>>> >> oob message may be healthy.
>>> >>
>>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
>>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
>>> >> share the heartbeat message.
>>> >>
>>> >> In a real rbd use case env, if we combines these ping/pong messages,
>>> >> thousands of messages could be avoided which means lots of resources.
>>> >>
>>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
>>> >> and increase frequency which help a lot to the accurate of cluster
>>> >> failure detection!
>>> >
>>> > I'm very excited to see this move forward!
>>> >
>>> >> Design:
>>> >>
>>> >> As discussed in Raleigh, we could defines these interfaces:
>>> >>
>>> >> int Connection::register_oob_message(identitfy_op, callback, interval);
>>> >>
>>> >> Users like Objecter linger ping could register a "callback" which
>>> >> generate bufferlist used to be carried by heartbeat message.
>>> >> "interval" indicate the user's oob message's send interval.
>>> >>
>>> >> "identitfy_op" indicates who can handle the oob info in peer side.
>>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
>>> >
>>> > This looks convenient for the simpler callers, but I worry it won't work
>>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
>>> > info and the code already exists to do the the heartbeat sends.  I'm not
>>> > sure it will simplify to a simple interval.
>>>
>>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
>>> register callback when adding new peer and unregister callback when
>>> removing peer from "heartbeat_peers".
>>>
>>> The main send message construct callback extract from this loop:
>>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
>>>        i != heartbeat_peers.end();
>>>        ++i) {
>>>     int peer = i->first;
>>>     i->second.last_tx = now;
>>>     if (i->second.first_tx == utime_t())
>>>       i->second.first_tx = now;
>>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
>>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
>>>  service.get_osdmap()->get_epoch(),
>>>  MOSDPing::PING,
>>>  now));
>>>
>>>     if (i->second.con_front)
>>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
>>>     service.get_osdmap()->get_epoch(),
>>>     MOSDPing::PING,
>>>     now));
>>>   }
>>>
>>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
>>> Then I think lots of locking/odding things exists on heartbeat
>>> dispatch/handle process. sending process is clear I guess.
>>
>> Yeah, I guess that's fine.  I was worried about some dependency between
>> who we ping and the osdmap epoch in the message (and races adding/removing
>> heartbeat peers), but I think it doesn't matter.
>>
>> Even so, I think it would be good to expose the send_message_oob()
>> interface, and do this in 2 stages so the two changes are decoupled.
>> Unless there is some implementation reason why the oob message scheduling
>> needs to be done inside the messenger?
>
> Agreed! we could remove heartbeat messenger firstly!
>
>>
>> sage
>>
>>> The advantage to register callback is we can combine multi layers oob
>>> messages to one.
>>>
>>> >
>>> > An easier first step would be to just define a
>>> > Connection::send_message_oob(Message*).  That would require almost no
>>> > changes to the calling code, and avoid having to create the timing
>>> > infrastructure inside AsyncMessenger...
>>> >
>>> > sage
>>> >
>>> >> void Dispatcher::ms_dispatch_oob(Message*)
>>> >>
>>> >> handle the oob message with parsing each oob part.
>>> >>
>>> >> So lots of timer control in user's side could be avoided via callback
>>> >> generator. When sending, OOB message could insert the front of send
>>> >> message queue but we can't get any help from kernel oob flag since
>>> >> it's really useless..
>>> >>
>>> >> Any suggestion is welcomed!

Let's keep in mind the challenges of out-of-band messaging over TCP/IP.

Namely, when we discussed this we couldn't figure out any way
(including the TCP priority stuff, which doesn't work with the
required semantics — even when it does function) to get traffic to
actually go out-of-band. IB messaging systems actually have a
"channels" concept that lets you do genuine OOB transmission that
skips over queues and other data; TCP doesn't. In fact the best we
came up with for doing this with Simple/AsyncMessenger was giving the
Messenger duplicate sockets/queues/etc, which is hardly ideal.

So, maybe we can remove the heartbeat messenger by giving each
Connection two sockets and queues. That might even work better for the
AsyncMessenger than it does for SimpleMessenger?
But any implementation that orders OSD heartbeat messages behind
ordinary data traffic in kernel or router buffers is probably going to
fail us. :(
-Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-06 17:42         ` Gregory Farnum
@ 2016-09-06 18:06           ` Sage Weil
  2016-09-07  2:46             ` Haomai Wang
  2016-09-07  2:43           ` Haomai Wang
  1 sibling, 1 reply; 10+ messages in thread
From: Sage Weil @ 2016-09-06 18:06 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6505 bytes --]

On Tue, 6 Sep 2016, Gregory Farnum wrote:
> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xsky.com> wrote:
> > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@newdream.net> wrote:
> >> On Tue, 6 Sep 2016, Haomai Wang wrote:
> >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
> >>> > Hi Haomai!
> >>> >
> >>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
> >>> >> Background:
> >>> >> Each osd has two heartbeat messenger instances to maintain front/back
> >>> >> network available. It brings lots of connections and messages overhead
> >>> >> in scale out cluster. Actually we can combine these heartbeat
> >>> >> exchanges to public/cluster messengers to reduce tons of
> >>> >> connections(resources).
> >>> >>
> >>> >> Then heartbeat message should be OOB and shared the same thread/socket
> >>> >> with normal message channel. So it can exactly represent the heartbeat
> >>> >> role for real IO message. Otherwise, heartbeat channel's status can't
> >>> >> indicate the real IO message channel status. Because different socket
> >>> >> uses different send buffer/recv buffer, if real io message blocked,
> >>> >> oob message may be healthy.
> >>> >>
> >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
> >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
> >>> >> share the heartbeat message.
> >>> >>
> >>> >> In a real rbd use case env, if we combines these ping/pong messages,
> >>> >> thousands of messages could be avoided which means lots of resources.
> >>> >>
> >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
> >>> >> and increase frequency which help a lot to the accurate of cluster
> >>> >> failure detection!
> >>> >
> >>> > I'm very excited to see this move forward!
> >>> >
> >>> >> Design:
> >>> >>
> >>> >> As discussed in Raleigh, we could defines these interfaces:
> >>> >>
> >>> >> int Connection::register_oob_message(identitfy_op, callback, interval);
> >>> >>
> >>> >> Users like Objecter linger ping could register a "callback" which
> >>> >> generate bufferlist used to be carried by heartbeat message.
> >>> >> "interval" indicate the user's oob message's send interval.
> >>> >>
> >>> >> "identitfy_op" indicates who can handle the oob info in peer side.
> >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
> >>> >
> >>> > This looks convenient for the simpler callers, but I worry it won't work
> >>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
> >>> > info and the code already exists to do the the heartbeat sends.  I'm not
> >>> > sure it will simplify to a simple interval.
> >>>
> >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
> >>> register callback when adding new peer and unregister callback when
> >>> removing peer from "heartbeat_peers".
> >>>
> >>> The main send message construct callback extract from this loop:
> >>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
> >>>        i != heartbeat_peers.end();
> >>>        ++i) {
> >>>     int peer = i->first;
> >>>     i->second.last_tx = now;
> >>>     if (i->second.first_tx == utime_t())
> >>>       i->second.first_tx = now;
> >>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
> >>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
> >>>  service.get_osdmap()->get_epoch(),
> >>>  MOSDPing::PING,
> >>>  now));
> >>>
> >>>     if (i->second.con_front)
> >>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
> >>>     service.get_osdmap()->get_epoch(),
> >>>     MOSDPing::PING,
> >>>     now));
> >>>   }
> >>>
> >>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
> >>> Then I think lots of locking/odding things exists on heartbeat
> >>> dispatch/handle process. sending process is clear I guess.
> >>
> >> Yeah, I guess that's fine.  I was worried about some dependency between
> >> who we ping and the osdmap epoch in the message (and races adding/removing
> >> heartbeat peers), but I think it doesn't matter.
> >>
> >> Even so, I think it would be good to expose the send_message_oob()
> >> interface, and do this in 2 stages so the two changes are decoupled.
> >> Unless there is some implementation reason why the oob message scheduling
> >> needs to be done inside the messenger?
> >
> > Agreed! we could remove heartbeat messenger firstly!
> >
> >>
> >> sage
> >>
> >>> The advantage to register callback is we can combine multi layers oob
> >>> messages to one.
> >>>
> >>> >
> >>> > An easier first step would be to just define a
> >>> > Connection::send_message_oob(Message*).  That would require almost no
> >>> > changes to the calling code, and avoid having to create the timing
> >>> > infrastructure inside AsyncMessenger...
> >>> >
> >>> > sage
> >>> >
> >>> >> void Dispatcher::ms_dispatch_oob(Message*)
> >>> >>
> >>> >> handle the oob message with parsing each oob part.
> >>> >>
> >>> >> So lots of timer control in user's side could be avoided via callback
> >>> >> generator. When sending, OOB message could insert the front of send
> >>> >> message queue but we can't get any help from kernel oob flag since
> >>> >> it's really useless..
> >>> >>
> >>> >> Any suggestion is welcomed!
> 
> Let's keep in mind the challenges of out-of-band messaging over TCP/IP.
> 
> Namely, when we discussed this we couldn't figure out any way
> (including the TCP priority stuff, which doesn't work with the
> required semantics — even when it does function) to get traffic to
> actually go out-of-band. IB messaging systems actually have a
> "channels" concept that lets you do genuine OOB transmission that
> skips over queues and other data; TCP doesn't. In fact the best we
> came up with for doing this with Simple/AsyncMessenger was giving the
> Messenger duplicate sockets/queues/etc, which is hardly ideal.
> 
> So, maybe we can remove the heartbeat messenger by giving each
> Connection two sockets and queues. That might even work better for the
> AsyncMessenger than it does for SimpleMessenger?
> But any implementation that orders OSD heartbeat messages behind
> ordinary data traffic in kernel or router buffers is probably going to
> fail us. :(

Oh, good point.  I didn't read that paragraph carefully.  I think we 
should use a second socket connected to the same address for OOB messages.  
Or possibly push them over UDP... but we'd need to define retry semantics 
in that case.

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-06 17:42         ` Gregory Farnum
  2016-09-06 18:06           ` Sage Weil
@ 2016-09-07  2:43           ` Haomai Wang
  1 sibling, 0 replies; 10+ messages in thread
From: Haomai Wang @ 2016-09-07  2:43 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel

On Wed, Sep 7, 2016 at 1:42 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xsky.com> wrote:
>> On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Tue, 6 Sep 2016, Haomai Wang wrote:
>>>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
>>>> > Hi Haomai!
>>>> >
>>>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
>>>> >> Background:
>>>> >> Each osd has two heartbeat messenger instances to maintain front/back
>>>> >> network available. It brings lots of connections and messages overhead
>>>> >> in scale out cluster. Actually we can combine these heartbeat
>>>> >> exchanges to public/cluster messengers to reduce tons of
>>>> >> connections(resources).
>>>> >>
>>>> >> Then heartbeat message should be OOB and shared the same thread/socket
>>>> >> with normal message channel. So it can exactly represent the heartbeat
>>>> >> role for real IO message. Otherwise, heartbeat channel's status can't
>>>> >> indicate the real IO message channel status. Because different socket
>>>> >> uses different send buffer/recv buffer, if real io message blocked,
>>>> >> oob message may be healthy.
>>>> >>
>>>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
>>>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
>>>> >> share the heartbeat message.
>>>> >>
>>>> >> In a real rbd use case env, if we combines these ping/pong messages,
>>>> >> thousands of messages could be avoided which means lots of resources.
>>>> >>
>>>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
>>>> >> and increase frequency which help a lot to the accurate of cluster
>>>> >> failure detection!
>>>> >
>>>> > I'm very excited to see this move forward!
>>>> >
>>>> >> Design:
>>>> >>
>>>> >> As discussed in Raleigh, we could defines these interfaces:
>>>> >>
>>>> >> int Connection::register_oob_message(identitfy_op, callback, interval);
>>>> >>
>>>> >> Users like Objecter linger ping could register a "callback" which
>>>> >> generate bufferlist used to be carried by heartbeat message.
>>>> >> "interval" indicate the user's oob message's send interval.
>>>> >>
>>>> >> "identitfy_op" indicates who can handle the oob info in peer side.
>>>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
>>>> >
>>>> > This looks convenient for the simpler callers, but I worry it won't work
>>>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
>>>> > info and the code already exists to do the the heartbeat sends.  I'm not
>>>> > sure it will simplify to a simple interval.
>>>>
>>>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
>>>> register callback when adding new peer and unregister callback when
>>>> removing peer from "heartbeat_peers".
>>>>
>>>> The main send message construct callback extract from this loop:
>>>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
>>>>        i != heartbeat_peers.end();
>>>>        ++i) {
>>>>     int peer = i->first;
>>>>     i->second.last_tx = now;
>>>>     if (i->second.first_tx == utime_t())
>>>>       i->second.first_tx = now;
>>>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
>>>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
>>>>  service.get_osdmap()->get_epoch(),
>>>>  MOSDPing::PING,
>>>>  now));
>>>>
>>>>     if (i->second.con_front)
>>>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
>>>>     service.get_osdmap()->get_epoch(),
>>>>     MOSDPing::PING,
>>>>     now));
>>>>   }
>>>>
>>>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
>>>> Then I think lots of locking/odding things exists on heartbeat
>>>> dispatch/handle process. sending process is clear I guess.
>>>
>>> Yeah, I guess that's fine.  I was worried about some dependency between
>>> who we ping and the osdmap epoch in the message (and races adding/removing
>>> heartbeat peers), but I think it doesn't matter.
>>>
>>> Even so, I think it would be good to expose the send_message_oob()
>>> interface, and do this in 2 stages so the two changes are decoupled.
>>> Unless there is some implementation reason why the oob message scheduling
>>> needs to be done inside the messenger?
>>
>> Agreed! we could remove heartbeat messenger firstly!
>>
>>>
>>> sage
>>>
>>>> The advantage to register callback is we can combine multi layers oob
>>>> messages to one.
>>>>
>>>> >
>>>> > An easier first step would be to just define a
>>>> > Connection::send_message_oob(Message*).  That would require almost no
>>>> > changes to the calling code, and avoid having to create the timing
>>>> > infrastructure inside AsyncMessenger...
>>>> >
>>>> > sage
>>>> >
>>>> >> void Dispatcher::ms_dispatch_oob(Message*)
>>>> >>
>>>> >> handle the oob message with parsing each oob part.
>>>> >>
>>>> >> So lots of timer control in user's side could be avoided via callback
>>>> >> generator. When sending, OOB message could insert the front of send
>>>> >> message queue but we can't get any help from kernel oob flag since
>>>> >> it's really useless..
>>>> >>
>>>> >> Any suggestion is welcomed!
>
> Let's keep in mind the challenges of out-of-band messaging over TCP/IP.
>
> Namely, when we discussed this we couldn't figure out any way
> (including the TCP priority stuff, which doesn't work with the
> required semantics — even when it does function) to get traffic to
> actually go out-of-band. IB messaging systems actually have a
> "channels" concept that lets you do genuine OOB transmission that
> skips over queues and other data; TCP doesn't. In fact the best we
> came up with for doing this with Simple/AsyncMessenger was giving the
> Messenger duplicate sockets/queues/etc, which is hardly ideal.

hmm, I also think about oob message may not deliver when the socket is
busy,  do we consider to add oob udp socket associated to
Connection(hb msg should be lossy, so we don't need to handle errors).
So in the upper layer, we don't explicit maintain heartbeat
connection. I can see we can achieve the same goal mentioned above.

Then when we target to msgr v2, we still want to move heartbeat logic
to messenger layer to aggregate hb msg in multi entity case.

>
> So, maybe we can remove the heartbeat messenger by giving each
> Connection two sockets and queues. That might even work better for the
> AsyncMessenger than it does for SimpleMessenger?

actually asyncmessenger also plays a lightweight role compared to
simplemessenger, so we may not need to do this.

> But any implementation that orders OSD heartbeat messages behind
> ordinary data traffic in kernel or router buffers is probably going to
> fail us. :(
> -Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-06 18:06           ` Sage Weil
@ 2016-09-07  2:46             ` Haomai Wang
  2016-09-07  2:58               ` Matt Benjamin
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2016-09-07  2:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

On Wed, Sep 7, 2016 at 2:06 AM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 6 Sep 2016, Gregory Farnum wrote:
>> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xsky.com> wrote:
>> > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@newdream.net> wrote:
>> >> On Tue, 6 Sep 2016, Haomai Wang wrote:
>> >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
>> >>> > Hi Haomai!
>> >>> >
>> >>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
>> >>> >> Background:
>> >>> >> Each osd has two heartbeat messenger instances to maintain front/back
>> >>> >> network available. It brings lots of connections and messages overhead
>> >>> >> in scale out cluster. Actually we can combine these heartbeat
>> >>> >> exchanges to public/cluster messengers to reduce tons of
>> >>> >> connections(resources).
>> >>> >>
>> >>> >> Then heartbeat message should be OOB and shared the same thread/socket
>> >>> >> with normal message channel. So it can exactly represent the heartbeat
>> >>> >> role for real IO message. Otherwise, heartbeat channel's status can't
>> >>> >> indicate the real IO message channel status. Because different socket
>> >>> >> uses different send buffer/recv buffer, if real io message blocked,
>> >>> >> oob message may be healthy.
>> >>> >>
>> >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
>> >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
>> >>> >> share the heartbeat message.
>> >>> >>
>> >>> >> In a real rbd use case env, if we combines these ping/pong messages,
>> >>> >> thousands of messages could be avoided which means lots of resources.
>> >>> >>
>> >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat interval
>> >>> >> and increase frequency which help a lot to the accurate of cluster
>> >>> >> failure detection!
>> >>> >
>> >>> > I'm very excited to see this move forward!
>> >>> >
>> >>> >> Design:
>> >>> >>
>> >>> >> As discussed in Raleigh, we could defines these interfaces:
>> >>> >>
>> >>> >> int Connection::register_oob_message(identitfy_op, callback, interval);
>> >>> >>
>> >>> >> Users like Objecter linger ping could register a "callback" which
>> >>> >> generate bufferlist used to be carried by heartbeat message.
>> >>> >> "interval" indicate the user's oob message's send interval.
>> >>> >>
>> >>> >> "identitfy_op" indicates who can handle the oob info in peer side.
>> >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message define.
>> >>> >
>> >>> > This looks convenient for the simpler callers, but I worry it won't work
>> >>> > as well for OSDPing. There's a bunch of odd locking around the heartbeat
>> >>> > info and the code already exists to do the the heartbeat sends.  I'm not
>> >>> > sure it will simplify to a simple interval.
>> >>>
>> >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
>> >>> register callback when adding new peer and unregister callback when
>> >>> removing peer from "heartbeat_peers".
>> >>>
>> >>> The main send message construct callback extract from this loop:
>> >>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
>> >>>        i != heartbeat_peers.end();
>> >>>        ++i) {
>> >>>     int peer = i->first;
>> >>>     i->second.last_tx = now;
>> >>>     if (i->second.first_tx == utime_t())
>> >>>       i->second.first_tx = now;
>> >>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
>> >>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
>> >>>  service.get_osdmap()->get_epoch(),
>> >>>  MOSDPing::PING,
>> >>>  now));
>> >>>
>> >>>     if (i->second.con_front)
>> >>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
>> >>>     service.get_osdmap()->get_epoch(),
>> >>>     MOSDPing::PING,
>> >>>     now));
>> >>>   }
>> >>>
>> >>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
>> >>> Then I think lots of locking/odding things exists on heartbeat
>> >>> dispatch/handle process. sending process is clear I guess.
>> >>
>> >> Yeah, I guess that's fine.  I was worried about some dependency between
>> >> who we ping and the osdmap epoch in the message (and races adding/removing
>> >> heartbeat peers), but I think it doesn't matter.
>> >>
>> >> Even so, I think it would be good to expose the send_message_oob()
>> >> interface, and do this in 2 stages so the two changes are decoupled.
>> >> Unless there is some implementation reason why the oob message scheduling
>> >> needs to be done inside the messenger?
>> >
>> > Agreed! we could remove heartbeat messenger firstly!
>> >
>> >>
>> >> sage
>> >>
>> >>> The advantage to register callback is we can combine multi layers oob
>> >>> messages to one.
>> >>>
>> >>> >
>> >>> > An easier first step would be to just define a
>> >>> > Connection::send_message_oob(Message*).  That would require almost no
>> >>> > changes to the calling code, and avoid having to create the timing
>> >>> > infrastructure inside AsyncMessenger...
>> >>> >
>> >>> > sage
>> >>> >
>> >>> >> void Dispatcher::ms_dispatch_oob(Message*)
>> >>> >>
>> >>> >> handle the oob message with parsing each oob part.
>> >>> >>
>> >>> >> So lots of timer control in user's side could be avoided via callback
>> >>> >> generator. When sending, OOB message could insert the front of send
>> >>> >> message queue but we can't get any help from kernel oob flag since
>> >>> >> it's really useless..
>> >>> >>
>> >>> >> Any suggestion is welcomed!
>>
>> Let's keep in mind the challenges of out-of-band messaging over TCP/IP.
>>
>> Namely, when we discussed this we couldn't figure out any way
>> (including the TCP priority stuff, which doesn't work with the
>> required semantics — even when it does function) to get traffic to
>> actually go out-of-band. IB messaging systems actually have a
>> "channels" concept that lets you do genuine OOB transmission that
>> skips over queues and other data; TCP doesn't. In fact the best we
>> came up with for doing this with Simple/AsyncMessenger was giving the
>> Messenger duplicate sockets/queues/etc, which is hardly ideal.
>>
>> So, maybe we can remove the heartbeat messenger by giving each
>> Connection two sockets and queues. That might even work better for the
>> AsyncMessenger than it does for SimpleMessenger?
>> But any implementation that orders OSD heartbeat messages behind
>> ordinary data traffic in kernel or router buffers is probably going to
>> fail us. :(
>
> Oh, good point.  I didn't read that paragraph carefully.  I think we
> should use a second socket connected to the same address for OOB messages.
> Or possibly push them over UDP... but we'd need to define retry semantics
> in that case.

if udp, I think udp hb interval should be less, caller should delegate
send logic to connection...

>
> sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OOB message roll into Messenger interface
  2016-09-07  2:46             ` Haomai Wang
@ 2016-09-07  2:58               ` Matt Benjamin
  0 siblings, 0 replies; 10+ messages in thread
From: Matt Benjamin @ 2016-09-07  2:58 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, Gregory Farnum, ceph-devel

ifaict, sending over UDP brings lossy delivery

will that in general be acceptable?

Matt

----- Original Message -----
> From: "Haomai Wang" <haomai@xsky.com>
> To: "Sage Weil" <sage@newdream.net>
> Cc: "Gregory Farnum" <gfarnum@redhat.com>, ceph-devel@vger.kernel.org
> Sent: Tuesday, September 6, 2016 10:46:11 PM
> Subject: Re: OOB message roll into Messenger interface
> 
> On Wed, Sep 7, 2016 at 2:06 AM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 6 Sep 2016, Gregory Farnum wrote:
> >> On Tue, Sep 6, 2016 at 7:15 AM, Haomai Wang <haomai@xsky.com> wrote:
> >> > On Tue, Sep 6, 2016 at 10:06 PM, Sage Weil <sage@newdream.net> wrote:
> >> >> On Tue, 6 Sep 2016, Haomai Wang wrote:
> >> >>> On Tue, Sep 6, 2016 at 9:17 PM, Sage Weil <sage@newdream.net> wrote:
> >> >>> > Hi Haomai!
> >> >>> >
> >> >>> > On Sun, 4 Sep 2016, Haomai Wang wrote:
> >> >>> >> Background:
> >> >>> >> Each osd has two heartbeat messenger instances to maintain
> >> >>> >> front/back
> >> >>> >> network available. It brings lots of connections and messages
> >> >>> >> overhead
> >> >>> >> in scale out cluster. Actually we can combine these heartbeat
> >> >>> >> exchanges to public/cluster messengers to reduce tons of
> >> >>> >> connections(resources).
> >> >>> >>
> >> >>> >> Then heartbeat message should be OOB and shared the same
> >> >>> >> thread/socket
> >> >>> >> with normal message channel. So it can exactly represent the
> >> >>> >> heartbeat
> >> >>> >> role for real IO message. Otherwise, heartbeat channel's status
> >> >>> >> can't
> >> >>> >> indicate the real IO message channel status. Because different
> >> >>> >> socket
> >> >>> >> uses different send buffer/recv buffer, if real io message blocked,
> >> >>> >> oob message may be healthy.
> >> >>> >>
> >> >>> >> Besides OSD's heartbeat things, we have logic PING/PONG lived in
> >> >>> >> Objecter Ping/WatchNotify Ping etc. For the same goal, they could
> >> >>> >> share the heartbeat message.
> >> >>> >>
> >> >>> >> In a real rbd use case env, if we combines these ping/pong
> >> >>> >> messages,
> >> >>> >> thousands of messages could be avoided which means lots of
> >> >>> >> resources.
> >> >>> >>
> >> >>> >> As we reduce the heartbeat overhead, we can reduce heartbeat
> >> >>> >> interval
> >> >>> >> and increase frequency which help a lot to the accurate of cluster
> >> >>> >> failure detection!
> >> >>> >
> >> >>> > I'm very excited to see this move forward!
> >> >>> >
> >> >>> >> Design:
> >> >>> >>
> >> >>> >> As discussed in Raleigh, we could defines these interfaces:
> >> >>> >>
> >> >>> >> int Connection::register_oob_message(identitfy_op, callback,
> >> >>> >> interval);
> >> >>> >>
> >> >>> >> Users like Objecter linger ping could register a "callback" which
> >> >>> >> generate bufferlist used to be carried by heartbeat message.
> >> >>> >> "interval" indicate the user's oob message's send interval.
> >> >>> >>
> >> >>> >> "identitfy_op" indicates who can handle the oob info in peer side.
> >> >>> >> Like "Ping", "OSDPing" or "LingerPing" as the current message
> >> >>> >> define.
> >> >>> >
> >> >>> > This looks convenient for the simpler callers, but I worry it won't
> >> >>> > work
> >> >>> > as well for OSDPing. There's a bunch of odd locking around the
> >> >>> > heartbeat
> >> >>> > info and the code already exists to do the the heartbeat sends.  I'm
> >> >>> > not
> >> >>> > sure it will simplify to a simple interval.
> >> >>>
> >> >>> Hmm, I'm not sure what's the odd locking thing refer to. As we can
> >> >>> register callback when adding new peer and unregister callback when
> >> >>> removing peer from "heartbeat_peers".
> >> >>>
> >> >>> The main send message construct callback extract from this loop:
> >> >>>   for (map<int,HeartbeatInfo>::iterator i = heartbeat_peers.begin();
> >> >>>        i != heartbeat_peers.end();
> >> >>>        ++i) {
> >> >>>     int peer = i->first;
> >> >>>     i->second.last_tx = now;
> >> >>>     if (i->second.first_tx == utime_t())
> >> >>>       i->second.first_tx = now;
> >> >>>     dout(30) << "heartbeat sending ping to osd." << peer << dendl;
> >> >>>     i->second.con_back->send_message(new MOSDPing(monc->get_fsid(),
> >> >>>  service.get_osdmap()->get_epoch(),
> >> >>>  MOSDPing::PING,
> >> >>>  now));
> >> >>>
> >> >>>     if (i->second.con_front)
> >> >>>       i->second.con_front->send_message(new MOSDPing(monc->get_fsid(),
> >> >>>     service.get_osdmap()->get_epoch(),
> >> >>>     MOSDPing::PING,
> >> >>>     now));
> >> >>>   }
> >> >>>
> >> >>> Only "fsid", "osdmap epoch" are required, I don't think it will block.
> >> >>> Then I think lots of locking/odding things exists on heartbeat
> >> >>> dispatch/handle process. sending process is clear I guess.
> >> >>
> >> >> Yeah, I guess that's fine.  I was worried about some dependency between
> >> >> who we ping and the osdmap epoch in the message (and races
> >> >> adding/removing
> >> >> heartbeat peers), but I think it doesn't matter.
> >> >>
> >> >> Even so, I think it would be good to expose the send_message_oob()
> >> >> interface, and do this in 2 stages so the two changes are decoupled.
> >> >> Unless there is some implementation reason why the oob message
> >> >> scheduling
> >> >> needs to be done inside the messenger?
> >> >
> >> > Agreed! we could remove heartbeat messenger firstly!
> >> >
> >> >>
> >> >> sage
> >> >>
> >> >>> The advantage to register callback is we can combine multi layers oob
> >> >>> messages to one.
> >> >>>
> >> >>> >
> >> >>> > An easier first step would be to just define a
> >> >>> > Connection::send_message_oob(Message*).  That would require almost
> >> >>> > no
> >> >>> > changes to the calling code, and avoid having to create the timing
> >> >>> > infrastructure inside AsyncMessenger...
> >> >>> >
> >> >>> > sage
> >> >>> >
> >> >>> >> void Dispatcher::ms_dispatch_oob(Message*)
> >> >>> >>
> >> >>> >> handle the oob message with parsing each oob part.
> >> >>> >>
> >> >>> >> So lots of timer control in user's side could be avoided via
> >> >>> >> callback
> >> >>> >> generator. When sending, OOB message could insert the front of send
> >> >>> >> message queue but we can't get any help from kernel oob flag since
> >> >>> >> it's really useless..
> >> >>> >>
> >> >>> >> Any suggestion is welcomed!
> >>
> >> Let's keep in mind the challenges of out-of-band messaging over TCP/IP.
> >>
> >> Namely, when we discussed this we couldn't figure out any way
> >> (including the TCP priority stuff, which doesn't work with the
> >> required semantics — even when it does function) to get traffic to
> >> actually go out-of-band. IB messaging systems actually have a
> >> "channels" concept that lets you do genuine OOB transmission that
> >> skips over queues and other data; TCP doesn't. In fact the best we
> >> came up with for doing this with Simple/AsyncMessenger was giving the
> >> Messenger duplicate sockets/queues/etc, which is hardly ideal.
> >>
> >> So, maybe we can remove the heartbeat messenger by giving each
> >> Connection two sockets and queues. That might even work better for the
> >> AsyncMessenger than it does for SimpleMessenger?
> >> But any implementation that orders OSD heartbeat messages behind
> >> ordinary data traffic in kernel or router buffers is probably going to
> >> fail us. :(
> >
> > Oh, good point.  I didn't read that paragraph carefully.  I think we
> > should use a second socket connected to the same address for OOB messages.
> > Or possibly push them over UDP... but we'd need to define retry semantics
> > in that case.
> 
> if udp, I think udp hb interval should be less, caller should delegate
> send logic to connection...
> 
> >
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-09-07  2:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-03 16:01 OOB message roll into Messenger interface Haomai Wang
2016-09-06 13:17 ` Sage Weil
2016-09-06 13:33   ` Haomai Wang
2016-09-06 14:06     ` Sage Weil
2016-09-06 14:15       ` Haomai Wang
2016-09-06 17:42         ` Gregory Farnum
2016-09-06 18:06           ` Sage Weil
2016-09-07  2:46             ` Haomai Wang
2016-09-07  2:58               ` Matt Benjamin
2016-09-07  2:43           ` Haomai Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.