All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: seastar and 'tame reactor'
       [not found] <d0f50268-72bb-1196-7ce9-0b9e21808ffb@redhat.com>
@ 2018-01-30 22:32 ` Josh Durgin
  2018-02-07 16:01   ` kefu chai
  0 siblings, 1 reply; 15+ messages in thread
From: Josh Durgin @ 2018-01-30 22:32 UTC (permalink / raw)
  To: Casey Bodley; +Cc: Adam Emerson, Gregory Farnum, kefu chai, ceph-devel

[adding ceph-devel]

On 01/30/2018 01:56 PM, Casey Bodley wrote:
> Hey Josh,
> 
> I heard you mention in the call yesterday that you're looking into this 
> part of seastar integration. I was just reading through the relevant 
> code over the weekend, and wanted to compare notes:
> 
> 
> in seastar, all cross-core communication goes through lockfree spsc 
> queues, which are encapsulated by 'class smp_message_queue' in 
> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup 
> in smp::configure(). early in reactor::run() (which is effectively each 
> seastar thread's entrypoint), it registers a smp_poller to poll all of 
> the queues directed at that cpu
> 
> what we need is a way to inject messages into each seastar reactor from 
> arbitrary/external threads. our requirements are very similar to 
> smp_message_queue's, with a few exceptions:
> 
> -each seastar reactor should be the single consumer of a multi-producer 
> queue, and poll on that as well

Yes, this is what I was thinking too, maybe a boost::lockfree::queue

> -the submit() function would return void instead of a future (which 
> removes the need for a lot of other stuff, like the _completions queue, 
> async_work_item::_promise, etc)
> 
> figuring out how to factor this stuff out of smp_message_queue cleanly 
> is the hard part i guess

I was thinking it could start off as a separate implementation, but
hadn't looked too closely at sharing pieces of it.

> in terms of startup, it could be allocated as a static array similar to 
> smp::_qs (except it would be dimensioned by [smp::count] instead of 
> [smp::count][smp::count]). then a new function could be added alongside 
> smp::submit_to() that submits to the given cpu's external queue (and 
> also returns void). this stuff should probably be disabled by default, 
> and only turned on if enabled in configuration

++

> for a super simple unit test, you could spawn an external thread that 
> does something like this:
> 
> std::mutex mutex;
> std::condition_variable cond;
> std::atomic<int> completions = 0;
> // submit a message to each reactor
> for (int i = 0; i < smp::count; i++) {
>    smp::external_submit_to(i, [&] { ++completions; cond.notify_one(); });
> }
> // wait for all completions
> std::unique_lock lock(mutex);
> cond.wait(lock, [&] { return completions == smp::count; });

Yeah, this looks like a nice example.

> Sorry that I've been slow to help with this - keep me posted?

No worries, I've been slow about this too - I've asked Kefu to look at
it this morning, so I'm sure he'll have some more thoughts soon.

Josh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-01-30 22:32 ` seastar and 'tame reactor' Josh Durgin
@ 2018-02-07 16:01   ` kefu chai
  2018-02-07 17:11     ` Casey Bodley
  0 siblings, 1 reply; 15+ messages in thread
From: kefu chai @ 2018-02-07 16:01 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Casey Bodley, Adam Emerson, Gregory Farnum, ceph-devel

On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com> wrote:
> [adding ceph-devel]
>
> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>
>> Hey Josh,
>>
>> I heard you mention in the call yesterday that you're looking into this
>> part of seastar integration. I was just reading through the relevant code
>> over the weekend, and wanted to compare notes:
>>
>>
>> in seastar, all cross-core communication goes through lockfree spsc
>> queues, which are encapsulated by 'class smp_message_queue' in
>> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup in
>> smp::configure(). early in reactor::run() (which is effectively each seastar
>> thread's entrypoint), it registers a smp_poller to poll all of the queues
>> directed at that cpu
>>
>> what we need is a way to inject messages into each seastar reactor from
>> arbitrary/external threads. our requirements are very similar to

i think we will have a sharded<osd::PublicService> on each core. in
each instance of PublicService, we will be listening and serving
requests from external clients of cluster. the same applies to
sharded<osd::ClusterService>, which will be responsible for serving
the requests from its peers in the cluster. the control flow of a
typical OSD read request from a public RADOS client will look like:

1. the TCP connection is accepted by one of the listening
sharded<osd::PublicService>.
2. decode the message
3. osd encapsulates the request in the message as a future, and submit
it to another core after hashing the involved pg # to the core #.
something like (in pseudo code):
  engine().submit_to(osdmap_shard, [] {
    return get_newer_osdmap(m->epoch);
    // need to figure out how to reference a "osdmap service" in seastar.
  }).then([] (auto osdmap) {
    submit_to(pg_to_shard(m->ops.op.pg, [] {
      return pg.do_ops(m->ops);
    });
  });
4. the core serving the involved pg (i.e. pg service) will dequeue
this request, and use read_dma() call to delegate the aio request to
the core maintaining the io queue.
5. once the aio completes, the PublicService will continue on, with
the then() block. it will send the response back to client.

so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
is good enough for us, i think.

>> smp_message_queue's, with a few exceptions:
>>
>> -each seastar reactor should be the single consumer of a multi-producer
>> queue, and poll on that as well
>
>
> Yes, this is what I was thinking too, maybe a boost::lockfree::queue
>
>> -the submit() function would return void instead of a future (which
>> removes the need for a lot of other stuff, like the _completions queue,
>> async_work_item::_promise, etc)
>>
>> figuring out how to factor this stuff out of smp_message_queue cleanly is
>> the hard part i guess
>
>
> I was thinking it could start off as a separate implementation, but
> hadn't looked too closely at sharing pieces of it.
>
>> in terms of startup, it could be allocated as a static array similar to
>> smp::_qs (except it would be dimensioned by [smp::count] instead of
>> [smp::count][smp::count]). then a new function could be added alongside
>> smp::submit_to() that submits to the given cpu's external queue (and also
>> returns void). this stuff should probably be disabled by default, and only
>> turned on if enabled in configuration
>
>
> ++
>
>> for a super simple unit test, you could spawn an external thread that does
>> something like this:
>>
>> std::mutex mutex;
>> std::condition_variable cond;
>> std::atomic<int> completions = 0;
>> // submit a message to each reactor
>> for (int i = 0; i < smp::count; i++) {
>>    smp::external_submit_to(i, [&] { ++completions; cond.notify_one(); });
>> }
>> // wait for all completions
>> std::unique_lock lock(mutex);
>> cond.wait(lock, [&] { return completions == smp::count; });
>
>
> Yeah, this looks like a nice example.
>
>> Sorry that I've been slow to help with this - keep me posted?
>
>
> No worries, I've been slow about this too - I've asked Kefu to look at
> it this morning, so I'm sure he'll have some more thoughts soon.
>
> Josh



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-07 16:01   ` kefu chai
@ 2018-02-07 17:11     ` Casey Bodley
  2018-02-07 19:22       ` Gregory Farnum
  2018-02-12 19:40       ` Allen Samuels
  0 siblings, 2 replies; 15+ messages in thread
From: Casey Bodley @ 2018-02-07 17:11 UTC (permalink / raw)
  To: kefu chai, Josh Durgin; +Cc: Adam Emerson, Gregory Farnum, ceph-devel


On 02/07/2018 11:01 AM, kefu chai wrote:
> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>> [adding ceph-devel]
>>
>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>> Hey Josh,
>>>
>>> I heard you mention in the call yesterday that you're looking into this
>>> part of seastar integration. I was just reading through the relevant code
>>> over the weekend, and wanted to compare notes:
>>>
>>>
>>> in seastar, all cross-core communication goes through lockfree spsc
>>> queues, which are encapsulated by 'class smp_message_queue' in
>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup in
>>> smp::configure(). early in reactor::run() (which is effectively each seastar
>>> thread's entrypoint), it registers a smp_poller to poll all of the queues
>>> directed at that cpu
>>>
>>> what we need is a way to inject messages into each seastar reactor from
>>> arbitrary/external threads. our requirements are very similar to
> i think we will have a sharded<osd::PublicService> on each core. in
> each instance of PublicService, we will be listening and serving
> requests from external clients of cluster. the same applies to
> sharded<osd::ClusterService>, which will be responsible for serving
> the requests from its peers in the cluster. the control flow of a
> typical OSD read request from a public RADOS client will look like:
>
> 1. the TCP connection is accepted by one of the listening
> sharded<osd::PublicService>.
> 2. decode the message
> 3. osd encapsulates the request in the message as a future, and submit
> it to another core after hashing the involved pg # to the core #.
> something like (in pseudo code):
>    engine().submit_to(osdmap_shard, [] {
>      return get_newer_osdmap(m->epoch);
>      // need to figure out how to reference a "osdmap service" in seastar.
>    }).then([] (auto osdmap) {
>      submit_to(pg_to_shard(m->ops.op.pg, [] {
>        return pg.do_ops(m->ops);
>      });
>    });
> 4. the core serving the involved pg (i.e. pg service) will dequeue
> this request, and use read_dma() call to delegate the aio request to
> the core maintaining the io queue.
> 5. once the aio completes, the PublicService will continue on, with
> the then() block. it will send the response back to client.
>
> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
> is good enough for us, i think.
>

Hey Kefu,

That sounds entirely reasonable, but assumes that everything will be 
running inside of seastar from the start. We've been looking for an 
incremental approach that would allow us to start with some subset 
running inside of seastar, with a mechanism for communication between 
that and the osd's existing threads. One suggestion was to start with 
just the messenger inside of seastar, and gradually move that 
seastar-to-external-thread boundary further down the io path as code is 
refactored to support it. It sounds unlikely that we'll ever get rocksdb 
running inside of seastar, so the objectstore will need its own threads 
until there's a viable alternative.

So the mpsc queue and smp::external_submit_to() interface was a strategy 
for passing messages into seastar from arbitrary non-seastar threads. 
Communication in the other direction just needs to be non-blocking (my 
example just signaled a condition variable without holding its mutex).

What are your thoughts on the incremental approach?

Casey

ps. I'd love to see more thought put into the design of the finished 
product, and your outline is a good start! Avi Kivity @scylladb shared 
one suggestion that I really liked, which was to give each shard of the 
osd a separate network endpoint, and add enough information to the 
osdmap so that clients could send their messages directly to the shard 
that would process them. That piece can come in later, but could 
eliminate some of the extra latency from your step 3.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-07 17:11     ` Casey Bodley
@ 2018-02-07 19:22       ` Gregory Farnum
  2018-02-12 15:45         ` kefu chai
  2018-02-12 19:40       ` Allen Samuels
  1 sibling, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2018-02-07 19:22 UTC (permalink / raw)
  To: Casey Bodley; +Cc: kefu chai, Josh Durgin, Adam Emerson, ceph-devel

On Wed, Feb 7, 2018 at 9:11 AM, Casey Bodley <cbodley@redhat.com> wrote:
>
> On 02/07/2018 11:01 AM, kefu chai wrote:
>>
>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>>>
>>> [adding ceph-devel]
>>>
>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>
>>>> Hey Josh,
>>>>
>>>> I heard you mention in the call yesterday that you're looking into this
>>>> part of seastar integration. I was just reading through the relevant
>>>> code
>>>> over the weekend, and wanted to compare notes:
>>>>
>>>>
>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup
>>>> in
>>>> smp::configure(). early in reactor::run() (which is effectively each
>>>> seastar
>>>> thread's entrypoint), it registers a smp_poller to poll all of the
>>>> queues
>>>> directed at that cpu
>>>>
>>>> what we need is a way to inject messages into each seastar reactor from
>>>> arbitrary/external threads. our requirements are very similar to
>>
>> i think we will have a sharded<osd::PublicService> on each core. in
>> each instance of PublicService, we will be listening and serving
>> requests from external clients of cluster. the same applies to
>> sharded<osd::ClusterService>, which will be responsible for serving
>> the requests from its peers in the cluster. the control flow of a
>> typical OSD read request from a public RADOS client will look like:
>>
>> 1. the TCP connection is accepted by one of the listening
>> sharded<osd::PublicService>.
>> 2. decode the message
>> 3. osd encapsulates the request in the message as a future, and submit
>> it to another core after hashing the involved pg # to the core #.
>> something like (in pseudo code):
>>    engine().submit_to(osdmap_shard, [] {
>>      return get_newer_osdmap(m->epoch);
>>      // need to figure out how to reference a "osdmap service" in seastar.
>>    }).then([] (auto osdmap) {
>>      submit_to(pg_to_shard(m->ops.op.pg, [] {
>>        return pg.do_ops(m->ops);
>>      });
>>    });
>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>> this request, and use read_dma() call to delegate the aio request to
>> the core maintaining the io queue.
>> 5. once the aio completes, the PublicService will continue on, with
>> the then() block. it will send the response back to client.
>>
>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>> is good enough for us, i think.
>>
>
> Hey Kefu,
>
> That sounds entirely reasonable, but assumes that everything will be running
> inside of seastar from the start. We've been looking for an incremental
> approach that would allow us to start with some subset running inside of
> seastar, with a mechanism for communication between that and the osd's
> existing threads. One suggestion was to start with just the messenger inside
> of seastar, and gradually move that seastar-to-external-thread boundary
> further down the io path as code is refactored to support it. It sounds
> unlikely that we'll ever get rocksdb running inside of seastar, so the
> objectstore will need its own threads until there's a viable alternative.
>
> So the mpsc queue and smp::external_submit_to() interface was a strategy for
> passing messages into seastar from arbitrary non-seastar threads.
> Communication in the other direction just needs to be non-blocking (my
> example just signaled a condition variable without holding its mutex).
>
> What are your thoughts on the incremental approach?
>
> Casey
>
> ps. I'd love to see more thought put into the design of the finished
> product, and your outline is a good start! Avi Kivity @scylladb shared one
> suggestion that I really liked, which was to give each shard of the osd a
> separate network endpoint, and add enough information to the osdmap so that
> clients could send their messages directly to the shard that would process
> them. That piece can come in later, but could eliminate some of the extra
> latency from your step 3.

This is something we've discussed but will want to think about very
carefully once we have more performance available. Increasing the
number of (very stateful) connections the OSDs and clients need to
maintain like that is not something to undertake lightly right now,
and in fact is the opposite of the multiplexing connections work going
on for msgr v2. ;)
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-07 19:22       ` Gregory Farnum
@ 2018-02-12 15:45         ` kefu chai
  2018-02-12 15:55           ` Matt Benjamin
  0 siblings, 1 reply; 15+ messages in thread
From: kefu chai @ 2018-02-12 15:45 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Casey Bodley, Josh Durgin, Adam Emerson, ceph-devel

On Thu, Feb 8, 2018 at 3:22 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Wed, Feb 7, 2018 at 9:11 AM, Casey Bodley <cbodley@redhat.com> wrote:
>>
>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>>
>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>>>>
>>>> [adding ceph-devel]
>>>>
>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>>
>>>>> Hey Josh,
>>>>>
>>>>> I heard you mention in the call yesterday that you're looking into this
>>>>> part of seastar integration. I was just reading through the relevant
>>>>> code
>>>>> over the weekend, and wanted to compare notes:
>>>>>
>>>>>
>>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup
>>>>> in
>>>>> smp::configure(). early in reactor::run() (which is effectively each
>>>>> seastar
>>>>> thread's entrypoint), it registers a smp_poller to poll all of the
>>>>> queues
>>>>> directed at that cpu
>>>>>
>>>>> what we need is a way to inject messages into each seastar reactor from
>>>>> arbitrary/external threads. our requirements are very similar to
>>>
>>> i think we will have a sharded<osd::PublicService> on each core. in
>>> each instance of PublicService, we will be listening and serving
>>> requests from external clients of cluster. the same applies to
>>> sharded<osd::ClusterService>, which will be responsible for serving
>>> the requests from its peers in the cluster. the control flow of a
>>> typical OSD read request from a public RADOS client will look like:
>>>
>>> 1. the TCP connection is accepted by one of the listening
>>> sharded<osd::PublicService>.
>>> 2. decode the message
>>> 3. osd encapsulates the request in the message as a future, and submit
>>> it to another core after hashing the involved pg # to the core #.
>>> something like (in pseudo code):
>>>    engine().submit_to(osdmap_shard, [] {
>>>      return get_newer_osdmap(m->epoch);
>>>      // need to figure out how to reference a "osdmap service" in seastar.
>>>    }).then([] (auto osdmap) {
>>>      submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>        return pg.do_ops(m->ops);
>>>      });
>>>    });
>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>>> this request, and use read_dma() call to delegate the aio request to
>>> the core maintaining the io queue.
>>> 5. once the aio completes, the PublicService will continue on, with
>>> the then() block. it will send the response back to client.
>>>
>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>>> is good enough for us, i think.
>>>
>>
>> Hey Kefu,
>>
>> That sounds entirely reasonable, but assumes that everything will be running
>> inside of seastar from the start. We've been looking for an incremental
>> approach that would allow us to start with some subset running inside of
>> seastar, with a mechanism for communication between that and the osd's
>> existing threads. One suggestion was to start with just the messenger inside
>> of seastar, and gradually move that seastar-to-external-thread boundary
>> further down the io path as code is refactored to support it. It sounds
>> unlikely that we'll ever get rocksdb running inside of seastar, so the
>> objectstore will need its own threads until there's a viable alternative.
>>
>> So the mpsc queue and smp::external_submit_to() interface was a strategy for
>> passing messages into seastar from arbitrary non-seastar threads.
>> Communication in the other direction just needs to be non-blocking (my
>> example just signaled a condition variable without holding its mutex).
>>
>> What are your thoughts on the incremental approach?

yes. if we need send from a thread running a random core, we do need
the mpsc queue, and an smp::external_submit_to() interface, as we
don't have the access to the TLS "local_engine". but this hybrid
approach makes me nervous. as i think seastar is an intrusive
framework. we either embrace it or go with our own work queue model.
let me give it a try to see if we can have a firewall between the
seastar world and the non-seastar world.

>>
>> Casey
>>
>> ps. I'd love to see more thought put into the design of the finished
>> product, and your outline is a good start! Avi Kivity @scylladb shared one
>> suggestion that I really liked, which was to give each shard of the osd a
>> separate network endpoint, and add enough information to the osdmap so that
>> clients could send their messages directly to the shard that would process
>> them. That piece can come in later, but could eliminate some of the extra
>> latency from your step 3.
>
> This is something we've discussed but will want to think about very
> carefully once we have more performance available. Increasing the
> number of (very stateful) connections the OSDs and clients need to
> maintain like that is not something to undertake lightly right now,
> and in fact is the opposite of the multiplexing connections work going
> on for msgr v2. ;)
> -Greg



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-12 15:45         ` kefu chai
@ 2018-02-12 15:55           ` Matt Benjamin
  2018-02-12 15:57             ` Gregory Farnum
  2018-02-13 13:35             ` kefu chai
  0 siblings, 2 replies; 15+ messages in thread
From: Matt Benjamin @ 2018-02-12 15:55 UTC (permalink / raw)
  To: kefu chai
  Cc: Gregory Farnum, Casey Bodley, Josh Durgin, Adam Emerson, ceph-devel

How does tame reactor induce more OSD sessions (@greg);  @kefu, in't
the hybrid model another way of saying, tame reactor?  The intuition
I've had to this point is that the interfacing here is essentially
similar to making seastar interact with anything else, including
frameworks (disks, memory devices) that it absolutely wants to and
must.

Matt

On Mon, Feb 12, 2018 at 10:45 AM, kefu chai <tchaikov@gmail.com> wrote:
> On Thu, Feb 8, 2018 at 3:22 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
>> On Wed, Feb 7, 2018 at 9:11 AM, Casey Bodley <cbodley@redhat.com> wrote:
>>>
>>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>>>
>>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>>>>>
>>>>> [adding ceph-devel]
>>>>>
>>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>>>
>>>>>> Hey Josh,
>>>>>>
>>>>>> I heard you mention in the call yesterday that you're looking into this
>>>>>> part of seastar integration. I was just reading through the relevant
>>>>>> code
>>>>>> over the weekend, and wanted to compare notes:
>>>>>>
>>>>>>
>>>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup
>>>>>> in
>>>>>> smp::configure(). early in reactor::run() (which is effectively each
>>>>>> seastar
>>>>>> thread's entrypoint), it registers a smp_poller to poll all of the
>>>>>> queues
>>>>>> directed at that cpu
>>>>>>
>>>>>> what we need is a way to inject messages into each seastar reactor from
>>>>>> arbitrary/external threads. our requirements are very similar to
>>>>
>>>> i think we will have a sharded<osd::PublicService> on each core. in
>>>> each instance of PublicService, we will be listening and serving
>>>> requests from external clients of cluster. the same applies to
>>>> sharded<osd::ClusterService>, which will be responsible for serving
>>>> the requests from its peers in the cluster. the control flow of a
>>>> typical OSD read request from a public RADOS client will look like:
>>>>
>>>> 1. the TCP connection is accepted by one of the listening
>>>> sharded<osd::PublicService>.
>>>> 2. decode the message
>>>> 3. osd encapsulates the request in the message as a future, and submit
>>>> it to another core after hashing the involved pg # to the core #.
>>>> something like (in pseudo code):
>>>>    engine().submit_to(osdmap_shard, [] {
>>>>      return get_newer_osdmap(m->epoch);
>>>>      // need to figure out how to reference a "osdmap service" in seastar.
>>>>    }).then([] (auto osdmap) {
>>>>      submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>>        return pg.do_ops(m->ops);
>>>>      });
>>>>    });
>>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>>>> this request, and use read_dma() call to delegate the aio request to
>>>> the core maintaining the io queue.
>>>> 5. once the aio completes, the PublicService will continue on, with
>>>> the then() block. it will send the response back to client.
>>>>
>>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>>>> is good enough for us, i think.
>>>>
>>>
>>> Hey Kefu,
>>>
>>> That sounds entirely reasonable, but assumes that everything will be running
>>> inside of seastar from the start. We've been looking for an incremental
>>> approach that would allow us to start with some subset running inside of
>>> seastar, with a mechanism for communication between that and the osd's
>>> existing threads. One suggestion was to start with just the messenger inside
>>> of seastar, and gradually move that seastar-to-external-thread boundary
>>> further down the io path as code is refactored to support it. It sounds
>>> unlikely that we'll ever get rocksdb running inside of seastar, so the
>>> objectstore will need its own threads until there's a viable alternative.
>>>
>>> So the mpsc queue and smp::external_submit_to() interface was a strategy for
>>> passing messages into seastar from arbitrary non-seastar threads.
>>> Communication in the other direction just needs to be non-blocking (my
>>> example just signaled a condition variable without holding its mutex).
>>>
>>> What are your thoughts on the incremental approach?
>
> yes. if we need send from a thread running a random core, we do need
> the mpsc queue, and an smp::external_submit_to() interface, as we
> don't have the access to the TLS "local_engine". but this hybrid
> approach makes me nervous. as i think seastar is an intrusive
> framework. we either embrace it or go with our own work queue model.
> let me give it a try to see if we can have a firewall between the
> seastar world and the non-seastar world.
>
>>>
>>> Casey
>>>
>>> ps. I'd love to see more thought put into the design of the finished
>>> product, and your outline is a good start! Avi Kivity @scylladb shared one
>>> suggestion that I really liked, which was to give each shard of the osd a
>>> separate network endpoint, and add enough information to the osdmap so that
>>> clients could send their messages directly to the shard that would process
>>> them. That piece can come in later, but could eliminate some of the extra
>>> latency from your step 3.
>>
>> This is something we've discussed but will want to think about very
>> carefully once we have more performance available. Increasing the
>> number of (very stateful) connections the OSDs and clients need to
>> maintain like that is not something to undertake lightly right now,
>> and in fact is the opposite of the multiplexing connections work going
>> on for msgr v2. ;)
>> -Greg
>
>
>
> --
> Regards
> Kefu Chai
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-12 15:55           ` Matt Benjamin
@ 2018-02-12 15:57             ` Gregory Farnum
  2018-02-13 13:35             ` kefu chai
  1 sibling, 0 replies; 15+ messages in thread
From: Gregory Farnum @ 2018-02-12 15:57 UTC (permalink / raw)
  To: Matt Benjamin
  Cc: kefu chai, Casey Bodley, Josh Durgin, Adam Emerson, ceph-devel

> On Mon, Feb 12, 2018 at 10:45 AM, kefu chai <tchaikov@gmail.com> wrote:
>> On Thu, Feb 8, 2018 at 3:22 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
>>> On Wed, Feb 7, 2018 at 9:11 AM, Casey Bodley <cbodley@redhat.com> wrote:
>>>>
>>>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>>>>
>>>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>>>>>>
>>>>>> [adding ceph-devel]
>>>>>>
>>>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>>>>
>>>>>>> Hey Josh,
>>>>>>>
>>>>>>> I heard you mention in the call yesterday that you're looking into this
>>>>>>> part of seastar integration. I was just reading through the relevant
>>>>>>> code
>>>>>>> over the weekend, and wanted to compare notes:
>>>>>>>
>>>>>>>
>>>>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on startup
>>>>>>> in
>>>>>>> smp::configure(). early in reactor::run() (which is effectively each
>>>>>>> seastar
>>>>>>> thread's entrypoint), it registers a smp_poller to poll all of the
>>>>>>> queues
>>>>>>> directed at that cpu
>>>>>>>
>>>>>>> what we need is a way to inject messages into each seastar reactor from
>>>>>>> arbitrary/external threads. our requirements are very similar to
>>>>>
>>>>> i think we will have a sharded<osd::PublicService> on each core. in
>>>>> each instance of PublicService, we will be listening and serving
>>>>> requests from external clients of cluster. the same applies to
>>>>> sharded<osd::ClusterService>, which will be responsible for serving
>>>>> the requests from its peers in the cluster. the control flow of a
>>>>> typical OSD read request from a public RADOS client will look like:
>>>>>
>>>>> 1. the TCP connection is accepted by one of the listening
>>>>> sharded<osd::PublicService>.
>>>>> 2. decode the message
>>>>> 3. osd encapsulates the request in the message as a future, and submit
>>>>> it to another core after hashing the involved pg # to the core #.
>>>>> something like (in pseudo code):
>>>>>    engine().submit_to(osdmap_shard, [] {
>>>>>      return get_newer_osdmap(m->epoch);
>>>>>      // need to figure out how to reference a "osdmap service" in seastar.
>>>>>    }).then([] (auto osdmap) {
>>>>>      submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>>>        return pg.do_ops(m->ops);
>>>>>      });
>>>>>    });
>>>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>>>>> this request, and use read_dma() call to delegate the aio request to
>>>>> the core maintaining the io queue.
>>>>> 5. once the aio completes, the PublicService will continue on, with
>>>>> the then() block. it will send the response back to client.
>>>>>
>>>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>>>>> is good enough for us, i think.
>>>>>
>>>>
>>>> Hey Kefu,
>>>>
>>>> That sounds entirely reasonable, but assumes that everything will be running
>>>> inside of seastar from the start. We've been looking for an incremental
>>>> approach that would allow us to start with some subset running inside of
>>>> seastar, with a mechanism for communication between that and the osd's
>>>> existing threads. One suggestion was to start with just the messenger inside
>>>> of seastar, and gradually move that seastar-to-external-thread boundary
>>>> further down the io path as code is refactored to support it. It sounds
>>>> unlikely that we'll ever get rocksdb running inside of seastar, so the
>>>> objectstore will need its own threads until there's a viable alternative.
>>>>
>>>> So the mpsc queue and smp::external_submit_to() interface was a strategy for
>>>> passing messages into seastar from arbitrary non-seastar threads.
>>>> Communication in the other direction just needs to be non-blocking (my
>>>> example just signaled a condition variable without holding its mutex).
>>>>
>>>> What are your thoughts on the incremental approach?
>>
>> yes. if we need send from a thread running a random core, we do need
>> the mpsc queue, and an smp::external_submit_to() interface, as we
>> don't have the access to the TLS "local_engine". but this hybrid
>> approach makes me nervous. as i think seastar is an intrusive
>> framework. we either embrace it or go with our own work queue model.
>> let me give it a try to see if we can have a firewall between the
>> seastar world and the non-seastar world.

We've talked about this pretty extensively and a whole-code-base
transition is just not going to be feasible to do in one go, so we
need an interoperations layer. Hopefully we won't have to cross it
very often (although it will be at least once per op, given BlueStore,
as Casey mentioned).

We haven't thought through all the consequences of that, but it should
be doable since most of the data structures will not cross very often.
Those that might need to be operated on in both sides are probably
already covered by fine-grained locking, and I'm hopeful we can build
a pretty thin hybrid lock that consists of a mutex (used by
non-seastar, and for seastar to claim it from the old world) and a
seastar lock (used by seastar the rest of the time). Things like that
ought to go pretty far.



On Mon, Feb 12, 2018 at 7:55 AM, Matt Benjamin <mbenjami@redhat.com> wrote:
> How does tame reactor induce more OSD sessions (@greg);  @kefu, in't
> the hybrid model another way of saying, tame reactor?  The intuition
> I've had to this point is that the interfacing here is essentially
> similar to making seastar interact with anything else, including
> frameworks (disks, memory devices) that it absolutely wants to and
> must.

That was just if we try to make clients direct all IO to the correct
core immediately, instead of going through a crossbar.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: seastar and 'tame reactor'
  2018-02-07 17:11     ` Casey Bodley
  2018-02-07 19:22       ` Gregory Farnum
@ 2018-02-12 19:40       ` Allen Samuels
  2018-02-13 15:46         ` Casey Bodley
  1 sibling, 1 reply; 15+ messages in thread
From: Allen Samuels @ 2018-02-12 19:40 UTC (permalink / raw)
  To: Casey Bodley, kefu chai, Josh Durgin
  Cc: Adam Emerson, Gregory Farnum, ceph-devel

I would think that it ought to be reasonably straightforward to get RocksDB (or other thread-based foreign code) to run under the seastar framework provided that you're able to locate all os-invoking primitives within the foreign code and convert those into calls into your compatibility layer. That layer would have to simulate context switching (relatively easy) as well as provide an implementation of that kernel call. In the case of RocksDB, some of that work has already been done (generally, the file and I/O operations are done through a compatibility layer that's provided as a parameter. I'm not as sure about the synchronization primitives, but it ought to be relatively easy to extend to cover those).

Has this been discussed?


Allen Samuels  
R&D Engineering Fellow 

Western Digital® 
Email:  allen.samuels@wdc.com 
Office:  +1-408-801-7030
Mobile: +1-408-780-6416 


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Casey Bodley
> Sent: Wednesday, February 07, 2018 9:11 AM
> To: kefu chai <tchaikov@gmail.com>; Josh Durgin <jdurgin@redhat.com>
> Cc: Adam Emerson <aemerson@redhat.com>; Gregory Farnum
> <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: seastar and 'tame reactor'
> 
> 
> On 02/07/2018 11:01 AM, kefu chai wrote:
> > On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com>
> wrote:
> >> [adding ceph-devel]
> >>
> >> On 01/30/2018 01:56 PM, Casey Bodley wrote:
> >>> Hey Josh,
> >>>
> >>> I heard you mention in the call yesterday that you're looking into
> >>> this part of seastar integration. I was just reading through the
> >>> relevant code over the weekend, and wanted to compare notes:
> >>>
> >>>
> >>> in seastar, all cross-core communication goes through lockfree spsc
> >>> queues, which are encapsulated by 'class smp_message_queue' in
> >>> core/reactor.hh. all of these queues (smp::_qs) are allocated on
> >>> startup in smp::configure(). early in reactor::run() (which is
> >>> effectively each seastar thread's entrypoint), it registers a
> >>> smp_poller to poll all of the queues directed at that cpu
> >>>
> >>> what we need is a way to inject messages into each seastar reactor
> >>> from arbitrary/external threads. our requirements are very similar
> >>> to
> > i think we will have a sharded<osd::PublicService> on each core. in
> > each instance of PublicService, we will be listening and serving
> > requests from external clients of cluster. the same applies to
> > sharded<osd::ClusterService>, which will be responsible for serving
> > the requests from its peers in the cluster. the control flow of a
> > typical OSD read request from a public RADOS client will look like:
> >
> > 1. the TCP connection is accepted by one of the listening
> > sharded<osd::PublicService>.
> > 2. decode the message
> > 3. osd encapsulates the request in the message as a future, and submit
> > it to another core after hashing the involved pg # to the core #.
> > something like (in pseudo code):
> >    engine().submit_to(osdmap_shard, [] {
> >      return get_newer_osdmap(m->epoch);
> >      // need to figure out how to reference a "osdmap service" in seastar.
> >    }).then([] (auto osdmap) {
> >      submit_to(pg_to_shard(m->ops.op.pg, [] {
> >        return pg.do_ops(m->ops);
> >      });
> >    });
> > 4. the core serving the involved pg (i.e. pg service) will dequeue
> > this request, and use read_dma() call to delegate the aio request to
> > the core maintaining the io queue.
> > 5. once the aio completes, the PublicService will continue on, with
> > the then() block. it will send the response back to client.
> >
> > so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
> > is good enough for us, i think.
> >
> 
> Hey Kefu,
> 
> That sounds entirely reasonable, but assumes that everything will be running
> inside of seastar from the start. We've been looking for an incremental
> approach that would allow us to start with some subset running inside of
> seastar, with a mechanism for communication between that and the osd's
> existing threads. One suggestion was to start with just the messenger inside
> of seastar, and gradually move that seastar-to-external-thread boundary
> further down the io path as code is refactored to support it. It sounds
> unlikely that we'll ever get rocksdb running inside of seastar, so the
> objectstore will need its own threads until there's a viable alternative.
> 
> So the mpsc queue and smp::external_submit_to() interface was a strategy
> for passing messages into seastar from arbitrary non-seastar threads.
> Communication in the other direction just needs to be non-blocking (my
> example just signaled a condition variable without holding its mutex).
> 
> What are your thoughts on the incremental approach?
> 
> Casey
> 
> ps. I'd love to see more thought put into the design of the finished product,
> and your outline is a good start! Avi Kivity @scylladb shared one suggestion
> that I really liked, which was to give each shard of the osd a separate network
> endpoint, and add enough information to the osdmap so that clients could
> send their messages directly to the shard that would process them. That
> piece can come in later, but could eliminate some of the extra latency from
> your step 3.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-12 15:55           ` Matt Benjamin
  2018-02-12 15:57             ` Gregory Farnum
@ 2018-02-13 13:35             ` kefu chai
  2018-02-13 15:58               ` Casey Bodley
  1 sibling, 1 reply; 15+ messages in thread
From: kefu chai @ 2018-02-13 13:35 UTC (permalink / raw)
  To: Matt Benjamin
  Cc: Gregory Farnum, Casey Bodley, Josh Durgin, Adam Emerson, ceph-devel

On Mon, Feb 12, 2018 at 11:55 PM, Matt Benjamin <mbenjami@redhat.com> wrote:
> How does tame reactor induce more OSD sessions (@greg);  @kefu, in't
> the hybrid model another way of saying, tame reactor?  The intuition

i just realized that it is =)

> I've had to this point is that the interfacing here is essentially
> similar to making seastar interact with anything else, including
> frameworks (disks, memory devices) that it absolutely wants to and
> must.
>


-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-12 19:40       ` Allen Samuels
@ 2018-02-13 15:46         ` Casey Bodley
  2018-02-13 16:17           ` liuchang0812
  0 siblings, 1 reply; 15+ messages in thread
From: Casey Bodley @ 2018-02-13 15:46 UTC (permalink / raw)
  To: Allen Samuels, kefu chai, Josh Durgin
  Cc: Adam Emerson, Gregory Farnum, ceph-devel



On 02/12/2018 02:40 PM, Allen Samuels wrote:
> I would think that it ought to be reasonably straightforward to get RocksDB (or other thread-based foreign code) to run under the seastar framework provided that you're able to locate all os-invoking primitives within the foreign code and convert those into calls into your compatibility layer. That layer would have to simulate context switching (relatively easy) as well as provide an implementation of that kernel call. In the case of RocksDB, some of that work has already been done (generally, the file and I/O operations are done through a compatibility layer that's provided as a parameter. I'm not as sure about the synchronization primitives, but it ought to be relatively easy to extend to cover those).
>
> Has this been discussed?

I don't think it has, no. I'm not familiar with these rocksdb env 
interfaces, but this sounds promising.

>
> Allen Samuels
> R&D Engineering Fellow
>
> Western Digital®
> Email:  allen.samuels@wdc.com
> Office:  +1-408-801-7030
> Mobile: +1-408-780-6416
>
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Casey Bodley
>> Sent: Wednesday, February 07, 2018 9:11 AM
>> To: kefu chai <tchaikov@gmail.com>; Josh Durgin <jdurgin@redhat.com>
>> Cc: Adam Emerson <aemerson@redhat.com>; Gregory Farnum
>> <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: seastar and 'tame reactor'
>>
>>
>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com>
>> wrote:
>>>> [adding ceph-devel]
>>>>
>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>> Hey Josh,
>>>>>
>>>>> I heard you mention in the call yesterday that you're looking into
>>>>> this part of seastar integration. I was just reading through the
>>>>> relevant code over the weekend, and wanted to compare notes:
>>>>>
>>>>>
>>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on
>>>>> startup in smp::configure(). early in reactor::run() (which is
>>>>> effectively each seastar thread's entrypoint), it registers a
>>>>> smp_poller to poll all of the queues directed at that cpu
>>>>>
>>>>> what we need is a way to inject messages into each seastar reactor
>>>>> from arbitrary/external threads. our requirements are very similar
>>>>> to
>>> i think we will have a sharded<osd::PublicService> on each core. in
>>> each instance of PublicService, we will be listening and serving
>>> requests from external clients of cluster. the same applies to
>>> sharded<osd::ClusterService>, which will be responsible for serving
>>> the requests from its peers in the cluster. the control flow of a
>>> typical OSD read request from a public RADOS client will look like:
>>>
>>> 1. the TCP connection is accepted by one of the listening
>>> sharded<osd::PublicService>.
>>> 2. decode the message
>>> 3. osd encapsulates the request in the message as a future, and submit
>>> it to another core after hashing the involved pg # to the core #.
>>> something like (in pseudo code):
>>>     engine().submit_to(osdmap_shard, [] {
>>>       return get_newer_osdmap(m->epoch);
>>>       // need to figure out how to reference a "osdmap service" in seastar.
>>>     }).then([] (auto osdmap) {
>>>       submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>         return pg.do_ops(m->ops);
>>>       });
>>>     });
>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>>> this request, and use read_dma() call to delegate the aio request to
>>> the core maintaining the io queue.
>>> 5. once the aio completes, the PublicService will continue on, with
>>> the then() block. it will send the response back to client.
>>>
>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>>> is good enough for us, i think.
>>>
>> Hey Kefu,
>>
>> That sounds entirely reasonable, but assumes that everything will be running
>> inside of seastar from the start. We've been looking for an incremental
>> approach that would allow us to start with some subset running inside of
>> seastar, with a mechanism for communication between that and the osd's
>> existing threads. One suggestion was to start with just the messenger inside
>> of seastar, and gradually move that seastar-to-external-thread boundary
>> further down the io path as code is refactored to support it. It sounds
>> unlikely that we'll ever get rocksdb running inside of seastar, so the
>> objectstore will need its own threads until there's a viable alternative.
>>
>> So the mpsc queue and smp::external_submit_to() interface was a strategy
>> for passing messages into seastar from arbitrary non-seastar threads.
>> Communication in the other direction just needs to be non-blocking (my
>> example just signaled a condition variable without holding its mutex).
>>
>> What are your thoughts on the incremental approach?
>>
>> Casey
>>
>> ps. I'd love to see more thought put into the design of the finished product,
>> and your outline is a good start! Avi Kivity @scylladb shared one suggestion
>> that I really liked, which was to give each shard of the osd a separate network
>> endpoint, and add enough information to the osdmap so that clients could
>> send their messages directly to the shard that would process them. That
>> piece can come in later, but could eliminate some of the extra latency from
>> your step 3.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-13 13:35             ` kefu chai
@ 2018-02-13 15:58               ` Casey Bodley
  0 siblings, 0 replies; 15+ messages in thread
From: Casey Bodley @ 2018-02-13 15:58 UTC (permalink / raw)
  To: kefu chai, Matt Benjamin
  Cc: Gregory Farnum, Josh Durgin, Adam Emerson, ceph-devel



On 02/13/2018 08:35 AM, kefu chai wrote:
> On Mon, Feb 12, 2018 at 11:55 PM, Matt Benjamin <mbenjami@redhat.com> wrote:
>> How does tame reactor induce more OSD sessions (@greg);  @kefu, in't
>> the hybrid model another way of saying, tame reactor?  The intuition
> i just realized that it is =)
>
>> I've had to this point is that the interfacing here is essentially
>> similar to making seastar interact with anything else, including
>> frameworks (disks, memory devices) that it absolutely wants to and
>> must.
>>
>

Sorry, I'm probably confusing things here by reusing the 'tame reactor' 
term that Adam coined. The original idea was to allow external threads 
to construct a seastar reactor, and call some new function that does a 
single polling/event loop before returning control to that thread. The 
word 'tame' meant that it wouldn't steal control from your thread 
forever like reactor::run() does.

This 'external queue' idea is a different and less ambitious way to 
integrate with non-seastar threads that got some upstream buy-in from Avi.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-13 15:46         ` Casey Bodley
@ 2018-02-13 16:17           ` liuchang0812
  2018-02-14  3:16             ` Allen Samuels
  0 siblings, 1 reply; 15+ messages in thread
From: liuchang0812 @ 2018-02-13 16:17 UTC (permalink / raw)
  To: Casey Bodley
  Cc: Allen Samuels, kefu chai, Josh Durgin, Adam Emerson,
	Gregory Farnum, ceph-devel

rocksdb abstracts those synchronization primitives in
https://github.com/facebook/rocksdb/blob/master/port/port.h. and here
is a example port:
https://github.com/facebook/rocksdb/blob/master/port/port_example.h

2018-02-13 23:46 GMT+08:00, Casey Bodley <cbodley@redhat.com>:
>
>
> On 02/12/2018 02:40 PM, Allen Samuels wrote:
>> I would think that it ought to be reasonably straightforward to get
>> RocksDB (or other thread-based foreign code) to run under the seastar
>> framework provided that you're able to locate all os-invoking primitives
>> within the foreign code and convert those into calls into your
>> compatibility layer. That layer would have to simulate context switching
>> (relatively easy) as well as provide an implementation of that kernel
>> call. In the case of RocksDB, some of that work has already been done
>> (generally, the file and I/O operations are done through a compatibility
>> layer that's provided as a parameter. I'm not as sure about the
>> synchronization primitives, but it ought to be relatively easy to extend
>> to cover those).
>>
>> Has this been discussed?
>
> I don't think it has, no. I'm not familiar with these rocksdb env
> interfaces, but this sounds promising.
>
>>
>> Allen Samuels
>> R&D Engineering Fellow
>>
>> Western Digital®
>> Email:  allen.samuels@wdc.com
>> Office:  +1-408-801-7030
>> Mobile: +1-408-780-6416
>>
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Casey Bodley
>>> Sent: Wednesday, February 07, 2018 9:11 AM
>>> To: kefu chai <tchaikov@gmail.com>; Josh Durgin <jdurgin@redhat.com>
>>> Cc: Adam Emerson <aemerson@redhat.com>; Gregory Farnum
>>> <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
>>> Subject: Re: seastar and 'tame reactor'
>>>
>>>
>>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com>
>>> wrote:
>>>>> [adding ceph-devel]
>>>>>
>>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>>> Hey Josh,
>>>>>>
>>>>>> I heard you mention in the call yesterday that you're looking into
>>>>>> this part of seastar integration. I was just reading through the
>>>>>> relevant code over the weekend, and wanted to compare notes:
>>>>>>
>>>>>>
>>>>>> in seastar, all cross-core communication goes through lockfree spsc
>>>>>> queues, which are encapsulated by 'class smp_message_queue' in
>>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on
>>>>>> startup in smp::configure(). early in reactor::run() (which is
>>>>>> effectively each seastar thread's entrypoint), it registers a
>>>>>> smp_poller to poll all of the queues directed at that cpu
>>>>>>
>>>>>> what we need is a way to inject messages into each seastar reactor
>>>>>> from arbitrary/external threads. our requirements are very similar
>>>>>> to
>>>> i think we will have a sharded<osd::PublicService> on each core. in
>>>> each instance of PublicService, we will be listening and serving
>>>> requests from external clients of cluster. the same applies to
>>>> sharded<osd::ClusterService>, which will be responsible for serving
>>>> the requests from its peers in the cluster. the control flow of a
>>>> typical OSD read request from a public RADOS client will look like:
>>>>
>>>> 1. the TCP connection is accepted by one of the listening
>>>> sharded<osd::PublicService>.
>>>> 2. decode the message
>>>> 3. osd encapsulates the request in the message as a future, and submit
>>>> it to another core after hashing the involved pg # to the core #.
>>>> something like (in pseudo code):
>>>>     engine().submit_to(osdmap_shard, [] {
>>>>       return get_newer_osdmap(m->epoch);
>>>>       // need to figure out how to reference a "osdmap service" in
>>>> seastar.
>>>>     }).then([] (auto osdmap) {
>>>>       submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>>         return pg.do_ops(m->ops);
>>>>       });
>>>>     });
>>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
>>>> this request, and use read_dma() call to delegate the aio request to
>>>> the core maintaining the io queue.
>>>> 5. once the aio completes, the PublicService will continue on, with
>>>> the then() block. it will send the response back to client.
>>>>
>>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc
>>>> is good enough for us, i think.
>>>>
>>> Hey Kefu,
>>>
>>> That sounds entirely reasonable, but assumes that everything will be
>>> running
>>> inside of seastar from the start. We've been looking for an incremental
>>> approach that would allow us to start with some subset running inside of
>>> seastar, with a mechanism for communication between that and the osd's
>>> existing threads. One suggestion was to start with just the messenger
>>> inside
>>> of seastar, and gradually move that seastar-to-external-thread boundary
>>> further down the io path as code is refactored to support it. It sounds
>>> unlikely that we'll ever get rocksdb running inside of seastar, so the
>>> objectstore will need its own threads until there's a viable
>>> alternative.
>>>
>>> So the mpsc queue and smp::external_submit_to() interface was a strategy
>>> for passing messages into seastar from arbitrary non-seastar threads.
>>> Communication in the other direction just needs to be non-blocking (my
>>> example just signaled a condition variable without holding its mutex).
>>>
>>> What are your thoughts on the incremental approach?
>>>
>>> Casey
>>>
>>> ps. I'd love to see more thought put into the design of the finished
>>> product,
>>> and your outline is a good start! Avi Kivity @scylladb shared one
>>> suggestion
>>> that I really liked, which was to give each shard of the osd a separate
>>> network
>>> endpoint, and add enough information to the osdmap so that clients could
>>> send their messages directly to the shard that would process them. That
>>> piece can come in later, but could eliminate some of the extra latency
>>> from
>>> your step 3.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay� ʇڙ�,j ��f���h���z�
>> �w���\f���j:+v���w�j�m���� ����zZ+�����ݢj"��!tml=
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: seastar and 'tame reactor'
  2018-02-13 16:17           ` liuchang0812
@ 2018-02-14  3:16             ` Allen Samuels
  2018-02-15 20:04               ` Josh Durgin
  0 siblings, 1 reply; 15+ messages in thread
From: Allen Samuels @ 2018-02-14  3:16 UTC (permalink / raw)
  To: liuchang0812, Casey Bodley
  Cc: kefu chai, Josh Durgin, Adam Emerson, Gregory Farnum, ceph-devel

I'm not a RocksDB expert, but I did peak at the code. There really seems to be two different isolation strategies at play here, one strategy is associated with the "env" structure which seems to use a classic abstract-base-class and virtual functions to provide environment dependent implementations at run-time (mostly of the file-oriented operations). The second strategy (embodied in the ".../port/port_xxx.h" directory and assorted files) is a compile-time capability. 

We could have a long discussion on the desirability of one scheme over the other (which have different advantages/disadvantages) and the appropriate places to use one or the other, but for my purposes, I'm going to leave that for a later day. I'm simply going to assume that we have the ability to replace each of the objects and APIs that might cause context switches with our own implementation of same and ignore all of the difficulty and negatives associated with having that capability (they are legion !), we can return to that discussion later if there is a belief in the merits of this proposal.

I'm also going to say that I have only a cursory understanding of seastar, so no doubt, there will be inaccuracies stemming from that too....

The essential problem confronting us is how to convert the "synchronous" RocksDB interface (i.e., subroutine calls with threads that block as required) into  the "asynchronous" seastar-style interface (promises, futures, etc.) without re-writing all of the code.

The problem with the Rocks interface is that when a client calls a Rocks API, that API runs on the caller's stack and might invoke an operation that would block -- thereby freezing the entire seastar machine. Without loss of generality, I'll model all blocking operations as a combination of three events: (1) transmission of a "request" message to a recipient, (2) suspension of the calling activity [blocking], and (3) resumption of the blocking activity by the recipient or his agent [unblocking]. The purpose of (1) is to inform the recipient of of the responsibility of unblocking this requestor in the future. This easily models synchronous I/O operations, synchronization primitives, timers and other implicitly blocking operations (like calling the kernel to allocate some pages).

The solution is simple user-space stack switching, which is supported by the standard C library routines makecontext, setcontext, swapcontext, and getcontext. If you're not familiar with those, go read up on 'em.

In the proposed solution we intercept each Rocks call BEFORE it goes into Rocks code (again, I'll assume the appropriate compatibility/interceptor layer to exist with detailed implementation discussion deferred), create a NEW stack (getcontext, makecontext, swapcontext/setcontext) that's different from the seastar thread stack and to invoke the actual Rocks API code using the new stack. If the Rocks code completes without blocking all is well and good, you return back to seastar (setcontext/swapcontext) and exercise the fast-path case of satisfying your future/promise immediately. However, if the Rocks code needs to block, our new compatibility layer will perform operations (1) and (2)  and then switch BACK to the calling seastar stack indicating that the work is still in progress. Now the seastar machiner is fully operational (even though one call is suspending -- blocked ).  Eventually, (3) happens at which point the recipient cases a switch to the suspended stack (swapcontext/setcontext) resuming the previously suspended Rocks code (yes, some magic is required, see below). If that API call now completes you switch back to seastar and satisfy the original invoking promise/future and all is good (yea, recover the stack, blah blah). Of course the API call could block again, which is fine you just go back and do it again :).

Internal Rocks threads aren't really much different, the thread-start proxy just treats them as an external call as described above. Once they're started -- their associated stack never goes away (until the equivalent of join at shutdown).

Basically, we've simply built a small operating system except it's using non-preemptive scheduling.

Careful readers will notice that steps (2) and (3) really have two sub-cases. In one sub-case, the message recipient is another seastar promise/future (this happens with sync primitives) and is relatively easy to implement without any external locks (since it's all being done within the realm of a single seastar thread, no locking is required). The other sub-case is the more interesting case of when the message recipient is NOT within the seastar framework -- think I/O operation, etc. This is where my lack of detailed knowledge of seastar will show, it's relatively easy to do (2), since this ought not invoke anything worse than putting a message on a queue (which can be lockless) and then setting a condition variable to wake up the external entity that's going to do the actual processing (which shouldn't block). This might even be short-circuited in say the case of an SPDK I/O operation where seastar could actually queue the request and simply assume that some other agent will eventually detect the I/O completion (in essence the NVMe queue becomes of the recipient of the message). Doing (3) is the tricky part, seastar is going to have to poll some kind of message queue that contains unblocking messages from the external world, again this could be lockless, but it will need to be polled with the appropriate frequency to make sure that nothing gets starved out (indeed the interceptor layer described above is likely required to perform this polling as well as other places in seastar land).

That's it in a nutshell. The mini-operating system isn't that difficult to write. Almost all of the basic Rocks API operations are easily handled with some simple macros and templated classes. The basic internal stack switching isn't very difficult either -- though it can be a bit of bi**ch to debug if you're not used to have stacks switching out from underneath of you :)

Allen Samuels  
R&D Engineering Fellow 

Western Digital® 
Email:  allen.samuels@wdc.com 
Office:  +1-408-801-7030
Mobile: +1-408-780-6416 

-----Original Message-----
From: liuchang0812 [mailto:liuchang0812@gmail.com] 
Sent: Tuesday, February 13, 2018 8:17 AM
To: Casey Bodley <cbodley@redhat.com>
Cc: Allen Samuels <Allen.Samuels@wdc.com>; kefu chai <tchaikov@gmail.com>; Josh Durgin <jdurgin@redhat.com>; Adam Emerson <aemerson@redhat.com>; Gregory Farnum <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: seastar and 'tame reactor'

rocksdb abstracts those synchronization primitives in https://github.com/facebook/rocksdb/blob/master/port/port.h. and here is a example port:
https://github.com/facebook/rocksdb/blob/master/port/port_example.h

2018-02-13 23:46 GMT+08:00, Casey Bodley <cbodley@redhat.com>:
>
>
> On 02/12/2018 02:40 PM, Allen Samuels wrote:
>> I would think that it ought to be reasonably straightforward to get 
>> RocksDB (or other thread-based foreign code) to run under the seastar 
>> framework provided that you're able to locate all os-invoking 
>> primitives within the foreign code and convert those into calls into 
>> your compatibility layer. That layer would have to simulate context 
>> switching (relatively easy) as well as provide an implementation of 
>> that kernel call. In the case of RocksDB, some of that work has 
>> already been done (generally, the file and I/O operations are done 
>> through a compatibility layer that's provided as a parameter. I'm not 
>> as sure about the synchronization primitives, but it ought to be 
>> relatively easy to extend to cover those).
>>
>> Has this been discussed?
>
> I don't think it has, no. I'm not familiar with these rocksdb env 
> interfaces, but this sounds promising.
>
>>
>> Allen Samuels
>> R&D Engineering Fellow
>>
>> Western Digital®
>> Email:  allen.samuels@wdc.com
>> Office:  +1-408-801-7030
>> Mobile: +1-408-780-6416
>>
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>> owner@vger.kernel.org] On Behalf Of Casey Bodley
>>> Sent: Wednesday, February 07, 2018 9:11 AM
>>> To: kefu chai <tchaikov@gmail.com>; Josh Durgin <jdurgin@redhat.com>
>>> Cc: Adam Emerson <aemerson@redhat.com>; Gregory Farnum 
>>> <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
>>> Subject: Re: seastar and 'tame reactor'
>>>
>>>
>>> On 02/07/2018 11:01 AM, kefu chai wrote:
>>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com>
>>> wrote:
>>>>> [adding ceph-devel]
>>>>>
>>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
>>>>>> Hey Josh,
>>>>>>
>>>>>> I heard you mention in the call yesterday that you're looking 
>>>>>> into this part of seastar integration. I was just reading through 
>>>>>> the relevant code over the weekend, and wanted to compare notes:
>>>>>>
>>>>>>
>>>>>> in seastar, all cross-core communication goes through lockfree 
>>>>>> spsc queues, which are encapsulated by 'class smp_message_queue' 
>>>>>> in core/reactor.hh. all of these queues (smp::_qs) are allocated 
>>>>>> on startup in smp::configure(). early in reactor::run() (which is 
>>>>>> effectively each seastar thread's entrypoint), it registers a 
>>>>>> smp_poller to poll all of the queues directed at that cpu
>>>>>>
>>>>>> what we need is a way to inject messages into each seastar 
>>>>>> reactor from arbitrary/external threads. our requirements are 
>>>>>> very similar to
>>>> i think we will have a sharded<osd::PublicService> on each core. in 
>>>> each instance of PublicService, we will be listening and serving 
>>>> requests from external clients of cluster. the same applies to 
>>>> sharded<osd::ClusterService>, which will be responsible for serving 
>>>> the requests from its peers in the cluster. the control flow of a 
>>>> typical OSD read request from a public RADOS client will look like:
>>>>
>>>> 1. the TCP connection is accepted by one of the listening 
>>>> sharded<osd::PublicService>.
>>>> 2. decode the message
>>>> 3. osd encapsulates the request in the message as a future, and 
>>>> submit it to another core after hashing the involved pg # to the core #.
>>>> something like (in pseudo code):
>>>>     engine().submit_to(osdmap_shard, [] {
>>>>       return get_newer_osdmap(m->epoch);
>>>>       // need to figure out how to reference a "osdmap service" in 
>>>> seastar.
>>>>     }).then([] (auto osdmap) {
>>>>       submit_to(pg_to_shard(m->ops.op.pg, [] {
>>>>         return pg.do_ops(m->ops);
>>>>       });
>>>>     });
>>>> 4. the core serving the involved pg (i.e. pg service) will dequeue 
>>>> this request, and use read_dma() call to delegate the aio request 
>>>> to the core maintaining the io queue.
>>>> 5. once the aio completes, the PublicService will continue on, with 
>>>> the then() block. it will send the response back to client.
>>>>
>>>> so question is: why do we need a mpsc queue? the nr_core*nr_core 
>>>> spsc is good enough for us, i think.
>>>>
>>> Hey Kefu,
>>>
>>> That sounds entirely reasonable, but assumes that everything will be 
>>> running inside of seastar from the start. We've been looking for an 
>>> incremental approach that would allow us to start with some subset 
>>> running inside of seastar, with a mechanism for communication 
>>> between that and the osd's existing threads. One suggestion was to 
>>> start with just the messenger inside of seastar, and gradually move 
>>> that seastar-to-external-thread boundary further down the io path as 
>>> code is refactored to support it. It sounds unlikely that we'll ever 
>>> get rocksdb running inside of seastar, so the objectstore will need 
>>> its own threads until there's a viable alternative.
>>>
>>> So the mpsc queue and smp::external_submit_to() interface was a 
>>> strategy for passing messages into seastar from arbitrary non-seastar threads.
>>> Communication in the other direction just needs to be non-blocking 
>>> (my example just signaled a condition variable without holding its mutex).
>>>
>>> What are your thoughts on the incremental approach?
>>>
>>> Casey
>>>
>>> ps. I'd love to see more thought put into the design of the finished 
>>> product, and your outline is a good start! Avi Kivity @scylladb 
>>> shared one suggestion that I really liked, which was to give each 
>>> shard of the osd a separate network endpoint, and add enough 
>>> information to the osdmap so that clients could send their messages 
>>> directly to the shard that would process them. That piece can come 
>>> in later, but could eliminate some of the extra latency from your 
>>> step 3.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w   
   j:+v   w j m         zZ+     ݢj"  !tml=
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: seastar and 'tame reactor'
  2018-02-14  3:16             ` Allen Samuels
@ 2018-02-15 20:04               ` Josh Durgin
  2018-02-16 16:23                 ` Allen Samuels
  0 siblings, 1 reply; 15+ messages in thread
From: Josh Durgin @ 2018-02-15 20:04 UTC (permalink / raw)
  To: Allen Samuels
  Cc: liuchang0812, Casey Bodley, kefu chai, Adam Emerson,
	Gregory Farnum, ceph-devel

That's a good description of adapting traditional threaded code to
cooperative threading.  I agree it wouldn't be that much more work to
make a seastar <-> cooperative threading interface - in fact, seastar
already implements one via its own thread primitive, using
setcontext/longjmp etc. under the hood.

The difficulty of this approach lies in making this work across
cores. For rocksdb in particular, making use of multiple cores is
necessary for compaction to keep up under heavy write and delete
workloads. This implies creating an M:N scheduler, which is quite a
bit of extra complexity.

The purpose of the tame reactor approach is to keep the existing
threaded code running while parts of it have been converted to seastar
native interfaces, so we can adapt the osd incrementally. Keeping the
threaded portion using kernel threads using the kernel scheduler
avoids significant complexity here.

Keeping the interface between seastar + threaded code explicit is an
advantage as well, since it makes it clear just from reading the code
which context it is running in.

Josh


----- Original Message -----
> From: "Allen Samuels" <Allen.Samuels@wdc.com>
> To: "liuchang0812" <liuchang0812@gmail.com>, "Casey Bodley" <cbodley@redhat.com>
> Cc: "kefu chai" <tchaikov@gmail.com>, "Josh Durgin" <jdurgin@redhat.com>, "Adam Emerson" <aemerson@redhat.com>,
> "Gregory Farnum" <gfarnum@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, February 13, 2018 7:16:26 PM
> Subject: RE: seastar and 'tame reactor'
> 
> I'm not a RocksDB expert, but I did peak at the code. There really seems to
> be two different isolation strategies at play here, one strategy is
> associated with the "env" structure which seems to use a classic
> abstract-base-class and virtual functions to provide environment dependent
> implementations at run-time (mostly of the file-oriented operations). The
> second strategy (embodied in the ".../port/port_xxx.h" directory and
> assorted files) is a compile-time capability.
> 
> We could have a long discussion on the desirability of one scheme over the
> other (which have different advantages/disadvantages) and the appropriate
> places to use one or the other, but for my purposes, I'm going to leave that
> for a later day. I'm simply going to assume that we have the ability to
> replace each of the objects and APIs that might cause context switches with
> our own implementation of same and ignore all of the difficulty and
> negatives associated with having that capability (they are legion !), we can
> return to that discussion later if there is a belief in the merits of this
> proposal.
> 
> I'm also going to say that I have only a cursory understanding of seastar, so
> no doubt, there will be inaccuracies stemming from that too....
> 
> The essential problem confronting us is how to convert the "synchronous"
> RocksDB interface (i.e., subroutine calls with threads that block as
> required) into  the "asynchronous" seastar-style interface (promises,
> futures, etc.) without re-writing all of the code.
> 
> The problem with the Rocks interface is that when a client calls a Rocks API,
> that API runs on the caller's stack and might invoke an operation that would
> block -- thereby freezing the entire seastar machine. Without loss of
> generality, I'll model all blocking operations as a combination of three
> events: (1) transmission of a "request" message to a recipient, (2)
> suspension of the calling activity [blocking], and (3) resumption of the
> blocking activity by the recipient or his agent [unblocking]. The purpose of
> (1) is to inform the recipient of of the responsibility of unblocking this
> requestor in the future. This easily models synchronous I/O operations,
> synchronization primitives, timers and other implicitly blocking operations
> (like calling the kernel to allocate some pages).
> 
> The solution is simple user-space stack switching, which is supported by the
> standard C library routines makecontext, setcontext, swapcontext, and
> getcontext. If you're not familiar with those, go read up on 'em.
> 
> In the proposed solution we intercept each Rocks call BEFORE it goes into
> Rocks code (again, I'll assume the appropriate compatibility/interceptor
> layer to exist with detailed implementation discussion deferred), create a
> NEW stack (getcontext, makecontext, swapcontext/setcontext) that's different
> from the seastar thread stack and to invoke the actual Rocks API code using
> the new stack. If the Rocks code completes without blocking all is well and
> good, you return back to seastar (setcontext/swapcontext) and exercise the
> fast-path case of satisfying your future/promise immediately. However, if
> the Rocks code needs to block, our new compatibility layer will perform
> operations (1) and (2)  and then switch BACK to the calling seastar stack
> indicating that the work is still in progress. Now the seastar machiner is
> fully operational (even though one call is suspending -- blocked ).
> Eventually, (3) happens at which point the recipient cases a switch to the
> suspended stack (swapcontext/setcontext) resuming the previously suspended
> Rocks code (yes, some magic is required, see below). If that API call now
> completes you switch back to seastar and satisfy the original invoking
> promise/future and all is good (yea, recover the stack, blah blah). Of
> course the API call could block again, which is fine you just go back and do
> it again :).
> 
> Internal Rocks threads aren't really much different, the thread-start proxy
> just treats them as an external call as described above. Once they're
> started -- their associated stack never goes away (until the equivalent of
> join at shutdown).
> 
> Basically, we've simply built a small operating system except it's using
> non-preemptive scheduling.
> 
> Careful readers will notice that steps (2) and (3) really have two sub-cases.
> In one sub-case, the message recipient is another seastar promise/future
> (this happens with sync primitives) and is relatively easy to implement
> without any external locks (since it's all being done within the realm of a
> single seastar thread, no locking is required). The other sub-case is the
> more interesting case of when the message recipient is NOT within the
> seastar framework -- think I/O operation, etc. This is where my lack of
> detailed knowledge of seastar will show, it's relatively easy to do (2),
> since this ought not invoke anything worse than putting a message on a queue
> (which can be lockless) and then setting a condition variable to wake up the
> external entity that's going to do the actual processing (which shouldn't
> block). This might even be short-circuited in say the case of an SPDK I/O
> operation where seastar could actually queue the request and simply assume
> that some other agent will eventually detect the I/O completion (in essence
> the NVMe queue becomes of the recipient of the message). Doing (3) is the
> tricky part, seastar is going to have to poll some kind of message queue
> that contains unblocking messages from the external world, again this could
> be lockless, but it will need to be polled with the appropriate frequency to
> make sure that nothing gets starved out (indeed the interceptor layer
> described above is likely required to perform this polling as well as other
> places in seastar land).
> 
> That's it in a nutshell. The mini-operating system isn't that difficult to
> write. Almost all of the basic Rocks API operations are easily handled with
> some simple macros and templated classes. The basic internal stack switching
> isn't very difficult either -- though it can be a bit of bi**ch to debug if
> you're not used to have stacks switching out from underneath of you :)
> 
> Allen Samuels
> R&D Engineering Fellow
> 
> Western Digital®
> Email:  allen.samuels@wdc.com
> Office:  +1-408-801-7030
> Mobile: +1-408-780-6416
> 
> -----Original Message-----
> From: liuchang0812 [mailto:liuchang0812@gmail.com]
> Sent: Tuesday, February 13, 2018 8:17 AM
> To: Casey Bodley <cbodley@redhat.com>
> Cc: Allen Samuels <Allen.Samuels@wdc.com>; kefu chai <tchaikov@gmail.com>;
> Josh Durgin <jdurgin@redhat.com>; Adam Emerson <aemerson@redhat.com>;
> Gregory Farnum <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: seastar and 'tame reactor'
> 
> rocksdb abstracts those synchronization primitives in
> https://github.com/facebook/rocksdb/blob/master/port/port.h. and here is a
> example port:
> https://github.com/facebook/rocksdb/blob/master/port/port_example.h
> 
> 2018-02-13 23:46 GMT+08:00, Casey Bodley <cbodley@redhat.com>:
> >
> >
> > On 02/12/2018 02:40 PM, Allen Samuels wrote:
> >> I would think that it ought to be reasonably straightforward to get
> >> RocksDB (or other thread-based foreign code) to run under the seastar
> >> framework provided that you're able to locate all os-invoking
> >> primitives within the foreign code and convert those into calls into
> >> your compatibility layer. That layer would have to simulate context
> >> switching (relatively easy) as well as provide an implementation of
> >> that kernel call. In the case of RocksDB, some of that work has
> >> already been done (generally, the file and I/O operations are done
> >> through a compatibility layer that's provided as a parameter. I'm not
> >> as sure about the synchronization primitives, but it ought to be
> >> relatively easy to extend to cover those).
> >>
> >> Has this been discussed?
> >
> > I don't think it has, no. I'm not familiar with these rocksdb env
> > interfaces, but this sounds promising.
> >
> >>
> >> Allen Samuels
> >> R&D Engineering Fellow
> >>
> >> Western Digital®
> >> Email:  allen.samuels@wdc.com
> >> Office:  +1-408-801-7030
> >> Mobile: +1-408-780-6416
> >>
> >>
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>> owner@vger.kernel.org] On Behalf Of Casey Bodley
> >>> Sent: Wednesday, February 07, 2018 9:11 AM
> >>> To: kefu chai <tchaikov@gmail.com>; Josh Durgin <jdurgin@redhat.com>
> >>> Cc: Adam Emerson <aemerson@redhat.com>; Gregory Farnum
> >>> <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
> >>> Subject: Re: seastar and 'tame reactor'
> >>>
> >>>
> >>> On 02/07/2018 11:01 AM, kefu chai wrote:
> >>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com>
> >>> wrote:
> >>>>> [adding ceph-devel]
> >>>>>
> >>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
> >>>>>> Hey Josh,
> >>>>>>
> >>>>>> I heard you mention in the call yesterday that you're looking
> >>>>>> into this part of seastar integration. I was just reading through
> >>>>>> the relevant code over the weekend, and wanted to compare notes:
> >>>>>>
> >>>>>>
> >>>>>> in seastar, all cross-core communication goes through lockfree
> >>>>>> spsc queues, which are encapsulated by 'class smp_message_queue'
> >>>>>> in core/reactor.hh. all of these queues (smp::_qs) are allocated
> >>>>>> on startup in smp::configure(). early in reactor::run() (which is
> >>>>>> effectively each seastar thread's entrypoint), it registers a
> >>>>>> smp_poller to poll all of the queues directed at that cpu
> >>>>>>
> >>>>>> what we need is a way to inject messages into each seastar
> >>>>>> reactor from arbitrary/external threads. our requirements are
> >>>>>> very similar to
> >>>> i think we will have a sharded<osd::PublicService> on each core. in
> >>>> each instance of PublicService, we will be listening and serving
> >>>> requests from external clients of cluster. the same applies to
> >>>> sharded<osd::ClusterService>, which will be responsible for serving
> >>>> the requests from its peers in the cluster. the control flow of a
> >>>> typical OSD read request from a public RADOS client will look like:
> >>>>
> >>>> 1. the TCP connection is accepted by one of the listening
> >>>> sharded<osd::PublicService>.
> >>>> 2. decode the message
> >>>> 3. osd encapsulates the request in the message as a future, and
> >>>> submit it to another core after hashing the involved pg # to the core #.
> >>>> something like (in pseudo code):
> >>>>     engine().submit_to(osdmap_shard, [] {
> >>>>       return get_newer_osdmap(m->epoch);
> >>>>       // need to figure out how to reference a "osdmap service" in
> >>>> seastar.
> >>>>     }).then([] (auto osdmap) {
> >>>>       submit_to(pg_to_shard(m->ops.op.pg, [] {
> >>>>         return pg.do_ops(m->ops);
> >>>>       });
> >>>>     });
> >>>> 4. the core serving the involved pg (i.e. pg service) will dequeue
> >>>> this request, and use read_dma() call to delegate the aio request
> >>>> to the core maintaining the io queue.
> >>>> 5. once the aio completes, the PublicService will continue on, with
> >>>> the then() block. it will send the response back to client.
> >>>>
> >>>> so question is: why do we need a mpsc queue? the nr_core*nr_core
> >>>> spsc is good enough for us, i think.
> >>>>
> >>> Hey Kefu,
> >>>
> >>> That sounds entirely reasonable, but assumes that everything will be
> >>> running inside of seastar from the start. We've been looking for an
> >>> incremental approach that would allow us to start with some subset
> >>> running inside of seastar, with a mechanism for communication
> >>> between that and the osd's existing threads. One suggestion was to
> >>> start with just the messenger inside of seastar, and gradually move
> >>> that seastar-to-external-thread boundary further down the io path as
> >>> code is refactored to support it. It sounds unlikely that we'll ever
> >>> get rocksdb running inside of seastar, so the objectstore will need
> >>> its own threads until there's a viable alternative.
> >>>
> >>> So the mpsc queue and smp::external_submit_to() interface was a
> >>> strategy for passing messages into seastar from arbitrary non-seastar
> >>> threads.
> >>> Communication in the other direction just needs to be non-blocking
> >>> (my example just signaled a condition variable without holding its
> >>> mutex).
> >>>
> >>> What are your thoughts on the incremental approach?
> >>>
> >>> Casey
> >>>
> >>> ps. I'd love to see more thought put into the design of the finished
> >>> product, and your outline is a good start! Avi Kivity @scylladb
> >>> shared one suggestion that I really liked, which was to give each
> >>> shard of the osd a separate network endpoint, and add enough
> >>> information to the osdmap so that clients could send their messages
> >>> directly to the shard that would process them. That piece can come
> >>> in later, but could eliminate some of the extra latency from your
> >>> step 3.
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe
> >>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w
>    j:+v   w j m         zZ+     ݢj"  !tml=
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: seastar and 'tame reactor'
  2018-02-15 20:04               ` Josh Durgin
@ 2018-02-16 16:23                 ` Allen Samuels
  0 siblings, 0 replies; 15+ messages in thread
From: Allen Samuels @ 2018-02-16 16:23 UTC (permalink / raw)
  To: Josh Durgin
  Cc: liuchang0812, Casey Bodley, kefu chai, Adam Emerson,
	Gregory Farnum, ceph-devel

I understand the concern about cross-core issues, but I don't see what's different. In other words, doesn't the tame reactor scheme have essentially the same issues?

Seastar postulates a shared-nothing model with each CPU running independently. Cross-CPU messaging is assumed to be rare. More importantly, there's no built-in automatic cross-CPU load balancing capability (doesn't make sense in the shared-nothing model). Isn't that the real problem, i.e., that you're trying to retain the shared-memory model for a portion of the code?

As an aside, I would note that it could be simple to solve the Rocks problem simply by making a full Rocks instance for each CPU, i.e., use it in the shared-nothing model that matches seastar's design. (This assumes that we solve the multiple OSD in one process problem, which we probably need to do anyways, if we ever expect to use RDMA). But let's ignore that for our discussion.

So let's focus on the world where we have some code running as seastar threads and some code running old-style threads, the task is how to map these onto our CPUs. Let's assume we have C CPUs, M seastar threads and N non-seastar threads that we want to run (That ugly M:N scheduler you referenced). 

One really simple scheme is to just assign one M thread to each CPU core and then let the N threads run on any core -- competing for the various CPUs using the standard Linux scheduler. This is essentially what we have now and will have all of the same ugliness (I'm assuming that the seastar reactor will block when there's nothing to do, however, if it polls, then this scheme might require changes there). 

Another other scheme is to make a fixed allocation between the new/old code, i.e., dedicate some number of CPUs to seastar threads and run all of the non-seastar code on the remaining CPUs. The downside here is the fixed allocation of resources. However, if we assume that over time the non-seastar code shrinks, then this might be an acceptable transitional scheme.

In any case, I believe you still have to solve the M:N scheduling problem.

Allen Samuels  
R&D Engineering Fellow 

Western Digital® 
Email:  allen.samuels@wdc.com 
Office:  +1-408-801-7030
Mobile: +1-408-780-6416 


> -----Original Message-----
> From: Josh Durgin [mailto:jdurgin@redhat.com]
> Sent: Thursday, February 15, 2018 12:04 PM
> To: Allen Samuels <Allen.Samuels@wdc.com>
> Cc: liuchang0812 <liuchang0812@gmail.com>; Casey Bodley
> <cbodley@redhat.com>; kefu chai <tchaikov@gmail.com>; Adam Emerson
> <aemerson@redhat.com>; Gregory Farnum <gfarnum@redhat.com>; ceph-
> devel <ceph-devel@vger.kernel.org>
> Subject: Re: seastar and 'tame reactor'
> 
> That's a good description of adapting traditional threaded code to
> cooperative threading.  I agree it wouldn't be that much more work to make
> a seastar <-> cooperative threading interface - in fact, seastar already
> implements one via its own thread primitive, using setcontext/longjmp etc.
> under the hood.
> 
> The difficulty of this approach lies in making this work across cores. For
> rocksdb in particular, making use of multiple cores is necessary for
> compaction to keep up under heavy write and delete workloads. This implies
> creating an M:N scheduler, which is quite a bit of extra complexity.
> 
> The purpose of the tame reactor approach is to keep the existing threaded
> code running while parts of it have been converted to seastar native
> interfaces, so we can adapt the osd incrementally. Keeping the threaded
> portion using kernel threads using the kernel scheduler avoids significant
> complexity here.
> 
> Keeping the interface between seastar + threaded code explicit is an
> advantage as well, since it makes it clear just from reading the code which
> context it is running in.
> 
> Josh
> 
> 
> ----- Original Message -----
> > From: "Allen Samuels" <Allen.Samuels@wdc.com>
> > To: "liuchang0812" <liuchang0812@gmail.com>, "Casey Bodley"
> > <cbodley@redhat.com>
> > Cc: "kefu chai" <tchaikov@gmail.com>, "Josh Durgin"
> > <jdurgin@redhat.com>, "Adam Emerson" <aemerson@redhat.com>,
> "Gregory
> > Farnum" <gfarnum@redhat.com>, "ceph-devel"
> > <ceph-devel@vger.kernel.org>
> > Sent: Tuesday, February 13, 2018 7:16:26 PM
> > Subject: RE: seastar and 'tame reactor'
> >
> > I'm not a RocksDB expert, but I did peak at the code. There really
> > seems to be two different isolation strategies at play here, one
> > strategy is associated with the "env" structure which seems to use a
> > classic abstract-base-class and virtual functions to provide
> > environment dependent implementations at run-time (mostly of the
> > file-oriented operations). The second strategy (embodied in the
> > ".../port/port_xxx.h" directory and assorted files) is a compile-time
> capability.
> >
> > We could have a long discussion on the desirability of one scheme over
> > the other (which have different advantages/disadvantages) and the
> > appropriate places to use one or the other, but for my purposes, I'm
> > going to leave that for a later day. I'm simply going to assume that
> > we have the ability to replace each of the objects and APIs that might
> > cause context switches with our own implementation of same and ignore
> > all of the difficulty and negatives associated with having that
> > capability (they are legion !), we can return to that discussion later
> > if there is a belief in the merits of this proposal.
> >
> > I'm also going to say that I have only a cursory understanding of
> > seastar, so no doubt, there will be inaccuracies stemming from that too....
> >
> > The essential problem confronting us is how to convert the "synchronous"
> > RocksDB interface (i.e., subroutine calls with threads that block as
> > required) into  the "asynchronous" seastar-style interface (promises,
> > futures, etc.) without re-writing all of the code.
> >
> > The problem with the Rocks interface is that when a client calls a
> > Rocks API, that API runs on the caller's stack and might invoke an
> > operation that would block -- thereby freezing the entire seastar
> > machine. Without loss of generality, I'll model all blocking
> > operations as a combination of three
> > events: (1) transmission of a "request" message to a recipient, (2)
> > suspension of the calling activity [blocking], and (3) resumption of
> > the blocking activity by the recipient or his agent [unblocking]. The
> > purpose of
> > (1) is to inform the recipient of of the responsibility of unblocking
> > this requestor in the future. This easily models synchronous I/O
> > operations, synchronization primitives, timers and other implicitly
> > blocking operations (like calling the kernel to allocate some pages).
> >
> > The solution is simple user-space stack switching, which is supported
> > by the standard C library routines makecontext, setcontext,
> > swapcontext, and getcontext. If you're not familiar with those, go read up
> on 'em.
> >
> > In the proposed solution we intercept each Rocks call BEFORE it goes
> > into Rocks code (again, I'll assume the appropriate
> > compatibility/interceptor layer to exist with detailed implementation
> > discussion deferred), create a NEW stack (getcontext, makecontext,
> > swapcontext/setcontext) that's different from the seastar thread stack
> > and to invoke the actual Rocks API code using the new stack. If the
> > Rocks code completes without blocking all is well and good, you return
> > back to seastar (setcontext/swapcontext) and exercise the fast-path
> > case of satisfying your future/promise immediately. However, if the
> > Rocks code needs to block, our new compatibility layer will perform
> > operations (1) and (2)  and then switch BACK to the calling seastar
> > stack indicating that the work is still in progress. Now the seastar machiner
> is fully operational (even though one call is suspending -- blocked ).
> > Eventually, (3) happens at which point the recipient cases a switch to
> > the suspended stack (swapcontext/setcontext) resuming the previously
> > suspended Rocks code (yes, some magic is required, see below). If that
> > API call now completes you switch back to seastar and satisfy the
> > original invoking promise/future and all is good (yea, recover the
> > stack, blah blah). Of course the API call could block again, which is
> > fine you just go back and do it again :).
> >
> > Internal Rocks threads aren't really much different, the thread-start
> > proxy just treats them as an external call as described above. Once
> > they're started -- their associated stack never goes away (until the
> > equivalent of join at shutdown).
> >
> > Basically, we've simply built a small operating system except it's
> > using non-preemptive scheduling.
> >
> > Careful readers will notice that steps (2) and (3) really have two sub-cases.
> > In one sub-case, the message recipient is another seastar
> > promise/future (this happens with sync primitives) and is relatively
> > easy to implement without any external locks (since it's all being
> > done within the realm of a single seastar thread, no locking is
> > required). The other sub-case is the more interesting case of when the
> > message recipient is NOT within the seastar framework -- think I/O
> > operation, etc. This is where my lack of detailed knowledge of seastar
> > will show, it's relatively easy to do (2), since this ought not invoke
> > anything worse than putting a message on a queue (which can be
> > lockless) and then setting a condition variable to wake up the
> > external entity that's going to do the actual processing (which
> > shouldn't block). This might even be short-circuited in say the case
> > of an SPDK I/O operation where seastar could actually queue the
> > request and simply assume that some other agent will eventually detect
> > the I/O completion (in essence the NVMe queue becomes of the recipient
> > of the message). Doing (3) is the tricky part, seastar is going to
> > have to poll some kind of message queue that contains unblocking
> > messages from the external world, again this could be lockless, but it
> > will need to be polled with the appropriate frequency to make sure
> > that nothing gets starved out (indeed the interceptor layer described
> above is likely required to perform this polling as well as other places in
> seastar land).
> >
> > That's it in a nutshell. The mini-operating system isn't that
> > difficult to write. Almost all of the basic Rocks API operations are
> > easily handled with some simple macros and templated classes. The
> > basic internal stack switching isn't very difficult either -- though
> > it can be a bit of bi**ch to debug if you're not used to have stacks
> > switching out from underneath of you :)
> >
> > Allen Samuels
> > R&D Engineering Fellow
> >
> > Western Digital®
> > Email:  allen.samuels@wdc.com
> > Office:  +1-408-801-7030
> > Mobile: +1-408-780-6416
> >
> > -----Original Message-----
> > From: liuchang0812 [mailto:liuchang0812@gmail.com]
> > Sent: Tuesday, February 13, 2018 8:17 AM
> > To: Casey Bodley <cbodley@redhat.com>
> > Cc: Allen Samuels <Allen.Samuels@wdc.com>; kefu chai
> > <tchaikov@gmail.com>; Josh Durgin <jdurgin@redhat.com>; Adam
> Emerson
> > <aemerson@redhat.com>; Gregory Farnum <gfarnum@redhat.com>;
> ceph-devel
> > <ceph-devel@vger.kernel.org>
> > Subject: Re: seastar and 'tame reactor'
> >
> > rocksdb abstracts those synchronization primitives in
> > https://github.com/facebook/rocksdb/blob/master/port/port.h. and here
> > is a example port:
> > https://github.com/facebook/rocksdb/blob/master/port/port_example.h
> >
> > 2018-02-13 23:46 GMT+08:00, Casey Bodley <cbodley@redhat.com>:
> > >
> > >
> > > On 02/12/2018 02:40 PM, Allen Samuels wrote:
> > >> I would think that it ought to be reasonably straightforward to get
> > >> RocksDB (or other thread-based foreign code) to run under the
> > >> seastar framework provided that you're able to locate all
> > >> os-invoking primitives within the foreign code and convert those
> > >> into calls into your compatibility layer. That layer would have to
> > >> simulate context switching (relatively easy) as well as provide an
> > >> implementation of that kernel call. In the case of RocksDB, some of
> > >> that work has already been done (generally, the file and I/O
> > >> operations are done through a compatibility layer that's provided
> > >> as a parameter. I'm not as sure about the synchronization
> > >> primitives, but it ought to be relatively easy to extend to cover those).
> > >>
> > >> Has this been discussed?
> > >
> > > I don't think it has, no. I'm not familiar with these rocksdb env
> > > interfaces, but this sounds promising.
> > >
> > >>
> > >> Allen Samuels
> > >> R&D Engineering Fellow
> > >>
> > >> Western Digital®
> > >> Email:  allen.samuels@wdc.com
> > >> Office:  +1-408-801-7030
> > >> Mobile: +1-408-780-6416
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>> owner@vger.kernel.org] On Behalf Of Casey Bodley
> > >>> Sent: Wednesday, February 07, 2018 9:11 AM
> > >>> To: kefu chai <tchaikov@gmail.com>; Josh Durgin
> > >>> <jdurgin@redhat.com>
> > >>> Cc: Adam Emerson <aemerson@redhat.com>; Gregory Farnum
> > >>> <gfarnum@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > >>> Subject: Re: seastar and 'tame reactor'
> > >>>
> > >>>
> > >>> On 02/07/2018 11:01 AM, kefu chai wrote:
> > >>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin <jdurgin@redhat.com>
> > >>> wrote:
> > >>>>> [adding ceph-devel]
> > >>>>>
> > >>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote:
> > >>>>>> Hey Josh,
> > >>>>>>
> > >>>>>> I heard you mention in the call yesterday that you're looking
> > >>>>>> into this part of seastar integration. I was just reading
> > >>>>>> through the relevant code over the weekend, and wanted to
> compare notes:
> > >>>>>>
> > >>>>>>
> > >>>>>> in seastar, all cross-core communication goes through lockfree
> > >>>>>> spsc queues, which are encapsulated by 'class
> smp_message_queue'
> > >>>>>> in core/reactor.hh. all of these queues (smp::_qs) are
> > >>>>>> allocated on startup in smp::configure(). early in
> > >>>>>> reactor::run() (which is effectively each seastar thread's
> > >>>>>> entrypoint), it registers a smp_poller to poll all of the
> > >>>>>> queues directed at that cpu
> > >>>>>>
> > >>>>>> what we need is a way to inject messages into each seastar
> > >>>>>> reactor from arbitrary/external threads. our requirements are
> > >>>>>> very similar to
> > >>>> i think we will have a sharded<osd::PublicService> on each core.
> > >>>> in each instance of PublicService, we will be listening and
> > >>>> serving requests from external clients of cluster. the same
> > >>>> applies to sharded<osd::ClusterService>, which will be
> > >>>> responsible for serving the requests from its peers in the
> > >>>> cluster. the control flow of a typical OSD read request from a public
> RADOS client will look like:
> > >>>>
> > >>>> 1. the TCP connection is accepted by one of the listening
> > >>>> sharded<osd::PublicService>.
> > >>>> 2. decode the message
> > >>>> 3. osd encapsulates the request in the message as a future, and
> > >>>> submit it to another core after hashing the involved pg # to the core
> #.
> > >>>> something like (in pseudo code):
> > >>>>     engine().submit_to(osdmap_shard, [] {
> > >>>>       return get_newer_osdmap(m->epoch);
> > >>>>       // need to figure out how to reference a "osdmap service"
> > >>>> in seastar.
> > >>>>     }).then([] (auto osdmap) {
> > >>>>       submit_to(pg_to_shard(m->ops.op.pg, [] {
> > >>>>         return pg.do_ops(m->ops);
> > >>>>       });
> > >>>>     });
> > >>>> 4. the core serving the involved pg (i.e. pg service) will
> > >>>> dequeue this request, and use read_dma() call to delegate the aio
> > >>>> request to the core maintaining the io queue.
> > >>>> 5. once the aio completes, the PublicService will continue on,
> > >>>> with the then() block. it will send the response back to client.
> > >>>>
> > >>>> so question is: why do we need a mpsc queue? the nr_core*nr_core
> > >>>> spsc is good enough for us, i think.
> > >>>>
> > >>> Hey Kefu,
> > >>>
> > >>> That sounds entirely reasonable, but assumes that everything will
> > >>> be running inside of seastar from the start. We've been looking
> > >>> for an incremental approach that would allow us to start with some
> > >>> subset running inside of seastar, with a mechanism for
> > >>> communication between that and the osd's existing threads. One
> > >>> suggestion was to start with just the messenger inside of seastar,
> > >>> and gradually move that seastar-to-external-thread boundary
> > >>> further down the io path as code is refactored to support it. It
> > >>> sounds unlikely that we'll ever get rocksdb running inside of
> > >>> seastar, so the objectstore will need its own threads until there's a
> viable alternative.
> > >>>
> > >>> So the mpsc queue and smp::external_submit_to() interface was a
> > >>> strategy for passing messages into seastar from arbitrary
> > >>> non-seastar threads.
> > >>> Communication in the other direction just needs to be non-blocking
> > >>> (my example just signaled a condition variable without holding its
> > >>> mutex).
> > >>>
> > >>> What are your thoughts on the incremental approach?
> > >>>
> > >>> Casey
> > >>>
> > >>> ps. I'd love to see more thought put into the design of the
> > >>> finished product, and your outline is a good start! Avi Kivity
> > >>> @scylladb shared one suggestion that I really liked, which was to
> > >>> give each shard of the osd a separate network endpoint, and add
> > >>> enough information to the osdmap so that clients could send their
> > >>> messages directly to the shard that would process them. That piece
> > >>> can come in later, but could eliminate some of the extra latency
> > >>> from your step 3.
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe
> > >>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w
> >    j:+v   w j m         zZ+     ݢj"  !tml=
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j  f   h   z  w      j
> > :+v   w j m        zZ+  ݢj"

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-02-16 16:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <d0f50268-72bb-1196-7ce9-0b9e21808ffb@redhat.com>
2018-01-30 22:32 ` seastar and 'tame reactor' Josh Durgin
2018-02-07 16:01   ` kefu chai
2018-02-07 17:11     ` Casey Bodley
2018-02-07 19:22       ` Gregory Farnum
2018-02-12 15:45         ` kefu chai
2018-02-12 15:55           ` Matt Benjamin
2018-02-12 15:57             ` Gregory Farnum
2018-02-13 13:35             ` kefu chai
2018-02-13 15:58               ` Casey Bodley
2018-02-12 19:40       ` Allen Samuels
2018-02-13 15:46         ` Casey Bodley
2018-02-13 16:17           ` liuchang0812
2018-02-14  3:16             ` Allen Samuels
2018-02-15 20:04               ` Josh Durgin
2018-02-16 16:23                 ` Allen Samuels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.