From mboxrd@z Thu Jan 1 00:00:00 1970 From: liuchang0812 Subject: Re: seastar and 'tame reactor' Date: Wed, 14 Feb 2018 00:17:18 +0800 Message-ID: References: <5b32056c-c6f7-a1c4-d07a-ae04557b59cf@redhat.com> <9e6bc174-c6b3-a37e-abd5-b96d572d1d1b@redhat.com> <766c2ea4-93a0-3c8c-038a-67c4b4bda9b9@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail-lf0-f46.google.com ([209.85.215.46]:46584 "EHLO mail-lf0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934454AbeBMQRV (ORCPT ); Tue, 13 Feb 2018 11:17:21 -0500 Received: by mail-lf0-f46.google.com with SMTP id q194so25773229lfe.13 for ; Tue, 13 Feb 2018 08:17:20 -0800 (PST) In-Reply-To: <766c2ea4-93a0-3c8c-038a-67c4b4bda9b9@redhat.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Casey Bodley Cc: Allen Samuels , kefu chai , Josh Durgin , Adam Emerson , Gregory Farnum , ceph-devel rocksdb abstracts those synchronization primitives in https://github.com/facebook/rocksdb/blob/master/port/port.h. and here is a example port: https://github.com/facebook/rocksdb/blob/master/port/port_example.h 2018-02-13 23:46 GMT+08:00, Casey Bodley : > > > On 02/12/2018 02:40 PM, Allen Samuels wrote: >> I would think that it ought to be reasonably straightforward to get >> RocksDB (or other thread-based foreign code) to run under the seastar >> framework provided that you're able to locate all os-invoking primitives >> within the foreign code and convert those into calls into your >> compatibility layer. That layer would have to simulate context switching >> (relatively easy) as well as provide an implementation of that kernel >> call. In the case of RocksDB, some of that work has already been done >> (generally, the file and I/O operations are done through a compatibility >> layer that's provided as a parameter. I'm not as sure about the >> synchronization primitives, but it ought to be relatively easy to extend >> to cover those). >> >> Has this been discussed? > > I don't think it has, no. I'm not familiar with these rocksdb env > interfaces, but this sounds promising. > >> >> Allen Samuels >> R&D Engineering Fellow >> >> Western Digital=C2=AE >> Email: allen.samuels@wdc.com >> Office: +1-408-801-7030 >> Mobile: +1-408-780-6416 >> >> >>> -----Original Message----- >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >>> owner@vger.kernel.org] On Behalf Of Casey Bodley >>> Sent: Wednesday, February 07, 2018 9:11 AM >>> To: kefu chai ; Josh Durgin >>> Cc: Adam Emerson ; Gregory Farnum >>> ; ceph-devel >>> Subject: Re: seastar and 'tame reactor' >>> >>> >>> On 02/07/2018 11:01 AM, kefu chai wrote: >>>> On Wed, Jan 31, 2018 at 6:32 AM, Josh Durgin >>> wrote: >>>>> [adding ceph-devel] >>>>> >>>>> On 01/30/2018 01:56 PM, Casey Bodley wrote: >>>>>> Hey Josh, >>>>>> >>>>>> I heard you mention in the call yesterday that you're looking into >>>>>> this part of seastar integration. I was just reading through the >>>>>> relevant code over the weekend, and wanted to compare notes: >>>>>> >>>>>> >>>>>> in seastar, all cross-core communication goes through lockfree spsc >>>>>> queues, which are encapsulated by 'class smp_message_queue' in >>>>>> core/reactor.hh. all of these queues (smp::_qs) are allocated on >>>>>> startup in smp::configure(). early in reactor::run() (which is >>>>>> effectively each seastar thread's entrypoint), it registers a >>>>>> smp_poller to poll all of the queues directed at that cpu >>>>>> >>>>>> what we need is a way to inject messages into each seastar reactor >>>>>> from arbitrary/external threads. our requirements are very similar >>>>>> to >>>> i think we will have a sharded on each core. in >>>> each instance of PublicService, we will be listening and serving >>>> requests from external clients of cluster. the same applies to >>>> sharded, which will be responsible for serving >>>> the requests from its peers in the cluster. the control flow of a >>>> typical OSD read request from a public RADOS client will look like: >>>> >>>> 1. the TCP connection is accepted by one of the listening >>>> sharded. >>>> 2. decode the message >>>> 3. osd encapsulates the request in the message as a future, and submit >>>> it to another core after hashing the involved pg # to the core #. >>>> something like (in pseudo code): >>>> engine().submit_to(osdmap_shard, [] { >>>> return get_newer_osdmap(m->epoch); >>>> // need to figure out how to reference a "osdmap service" in >>>> seastar. >>>> }).then([] (auto osdmap) { >>>> submit_to(pg_to_shard(m->ops.op.pg, [] { >>>> return pg.do_ops(m->ops); >>>> }); >>>> }); >>>> 4. the core serving the involved pg (i.e. pg service) will dequeue >>>> this request, and use read_dma() call to delegate the aio request to >>>> the core maintaining the io queue. >>>> 5. once the aio completes, the PublicService will continue on, with >>>> the then() block. it will send the response back to client. >>>> >>>> so question is: why do we need a mpsc queue? the nr_core*nr_core spsc >>>> is good enough for us, i think. >>>> >>> Hey Kefu, >>> >>> That sounds entirely reasonable, but assumes that everything will be >>> running >>> inside of seastar from the start. We've been looking for an incremental >>> approach that would allow us to start with some subset running inside o= f >>> seastar, with a mechanism for communication between that and the osd's >>> existing threads. One suggestion was to start with just the messenger >>> inside >>> of seastar, and gradually move that seastar-to-external-thread boundary >>> further down the io path as code is refactored to support it. It sounds >>> unlikely that we'll ever get rocksdb running inside of seastar, so the >>> objectstore will need its own threads until there's a viable >>> alternative. >>> >>> So the mpsc queue and smp::external_submit_to() interface was a strateg= y >>> for passing messages into seastar from arbitrary non-seastar threads. >>> Communication in the other direction just needs to be non-blocking (my >>> example just signaled a condition variable without holding its mutex). >>> >>> What are your thoughts on the incremental approach? >>> >>> Casey >>> >>> ps. I'd love to see more thought put into the design of the finished >>> product, >>> and your outline is a good start! Avi Kivity @scylladb shared one >>> suggestion >>> that I really liked, which was to give each shard of the osd a separate >>> network >>> endpoint, and add enough information to the osdmap so that clients coul= d >>> send their messages directly to the shard that would process them. That >>> piece can come in later, but could eliminate some of the extra latency >>> from >>> your step 3. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n >>> the >>> body of a message to majordomo@vger.kernel.org More majordomo info at >>> http://vger.kernel.org/majordomo-info.html >> N=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDr=EF=BF=BD=EF=BF=BDy=EF=BF= =BD=EF=BF=BD=EF=BF=BDb=EF=BF=BDX=EF=BF=BD=EF=BF=BD=C7=A7v=EF=BF=BD^=EF=BF= =BD)=DE=BA{.n=EF=BF=BD+=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF=BD]z=EF=BF=BD=EF= =BF=BD=EF=BF=BD{ay=EF=BF=BD =CA=87=DA=99=EF=BF=BD,j =EF=BF=BD=EF=BF=BDf=EF= =BF=BD=EF=BF=BD=EF=BF=BDh=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF=BD >> =EF=BF=BDw=EF=BF=BD=EF=BF=BD=EF=BF=BD=0C=EF=BF=BD=EF=BF=BD=EF=BF=BDj:+v= =EF=BF=BD=EF=BF=BD=EF=BF=BDw=EF=BF=BDj=EF=BF=BDm=EF=BF=BD=EF=BF=BD=EF=BF=BD= =EF=BF=BD =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDzZ+=EF=BF=BD=EF=BF=BD=EF=BF= =BD=EF=BF=BD=EF=BF=BD=DD=A2j"=EF=BF=BD=EF=BF=BD!tml=3D > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >