From: Danil Kipnis <firstname.lastname@example.org> To: Sagi Grimberg <email@example.com> Cc: Jack Wang <firstname.lastname@example.org>, email@example.com, firstname.lastname@example.org, email@example.com, Christoph Hellwig <firstname.lastname@example.org>, email@example.com, firstname.lastname@example.org, email@example.com, Roman Pen <firstname.lastname@example.org>, email@example.com Subject: Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Date: Thu, 11 Jul 2019 10:54:02 +0200 Message-ID: <CAHg0HuxZvXH899=M4vC7BTH-bP2J35aTwsGhiGoC8AamD8gOyA@mail.gmail.com> (raw) In-Reply-To: <firstname.lastname@example.org> Hi Sagi, thanks a lot for the detailed reply. Answers inline below: On Tue, Jul 9, 2019 at 9:46 PM Sagi Grimberg <email@example.com> wrote: > > Hi Danil and Jack, > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community > > That is not exactly correct AFAIR, > > My main issues which were raised before are: > - IMO there isn't any justification to this ibtrs layering separation > given that the only user of this is your ibnbd. Unless you are > trying to submit another consumer, you should avoid adding another > subsystem that is not really general purpose. We designed ibtrs not only with the IBNBD in mind but also as the transport layer for a distributed SDS. We'd like to be able to do what ceph is capable of (automatic up/down scaling of the storage cluster, automatic recovery) but using in-kernel rdma-based IO transport drivers, thin-provisioned volume managers, etc. to keep the highest possible performance. That modest plan of ours should among others cover for the following: When using IBNBD/SRP/NVMEoF to export devices (say, thin-provisioned volumes) from server to client and building an (md-)raid on top of the imported devices on client side in order to provide for redundancy across different machines, one gets very decent throughput and low latency, since the IOs are sent in parallel to the storage machines. One downside of this setup, is that the resync traffic has to flow over the client, where the md-raid is sitting. Ideally the resync traffic should flow directly between the two "legs" (storage machines) of the raid. The server side of such a "distributed raid" capable of this direct syncing between the array members would necessarily require to have some logic on server side and hence could also sit on top of ibtrs. (To continue the analogy, the "server" side of an md-raid build on top of say two NVMEoF devices are just two block devices, which couldn't communicate with each other) All in all itbrs is a library to establish a "fat", multipath, autoreconnectable connection between two hosts on top of rdma, optimized for transport of IO traffic. > - ibtrs in general is using almost no infrastructure from the existing > kernel subsystems. Examples are: > - tag allocation mechanism (which I'm not clear why its needed) As you correctly noticed our client manages the buffers allocated and registered by the server on the connection establishment. Our tags are just a mechanism to take and release those buffers for incoming requests on client side. Since the buffers allocated by the server are to be shared between all the devices mapped from that server and all their HW queues (each having num_cpus of them) the mechanism behind get_tag/put_tag also takes care of the fairness. > - rdma rw abstraction similar to what we have in the core On the one hand we have only single IO related function: ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write with imm, or requests an rdma write with imm to be executed by the server. On the other hand we provide an abstraction to establish and manage what we call "session", which consist of multiple paths (to do failover and multipath with different policies), where each path consists of num_cpu rdma connections. Once you established a session you can add or remove paths from it on the fly. In case the connection to server is lost, the client does periodic attempts to reconnect automatically. On the server side you get just sg-lists with a direction READ or WRITE as requested by the client. We designed this interface not only as the minimum required to build a block device on top of rdma but also with a distributed raid in mind. > - list_next_or_null_rr_rcu ?? We use that for multipath. The macro (and more importantly the way we use it) has been reviewed by Linus and quit closely by Paul E. McKenney. AFAIR the conclusion was that Romans implementation is correct, but too tricky to use correctly in order to be included into kernel as a public interface. See https://lkml.org/lkml/2018/5/18/659 > - few other examples sprinkled around.. To my best knowledge we addressed everything we got comments on and will definitely do so in the future. > Another question, from what I understand from the code, the client > always rdma_writes data on writes (with imm) from a remote pool of > server buffers dedicated to it. Essentially all writes are immediate (no > rdma reads ever). How is that different than using send wrs to a set of > pre-posted recv buffers (like all others are doing)? Is it faster? At the very beginning of the project we did some measurements and saw, that it is faster. I'm not sure if this is still true, since the hardware and the drivers and rdma subsystem did change in that time. Also it seemed to make the code simpler. > Also, given that the server pre-allocate a substantial amount of memory > for each connection, is it documented the requirements from the server > side? Usually kernel implementations (especially upstream ones) will > avoid imposing such large longstanding memory requirements on the system > by default. I don't have a firm stand on this, but wanted to highlight > this as you are sending this for upstream inclusion. We definitely need to stress that somewhere. Will include into readme and add to the cover letter next time. Our memory management is indeed basically absent in favor of performance: The server reserves queue_depth of say 512K buffers. Each buffer is used by client for single IO only, no matter how big the request is. So if client only issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming for lowest possible latency from the beginning. It is probably possible to implement some clever allocator on the server side which wouldn't affect the performance a lot. > > and > > continue to maintain our code up-to-date with the upstream kernel > > while having an extra compatibility layer for older kernels in our > > out-of-tree repository. > > Overall, while I absolutely support your cause to lower your maintenance > overhead by having this sit upstream, I don't see why this can be > helpful to anyone else in the rdma community. If instead you can > crystallize why/how ibnbd is faster than anything else, and perhaps > contribute a common infrastructure piece (or enhance an existing one) > such that other existing ulps can leverage, it will be a lot more > compelling to include it upstream. > > > I understand that SRP and NVMEoF which are in the kernel already do > > provide equivalent functionality for the majority of the use cases. > > IBNBD on the other hand is showing higher performance and more > > importantly includes the IBTRS - a general purpose library to > > establish connections and transport BIO-like read/write sg-lists over > > RDMA, > > But who needs it? Can other ulps use it or pieces of it? I keep failing > to understand why is this a benefit if its specific to your ibnbd? See above and please ask if you have more questions to this. Thank you, Danil.
next prev parent reply index Thread overview: 123+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <firstname.lastname@example.org> 2019-06-20 15:03 ` [PATCH v4 01/25] sysfs: export sysfs_remove_file_self() Jack Wang 2019-09-23 17:21 ` Bart Van Assche 2019-09-25 9:30 ` Danil Kipnis 2019-07-09 9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis 2019-07-09 11:00 ` Leon Romanovsky 2019-07-09 11:17 ` Greg KH 2019-07-09 11:57 ` Jinpu Wang 2019-07-09 13:32 ` Leon Romanovsky 2019-07-09 15:39 ` Bart Van Assche 2019-07-09 11:37 ` Jinpu Wang 2019-07-09 12:06 ` Jason Gunthorpe 2019-07-09 13:15 ` Jinpu Wang 2019-07-09 13:19 ` Jason Gunthorpe 2019-07-09 14:17 ` Jinpu Wang 2019-07-09 21:27 ` Sagi Grimberg 2019-07-19 13:12 ` Danil Kipnis 2019-07-10 14:55 ` Danil Kipnis 2019-07-09 12:04 ` Jason Gunthorpe 2019-07-09 19:45 ` Sagi Grimberg 2019-07-10 13:55 ` Jason Gunthorpe 2019-07-10 16:25 ` Sagi Grimberg 2019-07-10 17:25 ` Jason Gunthorpe 2019-07-10 19:11 ` Sagi Grimberg 2019-07-11 7:27 ` Danil Kipnis 2019-07-11 8:54 ` Danil Kipnis [this message] 2019-07-12 0:22 ` Sagi Grimberg 2019-07-12 7:57 ` Jinpu Wang 2019-07-12 19:40 ` Sagi Grimberg 2019-07-15 11:21 ` Jinpu Wang 2019-07-12 10:58 ` Danil Kipnis [not found] ` <email@example.com> 2019-07-09 15:10 ` [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Leon Romanovsky 2019-07-09 15:18 ` Jinpu Wang 2019-07-09 15:51 ` Leon Romanovsky 2019-09-13 23:56 ` Bart Van Assche 2019-09-19 10:30 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-13 22:10 ` [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers Bart Van Assche 2019-09-15 14:30 ` Jinpu Wang 2019-09-16 5:27 ` Leon Romanovsky 2019-09-16 13:45 ` Bart Van Assche 2019-09-17 15:41 ` Leon Romanovsky 2019-09-17 15:52 ` Jinpu Wang 2019-09-16 7:08 ` Danil Kipnis 2019-09-16 14:57 ` Jinpu Wang 2019-09-16 17:25 ` Bart Van Assche 2019-09-17 12:27 ` Jinpu Wang 2019-09-16 15:39 ` Jinpu Wang 2019-09-18 15:26 ` Bart Van Assche 2019-09-18 16:11 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-13 22:25 ` [PATCH v4 16/25] ibnbd: client: private header with client structs and functions Bart Van Assche 2019-09-17 16:36 ` Jinpu Wang 2019-09-25 23:43 ` Danil Kipnis 2019-09-26 10:00 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-13 23:46 ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche 2019-09-16 14:17 ` Danil Kipnis 2019-09-16 16:46 ` Bart Van Assche 2019-09-17 11:39 ` Danil Kipnis 2019-09-18 7:14 ` Danil Kipnis 2019-09-18 15:47 ` Bart Van Assche 2019-09-20 8:29 ` Danil Kipnis 2019-09-25 22:26 ` Danil Kipnis 2019-09-26 9:55 ` Roman Penyaev 2019-09-26 15:01 ` Bart Van Assche 2019-09-27 8:52 ` Roman Penyaev 2019-09-27 9:32 ` Danil Kipnis 2019-09-27 12:18 ` Danil Kipnis 2019-09-27 16:37 ` Bart Van Assche 2019-09-27 16:50 ` Roman Penyaev 2019-09-27 17:16 ` Bart Van Assche 2019-09-17 13:09 ` Jinpu Wang 2019-09-17 16:46 ` Bart Van Assche 2019-09-18 12:02 ` Jinpu Wang 2019-09-18 16:05 ` Jinpu Wang 2019-09-14 0:00 ` Bart Van Assche [not found] ` <email@example.com> 2019-09-13 23:58 ` [PATCH v4 24/25] ibnbd: a bit of documentation Bart Van Assche 2019-09-18 12:22 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-18 16:28 ` [PATCH v4 18/25] ibnbd: client: sysfs interface functions Bart Van Assche 2019-09-19 15:55 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-18 17:41 ` [PATCH v4 20/25] ibnbd: server: main functionality Bart Van Assche 2019-09-20 7:36 ` Danil Kipnis 2019-09-20 15:42 ` Bart Van Assche 2019-09-23 15:19 ` Danil Kipnis [not found] ` <firstname.lastname@example.org> 2019-09-18 21:46 ` [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev Bart Van Assche 2019-09-26 14:04 ` Jinpu Wang 2019-09-26 15:11 ` Bart Van Assche 2019-09-26 15:25 ` Danil Kipnis 2019-09-26 15:29 ` Bart Van Assche 2019-09-26 15:38 ` Danil Kipnis 2019-09-26 15:42 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-23 17:44 ` [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections Bart Van Assche 2019-09-25 10:20 ` Danil Kipnis 2019-09-25 15:38 ` Bart Van Assche [not found] ` <firstname.lastname@example.org> 2019-09-23 21:51 ` [PATCH v4 06/25] ibtrs: client: main functionality Bart Van Assche 2019-09-25 17:36 ` Danil Kipnis 2019-09-25 18:55 ` Bart Van Assche 2019-09-25 20:50 ` Danil Kipnis 2019-09-25 21:08 ` Bart Van Assche 2019-09-25 21:16 ` Bart Van Assche 2019-09-25 22:53 ` Danil Kipnis 2019-09-25 23:21 ` Bart Van Assche 2019-09-26 9:16 ` Danil Kipnis [not found] ` <email@example.com> 2019-09-23 22:50 ` [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers Bart Van Assche 2019-09-25 21:45 ` Danil Kipnis 2019-09-25 21:57 ` Bart Van Assche 2019-09-27 8:56 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-23 23:03 ` [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules Bart Van Assche 2019-09-27 10:13 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-23 23:05 ` [PATCH v4 05/25] ibtrs: client: private header with client structs and functions Bart Van Assche 2019-09-27 10:18 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-23 23:15 ` [PATCH v4 07/25] ibtrs: client: statistics functions Bart Van Assche 2019-09-27 12:00 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-23 23:21 ` [PATCH v4 09/25] ibtrs: server: private header with server structs and functions Bart Van Assche 2019-09-27 12:04 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-23 23:49 ` [PATCH v4 10/25] ibtrs: server: main functionality Bart Van Assche 2019-09-27 15:03 ` Jinpu Wang 2019-09-27 15:11 ` Bart Van Assche 2019-09-27 15:19 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-23 23:56 ` [PATCH v4 11/25] ibtrs: server: statistics functions Bart Van Assche 2019-10-02 15:15 ` Jinpu Wang 2019-10-02 15:42 ` Leon Romanovsky 2019-10-02 15:45 ` Jinpu Wang 2019-10-02 16:00 ` Leon Romanovsky [not found] ` <firstname.lastname@example.org> 2019-09-24 0:00 ` [PATCH v4 12/25] ibtrs: server: sysfs interface functions Bart Van Assche 2019-10-02 15:11 ` Jinpu Wang
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CAHg0HuxZvXH899=M4vC7BTH-bP2J35aTwsGhiGoC8AamD8gOyA@mail.gmail.com' \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-RDMA Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \ email@example.com public-inbox-index linux-rdma Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma AGPL code for this site: git clone https://public-inbox.org/public-inbox.git