From: Danil Kipnis <email@example.com> To: Sagi Grimberg <firstname.lastname@example.org> Cc: Jack Wang <email@example.com>, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, Christoph Hellwig <email@example.com>, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, Roman Pen <email@example.com>, firstname.lastname@example.org Subject: Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Date: Fri, 12 Jul 2019 12:58:31 +0200 Message-ID: <CAHg0Huz93nUTYA8PcXOLaunBF0CT6yc-DcD8EhRJkw+GGL8c7A@mail.gmail.com> (raw) In-Reply-To: <email@example.com> On Fri, Jul 12, 2019 at 2:22 AM Sagi Grimberg <firstname.lastname@example.org> wrote: > > > >> My main issues which were raised before are: > >> - IMO there isn't any justification to this ibtrs layering separation > >> given that the only user of this is your ibnbd. Unless you are > >> trying to submit another consumer, you should avoid adding another > >> subsystem that is not really general purpose. > > We designed ibtrs not only with the IBNBD in mind but also as the > > transport layer for a distributed SDS. We'd like to be able to do what > > ceph is capable of (automatic up/down scaling of the storage cluster, > > automatic recovery) but using in-kernel rdma-based IO transport > > drivers, thin-provisioned volume managers, etc. to keep the highest > > possible performance. > > Sounds lovely, but still very much bound to your ibnbd. And that part > is not included in the patch set, so I still don't see why should this > be considered as a "generic" transport subsystem (it clearly isn't). Having IBTRS sit on a storage enables that storage to communicate with other storages (forward requests, request read from other storages i.e. for sync traffic). IBTRS is generic in the sense that it removes the strict separation into initiator (converting BIOs into some hardware specific protocol messages) and target (which forwards those messages to some local device supporting that protocol). It appears less generic to me to talk SCSI or NVME between storages if some storages have SCSI, other NVME disks or LVM volumes, or mixed setup. IBTRS allows to just send or request read of an sg-list between machines over rdma - the very minimum required to transport a BIO. It would in-deed support our case with the library if we would propose at least two users of it. We now only have a very early stage prototype capable of organizing storages in pools, multiplexing io between different storages, etc. sitting on top of ibtrs, it's not functional yet. On the other hand ibnbd with ibtrs alone already make over 10000 lines. > > All in all itbrs is a library to establish a "fat", multipath, > > autoreconnectable connection between two hosts on top of rdma, > > optimized for transport of IO traffic. > > That is also dictating a wire-protocol which makes it useless to pretty > much any other consumer. Personally, I don't see how this library > would ever be used outside of your ibnbd. Its true, IBTRS also imposes a protocol for connection establishment and IO path. I think at least the IO part we did reduce to a bare minimum: 350 * Write * 351 352 1. When processing a write request client selects one of the memory chunks 353 on the server side and rdma writes there the user data, user header and the 354 IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 355 contains size of the user header. The client tells the server which chunk has 356 been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by 357 using the IMM field. 358 359 2. When confirming a write request server sends an "empty" rdma message with 360 an immediate field. The 32 bit field is used to specify the outstanding 361 inflight IO and for the error code. 362 363 CLT SRV 364 usr_data + usr_hdr + ibtrs_msg_rdma_write -----------------> [IBTRS_IO_REQ_IMM] 365 [IBTRS_IO_RSP_IMM] <----------------- (id + errno) 366 367 * Read * 368 369 1. When processing a read request client selects one of the memory chunks 370 on the server side and rdma writes there the user header and the 371 IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of 372 the user header, flags (specifying if memory invalidation is necessary) and the 373 list of addresses along with keys for the data to be read into. 374 375 2. When confirming a read request server transfers the requested data first, 376 attaches an invalidation message if requested and finally an "empty" rdma 377 message with an immediate field. The 32 bit field is used to specify the 378 outstanding inflight IO and the error code. 379 380 CLT SRV 381 usr_hdr + ibtrs_msg_rdma_read --------------> [IBTRS_IO_REQ_IMM] 382 [IBTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 383 or in case client requested invalidation: 384 [IBTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) > >> - ibtrs in general is using almost no infrastructure from the existing > >> kernel subsystems. Examples are: > >> - tag allocation mechanism (which I'm not clear why its needed) > > As you correctly noticed our client manages the buffers allocated and > > registered by the server on the connection establishment. Our tags are > > just a mechanism to take and release those buffers for incoming > > requests on client side. Since the buffers allocated by the server are > > to be shared between all the devices mapped from that server and all > > their HW queues (each having num_cpus of them) the mechanism behind > > get_tag/put_tag also takes care of the fairness. > > We have infrastructure for this, sbitmaps. AFAIR Roman did try to use sbitmap but found no benefits in terms of readability or number of lines: " What is left unchanged on IBTRS side but was suggested to modify: - Bart suggested to use sbitmap instead of calling find_first_zero_bit() and friends. I found calling pure bit API is more explicit in comparison to sbitmap - there is no need in using sbitmap_queue and all the power of wait queues, no benefits in terms of LoC as well." https://lwn.net/Articles/756994/ If sbitmap is a must for our use case from the infrastructure point of view, we will reiterate on it. > > >> - rdma rw abstraction similar to what we have in the core > > On the one hand we have only single IO related function: > > ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write > > with imm, or requests an rdma write with imm to be executed by the > > server. > > For sure you can enhance the rw API to have imm support? I'm not familiar with the architectural intention behind rw.c. Extending the API with the support of imm field is (I guess) doable. > > On the other hand we provide an abstraction to establish and > > manage what we call "session", which consist of multiple paths (to do > > failover and multipath with different policies), where each path > > consists of num_cpu rdma connections. > > That's fine, but it doesn't mean that it also needs to re-write > infrastructure that we already have. Do you refer to rw.c? > > Once you established a session > > you can add or remove paths from it on the fly. In case the connection > > to server is lost, the client does periodic attempts to reconnect > > automatically. On the server side you get just sg-lists with a > > direction READ or WRITE as requested by the client. We designed this > > interface not only as the minimum required to build a block device on > > top of rdma but also with a distributed raid in mind. > > I suggest you take a look at the rw API and use that in your transport. We will look into rw.c. Do you suggest we move the multipath and the multiple QPs per path and connection establishment on *top* of it or *into* it? > >> Another question, from what I understand from the code, the client > >> always rdma_writes data on writes (with imm) from a remote pool of > >> server buffers dedicated to it. Essentially all writes are immediate (no > >> rdma reads ever). How is that different than using send wrs to a set of > >> pre-posted recv buffers (like all others are doing)? Is it faster? > > At the very beginning of the project we did some measurements and saw, > > that it is faster. I'm not sure if this is still true > > Its not significantly faster (can't imagine why it would be). > What could make a difference is probably the fact that you never > do rdma reads for I/O writes which might be better. Also perhaps the > fact that you normally don't wait for send completions before completing > I/O (which is broken), and the fact that you batch recv operations. > > I would be interested to understand what indeed makes ibnbd run faster > though. Yes, we would like to understand this too. I will try increasing the inline_data_size on nvme in our benchmarks as the next step to check if this influences the results. > >> Also, given that the server pre-allocate a substantial amount of memory > >> for each connection, is it documented the requirements from the server > >> side? Usually kernel implementations (especially upstream ones) will > >> avoid imposing such large longstanding memory requirements on the system > >> by default. I don't have a firm stand on this, but wanted to highlight > >> this as you are sending this for upstream inclusion. > > We definitely need to stress that somewhere. Will include into readme > > and add to the cover letter next time. Our memory management is indeed > > basically absent in favor of performance: The server reserves > > queue_depth of say 512K buffers. Each buffer is used by client for > > single IO only, no matter how big the request is. So if client only > > issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming > > for lowest possible latency from the beginning. It is probably > > possible to implement some clever allocator on the server side which > > wouldn't affect the performance a lot. > > Or you can fallback to rdma_read like the rest of the ulps. We currently have a single round trip for every write IO: write + ack. Wouldn't switching to rdma_read make 2 round trips out of it: command + rdma_read + ack?
next prev parent reply index Thread overview: 123+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <email@example.com> 2019-06-20 15:03 ` [PATCH v4 01/25] sysfs: export sysfs_remove_file_self() Jack Wang 2019-09-23 17:21 ` Bart Van Assche 2019-09-25 9:30 ` Danil Kipnis 2019-07-09 9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis 2019-07-09 11:00 ` Leon Romanovsky 2019-07-09 11:17 ` Greg KH 2019-07-09 11:57 ` Jinpu Wang 2019-07-09 13:32 ` Leon Romanovsky 2019-07-09 15:39 ` Bart Van Assche 2019-07-09 11:37 ` Jinpu Wang 2019-07-09 12:06 ` Jason Gunthorpe 2019-07-09 13:15 ` Jinpu Wang 2019-07-09 13:19 ` Jason Gunthorpe 2019-07-09 14:17 ` Jinpu Wang 2019-07-09 21:27 ` Sagi Grimberg 2019-07-19 13:12 ` Danil Kipnis 2019-07-10 14:55 ` Danil Kipnis 2019-07-09 12:04 ` Jason Gunthorpe 2019-07-09 19:45 ` Sagi Grimberg 2019-07-10 13:55 ` Jason Gunthorpe 2019-07-10 16:25 ` Sagi Grimberg 2019-07-10 17:25 ` Jason Gunthorpe 2019-07-10 19:11 ` Sagi Grimberg 2019-07-11 7:27 ` Danil Kipnis 2019-07-11 8:54 ` Danil Kipnis 2019-07-12 0:22 ` Sagi Grimberg 2019-07-12 7:57 ` Jinpu Wang 2019-07-12 19:40 ` Sagi Grimberg 2019-07-15 11:21 ` Jinpu Wang 2019-07-12 10:58 ` Danil Kipnis [this message] [not found] ` <firstname.lastname@example.org> 2019-07-09 15:10 ` [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Leon Romanovsky 2019-07-09 15:18 ` Jinpu Wang 2019-07-09 15:51 ` Leon Romanovsky 2019-09-13 23:56 ` Bart Van Assche 2019-09-19 10:30 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-13 22:10 ` [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers Bart Van Assche 2019-09-15 14:30 ` Jinpu Wang 2019-09-16 5:27 ` Leon Romanovsky 2019-09-16 13:45 ` Bart Van Assche 2019-09-17 15:41 ` Leon Romanovsky 2019-09-17 15:52 ` Jinpu Wang 2019-09-16 7:08 ` Danil Kipnis 2019-09-16 14:57 ` Jinpu Wang 2019-09-16 17:25 ` Bart Van Assche 2019-09-17 12:27 ` Jinpu Wang 2019-09-16 15:39 ` Jinpu Wang 2019-09-18 15:26 ` Bart Van Assche 2019-09-18 16:11 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-13 22:25 ` [PATCH v4 16/25] ibnbd: client: private header with client structs and functions Bart Van Assche 2019-09-17 16:36 ` Jinpu Wang 2019-09-25 23:43 ` Danil Kipnis 2019-09-26 10:00 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-13 23:46 ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche 2019-09-16 14:17 ` Danil Kipnis 2019-09-16 16:46 ` Bart Van Assche 2019-09-17 11:39 ` Danil Kipnis 2019-09-18 7:14 ` Danil Kipnis 2019-09-18 15:47 ` Bart Van Assche 2019-09-20 8:29 ` Danil Kipnis 2019-09-25 22:26 ` Danil Kipnis 2019-09-26 9:55 ` Roman Penyaev 2019-09-26 15:01 ` Bart Van Assche 2019-09-27 8:52 ` Roman Penyaev 2019-09-27 9:32 ` Danil Kipnis 2019-09-27 12:18 ` Danil Kipnis 2019-09-27 16:37 ` Bart Van Assche 2019-09-27 16:50 ` Roman Penyaev 2019-09-27 17:16 ` Bart Van Assche 2019-09-17 13:09 ` Jinpu Wang 2019-09-17 16:46 ` Bart Van Assche 2019-09-18 12:02 ` Jinpu Wang 2019-09-18 16:05 ` Jinpu Wang 2019-09-14 0:00 ` Bart Van Assche [not found] ` <firstname.lastname@example.org> 2019-09-13 23:58 ` [PATCH v4 24/25] ibnbd: a bit of documentation Bart Van Assche 2019-09-18 12:22 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-18 16:28 ` [PATCH v4 18/25] ibnbd: client: sysfs interface functions Bart Van Assche 2019-09-19 15:55 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-18 17:41 ` [PATCH v4 20/25] ibnbd: server: main functionality Bart Van Assche 2019-09-20 7:36 ` Danil Kipnis 2019-09-20 15:42 ` Bart Van Assche 2019-09-23 15:19 ` Danil Kipnis [not found] ` <email@example.com> 2019-09-18 21:46 ` [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev Bart Van Assche 2019-09-26 14:04 ` Jinpu Wang 2019-09-26 15:11 ` Bart Van Assche 2019-09-26 15:25 ` Danil Kipnis 2019-09-26 15:29 ` Bart Van Assche 2019-09-26 15:38 ` Danil Kipnis 2019-09-26 15:42 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-23 17:44 ` [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections Bart Van Assche 2019-09-25 10:20 ` Danil Kipnis 2019-09-25 15:38 ` Bart Van Assche [not found] ` <email@example.com> 2019-09-23 21:51 ` [PATCH v4 06/25] ibtrs: client: main functionality Bart Van Assche 2019-09-25 17:36 ` Danil Kipnis 2019-09-25 18:55 ` Bart Van Assche 2019-09-25 20:50 ` Danil Kipnis 2019-09-25 21:08 ` Bart Van Assche 2019-09-25 21:16 ` Bart Van Assche 2019-09-25 22:53 ` Danil Kipnis 2019-09-25 23:21 ` Bart Van Assche 2019-09-26 9:16 ` Danil Kipnis [not found] ` <firstname.lastname@example.org> 2019-09-23 22:50 ` [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers Bart Van Assche 2019-09-25 21:45 ` Danil Kipnis 2019-09-25 21:57 ` Bart Van Assche 2019-09-27 8:56 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-23 23:03 ` [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules Bart Van Assche 2019-09-27 10:13 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-23 23:05 ` [PATCH v4 05/25] ibtrs: client: private header with client structs and functions Bart Van Assche 2019-09-27 10:18 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-23 23:15 ` [PATCH v4 07/25] ibtrs: client: statistics functions Bart Van Assche 2019-09-27 12:00 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-23 23:21 ` [PATCH v4 09/25] ibtrs: server: private header with server structs and functions Bart Van Assche 2019-09-27 12:04 ` Jinpu Wang [not found] ` <email@example.com> 2019-09-23 23:49 ` [PATCH v4 10/25] ibtrs: server: main functionality Bart Van Assche 2019-09-27 15:03 ` Jinpu Wang 2019-09-27 15:11 ` Bart Van Assche 2019-09-27 15:19 ` Jinpu Wang [not found] ` <firstname.lastname@example.org> 2019-09-23 23:56 ` [PATCH v4 11/25] ibtrs: server: statistics functions Bart Van Assche 2019-10-02 15:15 ` Jinpu Wang 2019-10-02 15:42 ` Leon Romanovsky 2019-10-02 15:45 ` Jinpu Wang 2019-10-02 16:00 ` Leon Romanovsky [not found] ` <email@example.com> 2019-09-24 0:00 ` [PATCH v4 12/25] ibtrs: server: sysfs interface functions Bart Van Assche 2019-10-02 15:11 ` Jinpu Wang
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAHg0Huz93nUTYA8PcXOLaunBF0CT6yc-DcD8EhRJkw+GGL8c7A@mail.gmail.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-RDMA Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \ firstname.lastname@example.org public-inbox-index linux-rdma Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma AGPL code for this site: git clone https://public-inbox.org/public-inbox.git