Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Cc: Jack Wang <jinpuwang@gmail.com>,
	linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
	axboe@kernel.dk, Christoph Hellwig <hch@infradead.org>,
	bvanassche@acm.org, jgg@mellanox.com, dledford@redhat.com,
	Roman Pen <r.peniaev@gmail.com>,
	gregkh@linuxfoundation.org
Subject: Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
Date: Thu, 11 Jul 2019 17:22:26 -0700
Message-ID: <aef765ed-4bb9-2211-05d0-b320cc3ac275@grimberg.me> (raw)
In-Reply-To: <CAHg0HuxZvXH899=M4vC7BTH-bP2J35aTwsGhiGoC8AamD8gOyA@mail.gmail.com>


>> My main issues which were raised before are:
>> - IMO there isn't any justification to this ibtrs layering separation
>>     given that the only user of this is your ibnbd. Unless you are
>>     trying to submit another consumer, you should avoid adding another
>>     subsystem that is not really general purpose.
> We designed ibtrs not only with the IBNBD in mind but also as the
> transport layer for a distributed SDS. We'd like to be able to do what
> ceph is capable of (automatic up/down scaling of the storage cluster,
> automatic recovery) but using in-kernel rdma-based IO transport
> drivers, thin-provisioned volume managers, etc. to keep the highest
> possible performance.

Sounds lovely, but still very much bound to your ibnbd. And that part
is not included in the patch set, so I still don't see why should this
be considered as a "generic" transport subsystem (it clearly isn't).

> All in all itbrs is a library to establish a "fat", multipath,
> autoreconnectable connection between two hosts on top of rdma,
> optimized for transport of IO traffic.

That is also dictating a wire-protocol which makes it useless to pretty
much any other consumer. Personally, I don't see how this library
would ever be used outside of your ibnbd.

>> - ibtrs in general is using almost no infrastructure from the existing
>>     kernel subsystems. Examples are:
>>     - tag allocation mechanism (which I'm not clear why its needed)
> As you correctly noticed our client manages the buffers allocated and
> registered by the server on the connection establishment. Our tags are
> just a mechanism to take and release those buffers for incoming
> requests on client side. Since the buffers allocated by the server are
> to be shared between all the devices mapped from that server and all
> their HW queues (each having num_cpus of them) the mechanism behind
> get_tag/put_tag also takes care of the fairness.

We have infrastructure for this, sbitmaps.

>>     - rdma rw abstraction similar to what we have in the core
> On the one hand we have only single IO related function:
> ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
> with imm, or requests an rdma write with imm to be executed by the
> server.

For sure you can enhance the rw API to have imm support?

> On the other hand we provide an abstraction to establish and
> manage what we call "session", which consist of multiple paths (to do
> failover and multipath with different policies), where each path
> consists of num_cpu rdma connections.

That's fine, but it doesn't mean that it also needs to re-write
infrastructure that we already have.

> Once you established a session
> you can add or remove paths from it on the fly. In case the connection
> to server is lost, the client does periodic attempts to reconnect
> automatically. On the server side you get just sg-lists with a
> direction READ or WRITE as requested by the client. We designed this
> interface not only as the minimum required to build a block device on
> top of rdma but also with a distributed raid in mind.

I suggest you take a look at the rw API and use that in your transport.

>> Another question, from what I understand from the code, the client
>> always rdma_writes data on writes (with imm) from a remote pool of
>> server buffers dedicated to it. Essentially all writes are immediate (no
>> rdma reads ever). How is that different than using send wrs to a set of
>> pre-posted recv buffers (like all others are doing)? Is it faster?
> At the very beginning of the project we did some measurements and saw,
> that it is faster. I'm not sure if this is still true

Its not significantly faster (can't imagine why it would be).
What could make a difference is probably the fact that you never
do rdma reads for I/O writes which might be better. Also perhaps the
fact that you normally don't wait for send completions before completing
I/O (which is broken), and the fact that you batch recv operations.

I would be interested to understand what indeed makes ibnbd run faster
though.

>> Also, given that the server pre-allocate a substantial amount of memory
>> for each connection, is it documented the requirements from the server
>> side? Usually kernel implementations (especially upstream ones) will
>> avoid imposing such large longstanding memory requirements on the system
>> by default. I don't have a firm stand on this, but wanted to highlight
>> this as you are sending this for upstream inclusion.
> We definitely need to stress that somewhere. Will include into readme
> and add to the cover letter next time. Our memory management is indeed
> basically absent in favor of performance: The server reserves
> queue_depth of say 512K buffers. Each buffer is used by client for
> single IO only, no matter how big the request is. So if client only
> issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
> for lowest possible latency from the beginning. It is probably
> possible to implement some clever allocator on the server side which
> wouldn't affect the performance a lot.

Or you can fallback to rdma_read like the rest of the ulps.

  reply index

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190620150337.7847-1-jinpuwang@gmail.com>
2019-06-20 15:03 ` [PATCH v4 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
2019-09-23 17:21   ` Bart Van Assche
2019-09-25  9:30     ` Danil Kipnis
2019-07-09  9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis
2019-07-09 11:00   ` Leon Romanovsky
2019-07-09 11:17     ` Greg KH
2019-07-09 11:57       ` Jinpu Wang
2019-07-09 13:32       ` Leon Romanovsky
2019-07-09 15:39       ` Bart Van Assche
2019-07-09 11:37     ` Jinpu Wang
2019-07-09 12:06       ` Jason Gunthorpe
2019-07-09 13:15         ` Jinpu Wang
2019-07-09 13:19           ` Jason Gunthorpe
2019-07-09 14:17             ` Jinpu Wang
2019-07-09 21:27             ` Sagi Grimberg
2019-07-19 13:12               ` Danil Kipnis
2019-07-10 14:55     ` Danil Kipnis
2019-07-09 12:04   ` Jason Gunthorpe
2019-07-09 19:45   ` Sagi Grimberg
2019-07-10 13:55     ` Jason Gunthorpe
2019-07-10 16:25       ` Sagi Grimberg
2019-07-10 17:25         ` Jason Gunthorpe
2019-07-10 19:11           ` Sagi Grimberg
2019-07-11  7:27             ` Danil Kipnis
2019-07-11  8:54     ` Danil Kipnis
2019-07-12  0:22       ` Sagi Grimberg [this message]
2019-07-12  7:57         ` Jinpu Wang
2019-07-12 19:40           ` Sagi Grimberg
2019-07-15 11:21             ` Jinpu Wang
2019-07-12 10:58         ` Danil Kipnis
     [not found] ` <20190620150337.7847-26-jinpuwang@gmail.com>
2019-07-09 15:10   ` [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Leon Romanovsky
2019-07-09 15:18     ` Jinpu Wang
2019-07-09 15:51       ` Leon Romanovsky
2019-09-13 23:56   ` Bart Van Assche
2019-09-19 10:30     ` Jinpu Wang
     [not found] ` <20190620150337.7847-16-jinpuwang@gmail.com>
2019-09-13 22:10   ` [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers Bart Van Assche
2019-09-15 14:30     ` Jinpu Wang
2019-09-16  5:27       ` Leon Romanovsky
2019-09-16 13:45         ` Bart Van Assche
2019-09-17 15:41           ` Leon Romanovsky
2019-09-17 15:52             ` Jinpu Wang
2019-09-16  7:08       ` Danil Kipnis
2019-09-16 14:57       ` Jinpu Wang
2019-09-16 17:25         ` Bart Van Assche
2019-09-17 12:27           ` Jinpu Wang
2019-09-16 15:39       ` Jinpu Wang
2019-09-18 15:26         ` Bart Van Assche
2019-09-18 16:11           ` Jinpu Wang
     [not found] ` <20190620150337.7847-17-jinpuwang@gmail.com>
2019-09-13 22:25   ` [PATCH v4 16/25] ibnbd: client: private header with client structs and functions Bart Van Assche
2019-09-17 16:36     ` Jinpu Wang
2019-09-25 23:43       ` Danil Kipnis
2019-09-26 10:00         ` Jinpu Wang
     [not found] ` <20190620150337.7847-18-jinpuwang@gmail.com>
2019-09-13 23:46   ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche
2019-09-16 14:17     ` Danil Kipnis
2019-09-16 16:46       ` Bart Van Assche
2019-09-17 11:39         ` Danil Kipnis
2019-09-18  7:14           ` Danil Kipnis
2019-09-18 15:47             ` Bart Van Assche
2019-09-20  8:29               ` Danil Kipnis
2019-09-25 22:26               ` Danil Kipnis
2019-09-26  9:55                 ` Roman Penyaev
2019-09-26 15:01                   ` Bart Van Assche
2019-09-27  8:52                     ` Roman Penyaev
2019-09-27  9:32                       ` Danil Kipnis
2019-09-27 12:18                         ` Danil Kipnis
2019-09-27 16:37                       ` Bart Van Assche
2019-09-27 16:50                         ` Roman Penyaev
2019-09-27 17:16                           ` Bart Van Assche
2019-09-17 13:09     ` Jinpu Wang
2019-09-17 16:46       ` Bart Van Assche
2019-09-18 12:02         ` Jinpu Wang
2019-09-18 16:05     ` Jinpu Wang
2019-09-14  0:00   ` Bart Van Assche
     [not found] ` <20190620150337.7847-25-jinpuwang@gmail.com>
2019-09-13 23:58   ` [PATCH v4 24/25] ibnbd: a bit of documentation Bart Van Assche
2019-09-18 12:22     ` Jinpu Wang
     [not found] ` <20190620150337.7847-19-jinpuwang@gmail.com>
2019-09-18 16:28   ` [PATCH v4 18/25] ibnbd: client: sysfs interface functions Bart Van Assche
2019-09-19 15:55     ` Jinpu Wang
     [not found] ` <20190620150337.7847-21-jinpuwang@gmail.com>
2019-09-18 17:41   ` [PATCH v4 20/25] ibnbd: server: main functionality Bart Van Assche
2019-09-20  7:36     ` Danil Kipnis
2019-09-20 15:42       ` Bart Van Assche
2019-09-23 15:19         ` Danil Kipnis
     [not found] ` <20190620150337.7847-22-jinpuwang@gmail.com>
2019-09-18 21:46   ` [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev Bart Van Assche
2019-09-26 14:04     ` Jinpu Wang
2019-09-26 15:11       ` Bart Van Assche
2019-09-26 15:25         ` Danil Kipnis
2019-09-26 15:29           ` Bart Van Assche
2019-09-26 15:38             ` Danil Kipnis
2019-09-26 15:42               ` Jinpu Wang
     [not found] ` <20190620150337.7847-3-jinpuwang@gmail.com>
2019-09-23 17:44   ` [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections Bart Van Assche
2019-09-25 10:20     ` Danil Kipnis
2019-09-25 15:38       ` Bart Van Assche
     [not found] ` <20190620150337.7847-7-jinpuwang@gmail.com>
2019-09-23 21:51   ` [PATCH v4 06/25] ibtrs: client: main functionality Bart Van Assche
2019-09-25 17:36     ` Danil Kipnis
2019-09-25 18:55       ` Bart Van Assche
2019-09-25 20:50         ` Danil Kipnis
2019-09-25 21:08           ` Bart Van Assche
2019-09-25 21:16             ` Bart Van Assche
2019-09-25 22:53             ` Danil Kipnis
2019-09-25 23:21               ` Bart Van Assche
2019-09-26  9:16                 ` Danil Kipnis
     [not found] ` <20190620150337.7847-4-jinpuwang@gmail.com>
2019-09-23 22:50   ` [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers Bart Van Assche
2019-09-25 21:45     ` Danil Kipnis
2019-09-25 21:57       ` Bart Van Assche
2019-09-27  8:56     ` Jinpu Wang
     [not found] ` <20190620150337.7847-5-jinpuwang@gmail.com>
2019-09-23 23:03   ` [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules Bart Van Assche
2019-09-27 10:13     ` Jinpu Wang
     [not found] ` <20190620150337.7847-6-jinpuwang@gmail.com>
2019-09-23 23:05   ` [PATCH v4 05/25] ibtrs: client: private header with client structs and functions Bart Van Assche
2019-09-27 10:18     ` Jinpu Wang
     [not found] ` <20190620150337.7847-8-jinpuwang@gmail.com>
2019-09-23 23:15   ` [PATCH v4 07/25] ibtrs: client: statistics functions Bart Van Assche
2019-09-27 12:00     ` Jinpu Wang
     [not found] ` <20190620150337.7847-10-jinpuwang@gmail.com>
2019-09-23 23:21   ` [PATCH v4 09/25] ibtrs: server: private header with server structs and functions Bart Van Assche
2019-09-27 12:04     ` Jinpu Wang
     [not found] ` <20190620150337.7847-11-jinpuwang@gmail.com>
2019-09-23 23:49   ` [PATCH v4 10/25] ibtrs: server: main functionality Bart Van Assche
2019-09-27 15:03     ` Jinpu Wang
2019-09-27 15:11       ` Bart Van Assche
2019-09-27 15:19         ` Jinpu Wang
     [not found] ` <20190620150337.7847-12-jinpuwang@gmail.com>
2019-09-23 23:56   ` [PATCH v4 11/25] ibtrs: server: statistics functions Bart Van Assche
2019-10-02 15:15     ` Jinpu Wang
2019-10-02 15:42       ` Leon Romanovsky
2019-10-02 15:45         ` Jinpu Wang
2019-10-02 16:00           ` Leon Romanovsky
     [not found] ` <20190620150337.7847-13-jinpuwang@gmail.com>
2019-09-24  0:00   ` [PATCH v4 12/25] ibtrs: server: sysfs interface functions Bart Van Assche
2019-10-02 15:11     ` Jinpu Wang

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aef765ed-4bb9-2211-05d0-b320cc3ac275@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=danil.kipnis@cloud.ionos.com \
    --cc=dledford@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@infradead.org \
    --cc=jgg@mellanox.com \
    --cc=jinpuwang@gmail.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=r.peniaev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org linux-rdma@archiver.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/ public-inbox