From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=DuRR=VI=vger.kernel.org=linux-block-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7362DC74A4B
	for <linux-block@archiver.kernel.org>; Thu, 11 Jul 2019 08:54:16 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 4C1CF20872
	for <linux-block@archiver.kernel.org>; Thu, 11 Jul 2019 08:54:16 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=cloud.ionos.com header.i=@cloud.ionos.com header.b="WkvGTnj8"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727991AbfGKIyP (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Thu, 11 Jul 2019 04:54:15 -0400
Received: from mail-io1-f65.google.com ([209.85.166.65]:33584 "EHLO
        mail-io1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727974AbfGKIyP (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Thu, 11 Jul 2019 04:54:15 -0400
Received: by mail-io1-f65.google.com with SMTP id z3so10939436iog.0
        for <linux-block@vger.kernel.org>; Thu, 11 Jul 2019 01:54:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cloud.ionos.com; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=xB7gk9GV4LWD0anuaWOfarGLjo+bG9F97Mtskx4V/Hw=;
        b=WkvGTnj8CGB7fL8BikYeBlqTCQ057+a3pULdM5uKVpcvXnQdFbte3ikKyd7nb3+gjs
         2Ejwwk7QdEBWLQSC8W6u1mG6PA//3+GBQ9BPP+ShensmB0NO1j5leJByivlOExenf/Mk
         KMafUef//v4s8JTxuHUHsbhZkfZ2Pf95N8RWO4ndOwZP3jLSKlhChHDXsr82mOb+NEzp
         i4pX2wBYuoCm5K1eGgDnkssmIZVYb1NMerQ4xbCl74WXA/UJgogS9YSZ3f1dAYbR5dEL
         FAyFsUJQha71pAqbc05epNCeU0/6yRvK6Y/JIyBFshAenxuhWLc83V5mw15GD5iqfa3K
         E9pg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=xB7gk9GV4LWD0anuaWOfarGLjo+bG9F97Mtskx4V/Hw=;
        b=tE/O91KM1h5POOO5OB83Gq6HyPCCHDsCUPkABCOe8I2rUWq79vwoeP0ao1Pcd0cIZy
         /MEMExV7Um3H0sAFZ+4ICTlKGuDZCZH4WXh73Ai7MShk2EljTqUp/7hurqGkwYwoBcEx
         bUlCDy/6scQs4oRiK/CRCONP8+GJQBtA7B7fuXD4zEt7dpx5s+DtfYb3L6q0Z5HcxpUF
         EHVylX4xZ6NmqGELG42v78eRIdaiFydt8ws6lEb9E9x3j9J/C30FvqUWPU12huWI01ye
         desDTuWt4QKxbUZEiE4ja4yd3TQ/oEKzZpxDAgP0XrixdpG93ykOFaYHcRx7LWxcwqdc
         cQ2g==
X-Gm-Message-State: APjAAAUHkwnWyrnVqHOo1rgGuZpX3hDe2s60dkbFurV7LH5LLM9JzAF8
        H+0N02IhGiGBBeqFsbczqZilq6zZeITOlHblvNVH
X-Google-Smtp-Source: APXvYqyWgsLLfKCpK5w/OwXMN+gzsFDhaFJYaKpDGrf45/1NkZTG0xJ1JEyZs0Umaj0wr0weFx59ptW2BSP1BKO30CQ=
X-Received: by 2002:a6b:ce19:: with SMTP id p25mr3186764iob.201.1562835253969;
 Thu, 11 Jul 2019 01:54:13 -0700 (PDT)
MIME-Version: 1.0
References: <20190620150337.7847-1-jinpuwang@gmail.com> <CAHg0HuzUaKs-ACHah-VdNHbot0_usx4ErMesVAw8+DFR63FFqw@mail.gmail.com>
 <a8f2f1d2-b5d9-92fc-40c8-090af0487723@grimberg.me>
In-Reply-To: <a8f2f1d2-b5d9-92fc-40c8-090af0487723@grimberg.me>
From:   Danil Kipnis <danil.kipnis@cloud.ionos.com>
Date:   Thu, 11 Jul 2019 10:54:02 +0200
Message-ID: <CAHg0HuxZvXH899=M4vC7BTH-bP2J35aTwsGhiGoC8AamD8gOyA@mail.gmail.com>
Subject: Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block
 Device (IBNBD)
To:     Sagi Grimberg <sagi@grimberg.me>
Cc:     Jack Wang <jinpuwang@gmail.com>, linux-block@vger.kernel.org,
        linux-rdma@vger.kernel.org, axboe@kernel.dk,
        Christoph Hellwig <hch@infradead.org>, bvanassche@acm.org,
        jgg@mellanox.com, dledford@redhat.com,
        Roman Pen <r.peniaev@gmail.com>, gregkh@linuxfoundation.org
Content-Type: text/plain; charset="UTF-8"
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Hi Sagi,

thanks a lot for the detailed reply. Answers inline below:

On Tue, Jul 9, 2019 at 9:46 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Hi Danil and Jack,
>
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community
>
> That is not exactly correct AFAIR,
>
> My main issues which were raised before are:
> - IMO there isn't any justification to this ibtrs layering separation
>    given that the only user of this is your ibnbd. Unless you are
>    trying to submit another consumer, you should avoid adding another
>    subsystem that is not really general purpose.
We designed ibtrs not only with the IBNBD in mind but also as the
transport layer for a distributed SDS. We'd like to be able to do what
ceph is capable of (automatic up/down scaling of the storage cluster,
automatic recovery) but using in-kernel rdma-based IO transport
drivers, thin-provisioned volume managers, etc. to keep the highest
possible performance. That modest plan of ours should among others
cover for the following:
When using IBNBD/SRP/NVMEoF to export devices (say, thin-provisioned
volumes) from server to client and building an (md-)raid on top of the
imported devices on client side in order to provide for redundancy
across different machines, one gets very decent throughput and low
latency, since the IOs are sent in parallel to the storage machines.
One downside of this setup, is that the resync traffic has to flow
over the client, where the md-raid is sitting. Ideally the resync
traffic should flow directly between the two "legs" (storage machines)
of the raid. The server side of such a "distributed raid" capable of
this direct syncing between the array members would necessarily
require to have some logic on server side and hence could also sit on
top of ibtrs. (To continue the analogy, the "server" side of an
md-raid build on top of say two NVMEoF devices are just two block
devices, which couldn't communicate with each other)
All in all itbrs is a library to establish a "fat", multipath,
autoreconnectable connection between two hosts on top of rdma,
optimized for transport of IO traffic.

> - ibtrs in general is using almost no infrastructure from the existing
>    kernel subsystems. Examples are:
>    - tag allocation mechanism (which I'm not clear why its needed)
As you correctly noticed our client manages the buffers allocated and
registered by the server on the connection establishment. Our tags are
just a mechanism to take and release those buffers for incoming
requests on client side. Since the buffers allocated by the server are
to be shared between all the devices mapped from that server and all
their HW queues (each having num_cpus of them) the mechanism behind
get_tag/put_tag also takes care of the fairness.

>    - rdma rw abstraction similar to what we have in the core
On the one hand we have only single IO related function:
ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
with imm, or requests an rdma write with imm to be executed by the
server. On the other hand we provide an abstraction to establish and
manage what we call "session", which consist of multiple paths (to do
failover and multipath with different policies), where each path
consists of num_cpu rdma connections. Once you established a session
you can add or remove paths from it on the fly. In case the connection
to server is lost, the client does periodic attempts to reconnect
automatically. On the server side you get just sg-lists with a
direction READ or WRITE as requested by the client. We designed this
interface not only as the minimum required to build a block device on
top of rdma but also with a distributed raid in mind.

>    - list_next_or_null_rr_rcu ??
We use that for multipath. The macro (and more importantly the way we
use it) has been reviewed by Linus and quit closely by Paul E.
McKenney. AFAIR the conclusion was that Romans implementation is
correct, but too tricky to use correctly in order to be included into
kernel as a public interface. See https://lkml.org/lkml/2018/5/18/659

>    - few other examples sprinkled around..
To my best knowledge we addressed everything we got comments on and
will definitely do so in the future.

> Another question, from what I understand from the code, the client
> always rdma_writes data on writes (with imm) from a remote pool of
> server buffers dedicated to it. Essentially all writes are immediate (no
> rdma reads ever). How is that different than using send wrs to a set of
> pre-posted recv buffers (like all others are doing)? Is it faster?
At the very beginning of the project we did some measurements and saw,
that it is faster. I'm not sure if this is still true, since the
hardware and the drivers and rdma subsystem did change in that time.
Also it seemed to make the code simpler.

> Also, given that the server pre-allocate a substantial amount of memory
> for each connection, is it documented the requirements from the server
> side? Usually kernel implementations (especially upstream ones) will
> avoid imposing such large longstanding memory requirements on the system
> by default. I don't have a firm stand on this, but wanted to highlight
> this as you are sending this for upstream inclusion.
We definitely need to stress that somewhere. Will include into readme
and add to the cover letter next time. Our memory management is indeed
basically absent in favor of performance: The server reserves
queue_depth of say 512K buffers. Each buffer is used by client for
single IO only, no matter how big the request is. So if client only
issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
for lowest possible latency from the beginning. It is probably
possible to implement some clever allocator on the server side which
wouldn't affect the performance a lot.

>
>   and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
>
> Overall, while I absolutely support your cause to lower your maintenance
> overhead by having this sit upstream, I don't see why this can be
> helpful to anyone else in the rdma community. If instead you can
> crystallize why/how ibnbd is faster than anything else, and perhaps
> contribute a common infrastructure piece (or enhance an existing one)
> such that other existing ulps can leverage, it will be a lot more
> compelling to include it upstream.
>
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA,
>
> But who needs it? Can other ulps use it or pieces of it? I keep failing
> to understand why is this a benefit if its specific to your ibnbd?
See above and please ask if you have more questions to this.

Thank you,
Danil.