From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10D1FC74A21 for ; Wed, 10 Jul 2019 14:55:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DD9852086D for ; Wed, 10 Jul 2019 14:55:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cloud.ionos.com header.i=@cloud.ionos.com header.b="VFNULUCG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727823AbfGJOzh (ORCPT ); Wed, 10 Jul 2019 10:55:37 -0400 Received: from mail-io1-f65.google.com ([209.85.166.65]:35169 "EHLO mail-io1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726380AbfGJOzh (ORCPT ); Wed, 10 Jul 2019 10:55:37 -0400 Received: by mail-io1-f65.google.com with SMTP id m24so5384076ioo.2 for ; Wed, 10 Jul 2019 07:55:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloud.ionos.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Twe7ecyzjHZ8DMP/2Lr9YSihfBlQIaYCO4YxlwHixEU=; b=VFNULUCGYc5C/mAy1+GvgZKxhQSwQiTOI1WdoPQsmr12lST0WDAsEJFENwWY3yf8BS gyQk13B6DePfrb6TxztsQro/A3WUX/A4/td/wWY/EoITBAR7yYK9R63UVXcYy8YKzbSd LjDqeYr91o8JDuo16+S5UMyOrB7nkX7Clrs0ntg4azgN4qRwsXZvVE4o8n3yHrtF6Ttz pYhP5FZ1qg4Go41uK0NsTWxwa2S8LPY6/8gBOiJlbTQn1r9lK1xYHh7zueMAEAkWnT3B i99/sD44DWMNIUGOv4CYl8Bqx0yRWGu+Uhy1EU4SvD2fRXN1Wbg1TEN8FAUPL66U2Z4P bTYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Twe7ecyzjHZ8DMP/2Lr9YSihfBlQIaYCO4YxlwHixEU=; b=n279txm23qQRWenJvZkm72JXFthVnFe/ukpP9xJDTtL4WJZlUCO0UwdXBdK2I6Uo2p FJPdNgghFUOCaYOZixCSJ/XhxKJ31+VEuAhAWe0AIhXenecSVb/8SmY8P4HVpvsTe6Xn bouOrd6lv7RfXEdxpkmCul/sPlySujIA2DxF7/SjP6A+n4qJnwW9SUDtCQGet0ZQMep8 /r86Q0c95NTVIhYd5AttrtskS2UzcaK0ONqRxmp3Ro8e4XBHEhJDu9SIcTBEyacvonyw u8Kpn54wSmQp0dqWIpombDJMmBNj9hRR9wdBQAYnEG8ObV9f3b08eDzTL2EkAarFHggt Utqw== X-Gm-Message-State: APjAAAX4CfsTYgp8BzBhkKKhOK/k1Y3iZf8k1bEf449xapcKNrdPS/K/ 0pEQL0738fkZLxejQzHRwjhcuZsjiuUIC1V2mEJY X-Google-Smtp-Source: APXvYqwiNFgudvHR7APlwic5M5ARNema/LLFjl4sBRzFWkTW0ZH+V/Dp9giDofxDnT7DsB2uOoUFg67pSqWZ6SFNThU= X-Received: by 2002:a6b:5b01:: with SMTP id v1mr27601587ioh.120.1562770535816; Wed, 10 Jul 2019 07:55:35 -0700 (PDT) MIME-Version: 1.0 References: <20190620150337.7847-1-jinpuwang@gmail.com> <20190709110036.GQ7034@mtr-leonro.mtl.com> In-Reply-To: <20190709110036.GQ7034@mtr-leonro.mtl.com> From: Danil Kipnis Date: Wed, 10 Jul 2019 16:55:24 +0200 Message-ID: Subject: Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) To: Leon Romanovsky Cc: Jack Wang , linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, axboe@kernel.dk, Christoph Hellwig , Sagi Grimberg , bvanassche@acm.org, jgg@mellanox.com, dledford@redhat.com, Roman Pen , gregkh@linuxfoundation.org Content-Type: text/plain; charset="UTF-8" Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Hi Leon, thanks for the feedback! On Tue, Jul 9, 2019 at 1:00 PM Leon Romanovsky wrote: > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote: > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg, > > > > Could you please provide some feedback to the IBNBD driver and the > > IBTRS library? > > So far we addressed all the requests provided by the community and > > continue to maintain our code up-to-date with the upstream kernel > > while having an extra compatibility layer for older kernels in our > > out-of-tree repository. > > I understand that SRP and NVMEoF which are in the kernel already do > > provide equivalent functionality for the majority of the use cases. > > IBNBD on the other hand is showing higher performance and more > > importantly includes the IBTRS - a general purpose library to > > establish connections and transport BIO-like read/write sg-lists over > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While > > I believe IBNBD does meet the kernel coding standards, it doesn't have > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think > > it would make sense for us to rework our patchset and try pushing it > > for staging tree first, so that we can proof IBNBD is well maintained, > > beneficial for the eco-system, find a proper location for it within > > block/rdma subsystems? This would make it easier for people to try it > > out and would also be a huge step for us in terms of maintenance > > effort. > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the > > near future). Do you think it would make sense to rename the driver to > > RNBD/RTRS? > > It is better to avoid "staging" tree, because it will lack attention of > relevant people and your efforts will be lost once you will try to move > out of staging. We are all remembering Lustre and don't want to see it > again. > > Back then, you was asked to provide support for performance superiority. I have only theories of why ibnbd is showing better numbers than nvmeof: 1. The way we utilize the MQ framework in IBNBD. We promise to have queue_depth (say 512) requests on each of the num_cpus hardware queues of each device, but in fact we have only queue_depth for the whole "session" toward a given server. The moment we have queue_depth inflights we need stop the queue (on a device on a cpu) we get more requests on. We need to start them again after some requests are completed. We maintain per cpu lists of stopped HW queues, a bitmap showing which lists are not empty, etc. to wake them up in a round-robin fashion to avoid starvation of any devices. 2. We only do rdma writes with imm. A server reserves queue_depth of max_io_size buffers for a given client. The client manages those himself. Client uses imm field to tell to the server which buffer has been written (and where) and server uses the imm field to send back errno. If our max_io_size is 64K and queue_depth 512 and client only issues 4K IOs all the time, then 60*512K memory is wasted. On the other hand we do no buffer allocation/registration in io path on server side. Server sends rdma addresses and keys to those preregistered buffers on connection establishment and deallocates/unregisters them when a session is closed. That's for writes. For reads, client registers user buffers (after fr) and sends the addresses and keys to the server (with an rdma write with imm). Server rdma writes into those buffers. Client does the unregistering/invalidation and completes the request. > Can you please share any numbers with us? Apart from github (https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3) the performance results for v5.2-rc3 on two different systems can be accessed under dcd.ionos.com/ibnbd-performance-report. The page allows to filter out test scenarios interesting for comparison. > > Thanks