Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 01/25] sysfs: export sysfs_remove_file_self()
       [not found] <20190620150337.7847-1-jinpuwang@gmail.com>
@ 2019-06-20 15:03 ` Jack Wang
  2019-09-23 17:21   ` Bart Van Assche
  2019-07-09  9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 123+ messages in thread
From: Jack Wang @ 2019-06-20 15:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, jgg, dledford, danil.kipnis,
	rpenyaev, Roman Pen, linux-kernel

From: Roman Pen <roman.penyaev@profitbricks.com>

Function is going to be used in transport over RDMA module
in subsequent patches.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 fs/sysfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 130fc6fbcc03..1ff4672d7746 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -492,6 +492,7 @@ bool sysfs_remove_file_self(struct kobject *kobj, const struct attribute *attr)
 	kernfs_put(kn);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(sysfs_remove_file_self);
 
 void sysfs_remove_files(struct kobject *kobj, const struct attribute * const *ptr)
 {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
       [not found] <20190620150337.7847-1-jinpuwang@gmail.com>
  2019-06-20 15:03 ` [PATCH v4 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
@ 2019-07-09  9:55 ` Danil Kipnis
  2019-07-09 11:00   ` Leon Romanovsky
                     ` (2 more replies)
       [not found] ` <20190620150337.7847-26-jinpuwang@gmail.com>
                   ` (17 subsequent siblings)
  19 siblings, 3 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-07-09  9:55 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-block, linux-rdma, axboe, Christoph Hellwig, Sagi Grimberg,
	bvanassche, jgg, dledford, Roman Pen, gregkh

Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,

Could you please provide some feedback to the IBNBD driver and the
IBTRS library?
So far we addressed all the requests provided by the community and
continue to maintain our code up-to-date with the upstream kernel
while having an extra compatibility layer for older kernels in our
out-of-tree repository.
I understand that SRP and NVMEoF which are in the kernel already do
provide equivalent functionality for the majority of the use cases.
IBNBD on the other hand is showing higher performance and more
importantly includes the IBTRS - a general purpose library to
establish connections and transport BIO-like read/write sg-lists over
RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
I believe IBNBD does meet the kernel coding standards, it doesn't have
a lot of users, while SRP and NVMEoF are widely accepted. Do you think
it would make sense for us to rework our patchset and try pushing it
for staging tree first, so that we can proof IBNBD is well maintained,
beneficial for the eco-system, find a proper location for it within
block/rdma subsystems? This would make it easier for people to try it
out and would also be a huge step for us in terms of maintenance
effort.
The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
near future). Do you think it would make sense to rename the driver to
RNBD/RTRS?

Thank you,
Best Regards,
Danil

On Thu, Jun 20, 2019 at 5:03 PM Jack Wang <jinpuwang@gmail.com> wrote:
>
> Hi all,
>
> Here is v4 of IBNBD/IBTRS patches, which have minor changes
>
>  Changelog
>  ---------
> v4:
>   o Protocol extended to transport IO priorities
>   o Support for Mellanox ConnectX-4/X-5
>   o Minor sysfs extentions (display access mode on server side)
>   o Bug fixes: cleaning up sysfs folders, race on deallocation of resources
>   o Style fixes
>
> v3:
>   o Sparse fixes:
>      - le32 -> le16 conversion
>      - pcpu and RCU wrong declaration
>      - sysfs: dynamically alloc array of sockaddr structures to reduce
>            size of a stack frame
>
>   o Rename sysfs folder on client and server sides to show source and
>     destination addresses of the connection, i.e.:
>            .../<session-name>/paths/<src@dst>/
>
>   o Remove external inclusions from Makefiles.
>   * https://lwn.net/Articles/756994/
>
> v2:
>   o IBNBD:
>      - No legacy request IO mode, only MQ is left.
>
>   o IBTRS:
>      - No FMR registration, only FR is left.
>
>   * https://lwn.net/Articles/755075/
>
> v1:
>   - IBTRS: load-balancing and IO fail-over using multipath features were added.
>
>   - Major parts of the code were rewritten, simplified and overall code
>     size was reduced by a quarter.
>
>   * https://lwn.net/Articles/746342/
>
> v0:
>   - Initial submission
>
>   * https://lwn.net/Articles/718181/
>
>
>  Introduction
>  -------------
>
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE
> and iWARP, but we mainly tested in IB environment. It is optimized to
> transfer (read/write) IO blocks in the sense that it follows the BIO
> semantics of providing the possibility to either write data from a
> scatter-gather list to the remote side or to request ("read") data
> transfer from the remote side into a given set of buffers.
>
> IBTRS is multipath capable and provides I/O fail-over and load-balancing
> functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
> CMs and particular path is selected according to the load-balancing policy.
> It can be used for other components not bind to IBNBD.
>
>
> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
>
>
>    - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>      thus internal protocol is simple.
>    - IBTRS was developed as an independent RDMA transport library, which
>      supports fail-over and load-balancing policies using multipath, thus
>      it can be used for any other IO needs rather than only for block
>      device.
>    - IBNBD/IBTRS is fast.
>      Old comparison results:
>      https://www.spinics.net/lists/linux-rdma/msg48799.html
>      New comparison results: see performance measurements section below.
>
> Key features of IBTRS transport library and IBNBD block device:
>
> o High throughput and low latency due to:
>    - Only two RDMA messages per IO.
>    - IMM InfiniBand messages on responses to reduce round trip latency.
>    - Simplified memory management: memory allocation happens once on
>      server side when IBTRS session is established.
>
> o IO fail-over and load-balancing by using multipath.  According to
>   our test loads additional path brings ~20% of bandwidth.
>
> o Simple configuration of IBNBD:
>    - Server side is completely passive: volumes do not need to be
>      explicitly exported.
>    - Only IB port GID and device path needed on client side to map
>      a block device.
>    - A device is remapped automatically i.e. after storage reboot.
>
> Commits for kernel can be found here:
>    https://github.com/ionos-enterprise/ibnbd/tree/linux-5.2-rc3--ibnbd-v4
> The out-of-tree modules are here:
>    https://github.com/ionos-enterprise/ibnbd
>
> Vault 2017 presentation:
>   https://events.static.linuxfound.org/sites/events/files/slides/IBNBD-Vault-2017.pdf
>
>  Performance measurements
>  ------------------------
>
> o IBNBD and NVMEoRDMA
>
>   Performance results for the v5.2-rc3 kernel
>   link: https://github.com/ionos-enterprise/ibnbd/tree/develop/performance/v4-v5.2-rc3
>
> Roman Pen (25):
>   sysfs: export sysfs_remove_file_self()
>   ibtrs: public interface header to establish RDMA connections
>   ibtrs: private headers with IBTRS protocol structs and helpers
>   ibtrs: core: lib functions shared between client and server modules
>   ibtrs: client: private header with client structs and functions
>   ibtrs: client: main functionality
>   ibtrs: client: statistics functions
>   ibtrs: client: sysfs interface functions
>   ibtrs: server: private header with server structs and functions
>   ibtrs: server: main functionality
>   ibtrs: server: statistics functions
>   ibtrs: server: sysfs interface functions
>   ibtrs: include client and server modules into kernel compilation
>   ibtrs: a bit of documentation
>   ibnbd: private headers with IBNBD protocol structs and helpers
>   ibnbd: client: private header with client structs and functions
>   ibnbd: client: main functionality
>   ibnbd: client: sysfs interface functions
>   ibnbd: server: private header with server structs and functions
>   ibnbd: server: main functionality
>   ibnbd: server: functionality for IO submission to file or block dev
>   ibnbd: server: sysfs interface functions
>   ibnbd: include client and server modules into kernel compilation
>   ibnbd: a bit of documentation
>   MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
>
>  MAINTAINERS                                   |   14 +
>  drivers/block/Kconfig                         |    2 +
>  drivers/block/Makefile                        |    1 +
>  drivers/block/ibnbd/Kconfig                   |   24 +
>  drivers/block/ibnbd/Makefile                  |   13 +
>  drivers/block/ibnbd/README                    |  315 ++
>  drivers/block/ibnbd/ibnbd-clt-sysfs.c         |  691 ++++
>  drivers/block/ibnbd/ibnbd-clt.c               | 1832 +++++++++++
>  drivers/block/ibnbd/ibnbd-clt.h               |  166 +
>  drivers/block/ibnbd/ibnbd-log.h               |   59 +
>  drivers/block/ibnbd/ibnbd-proto.h             |  378 +++
>  drivers/block/ibnbd/ibnbd-srv-dev.c           |  408 +++
>  drivers/block/ibnbd/ibnbd-srv-dev.h           |  143 +
>  drivers/block/ibnbd/ibnbd-srv-sysfs.c         |  270 ++
>  drivers/block/ibnbd/ibnbd-srv.c               |  945 ++++++
>  drivers/block/ibnbd/ibnbd-srv.h               |   94 +
>  drivers/infiniband/Kconfig                    |    1 +
>  drivers/infiniband/ulp/Makefile               |    1 +
>  drivers/infiniband/ulp/ibtrs/Kconfig          |   22 +
>  drivers/infiniband/ulp/ibtrs/Makefile         |   15 +
>  drivers/infiniband/ulp/ibtrs/README           |  385 +++
>  .../infiniband/ulp/ibtrs/ibtrs-clt-stats.c    |  447 +++
>  .../infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c    |  514 +++
>  drivers/infiniband/ulp/ibtrs/ibtrs-clt.c      | 2844 +++++++++++++++++
>  drivers/infiniband/ulp/ibtrs/ibtrs-clt.h      |  308 ++
>  drivers/infiniband/ulp/ibtrs/ibtrs-log.h      |   84 +
>  drivers/infiniband/ulp/ibtrs/ibtrs-pri.h      |  463 +++
>  .../infiniband/ulp/ibtrs/ibtrs-srv-stats.c    |  103 +
>  .../infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c    |  303 ++
>  drivers/infiniband/ulp/ibtrs/ibtrs-srv.c      | 1998 ++++++++++++
>  drivers/infiniband/ulp/ibtrs/ibtrs-srv.h      |  170 +
>  drivers/infiniband/ulp/ibtrs/ibtrs.c          |  610 ++++
>  drivers/infiniband/ulp/ibtrs/ibtrs.h          |  318 ++
>  fs/sysfs/file.c                               |    1 +
>  34 files changed, 13942 insertions(+)
>  create mode 100644 drivers/block/ibnbd/Kconfig
>  create mode 100644 drivers/block/ibnbd/Makefile
>  create mode 100644 drivers/block/ibnbd/README
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-clt.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-log.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-proto.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv.c
>  create mode 100644 drivers/block/ibnbd/ibnbd-srv.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>  create mode 100644 drivers/infiniband/ulp/ibtrs/README
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c
>  create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09  9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis
@ 2019-07-09 11:00   ` Leon Romanovsky
  2019-07-09 11:17     ` Greg KH
                       ` (2 more replies)
  2019-07-09 12:04   ` Jason Gunthorpe
  2019-07-09 19:45   ` Sagi Grimberg
  2 siblings, 3 replies; 123+ messages in thread
From: Leon Romanovsky @ 2019-07-09 11:00 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	Sagi Grimberg, bvanassche, jgg, dledford, Roman Pen, gregkh

On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
>
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?
> So far we addressed all the requests provided by the community and
> continue to maintain our code up-to-date with the upstream kernel
> while having an extra compatibility layer for older kernels in our
> out-of-tree repository.
> I understand that SRP and NVMEoF which are in the kernel already do
> provide equivalent functionality for the majority of the use cases.
> IBNBD on the other hand is showing higher performance and more
> importantly includes the IBTRS - a general purpose library to
> establish connections and transport BIO-like read/write sg-lists over
> RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> I believe IBNBD does meet the kernel coding standards, it doesn't have
> a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> it would make sense for us to rework our patchset and try pushing it
> for staging tree first, so that we can proof IBNBD is well maintained,
> beneficial for the eco-system, find a proper location for it within
> block/rdma subsystems? This would make it easier for people to try it
> out and would also be a huge step for us in terms of maintenance
> effort.
> The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> near future). Do you think it would make sense to rename the driver to
> RNBD/RTRS?

It is better to avoid "staging" tree, because it will lack attention of
relevant people and your efforts will be lost once you will try to move
out of staging. We are all remembering Lustre and don't want to see it
again.

Back then, you was asked to provide support for performance superiority.
Can you please share any numbers with us?

Thanks

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 11:00   ` Leon Romanovsky
@ 2019-07-09 11:17     ` Greg KH
  2019-07-09 11:57       ` Jinpu Wang
                         ` (2 more replies)
  2019-07-09 11:37     ` Jinpu Wang
  2019-07-10 14:55     ` Danil Kipnis
  2 siblings, 3 replies; 123+ messages in thread
From: Greg KH @ 2019-07-09 11:17 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, Sagi Grimberg, bvanassche, jgg, dledford,
	Roman Pen

On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
> 
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.

That's up to the developers, that had nothing to do with the fact that
the code was in the staging tree.  If the Lustre developers had actually
done the requested work, it would have moved out of the staging tree.

So if these developers are willing to do the work to get something out
of staging, and into the "real" part of the kernel, I will gladly take
it.

But I will note that it is almost always easier to just do the work
ahead of time, and merge it in "correctly" than to go from staging into
the real part of the kernel.  But it's up to the developers what they
want to do.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 11:00   ` Leon Romanovsky
  2019-07-09 11:17     ` Greg KH
@ 2019-07-09 11:37     ` Jinpu Wang
  2019-07-09 12:06       ` Jason Gunthorpe
  2019-07-10 14:55     ` Danil Kipnis
  2 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-07-09 11:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Danil Kipnis, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, bvanassche, jgg, dledford,
	Roman Pen, Greg Kroah-Hartman, Jinpu Wang

Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
>
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
>
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.
>
> Back then, you was asked to provide support for performance superiority.
> Can you please share any numbers with us?
Hi Leon,

Thanks for you feedback.

For performance numbers,  Danil did intensive benchmark, and create
some PDF with graphes here:
https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3

It includes both single path results also different multipath policy results.

If you have any question regarding the results, please let us know.

>
> Thanks

Thanks
Jack Wang

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 11:17     ` Greg KH
@ 2019-07-09 11:57       ` Jinpu Wang
  2019-07-09 13:32       ` Leon Romanovsky
  2019-07-09 15:39       ` Bart Van Assche
  2 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-07-09 11:57 UTC (permalink / raw)
  To: Greg KH
  Cc: Leon Romanovsky, Danil Kipnis, linux-block, linux-rdma,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, bvanassche, jgg,
	dledford, Roman Pen, Jinpu Wang

Greg KH <gregkh@linuxfoundation.org> 于2019年7月9日周二 下午1:17写道:
>
> On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
>
> That's up to the developers, that had nothing to do with the fact that
> the code was in the staging tree.  If the Lustre developers had actually
> done the requested work, it would have moved out of the staging tree.
>
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.
Thanks Greg,

This is encouraging, we ARE willing to do the work to get IBNBD/IBTRS merged to
upstream kernel. We regularly contribute to stable kernel also
upsteam, backport patches, testing
stable rc release etc. We believe in opensource and the power of community.

Sure, we will try to go with so called real kernel, this is also what
we are doing
and did in the past, but since v3, we did not receive any real feedback.

We will see how thing will go.

Thanks again!
Jack Wang @ 1 & 1 IONOS Cloud GmbH


>
> But I will note that it is almost always easier to just do the work
> ahead of time, and merge it in "correctly" than to go from staging into
> the real part of the kernel.  But it's up to the developers what they
> want to do.
>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09  9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis
  2019-07-09 11:00   ` Leon Romanovsky
@ 2019-07-09 12:04   ` Jason Gunthorpe
  2019-07-09 19:45   ` Sagi Grimberg
  2 siblings, 0 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2019-07-09 12:04 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	Sagi Grimberg, bvanassche, dledford, Roman Pen, gregkh

On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> 
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?

From my perspective you need to get people from the block community to
go over this.

It is the merge window right now so nobody is really looking at
patches, you may need to resend it after rc1 to get attention.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 11:37     ` Jinpu Wang
@ 2019-07-09 12:06       ` Jason Gunthorpe
  2019-07-09 13:15         ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2019-07-09 12:06 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Leon Romanovsky, Danil Kipnis, linux-block, linux-rdma,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, bvanassche,
	dledford, Roman Pen, Greg Kroah-Hartman, Jinpu Wang

On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> >
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
> >
> > Back then, you was asked to provide support for performance superiority.
> > Can you please share any numbers with us?
> Hi Leon,
> 
> Thanks for you feedback.
> 
> For performance numbers,  Danil did intensive benchmark, and create
> some PDF with graphes here:
> https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> 
> It includes both single path results also different multipath policy results.
> 
> If you have any question regarding the results, please let us know.

I kind of recall that last time the perf numbers were skewed toward
IBNBD because the invalidation model for MR was wrong - did this get
fixed?

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 12:06       ` Jason Gunthorpe
@ 2019-07-09 13:15         ` Jinpu Wang
  2019-07-09 13:19           ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-07-09 13:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jinpu Wang, Leon Romanovsky, Danil Kipnis, linux-block,
	linux-rdma, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	bvanassche, dledford, Roman Pen, Greg Kroah-Hartman

On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> > >
> > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > >
> > > > Could you please provide some feedback to the IBNBD driver and the
> > > > IBTRS library?
> > > > So far we addressed all the requests provided by the community and
> > > > continue to maintain our code up-to-date with the upstream kernel
> > > > while having an extra compatibility layer for older kernels in our
> > > > out-of-tree repository.
> > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > provide equivalent functionality for the majority of the use cases.
> > > > IBNBD on the other hand is showing higher performance and more
> > > > importantly includes the IBTRS - a general purpose library to
> > > > establish connections and transport BIO-like read/write sg-lists over
> > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > it would make sense for us to rework our patchset and try pushing it
> > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > beneficial for the eco-system, find a proper location for it within
> > > > block/rdma subsystems? This would make it easier for people to try it
> > > > out and would also be a huge step for us in terms of maintenance
> > > > effort.
> > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > near future). Do you think it would make sense to rename the driver to
> > > > RNBD/RTRS?
> > >
> > > It is better to avoid "staging" tree, because it will lack attention of
> > > relevant people and your efforts will be lost once you will try to move
> > > out of staging. We are all remembering Lustre and don't want to see it
> > > again.
> > >
> > > Back then, you was asked to provide support for performance superiority.
> > > Can you please share any numbers with us?
> > Hi Leon,
> >
> > Thanks for you feedback.
> >
> > For performance numbers,  Danil did intensive benchmark, and create
> > some PDF with graphes here:
> > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> >
> > It includes both single path results also different multipath policy results.
> >
> > If you have any question regarding the results, please let us know.
>
> I kind of recall that last time the perf numbers were skewed toward
> IBNBD because the invalidation model for MR was wrong - did this get
> fixed?
>
> Jason

Thanks Jason for feedback.
Can you be  more specific about  "the invalidation model for MR was wrong"

I checked in the history of the email thread, only found
"I think from the RDMA side, before we accept something like this, I'd
like to hear from Christoph, Chuck or Sagi that the dataplane
implementation of this is correct, eg it uses the MRs properly and
invalidates at the right time, sequences with dma_ops as required,
etc.
"
And no reply from any of you since then.

Thanks,
Jack

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 13:15         ` Jinpu Wang
@ 2019-07-09 13:19           ` Jason Gunthorpe
  2019-07-09 14:17             ` Jinpu Wang
  2019-07-09 21:27             ` Sagi Grimberg
  0 siblings, 2 replies; 123+ messages in thread
From: Jason Gunthorpe @ 2019-07-09 13:19 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jinpu Wang, Leon Romanovsky, Danil Kipnis, linux-block,
	linux-rdma, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	bvanassche, dledford, Roman Pen, Greg Kroah-Hartman

On Tue, Jul 09, 2019 at 03:15:46PM +0200, Jinpu Wang wrote:
> On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> >
> > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> > > >
> > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > > >
> > > > > Could you please provide some feedback to the IBNBD driver and the
> > > > > IBTRS library?
> > > > > So far we addressed all the requests provided by the community and
> > > > > continue to maintain our code up-to-date with the upstream kernel
> > > > > while having an extra compatibility layer for older kernels in our
> > > > > out-of-tree repository.
> > > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > > provide equivalent functionality for the majority of the use cases.
> > > > > IBNBD on the other hand is showing higher performance and more
> > > > > importantly includes the IBTRS - a general purpose library to
> > > > > establish connections and transport BIO-like read/write sg-lists over
> > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > > it would make sense for us to rework our patchset and try pushing it
> > > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > > beneficial for the eco-system, find a proper location for it within
> > > > > block/rdma subsystems? This would make it easier for people to try it
> > > > > out and would also be a huge step for us in terms of maintenance
> > > > > effort.
> > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > > near future). Do you think it would make sense to rename the driver to
> > > > > RNBD/RTRS?
> > > >
> > > > It is better to avoid "staging" tree, because it will lack attention of
> > > > relevant people and your efforts will be lost once you will try to move
> > > > out of staging. We are all remembering Lustre and don't want to see it
> > > > again.
> > > >
> > > > Back then, you was asked to provide support for performance superiority.
> > > > Can you please share any numbers with us?
> > > Hi Leon,
> > >
> > > Thanks for you feedback.
> > >
> > > For performance numbers,  Danil did intensive benchmark, and create
> > > some PDF with graphes here:
> > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> > >
> > > It includes both single path results also different multipath policy results.
> > >
> > > If you have any question regarding the results, please let us know.
> >
> > I kind of recall that last time the perf numbers were skewed toward
> > IBNBD because the invalidation model for MR was wrong - did this get
> > fixed?
> >
> > Jason
> 
> Thanks Jason for feedback.
> Can you be  more specific about  "the invalidation model for MR was wrong"

MR's must be invalidated before data is handed over to the block
layer. It can't leave MRs open for access and then touch the memory
the MR covers.

IMHO this is the most likely explanation for any performance difference
from nvme..

> I checked in the history of the email thread, only found
> "I think from the RDMA side, before we accept something like this, I'd
> like to hear from Christoph, Chuck or Sagi that the dataplane
> implementation of this is correct, eg it uses the MRs properly and
> invalidates at the right time, sequences with dma_ops as required,
> etc.
> "
> And no reply from any of you since then.

This task still needs to happen..

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 11:17     ` Greg KH
  2019-07-09 11:57       ` Jinpu Wang
@ 2019-07-09 13:32       ` Leon Romanovsky
  2019-07-09 15:39       ` Bart Van Assche
  2 siblings, 0 replies; 123+ messages in thread
From: Leon Romanovsky @ 2019-07-09 13:32 UTC (permalink / raw)
  To: Greg KH
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, Sagi Grimberg, bvanassche, jgg, dledford,
	Roman Pen

On Tue, Jul 09, 2019 at 01:17:37PM +0200, Greg KH wrote:
> On Tue, Jul 09, 2019 at 02:00:36PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > >
> > > Could you please provide some feedback to the IBNBD driver and the
> > > IBTRS library?
> > > So far we addressed all the requests provided by the community and
> > > continue to maintain our code up-to-date with the upstream kernel
> > > while having an extra compatibility layer for older kernels in our
> > > out-of-tree repository.
> > > I understand that SRP and NVMEoF which are in the kernel already do
> > > provide equivalent functionality for the majority of the use cases.
> > > IBNBD on the other hand is showing higher performance and more
> > > importantly includes the IBTRS - a general purpose library to
> > > establish connections and transport BIO-like read/write sg-lists over
> > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > it would make sense for us to rework our patchset and try pushing it
> > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > beneficial for the eco-system, find a proper location for it within
> > > block/rdma subsystems? This would make it easier for people to try it
> > > out and would also be a huge step for us in terms of maintenance
> > > effort.
> > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > near future). Do you think it would make sense to rename the driver to
> > > RNBD/RTRS?
> >
> > It is better to avoid "staging" tree, because it will lack attention of
> > relevant people and your efforts will be lost once you will try to move
> > out of staging. We are all remembering Lustre and don't want to see it
> > again.
>
> That's up to the developers, that had nothing to do with the fact that
> the code was in the staging tree.  If the Lustre developers had actually
> done the requested work, it would have moved out of the staging tree.
>
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.

Greg,

It is not matter of how much *real* work developers will do, but
it is a matter of guidance to do *right* thing, which is hard to achieve
if people mentioned in the beginning of this thread wouldn't look on
staging code.

>
> But I will note that it is almost always easier to just do the work
> ahead of time, and merge it in "correctly" than to go from staging into
> the real part of the kernel.  But it's up to the developers what they
> want to do.
>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 13:19           ` Jason Gunthorpe
@ 2019-07-09 14:17             ` Jinpu Wang
  2019-07-09 21:27             ` Sagi Grimberg
  1 sibling, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-07-09 14:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg, bvanassche,
	chuck.lever
  Cc: Jinpu Wang, Leon Romanovsky, Danil Kipnis, linux-block,
	linux-rdma, Jens Axboe, dledford, Roman Pen, Greg Kroah-Hartman

On Tue, Jul 9, 2019 at 3:19 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Jul 09, 2019 at 03:15:46PM +0200, Jinpu Wang wrote:
> > On Tue, Jul 9, 2019 at 2:06 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
> > >
> > > On Tue, Jul 09, 2019 at 01:37:39PM +0200, Jinpu Wang wrote:
> > > > Leon Romanovsky <leon@kernel.org> 于2019年7月9日周二 下午1:00写道:
> > > > >
> > > > > On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > > > > > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> > > > > >
> > > > > > Could you please provide some feedback to the IBNBD driver and the
> > > > > > IBTRS library?
> > > > > > So far we addressed all the requests provided by the community and
> > > > > > continue to maintain our code up-to-date with the upstream kernel
> > > > > > while having an extra compatibility layer for older kernels in our
> > > > > > out-of-tree repository.
> > > > > > I understand that SRP and NVMEoF which are in the kernel already do
> > > > > > provide equivalent functionality for the majority of the use cases.
> > > > > > IBNBD on the other hand is showing higher performance and more
> > > > > > importantly includes the IBTRS - a general purpose library to
> > > > > > establish connections and transport BIO-like read/write sg-lists over
> > > > > > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > > > > > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > > > > > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > > > > > it would make sense for us to rework our patchset and try pushing it
> > > > > > for staging tree first, so that we can proof IBNBD is well maintained,
> > > > > > beneficial for the eco-system, find a proper location for it within
> > > > > > block/rdma subsystems? This would make it easier for people to try it
> > > > > > out and would also be a huge step for us in terms of maintenance
> > > > > > effort.
> > > > > > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > > > > > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > > > > > near future). Do you think it would make sense to rename the driver to
> > > > > > RNBD/RTRS?
> > > > >
> > > > > It is better to avoid "staging" tree, because it will lack attention of
> > > > > relevant people and your efforts will be lost once you will try to move
> > > > > out of staging. We are all remembering Lustre and don't want to see it
> > > > > again.
> > > > >
> > > > > Back then, you was asked to provide support for performance superiority.
> > > > > Can you please share any numbers with us?
> > > > Hi Leon,
> > > >
> > > > Thanks for you feedback.
> > > >
> > > > For performance numbers,  Danil did intensive benchmark, and create
> > > > some PDF with graphes here:
> > > > https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3
> > > >
> > > > It includes both single path results also different multipath policy results.
> > > >
> > > > If you have any question regarding the results, please let us know.
> > >
> > > I kind of recall that last time the perf numbers were skewed toward
> > > IBNBD because the invalidation model for MR was wrong - did this get
> > > fixed?
> > >
> > > Jason
> >
> > Thanks Jason for feedback.
> > Can you be  more specific about  "the invalidation model for MR was wrong"
>
> MR's must be invalidated before data is handed over to the block
> layer. It can't leave MRs open for access and then touch the memory
> the MR covers.
>
> IMHO this is the most likely explanation for any performance difference
> from nvme..
>
> > I checked in the history of the email thread, only found
> > "I think from the RDMA side, before we accept something like this, I'd
> > like to hear from Christoph, Chuck or Sagi that the dataplane
> > implementation of this is correct, eg it uses the MRs properly and
> > invalidates at the right time, sequences with dma_ops as required,
> > etc.
> > "
> > And no reply from any of you since then.
>
> This task still needs to happen..
>
> Jason

We did extensive testing and cross-checked how iser and nvmeof does
invalidation of MR,
doesn't find a problem.

+ Chuck
It will be appreciated if Christoph, Chuck, Sagi or Bart could give a
check, thank you in advance.

Thanks
Jack

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
       [not found] ` <20190620150337.7847-26-jinpuwang@gmail.com>
@ 2019-07-09 15:10   ` Leon Romanovsky
  2019-07-09 15:18     ` Jinpu Wang
  2019-09-13 23:56   ` Bart Van Assche
  1 sibling, 1 reply; 123+ messages in thread
From: Leon Romanovsky @ 2019-07-09 15:10 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-block, linux-rdma, axboe, hch, sagi, bvanassche, jgg,
	dledford, danil.kipnis, rpenyaev, Roman Pen, Jack Wang

On Thu, Jun 20, 2019 at 05:03:37PM +0200, Jack Wang wrote:
> From: Roman Pen <roman.penyaev@profitbricks.com>
>
> Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
> ---
>  MAINTAINERS | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a6954776a37e..0b7fd93f738d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7590,6 +7590,20 @@ IBM ServeRAID RAID DRIVER
>  S:	Orphan
>  F:	drivers/scsi/ips.*
>
> +IBNBD BLOCK DRIVERS
> +M:	IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> +L:	linux-block@vger.kernel.org
> +S:	Maintained
> +T:	git git://github.com/profitbricks/ibnbd.git
> +F:	drivers/block/ibnbd/
> +
> +IBTRS TRANSPORT DRIVERS
> +M:	IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>

I don't know if it rule or not, but can you please add real
person/persons to Maintainers list? Many times, those global
support lists are simply ignored.

> +L:	linux-rdma@vger.kernel.org
> +S:	Maintained
> +T:	git git://github.com/profitbricks/ibnbd.git

How did you imagine patch flow for ULP, while your tree is
external to RDMA tree?

> +F:	drivers/infiniband/ulp/ibtrs/
> +
>  ICH LPC AND GPIO DRIVER
>  M:	Peter Tyser <ptyser@xes-inc.com>
>  S:	Maintained
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
  2019-07-09 15:10   ` [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Leon Romanovsky
@ 2019-07-09 15:18     ` Jinpu Wang
  2019-07-09 15:51       ` Leon Romanovsky
  0 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-07-09 15:18 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Jason Gunthorpe, Doug Ledford, Danil Kipnis, rpenyaev, Roman Pen

On Tue, Jul 9, 2019 at 5:10 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Jun 20, 2019 at 05:03:37PM +0200, Jack Wang wrote:
> > From: Roman Pen <roman.penyaev@profitbricks.com>
> >
> > Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
> > Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
> > ---
> >  MAINTAINERS | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index a6954776a37e..0b7fd93f738d 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -7590,6 +7590,20 @@ IBM ServeRAID RAID DRIVER
> >  S:   Orphan
> >  F:   drivers/scsi/ips.*
> >
> > +IBNBD BLOCK DRIVERS
> > +M:   IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> > +L:   linux-block@vger.kernel.org
> > +S:   Maintained
> > +T:   git git://github.com/profitbricks/ibnbd.git
> > +F:   drivers/block/ibnbd/
> > +
> > +IBTRS TRANSPORT DRIVERS
> > +M:   IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
>
> I don't know if it rule or not, but can you please add real
> person/persons to Maintainers list? Many times, those global
> support lists are simply ignored.

Sure, we can use my and Danil 's name in next round.

>
> > +L:   linux-rdma@vger.kernel.org
> > +S:   Maintained
> > +T:   git git://github.com/profitbricks/ibnbd.git
>
> How did you imagine patch flow for ULP, while your tree is
> external to RDMA tree?

Plan was we gather the patch in the git tree, and
send patches to the list via git send email, do we accept pull request
from github?
What the preferred way?

Thanks Leon.
Jack
>
> > +F:   drivers/infiniband/ulp/ibtrs/
> > +
> >  ICH LPC AND GPIO DRIVER
> >  M:   Peter Tyser <ptyser@xes-inc.com>
> >  S:   Maintained
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 11:17     ` Greg KH
  2019-07-09 11:57       ` Jinpu Wang
  2019-07-09 13:32       ` Leon Romanovsky
@ 2019-07-09 15:39       ` Bart Van Assche
  2 siblings, 0 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-07-09 15:39 UTC (permalink / raw)
  To: Greg KH, Leon Romanovsky
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, Sagi Grimberg, jgg, dledford, Roman Pen

On 7/9/19 4:17 AM, Greg KH wrote:
> So if these developers are willing to do the work to get something out
> of staging, and into the "real" part of the kernel, I will gladly take
> it.

Linus once famously said "given enough eyeballs, all bugs are shallow".
There are already two block-over-RDMA driver pairs upstream (NVMeOF and
SRP). Accepting the IBTRS and IBNBD drivers upstream would reduce the
number of users of the upstream block-over-RDMA drivers and hence would
fragment the block-over-RDMA driver user base further. Additionally, I'm
not yet convinced that the interesting parts of IBNBD cannot be
integrated into the existing upstream drivers. So it's not clear to me
whether taking the IBTRS and IBNBD drivers upstream would help the Linux
user community.

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
  2019-07-09 15:18     ` Jinpu Wang
@ 2019-07-09 15:51       ` Leon Romanovsky
  0 siblings, 0 replies; 123+ messages in thread
From: Leon Romanovsky @ 2019-07-09 15:51 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Jason Gunthorpe, Doug Ledford, Danil Kipnis, rpenyaev, Roman Pen

On Tue, Jul 09, 2019 at 05:18:37PM +0200, Jinpu Wang wrote:
> On Tue, Jul 9, 2019 at 5:10 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Jun 20, 2019 at 05:03:37PM +0200, Jack Wang wrote:
> > > From: Roman Pen <roman.penyaev@profitbricks.com>
> > >
> > > Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
> > > Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
> > > ---
> > >  MAINTAINERS | 14 ++++++++++++++
> > >  1 file changed, 14 insertions(+)
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index a6954776a37e..0b7fd93f738d 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -7590,6 +7590,20 @@ IBM ServeRAID RAID DRIVER
> > >  S:   Orphan
> > >  F:   drivers/scsi/ips.*
> > >
> > > +IBNBD BLOCK DRIVERS
> > > +M:   IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> > > +L:   linux-block@vger.kernel.org
> > > +S:   Maintained
> > > +T:   git git://github.com/profitbricks/ibnbd.git
> > > +F:   drivers/block/ibnbd/
> > > +
> > > +IBTRS TRANSPORT DRIVERS
> > > +M:   IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> >
> > I don't know if it rule or not, but can you please add real
> > person/persons to Maintainers list? Many times, those global
> > support lists are simply ignored.
>
> Sure, we can use my and Danil 's name in next round.
>
> >
> > > +L:   linux-rdma@vger.kernel.org
> > > +S:   Maintained
> > > +T:   git git://github.com/profitbricks/ibnbd.git
> >
> > How did you imagine patch flow for ULP, while your tree is
> > external to RDMA tree?
>
> Plan was we gather the patch in the git tree, and
> send patches to the list via git send email, do we accept pull request
> from github?
> What the preferred way?

The preferred way is to start with sending patches directly
to the mailing and allow RDMA maintainers to collect and
apply them by themselves. It gives an easy way to other people
to do cross-subsystem changes and we are doing a lot of them.

Till you will be asked to send PRs the "T:" link should point to RDMA subsystem.

Thanks

>
> Thanks Leon.
> Jack
> >
> > > +F:   drivers/infiniband/ulp/ibtrs/
> > > +
> > >  ICH LPC AND GPIO DRIVER
> > >  M:   Peter Tyser <ptyser@xes-inc.com>
> > >  S:   Maintained
> > > --
> > > 2.17.1
> > >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09  9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis
  2019-07-09 11:00   ` Leon Romanovsky
  2019-07-09 12:04   ` Jason Gunthorpe
@ 2019-07-09 19:45   ` Sagi Grimberg
  2019-07-10 13:55     ` Jason Gunthorpe
  2019-07-11  8:54     ` Danil Kipnis
  2 siblings, 2 replies; 123+ messages in thread
From: Sagi Grimberg @ 2019-07-09 19:45 UTC (permalink / raw)
  To: Danil Kipnis, Jack Wang
  Cc: linux-block, linux-rdma, axboe, Christoph Hellwig, bvanassche,
	jgg, dledford, Roman Pen, gregkh

Hi Danil and Jack,

> Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> 
> Could you please provide some feedback to the IBNBD driver and the
> IBTRS library?
> So far we addressed all the requests provided by the community

That is not exactly correct AFAIR,

My main issues which were raised before are:
- IMO there isn't any justification to this ibtrs layering separation
   given that the only user of this is your ibnbd. Unless you are
   trying to submit another consumer, you should avoid adding another
   subsystem that is not really general purpose.

- ibtrs in general is using almost no infrastructure from the existing
   kernel subsystems. Examples are:
   - tag allocation mechanism (which I'm not clear why its needed)
   - rdma rw abstraction similar to what we have in the core
   - list_next_or_null_rr_rcu ??
   - few other examples sprinkled around..

Another question, from what I understand from the code, the client
always rdma_writes data on writes (with imm) from a remote pool of
server buffers dedicated to it. Essentially all writes are immediate (no
rdma reads ever). How is that different than using send wrs to a set of
pre-posted recv buffers (like all others are doing)? Is it faster?

Also, given that the server pre-allocate a substantial amount of memory
for each connection, is it documented the requirements from the server
side? Usually kernel implementations (especially upstream ones) will
avoid imposing such large longstanding memory requirements on the system
by default. I don't have a firm stand on this, but wanted to highlight
this as you are sending this for upstream inclusion.

  and
> continue to maintain our code up-to-date with the upstream kernel
> while having an extra compatibility layer for older kernels in our
> out-of-tree repository.

Overall, while I absolutely support your cause to lower your maintenance
overhead by having this sit upstream, I don't see why this can be
helpful to anyone else in the rdma community. If instead you can
crystallize why/how ibnbd is faster than anything else, and perhaps
contribute a common infrastructure piece (or enhance an existing one)
such that other existing ulps can leverage, it will be a lot more
compelling to include it upstream.

> I understand that SRP and NVMEoF which are in the kernel already do
> provide equivalent functionality for the majority of the use cases.
> IBNBD on the other hand is showing higher performance and more
> importantly includes the IBTRS - a general purpose library to
> establish connections and transport BIO-like read/write sg-lists over
> RDMA,

But who needs it? Can other ulps use it or pieces of it? I keep failing
to understand why is this a benefit if its specific to your ibnbd?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 13:19           ` Jason Gunthorpe
  2019-07-09 14:17             ` Jinpu Wang
@ 2019-07-09 21:27             ` Sagi Grimberg
  2019-07-19 13:12               ` Danil Kipnis
  1 sibling, 1 reply; 123+ messages in thread
From: Sagi Grimberg @ 2019-07-09 21:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Jinpu Wang
  Cc: Jinpu Wang, Leon Romanovsky, Danil Kipnis, linux-block,
	linux-rdma, Jens Axboe, Christoph Hellwig, bvanassche, dledford,
	Roman Pen, Greg Kroah-Hartman


>> Thanks Jason for feedback.
>> Can you be  more specific about  "the invalidation model for MR was wrong"
> 
> MR's must be invalidated before data is handed over to the block
> layer. It can't leave MRs open for access and then touch the memory
> the MR covers.

Jason is referring to these fixes:
2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing 
a request")
4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work 
request has completed")
b4b591c87f2b ("nvme-rdma: don't suppress send completions")

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 19:45   ` Sagi Grimberg
@ 2019-07-10 13:55     ` Jason Gunthorpe
  2019-07-10 16:25       ` Sagi Grimberg
  2019-07-11  8:54     ` Danil Kipnis
  1 sibling, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2019-07-10 13:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh

On Tue, Jul 09, 2019 at 12:45:57PM -0700, Sagi Grimberg wrote:

> Another question, from what I understand from the code, the client
> always rdma_writes data on writes (with imm) from a remote pool of
> server buffers dedicated to it. Essentially all writes are immediate (no
> rdma reads ever). How is that different than using send wrs to a set of
> pre-posted recv buffers (like all others are doing)? Is it faster?

RDMA WRITE only is generally a bit faster, and if you use a buffer
pool in a smart way it is possible to get very good data packing. With
SEND the number of recvq entries dictates how big the rx buffer can
be, or you waste even more memory by using partial send buffers..

A scheme like this seems like a high performance idea, but on the
other side, I have no idea how you could possibly manage invalidations
efficiently with a shared RX buffer pool...

The RXer has to push out an invalidation for the shared buffer pool
MR, but we don't have protocols for partial MR invalidation.

Which is back to my earlier thought that the main reason this perfoms
better is because it doesn't have synchronous MR invalidation.

Maybe this is fine, but it needs to be made very clear that it uses
this insecure operating model to get higher performance..

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 11:00   ` Leon Romanovsky
  2019-07-09 11:17     ` Greg KH
  2019-07-09 11:37     ` Jinpu Wang
@ 2019-07-10 14:55     ` Danil Kipnis
  2 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-07-10 14:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	Sagi Grimberg, bvanassche, jgg, dledford, Roman Pen, gregkh

Hi Leon,

thanks for the feedback!

On Tue, Jul 9, 2019 at 1:00 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Jul 09, 2019 at 11:55:03AM +0200, Danil Kipnis wrote:
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA, while SRP is targeting SCSI and NVMEoF is addressing NVME. While
> > I believe IBNBD does meet the kernel coding standards, it doesn't have
> > a lot of users, while SRP and NVMEoF are widely accepted. Do you think
> > it would make sense for us to rework our patchset and try pushing it
> > for staging tree first, so that we can proof IBNBD is well maintained,
> > beneficial for the eco-system, find a proper location for it within
> > block/rdma subsystems? This would make it easier for people to try it
> > out and would also be a huge step for us in terms of maintenance
> > effort.
> > The names IBNBD and IBTRS are in fact misleading. IBTRS sits on top of
> > RDMA and is not bound to IB (We will evaluate IBTRS with ROCE in the
> > near future). Do you think it would make sense to rename the driver to
> > RNBD/RTRS?
>
> It is better to avoid "staging" tree, because it will lack attention of
> relevant people and your efforts will be lost once you will try to move
> out of staging. We are all remembering Lustre and don't want to see it
> again.
>
> Back then, you was asked to provide support for performance superiority.

I have only theories of why ibnbd is showing better numbers than nvmeof:
1. The way we utilize the MQ framework in IBNBD. We promise to have
queue_depth (say 512) requests on each of the num_cpus hardware queues
of each device, but in fact we have only queue_depth for the whole
"session" toward a given server. The moment we have queue_depth
inflights we need stop the queue (on a device on a cpu) we get more
requests on. We need to start them again after some requests are
completed. We maintain per cpu lists of stopped HW queues, a bitmap
showing which lists are not empty, etc. to wake them up in a
round-robin fashion to avoid starvation of any devices.
2. We only do rdma writes with imm. A server reserves queue_depth of
max_io_size buffers for a given client. The client manages those
himself. Client uses imm field to tell to the server which buffer has
been written (and where) and server uses the imm field to send back
errno. If our max_io_size is 64K and queue_depth 512 and client only
issues 4K IOs all the time, then 60*512K memory is wasted. On the
other hand we do no buffer allocation/registration in io path on
server side. Server sends rdma addresses and keys to those
preregistered buffers on connection establishment and
deallocates/unregisters them when a session is closed. That's for
writes. For reads, client registers user buffers (after fr) and sends
the addresses and keys to the server (with an rdma write with imm).
Server rdma writes into those buffers. Client does the
unregistering/invalidation and completes the request.

> Can you please share any numbers with us?
Apart from github
(https://github.com/ionos-enterprise/ibnbd/tree/master/performance/v4-v5.2-rc3)
the performance results for v5.2-rc3 on two different systems can be
accessed under dcd.ionos.com/ibnbd-performance-report. The page allows
to filter out test scenarios interesting for comparison.

>
> Thanks

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-10 13:55     ` Jason Gunthorpe
@ 2019-07-10 16:25       ` Sagi Grimberg
  2019-07-10 17:25         ` Jason Gunthorpe
  0 siblings, 1 reply; 123+ messages in thread
From: Sagi Grimberg @ 2019-07-10 16:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh


>> Another question, from what I understand from the code, the client
>> always rdma_writes data on writes (with imm) from a remote pool of
>> server buffers dedicated to it. Essentially all writes are immediate (no
>> rdma reads ever). How is that different than using send wrs to a set of
>> pre-posted recv buffers (like all others are doing)? Is it faster?
> 
> RDMA WRITE only is generally a bit faster, and if you use a buffer
> pool in a smart way it is possible to get very good data packing.

There is no packing, its used exactly as send/recv, but with a remote
buffer pool (pool of 512K buffers) and the client selects one and rdma
write with imm to it.

> With
> SEND the number of recvq entries dictates how big the rx buffer can
> be, or you waste even more memory by using partial send buffers..

This is exactly how it used here.

> A scheme like this seems like a high performance idea, but on the
> other side, I have no idea how you could possibly manage invalidations
> efficiently with a shared RX buffer pool...

There are no invalidations, this remote server pool is registered once
and long lived with the session.

> The RXer has to push out an invalidation for the shared buffer pool
> MR, but we don't have protocols for partial MR invalidation.
> 
> Which is back to my earlier thought that the main reason this perfoms
> better is because it doesn't have synchronous MR invalidation.

This issue only exists on the client side. The server never
invalidates any of its buffers.

> Maybe this is fine, but it needs to be made very clear that it uses
> this insecure operating model to get higher performance..

I still do not understand why this should give any notice-able
performance advantage.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-10 16:25       ` Sagi Grimberg
@ 2019-07-10 17:25         ` Jason Gunthorpe
  2019-07-10 19:11           ` Sagi Grimberg
  0 siblings, 1 reply; 123+ messages in thread
From: Jason Gunthorpe @ 2019-07-10 17:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh

On Wed, Jul 10, 2019 at 09:25:05AM -0700, Sagi Grimberg wrote:
> 
> > > Another question, from what I understand from the code, the client
> > > always rdma_writes data on writes (with imm) from a remote pool of
> > > server buffers dedicated to it. Essentially all writes are immediate (no
> > > rdma reads ever). How is that different than using send wrs to a set of
> > > pre-posted recv buffers (like all others are doing)? Is it faster?
> > 
> > RDMA WRITE only is generally a bit faster, and if you use a buffer
> > pool in a smart way it is possible to get very good data packing.
> 
> There is no packing, its used exactly as send/recv, but with a remote
> buffer pool (pool of 512K buffers) and the client selects one and rdma
> write with imm to it.

Well that makes little sense then:)

> > Maybe this is fine, but it needs to be made very clear that it uses
> > this insecure operating model to get higher performance..
> 
> I still do not understand why this should give any notice-able
> performance advantage.

Usually omitting invalidations gives a healthy bump.

Also, RDMA WRITE is generally faster than READ at the HW level in
various ways.

Jason

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-10 17:25         ` Jason Gunthorpe
@ 2019-07-10 19:11           ` Sagi Grimberg
  2019-07-11  7:27             ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Sagi Grimberg @ 2019-07-10 19:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh


>> I still do not understand why this should give any notice-able
>> performance advantage.
> 
> Usually omitting invalidations gives a healthy bump.
> 
> Also, RDMA WRITE is generally faster than READ at the HW level in
> various ways.

Yes, but this should be essentially identical to running nvme-rdma
with 512KB of immediate-data (the nvme term is in-capsule data).

In the upstream nvme target we have inline_data_size port attribute
that is tunable for that (defaults to PAGE_SIZE).

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-10 19:11           ` Sagi Grimberg
@ 2019-07-11  7:27             ` Danil Kipnis
  0 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-07-11  7:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh

Hi Sagi,

thanks a lot for the analysis. I didn't know about about the
inline_data_size parameter in nvmet. It is at PAGE_SIZE on our
systems.
Will rerun our benchmarks with
echo 2097152 > /sys/kernel/config/nvmet/ports/1/param_inline_data_size
echo 2097152 > /sys/kernel/config/nvmet/ports/2/param_inline_data_size
before enabling the port.
Best
Danil.

On Wed, Jul 10, 2019 at 9:11 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> I still do not understand why this should give any notice-able
> >> performance advantage.
> >
> > Usually omitting invalidations gives a healthy bump.
> >
> > Also, RDMA WRITE is generally faster than READ at the HW level in
> > various ways.
>
> Yes, but this should be essentially identical to running nvme-rdma
> with 512KB of immediate-data (the nvme term is in-capsule data).
>
> In the upstream nvme target we have inline_data_size port attribute
> that is tunable for that (defaults to PAGE_SIZE).



-- 
Danil Kipnis
Linux Kernel Developer

1&1 IONOS Cloud GmbH | Greifswalder Str. 207 | 10405 Berlin | Germany
E-mail: danil.kipnis@cloud.ionos.com | Web: www.ionos.de


Head Office: Berlin, Germany
District Court Berlin Charlottenburg, Registration number: HRB 125506 B
Executive Management: Christoph Steffens, Matthias Steinberg, Achim Weiss

Member of United Internet

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this
e-mail in any way is prohibited. If you have received this e-mail in
error, please notify the sender and delete the e-mail.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 19:45   ` Sagi Grimberg
  2019-07-10 13:55     ` Jason Gunthorpe
@ 2019-07-11  8:54     ` Danil Kipnis
  2019-07-12  0:22       ` Sagi Grimberg
  1 sibling, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-07-11  8:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	bvanassche, jgg, dledford, Roman Pen, gregkh

Hi Sagi,

thanks a lot for the detailed reply. Answers inline below:

On Tue, Jul 9, 2019 at 9:46 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Hi Danil and Jack,
>
> > Hallo Doug, Hallo Jason, Hallo Jens, Hallo Greg,
> >
> > Could you please provide some feedback to the IBNBD driver and the
> > IBTRS library?
> > So far we addressed all the requests provided by the community
>
> That is not exactly correct AFAIR,
>
> My main issues which were raised before are:
> - IMO there isn't any justification to this ibtrs layering separation
>    given that the only user of this is your ibnbd. Unless you are
>    trying to submit another consumer, you should avoid adding another
>    subsystem that is not really general purpose.
We designed ibtrs not only with the IBNBD in mind but also as the
transport layer for a distributed SDS. We'd like to be able to do what
ceph is capable of (automatic up/down scaling of the storage cluster,
automatic recovery) but using in-kernel rdma-based IO transport
drivers, thin-provisioned volume managers, etc. to keep the highest
possible performance. That modest plan of ours should among others
cover for the following:
When using IBNBD/SRP/NVMEoF to export devices (say, thin-provisioned
volumes) from server to client and building an (md-)raid on top of the
imported devices on client side in order to provide for redundancy
across different machines, one gets very decent throughput and low
latency, since the IOs are sent in parallel to the storage machines.
One downside of this setup, is that the resync traffic has to flow
over the client, where the md-raid is sitting. Ideally the resync
traffic should flow directly between the two "legs" (storage machines)
of the raid. The server side of such a "distributed raid" capable of
this direct syncing between the array members would necessarily
require to have some logic on server side and hence could also sit on
top of ibtrs. (To continue the analogy, the "server" side of an
md-raid build on top of say two NVMEoF devices are just two block
devices, which couldn't communicate with each other)
All in all itbrs is a library to establish a "fat", multipath,
autoreconnectable connection between two hosts on top of rdma,
optimized for transport of IO traffic.

> - ibtrs in general is using almost no infrastructure from the existing
>    kernel subsystems. Examples are:
>    - tag allocation mechanism (which I'm not clear why its needed)
As you correctly noticed our client manages the buffers allocated and
registered by the server on the connection establishment. Our tags are
just a mechanism to take and release those buffers for incoming
requests on client side. Since the buffers allocated by the server are
to be shared between all the devices mapped from that server and all
their HW queues (each having num_cpus of them) the mechanism behind
get_tag/put_tag also takes care of the fairness.

>    - rdma rw abstraction similar to what we have in the core
On the one hand we have only single IO related function:
ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
with imm, or requests an rdma write with imm to be executed by the
server. On the other hand we provide an abstraction to establish and
manage what we call "session", which consist of multiple paths (to do
failover and multipath with different policies), where each path
consists of num_cpu rdma connections. Once you established a session
you can add or remove paths from it on the fly. In case the connection
to server is lost, the client does periodic attempts to reconnect
automatically. On the server side you get just sg-lists with a
direction READ or WRITE as requested by the client. We designed this
interface not only as the minimum required to build a block device on
top of rdma but also with a distributed raid in mind.

>    - list_next_or_null_rr_rcu ??
We use that for multipath. The macro (and more importantly the way we
use it) has been reviewed by Linus and quit closely by Paul E.
McKenney. AFAIR the conclusion was that Romans implementation is
correct, but too tricky to use correctly in order to be included into
kernel as a public interface. See https://lkml.org/lkml/2018/5/18/659

>    - few other examples sprinkled around..
To my best knowledge we addressed everything we got comments on and
will definitely do so in the future.

> Another question, from what I understand from the code, the client
> always rdma_writes data on writes (with imm) from a remote pool of
> server buffers dedicated to it. Essentially all writes are immediate (no
> rdma reads ever). How is that different than using send wrs to a set of
> pre-posted recv buffers (like all others are doing)? Is it faster?
At the very beginning of the project we did some measurements and saw,
that it is faster. I'm not sure if this is still true, since the
hardware and the drivers and rdma subsystem did change in that time.
Also it seemed to make the code simpler.

> Also, given that the server pre-allocate a substantial amount of memory
> for each connection, is it documented the requirements from the server
> side? Usually kernel implementations (especially upstream ones) will
> avoid imposing such large longstanding memory requirements on the system
> by default. I don't have a firm stand on this, but wanted to highlight
> this as you are sending this for upstream inclusion.
We definitely need to stress that somewhere. Will include into readme
and add to the cover letter next time. Our memory management is indeed
basically absent in favor of performance: The server reserves
queue_depth of say 512K buffers. Each buffer is used by client for
single IO only, no matter how big the request is. So if client only
issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
for lowest possible latency from the beginning. It is probably
possible to implement some clever allocator on the server side which
wouldn't affect the performance a lot.

>
>   and
> > continue to maintain our code up-to-date with the upstream kernel
> > while having an extra compatibility layer for older kernels in our
> > out-of-tree repository.
>
> Overall, while I absolutely support your cause to lower your maintenance
> overhead by having this sit upstream, I don't see why this can be
> helpful to anyone else in the rdma community. If instead you can
> crystallize why/how ibnbd is faster than anything else, and perhaps
> contribute a common infrastructure piece (or enhance an existing one)
> such that other existing ulps can leverage, it will be a lot more
> compelling to include it upstream.
>
> > I understand that SRP and NVMEoF which are in the kernel already do
> > provide equivalent functionality for the majority of the use cases.
> > IBNBD on the other hand is showing higher performance and more
> > importantly includes the IBTRS - a general purpose library to
> > establish connections and transport BIO-like read/write sg-lists over
> > RDMA,
>
> But who needs it? Can other ulps use it or pieces of it? I keep failing
> to understand why is this a benefit if its specific to your ibnbd?
See above and please ask if you have more questions to this.

Thank you,
Danil.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-11  8:54     ` Danil Kipnis
@ 2019-07-12  0:22       ` Sagi Grimberg
  2019-07-12  7:57         ` Jinpu Wang
  2019-07-12 10:58         ` Danil Kipnis
  0 siblings, 2 replies; 123+ messages in thread
From: Sagi Grimberg @ 2019-07-12  0:22 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	bvanassche, jgg, dledford, Roman Pen, gregkh


>> My main issues which were raised before are:
>> - IMO there isn't any justification to this ibtrs layering separation
>>     given that the only user of this is your ibnbd. Unless you are
>>     trying to submit another consumer, you should avoid adding another
>>     subsystem that is not really general purpose.
> We designed ibtrs not only with the IBNBD in mind but also as the
> transport layer for a distributed SDS. We'd like to be able to do what
> ceph is capable of (automatic up/down scaling of the storage cluster,
> automatic recovery) but using in-kernel rdma-based IO transport
> drivers, thin-provisioned volume managers, etc. to keep the highest
> possible performance.

Sounds lovely, but still very much bound to your ibnbd. And that part
is not included in the patch set, so I still don't see why should this
be considered as a "generic" transport subsystem (it clearly isn't).

> All in all itbrs is a library to establish a "fat", multipath,
> autoreconnectable connection between two hosts on top of rdma,
> optimized for transport of IO traffic.

That is also dictating a wire-protocol which makes it useless to pretty
much any other consumer. Personally, I don't see how this library
would ever be used outside of your ibnbd.

>> - ibtrs in general is using almost no infrastructure from the existing
>>     kernel subsystems. Examples are:
>>     - tag allocation mechanism (which I'm not clear why its needed)
> As you correctly noticed our client manages the buffers allocated and
> registered by the server on the connection establishment. Our tags are
> just a mechanism to take and release those buffers for incoming
> requests on client side. Since the buffers allocated by the server are
> to be shared between all the devices mapped from that server and all
> their HW queues (each having num_cpus of them) the mechanism behind
> get_tag/put_tag also takes care of the fairness.

We have infrastructure for this, sbitmaps.

>>     - rdma rw abstraction similar to what we have in the core
> On the one hand we have only single IO related function:
> ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
> with imm, or requests an rdma write with imm to be executed by the
> server.

For sure you can enhance the rw API to have imm support?

> On the other hand we provide an abstraction to establish and
> manage what we call "session", which consist of multiple paths (to do
> failover and multipath with different policies), where each path
> consists of num_cpu rdma connections.

That's fine, but it doesn't mean that it also needs to re-write
infrastructure that we already have.

> Once you established a session
> you can add or remove paths from it on the fly. In case the connection
> to server is lost, the client does periodic attempts to reconnect
> automatically. On the server side you get just sg-lists with a
> direction READ or WRITE as requested by the client. We designed this
> interface not only as the minimum required to build a block device on
> top of rdma but also with a distributed raid in mind.

I suggest you take a look at the rw API and use that in your transport.

>> Another question, from what I understand from the code, the client
>> always rdma_writes data on writes (with imm) from a remote pool of
>> server buffers dedicated to it. Essentially all writes are immediate (no
>> rdma reads ever). How is that different than using send wrs to a set of
>> pre-posted recv buffers (like all others are doing)? Is it faster?
> At the very beginning of the project we did some measurements and saw,
> that it is faster. I'm not sure if this is still true

Its not significantly faster (can't imagine why it would be).
What could make a difference is probably the fact that you never
do rdma reads for I/O writes which might be better. Also perhaps the
fact that you normally don't wait for send completions before completing
I/O (which is broken), and the fact that you batch recv operations.

I would be interested to understand what indeed makes ibnbd run faster
though.

>> Also, given that the server pre-allocate a substantial amount of memory
>> for each connection, is it documented the requirements from the server
>> side? Usually kernel implementations (especially upstream ones) will
>> avoid imposing such large longstanding memory requirements on the system
>> by default. I don't have a firm stand on this, but wanted to highlight
>> this as you are sending this for upstream inclusion.
> We definitely need to stress that somewhere. Will include into readme
> and add to the cover letter next time. Our memory management is indeed
> basically absent in favor of performance: The server reserves
> queue_depth of say 512K buffers. Each buffer is used by client for
> single IO only, no matter how big the request is. So if client only
> issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
> for lowest possible latency from the beginning. It is probably
> possible to implement some clever allocator on the server side which
> wouldn't affect the performance a lot.

Or you can fallback to rdma_read like the rest of the ulps.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-12  0:22       ` Sagi Grimberg
@ 2019-07-12  7:57         ` Jinpu Wang
  2019-07-12 19:40           ` Sagi Grimberg
  2019-07-12 10:58         ` Danil Kipnis
  1 sibling, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-07-12  7:57 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Danil Kipnis, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, bvanassche, jgg, dledford, Roman Pen,
	Greg Kroah-Hartman

Hi Sagi,

> >> Another question, from what I understand from the code, the client
> >> always rdma_writes data on writes (with imm) from a remote pool of
> >> server buffers dedicated to it. Essentially all writes are immediate (no
> >> rdma reads ever). How is that different than using send wrs to a set of
> >> pre-posted recv buffers (like all others are doing)? Is it faster?
> > At the very beginning of the project we did some measurements and saw,
> > that it is faster. I'm not sure if this is still true
>
> Its not significantly faster (can't imagine why it would be).
> What could make a difference is probably the fact that you never
> do rdma reads for I/O writes which might be better. Also perhaps the
> fact that you normally don't wait for send completions before completing
> I/O (which is broken), and the fact that you batch recv operations.

I don't know how do you come to the conclusion we don't wait for send
completion before completing IO.

We do chain wr on successfull read request from server, see funtion
rdma_write_sg,

 318 static int rdma_write_sg(struct ibtrs_srv_op *id)
 319 {
 320         struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess);
 321         dma_addr_t dma_addr = sess->dma_addr[id->msg_id];
 322         struct ibtrs_srv *srv = sess->srv;
 323         struct ib_send_wr inv_wr, imm_wr;
 324         struct ib_rdma_wr *wr = NULL;
snip
333         need_inval = le16_to_cpu(id->rd_msg->flags) &
IBTRS_MSG_NEED_INVAL_F;
snip
 357                 wr->wr.wr_cqe   = &io_comp_cqe;
 358                 wr->wr.sg_list  = list;
 359                 wr->wr.num_sge  = 1;
 360                 wr->remote_addr = le64_to_cpu(id->rd_msg->desc[i].addr);
 361                 wr->rkey        = le32_to_cpu(id->rd_msg->desc[i].key);
 snip
368                 if (i < (sg_cnt - 1))
 369                         wr->wr.next = &id->tx_wr[i + 1].wr;
 370                 else if (need_inval)
 371                         wr->wr.next = &inv_wr;
 372                 else
 373                         wr->wr.next = &imm_wr;
 374
 375                 wr->wr.opcode = IB_WR_RDMA_WRITE;
 376                 wr->wr.ex.imm_data = 0;
 377                 wr->wr.send_flags  = 0;
snip
 386         if (need_inval) {
 387                 inv_wr.next = &imm_wr;
 388                 inv_wr.wr_cqe = &io_comp_cqe;
 389                 inv_wr.sg_list = NULL;
 390                 inv_wr.num_sge = 0;
 391                 inv_wr.opcode = IB_WR_SEND_WITH_INV;
 392                 inv_wr.send_flags = 0;
 393                 inv_wr.ex.invalidate_rkey = rkey;
 394         }
 395         imm_wr.next = NULL;
 396         imm_wr.wr_cqe = &io_comp_cqe;
 397         imm_wr.sg_list = NULL;
 398         imm_wr.num_sge = 0;
 399         imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
 400         imm_wr.send_flags = flags;
 401         imm_wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id,
 402                                                              0,
need_inval));
 403


when we need to do invalidation of remote memory, there will chain WR
togather, last 2 are inv_wr, and imm_wr.
imm_wr is the last one, this is important, due to the fact RC QP are
ordered, we know when when we received
IB_WC_RECV_RDMA_WITH_IMM and w_inval is true, hardware should already
finished it's job to invalidate the MR.
If server fails to invalidate, we will do local invalidation, and wait
for completion.

On client side
284 static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
 285                               bool notify, bool can_wait)
 286 {
 287         struct ibtrs_clt_con *con = req->con;
 288         struct ibtrs_clt_sess *sess;
 289         struct ibtrs_clt *clt;
 290         int err;
 291
 292         if (WARN_ON(!req->in_use))
 293                 return;
 294         if (WARN_ON(!req->con))
 295                 return;
 296         sess = to_clt_sess(con->c.sess);
 297         clt = sess->clt;
 298
 299         if (req->sg_cnt) {
 300                 if (unlikely(req->dir == DMA_FROM_DEVICE &&
req->need_inv)) {
 301                         /*
 302                          * We are here to invalidate RDMA read requests
 303                          * ourselves.  In normal scenario server should
 304                          * send INV for all requested RDMA reads, but
 305                          * we are here, thus two things could happen:
 306                          *
 307                          *    1.  this is failover, when errno != 0
 308                          *        and can_wait == 1,
 309                          *
 310                          *    2.  something totally bad happened and
 311                          *        server forgot to send INV, so we
 312                          *        should do that ourselves.
 313                          */
 314
 315                         if (likely(can_wait)) {
 316                                 req->need_inv_comp = true;
 317                         } else {
 318                                 /* This should be IO path, so
always notify */
 319                                 WARN_ON(!notify);
 320                                 /* Save errno for INV callback */
 321                                 req->inv_errno = errno;
 322                         }
 323
 324                         err = ibtrs_inv_rkey(req);
 325                         if (unlikely(err)) {
 326                                 ibtrs_err(sess, "Send INV WR
key=%#x: %d\n",
 327                                           req->mr->rkey, err);
 328                         } else if (likely(can_wait)) {
 329                                 wait_for_completion(&req->inv_comp);
 330                         } else {
330                         } else {
 331                                 /*
 332                                  * Something went wrong, so request will be
 333                                  * completed from INV callback.
 334                                  */
 335                                 WARN_ON_ONCE(1);
 336
 337                                 return;
 338                         }
 339                 }
 340                 ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
 341                                 req->sg_cnt, req->dir);
 342         }
 343         if (sess->stats.enable_rdma_lat)
 344                 ibtrs_clt_update_rdma_lat(&sess->stats,
 345                                 req->dir == DMA_FROM_DEVICE,
 346                                 jiffies_to_msecs(jiffies -
req->start_jiffies));
 347         ibtrs_clt_decrease_inflight(&sess->stats);
 348
 349         req->in_use = false;
 350         req->con = NULL;
 351
 352         if (notify)
 353                 req->conf(req->priv, errno);
 354 }

 356 static void process_io_rsp(struct ibtrs_clt_sess *sess, u32
msg_id,
 357                            s16 errno, bool w_inval)
 358 {
 359         struct ibtrs_clt_io_req *req;
 360
 361         if (WARN_ON(msg_id >= sess->queue_depth))
 362                 return;
 363
 364         req = &sess->reqs[msg_id];
 365         /* Drop need_inv if server responsed with invalidation */
 366         req->need_inv &= !w_inval;
 367         complete_rdma_req(req, errno, true, false);
 368 }

Hope this clears the doubt.

Regards,
Jack

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-12  0:22       ` Sagi Grimberg
  2019-07-12  7:57         ` Jinpu Wang
@ 2019-07-12 10:58         ` Danil Kipnis
  1 sibling, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-07-12 10:58 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jack Wang, linux-block, linux-rdma, axboe, Christoph Hellwig,
	bvanassche, jgg, dledford, Roman Pen, gregkh

On Fri, Jul 12, 2019 at 2:22 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> My main issues which were raised before are:
> >> - IMO there isn't any justification to this ibtrs layering separation
> >>     given that the only user of this is your ibnbd. Unless you are
> >>     trying to submit another consumer, you should avoid adding another
> >>     subsystem that is not really general purpose.
> > We designed ibtrs not only with the IBNBD in mind but also as the
> > transport layer for a distributed SDS. We'd like to be able to do what
> > ceph is capable of (automatic up/down scaling of the storage cluster,
> > automatic recovery) but using in-kernel rdma-based IO transport
> > drivers, thin-provisioned volume managers, etc. to keep the highest
> > possible performance.
>
> Sounds lovely, but still very much bound to your ibnbd. And that part
> is not included in the patch set, so I still don't see why should this
> be considered as a "generic" transport subsystem (it clearly isn't).
Having IBTRS sit on a storage enables that storage to communicate with
other storages (forward requests, request read from other storages
i.e. for sync traffic). IBTRS is generic in the sense that it removes
the strict separation into initiator (converting BIOs into some
hardware specific protocol messages) and target (which forwards those
messages to some local device supporting that protocol).
It appears less generic to me to talk SCSI or NVME between storages if
some storages have SCSI, other NVME disks or LVM volumes, or mixed
setup. IBTRS allows to just send or request read of an sg-list between
machines over rdma - the very minimum required to transport a BIO.
It would in-deed support our case with the library if we would propose
at least two users of it. We now only have a very early stage
prototype capable of organizing storages in pools, multiplexing io
between different storages, etc. sitting on top of ibtrs, it's not
functional yet. On the other hand ibnbd with ibtrs alone already make
over 10000 lines.

> > All in all itbrs is a library to establish a "fat", multipath,
> > autoreconnectable connection between two hosts on top of rdma,
> > optimized for transport of IO traffic.
>
> That is also dictating a wire-protocol which makes it useless to pretty
> much any other consumer. Personally, I don't see how this library
> would ever be used outside of your ibnbd.
Its true, IBTRS also imposes a protocol for connection establishment
and IO path. I think at least the IO part we did reduce to a bare
minimum:
350 * Write *
351
352 1. When processing a write request client selects one of the memory chunks
353 on the server side and rdma writes there the user data, user header and the
354 IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
355 contains size of the user header. The client tells the server
which chunk has
356 been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by
357 using the IMM field.
358
359 2. When confirming a write request server sends an "empty" rdma message with
360 an immediate field. The 32 bit field is used to specify the outstanding
361 inflight IO and for the error code.
362
363 CLT                                                          SRV
364 usr_data + usr_hdr + ibtrs_msg_rdma_write ----------------->
[IBTRS_IO_REQ_IMM]
365 [IBTRS_IO_RSP_IMM]                        <----------------- (id + errno)
366
367 * Read *
368
369 1. When processing a read request client selects one of the memory chunks
370 on the server side and rdma writes there the user header and the
371 IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of
372 the user header, flags (specifying if memory invalidation is
necessary) and the
373 list of addresses along with keys for the data to be read into.
374
375 2. When confirming a read request server transfers the requested data first,
376 attaches an invalidation message if requested and finally an "empty" rdma
377 message with an immediate field. The 32 bit field is used to specify the
378 outstanding inflight IO and the error code.
379
380 CLT                                           SRV
381 usr_hdr + ibtrs_msg_rdma_read --------------> [IBTRS_IO_REQ_IMM]
382 [IBTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
383 or in case client requested invalidation:
384 [IBTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) +
(id + errno)

> >> - ibtrs in general is using almost no infrastructure from the existing
> >>     kernel subsystems. Examples are:
> >>     - tag allocation mechanism (which I'm not clear why its needed)
> > As you correctly noticed our client manages the buffers allocated and
> > registered by the server on the connection establishment. Our tags are
> > just a mechanism to take and release those buffers for incoming
> > requests on client side. Since the buffers allocated by the server are
> > to be shared between all the devices mapped from that server and all
> > their HW queues (each having num_cpus of them) the mechanism behind
> > get_tag/put_tag also takes care of the fairness.
>
> We have infrastructure for this, sbitmaps.
AFAIR Roman did try to use sbitmap but found no benefits in terms of
readability or number of lines:
" What is left unchanged on IBTRS side but was suggested to modify:
     - Bart suggested to use sbitmap instead of calling find_first_zero_bit()
  and friends.  I found calling pure bit API is more explicit in
  comparison to sbitmap - there is no need in using sbitmap_queue
  and all the power of wait queues, no benefits in terms of LoC
  as well." https://lwn.net/Articles/756994/

If sbitmap is a must for our use case from the infrastructure point of
view, we will reiterate on it.

>
> >>     - rdma rw abstraction similar to what we have in the core
> > On the one hand we have only single IO related function:
> > ibtrs_clt_request(READ/WRITE, session,...), which executes rdma write
> > with imm, or requests an rdma write with imm to be executed by the
> > server.
>
> For sure you can enhance the rw API to have imm support?
I'm not familiar with the architectural intention behind rw.c.
Extending the API with the support of imm field is (I guess) doable.

> > On the other hand we provide an abstraction to establish and
> > manage what we call "session", which consist of multiple paths (to do
> > failover and multipath with different policies), where each path
> > consists of num_cpu rdma connections.
>
> That's fine, but it doesn't mean that it also needs to re-write
> infrastructure that we already have.
Do you refer to rw.c?

> > Once you established a session
> > you can add or remove paths from it on the fly. In case the connection
> > to server is lost, the client does periodic attempts to reconnect
> > automatically. On the server side you get just sg-lists with a
> > direction READ or WRITE as requested by the client. We designed this
> > interface not only as the minimum required to build a block device on
> > top of rdma but also with a distributed raid in mind.
>
> I suggest you take a look at the rw API and use that in your transport.
We will look into rw.c. Do you suggest we move the multipath and the
multiple QPs per path and connection establishment on *top* of it or
*into* it?

> >> Another question, from what I understand from the code, the client
> >> always rdma_writes data on writes (with imm) from a remote pool of
> >> server buffers dedicated to it. Essentially all writes are immediate (no
> >> rdma reads ever). How is that different than using send wrs to a set of
> >> pre-posted recv buffers (like all others are doing)? Is it faster?
> > At the very beginning of the project we did some measurements and saw,
> > that it is faster. I'm not sure if this is still true
>
> Its not significantly faster (can't imagine why it would be).
> What could make a difference is probably the fact that you never
> do rdma reads for I/O writes which might be better. Also perhaps the
> fact that you normally don't wait for send completions before completing
> I/O (which is broken), and the fact that you batch recv operations.
>
> I would be interested to understand what indeed makes ibnbd run faster
> though.
Yes, we would like to understand this too. I will try increasing the
inline_data_size on nvme in our benchmarks as the next step to check
if this influences the results.

> >> Also, given that the server pre-allocate a substantial amount of memory
> >> for each connection, is it documented the requirements from the server
> >> side? Usually kernel implementations (especially upstream ones) will
> >> avoid imposing such large longstanding memory requirements on the system
> >> by default. I don't have a firm stand on this, but wanted to highlight
> >> this as you are sending this for upstream inclusion.
> > We definitely need to stress that somewhere. Will include into readme
> > and add to the cover letter next time. Our memory management is indeed
> > basically absent in favor of performance: The server reserves
> > queue_depth of say 512K buffers. Each buffer is used by client for
> > single IO only, no matter how big the request is. So if client only
> > issues 4K IOs, we do waste 508*queue_depth K of memory. We were aiming
> > for lowest possible latency from the beginning. It is probably
> > possible to implement some clever allocator on the server side which
> > wouldn't affect the performance a lot.
>
> Or you can fallback to rdma_read like the rest of the ulps.
We currently have a single round trip for every write IO: write + ack.
Wouldn't switching to rdma_read make 2 round trips out of it: command
+ rdma_read + ack?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-12  7:57         ` Jinpu Wang
@ 2019-07-12 19:40           ` Sagi Grimberg
  2019-07-15 11:21             ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Sagi Grimberg @ 2019-07-12 19:40 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Danil Kipnis, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, bvanassche, jgg, dledford, Roman Pen,
	Greg Kroah-Hartman


> Hi Sagi,
> 
>>>> Another question, from what I understand from the code, the client
>>>> always rdma_writes data on writes (with imm) from a remote pool of
>>>> server buffers dedicated to it. Essentially all writes are immediate (no
>>>> rdma reads ever). How is that different than using send wrs to a set of
>>>> pre-posted recv buffers (like all others are doing)? Is it faster?
>>> At the very beginning of the project we did some measurements and saw,
>>> that it is faster. I'm not sure if this is still true
>>
>> Its not significantly faster (can't imagine why it would be).
>> What could make a difference is probably the fact that you never
>> do rdma reads for I/O writes which might be better. Also perhaps the
>> fact that you normally don't wait for send completions before completing
>> I/O (which is broken), and the fact that you batch recv operations.
> 
> I don't know how do you come to the conclusion we don't wait for send
> completion before completing IO.
> 
> We do chain wr on successfull read request from server, see funtion
> rdma_write_sg,

I was referring to the client side

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-12 19:40           ` Sagi Grimberg
@ 2019-07-15 11:21             ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-07-15 11:21 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Danil Kipnis, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, bvanassche, jgg, dledford, Roman Pen,
	Greg Kroah-Hartman

Sagi Grimberg <sagi@grimberg.me> 于2019年7月12日周五 下午9:40写道:
>
>
> > Hi Sagi,
> >
> >>>> Another question, from what I understand from the code, the client
> >>>> always rdma_writes data on writes (with imm) from a remote pool of
> >>>> server buffers dedicated to it. Essentially all writes are immediate (no
> >>>> rdma reads ever). How is that different than using send wrs to a set of
> >>>> pre-posted recv buffers (like all others are doing)? Is it faster?
> >>> At the very beginning of the project we did some measurements and saw,
> >>> that it is faster. I'm not sure if this is still true
> >>
> >> Its not significantly faster (can't imagine why it would be).
> >> What could make a difference is probably the fact that you never
> >> do rdma reads for I/O writes which might be better. Also perhaps the
> >> fact that you normally don't wait for send completions before completing
> >> I/O (which is broken), and the fact that you batch recv operations.
> >
> > I don't know how do you come to the conclusion we don't wait for send
> > completion before completing IO.
> >
> > We do chain wr on successfull read request from server, see funtion
> > rdma_write_sg,
>
> I was referring to the client side
Hi Sagi,

I checked the 3 commits you mentioned in earlier thread again, I now
get your point.
You meant the behavior following commits try to fix.

4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work
request has completed")
b4b591c87f2b ("nvme-rdma: don't suppress send completions")

In this sense, ibtrs client side are not waiting for the completions
for RDMA WRITE WR to finish.
But we did it right for local invalidation.

I checked SRP/iser, they are not even wait for local invalidation, no
signal flag set.

If it's a problem, we should fix them too, maybe more.

My question is do you see the behavior (HCA retry send due to drop ack
) in the field,
is it possible to reproduce?

Thanks,
Jack

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2019-07-09 21:27             ` Sagi Grimberg
@ 2019-07-19 13:12               ` Danil Kipnis
  0 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-07-19 13:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Jinpu Wang, Jinpu Wang, Leon Romanovsky,
	linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	bvanassche, dledford, Roman Pen, Greg Kroah-Hartman

Hi Sagi,

thanks a lot for the information. We are doing the right thing
regarding the invalidation (your 2f122e4f5107), but we do use
unsignalled sends and need to fix that. Please correct me if I'm
wrong: The patches (b4b591c87f2b, b4b591c87f2b) fix the problem that
if the ack from target is lost for some reason, the initiators HCA
will resend it even after the request is completed.
But doesn't the same problem persist also other way around: for the
lost acks from client? I mean, target is did a send for the "read"
IOs; client completed the request (after invalidation, refcount
dropped to 0, etc), but the ack is not delivered to the HCA of the
target, so the target will also resend it. This seems unfixable, since
the client can't possible know if the server received his ack or not?
Doesn't the problem go away, if rdma_conn_param.retry_count is just set to 0?

Thanks for your help,
Best,
Danil.

On Tue, Jul 9, 2019 at 11:27 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >> Thanks Jason for feedback.
> >> Can you be  more specific about  "the invalidation model for MR was wrong"
> >
> > MR's must be invalidated before data is handed over to the block
> > layer. It can't leave MRs open for access and then touch the memory
> > the MR covers.
>
> Jason is referring to these fixes:
> 2f122e4f5107 ("nvme-rdma: wait for local invalidation before completing
> a request")
> 4af7f7ff92a4 ("nvme-rdma: don't complete requests before a send work
> request has completed")
> b4b591c87f2b ("nvme-rdma: don't suppress send completions")

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
       [not found] ` <20190620150337.7847-16-jinpuwang@gmail.com>
@ 2019-09-13 22:10   ` Bart Van Assche
  2019-09-15 14:30     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-13 22:10 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +#define ibnbd_log(fn, dev, fmt, ...) ({				\
> +	__builtin_choose_expr(						\
> +		__builtin_types_compatible_p(				\
> +			typeof(dev), struct ibnbd_clt_dev *),		\
> +		fn("<%s@%s> " fmt, (dev)->pathname,			\
> +		(dev)->sess->sessname,					\
> +		   ##__VA_ARGS__),					\
> +		__builtin_choose_expr(					\
> +			__builtin_types_compatible_p(typeof(dev),	\
> +					struct ibnbd_srv_sess_dev *),	\
> +			fn("<%s@%s>: " fmt, (dev)->pathname,		\
> +			   (dev)->sess->sessname, ##__VA_ARGS__),	\
> +			unknown_type()));				\
> +})

Please remove the __builtin_choose_expr() / 
__builtin_types_compatible_p() construct and split this macro into two 
macros or inline functions: one for struct ibnbd_clt_dev and another one 
for struct ibnbd_srv_sess_dev.

> +#define IBNBD_PROTO_VER_MAJOR 2
> +#define IBNBD_PROTO_VER_MINOR 0
> +
> +#define IBNBD_PROTO_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
> +			       __stringify(IBNBD_PROTO_VER_MINOR)
> +
> +#ifndef IBNBD_VER_STRING
> +#define IBNBD_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
> +			 __stringify(IBNBD_PROTO_VER_MINOR)

Upstream code should not have a version number.

> +/* TODO: should be configurable */
> +#define IBTRS_PORT 1234

How about converting this macro into a kernel module parameter?

> +enum ibnbd_access_mode {
> +	IBNBD_ACCESS_RO,
> +	IBNBD_ACCESS_RW,
> +	IBNBD_ACCESS_MIGRATION,
> +};

Some more information about what IBNBD_ACCESS_MIGRATION represents would 
be welcome.

> +#define _IBNBD_FILEIO  0
> +#define _IBNBD_BLOCKIO 1
> +#define _IBNBD_AUTOIO  2
 >
> +enum ibnbd_io_mode {
> +	IBNBD_FILEIO = _IBNBD_FILEIO,
> +	IBNBD_BLOCKIO = _IBNBD_BLOCKIO,
> +	IBNBD_AUTOIO = _IBNBD_AUTOIO,
> +};

Since the IBNBD_* and _IBNBD_* constants have the same numerical value, 
are the former constants really necessary?

> +/**
> + * struct ibnbd_msg_sess_info - initial session info from client to server
> + * @hdr:		message header
> + * @ver:		IBNBD protocol version
> + */
> +struct ibnbd_msg_sess_info {
> +	struct ibnbd_msg_hdr hdr;
> +	u8		ver;
> +	u8		reserved[31];
> +};

Since the wire protocol is versioned, is it really necessary to add 31 
reserved bytes?

> +struct ibnbd_msg_sess_info_rsp {
> +	struct ibnbd_msg_hdr hdr;
> +	u8		ver;
> +	u8		reserved[31];
> +};

Same comment here.

> +/**
> + * struct ibnbd_msg_open_rsp - response message to IBNBD_MSG_OPEN
> + * @hdr:		message header
> + * @nsectors:		number of sectors

What is the size of a single sector?

> + * @device_id:		device_id on server side to identify the device

Please use the same order for the members in the kernel-doc header as in 
the structure.

> + * @queue_flags:	queue_flags of the device on server side

Where is the queue_flags member?

> + * @discard_granularity: size of the internal discard allocation unit
> + * @discard_alignment: offset from internal allocation assignment
> + * @physical_block_size: physical block size device supports
> + * @logical_block_size: logical block size device supports

What is the unit for these four members?

> + * @max_segments:	max segments hardware support in one transfer

Does 'hardware' refer to the RDMA adapter that transfers the IBNBD 
message or to the storage device? In the latter case, I assume that 
transfer refers to a DMA transaction?

> + * @io_mode:		io_mode device is opened.

Should a reference to enum ibnbd_io_mode be added?

> +	u8			__padding[10];

Why ten padding bytes? Does alignment really matter for a data structure 
like this one?

> +/**
> + * struct ibnbd_msg_io_old - message for I/O read/write for
> + * ver < IBNBD_PROTO_VER_MAJOR
> + * This structure is there only to know the size of the "old" message format
> + * @hdr:	message header
> + * @device_id:	device_id on server side to find the right device
> + * @sector:	bi_sector attribute from struct bio
> + * @rw:		bitmask, valid values are defined in enum ibnbd_io_flags
> + * @bi_size:    number of bytes for I/O read/write
> + * @prio:       priority
> + */
> +struct ibnbd_msg_io_old {
> +	struct ibnbd_msg_hdr hdr;
> +	__le32		device_id;
> +	__le64		sector;
> +	__le32		rw;
> +	__le32		bi_size;
> +};

Since this is the first version of IBNBD that is being sent upstream, I 
think that ibnbd_msg_io_old should be left out.

> +
> +/**
> + * struct ibnbd_msg_io - message for I/O read/write
> + * @hdr:	message header
> + * @device_id:	device_id on server side to find the right device
> + * @sector:	bi_sector attribute from struct bio
> + * @rw:		bitmask, valid values are defined in enum ibnbd_io_flags

enum ibnbd_io_flags doesn't look like a bitmask but rather like a bit 
field (https://en.wikipedia.org/wiki/Bit_field)?

> +static inline u32 ibnbd_to_bio_flags(u32 ibnbd_flags)
> +{
> +	u32 bio_flags;

The names ibnbd_flags and bio_flags are confusing since these two 
variables not only contain flags but also an operation. How about 
changing 'flags' into 'opf' or 'op_flags'?

> +static inline const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
> +{
> +	switch (mode) {
> +	case IBNBD_FILEIO:
> +		return "fileio";
> +	case IBNBD_BLOCKIO:
> +		return "blockio";
> +	case IBNBD_AUTOIO:
> +		return "autoio";
> +	default:
> +		return "unknown";
> +	}
> +}
> +
> +static inline const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
> +{
> +	switch (mode) {
> +	case IBNBD_ACCESS_RO:
> +		return "ro";
> +	case IBNBD_ACCESS_RW:
> +		return "rw";
> +	case IBNBD_ACCESS_MIGRATION:
> +		return "migration";
> +	default:
> +		return "unknown";
> +	}
> +}

These two functions are not in the hot path and hence should not be 
inline functions.

Note: I plan to review the entire patch series but it may take some time 
before I have finished reviewing the entire patch series.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 16/25] ibnbd: client: private header with client structs and functions
       [not found] ` <20190620150337.7847-17-jinpuwang@gmail.com>
@ 2019-09-13 22:25   ` Bart Van Assche
  2019-09-17 16:36     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-13 22:25 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +	char			pathname[NAME_MAX];
[ ... ]
 > +	char			blk_symlink_name[NAME_MAX];

Please allocate path names dynamically instead of hard-coding the upper 
length for a path.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
       [not found] ` <20190620150337.7847-18-jinpuwang@gmail.com>
@ 2019-09-13 23:46   ` Bart Van Assche
  2019-09-16 14:17     ` Danil Kipnis
                       ` (2 more replies)
  2019-09-14  0:00   ` Bart Van Assche
  1 sibling, 3 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-13 23:46 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +MODULE_VERSION(IBNBD_VER_STRING);

No version numbers in upstream code please.

> +/*
> + * This is for closing devices when unloading the module:
> + * we might be closing a lot (>256) of devices in parallel
> + * and it is better not to use the system_wq.
> + */
> +static struct workqueue_struct *unload_wq;

I think that a better motivation is needed for the introduction of a new 
workqueue.

> +#define KERNEL_SECTOR_SIZE      512

Please use SECTOR_SIZE instead of redefining it.

> +static int ibnbd_clt_revalidate_disk(struct ibnbd_clt_dev *dev,
> +				     size_t new_nsectors)
> +{
> +	int err = 0;
> +
> +	ibnbd_info(dev, "Device size changed from %zu to %zu sectors\n",
> +		   dev->nsectors, new_nsectors);
> +	dev->nsectors = new_nsectors;
> +	set_capacity(dev->gd,
> +		     dev->nsectors * (dev->logical_block_size /
> +				      KERNEL_SECTOR_SIZE));
> +	err = revalidate_disk(dev->gd);
> +	if (err)
> +		ibnbd_err(dev, "Failed to change device size from"
> +			  " %zu to %zu, err: %d\n", dev->nsectors,
> +			  new_nsectors, err);
> +	return err;
> +}

Since this function changes the block device size, I think that the name 
ibnbd_clt_revalidate_disk() is confusing. Please rename this function.

> +/**
> + * ibnbd_get_cpu_qlist() - finds a list with HW queues to be requeued
> + *
> + * Description:
> + *     Each CPU has a list of HW queues, which needs to be requeed.  If a list
> + *     is not empty - it is marked with a bit.  This function finds first
> + *     set bit in a bitmap and returns corresponding CPU list.
> + */

What does it mean to requeue a queue? Queue elements can be requeued but 
a queue in its entirety not. Please make this comment more clear.

> +/**
> + * ibnbd_requeue_if_needed() - requeue if CPU queue is marked as non empty
> + *
> + * Description:
> + *     Each CPU has it's own list of HW queues, which should be requeued.
> + *     Function finds such list with HW queues, takes a list lock, picks up
> + *     the first HW queue out of the list and requeues it.
> + *
> + * Return:
> + *     True if the queue was requeued, false otherwise.
> + *
> + * Context:
> + *     Does not matter.
> + */

Same comment here.

> +/**
> + * ibnbd_requeue_all_if_idle() - requeue all queues left in the list if
> + *     session is idling (there are no requests in-flight).
> + *
> + * Description:
> + *     This function tries to rerun all stopped queues if there are no
> + *     requests in-flight anymore.  This function tries to solve an obvious
> + *     problem, when number of tags < than number of queues (hctx), which
> + *     are stopped and put to sleep.  If last tag, which has been just put,
> + *     does not wake up all left queues (hctxs), IO requests hang forever.
> + *
> + *     That can happen when all number of tags, say N, have been exhausted
> + *     from one CPU, and we have many block devices per session, say M.
> + *     Each block device has it's own queue (hctx) for each CPU, so eventually
> + *     we can put that number of queues (hctxs) to sleep: M x nr_cpu_ids.
> + *     If number of tags N < M x nr_cpu_ids finally we will get an IO hang.
> + *
> + *     To avoid this hang last caller of ibnbd_put_tag() (last caller is the
> + *     one who observes sess->busy == 0) must wake up all remaining queues.
> + *
> + * Context:
> + *     Does not matter.
> + */

Same comment here.

A more general question is why ibnbd needs its own queue management 
while no other block driver needs this?

> +static void ibnbd_softirq_done_fn(struct request *rq)
> +{
> +	struct ibnbd_clt_dev *dev	= rq->rq_disk->private_data;
> +	struct ibnbd_clt_session *sess	= dev->sess;
> +	struct ibnbd_iu *iu;
> +
> +	iu = blk_mq_rq_to_pdu(rq);
> +	ibnbd_put_tag(sess, iu->tag);
> +	blk_mq_end_request(rq, iu->status);
> +}
> +
> +static void msg_io_conf(void *priv, int errno)
> +{
> +	struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
> +	struct ibnbd_clt_dev *dev = iu->dev;
> +	struct request *rq = iu->rq;
> +
> +	iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
> +
> +	if (softirq_enable) {
> +		blk_mq_complete_request(rq);
> +	} else {
> +		ibnbd_put_tag(dev->sess, iu->tag);
> +		blk_mq_end_request(rq, iu->status);
> +	}

Block drivers must call blk_mq_complete_request() instead of 
blk_mq_end_request() to complete a request after processing of the 
request has been started. Calling blk_mq_end_request() to complete a 
request is racy in case a timeout occurs while blk_mq_end_request() is 
in progress.

> +static void msg_conf(void *priv, int errno)
> +{
> +	struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;

The kernel code I'm familiar with does not cast void pointers explicitly 
into another type. Please follow that convention and leave the cast out 
from the above and also from similar statements.

> +static int send_usr_msg(struct ibtrs_clt *ibtrs, int dir,
> +			struct ibnbd_iu *iu, struct kvec *vec, size_t nr,
> +			size_t len, struct scatterlist *sg, unsigned int sg_len,
> +			void (*conf)(struct work_struct *work),
> +			int *errno, bool wait)
> +{
> +	int err;
> +
> +	INIT_WORK(&iu->work, conf);
> +	err = ibtrs_clt_request(dir, msg_conf, ibtrs, iu->tag,
> +				iu, vec, nr, len, sg, sg_len);
> +	if (!err && wait) {
> +		wait_event(iu->comp.wait, iu->comp.errno != INT_MAX);

This looks weird. Why is this a wait_event() call instead of a 
wait_for_completion() call?

> +static struct blk_mq_ops ibnbd_mq_ops;
> +static int setup_mq_tags(struct ibnbd_clt_session *sess)
> +{
> +	struct blk_mq_tag_set *tags = &sess->tag_set;
> +
> +	memset(tags, 0, sizeof(*tags));
> +	tags->ops		= &ibnbd_mq_ops;
> +	tags->queue_depth	= sess->queue_depth;
> +	tags->numa_node		= NUMA_NO_NODE;
> +	tags->flags		= BLK_MQ_F_SHOULD_MERGE |
> +				  BLK_MQ_F_TAG_SHARED;
> +	tags->cmd_size		= sizeof(struct ibnbd_iu);
> +	tags->nr_hw_queues	= num_online_cpus();
> +
> +	return blk_mq_alloc_tag_set(tags);
> +}

Forward declarations should be avoided when possible. Can the forward 
declaration of ibnbd_mq_ops be avoided by moving the definition of 
setup_mq_tags() down?

> +static inline void wake_up_ibtrs_waiters(struct ibnbd_clt_session *sess)
> +{
> +	/* paired with rmb() in wait_for_ibtrs_connection() */
> +	smp_wmb();
> +	sess->ibtrs_ready = true;
> +	wake_up_all(&sess->ibtrs_waitq);
> +}

The placement of the smp_wmb() call looks wrong to me. Since 
wake_up_all() and wait_event() already guarantee acquire/release 
behavior, I think that the explicit barriers can be left out from this 
function and also from wait_for_ibtrs_connection().

> +static void wait_for_ibtrs_disconnection(struct ibnbd_clt_session *sess)
> +__releases(&sess_lock)
> +__acquires(&sess_lock)
> +{
> +	DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
> +
> +	prepare_to_wait(&sess->ibtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
> +	if (IS_ERR_OR_NULL(sess->ibtrs)) {
> +		finish_wait(&sess->ibtrs_waitq, &wait);
> +		return;
> +	}
> +	mutex_unlock(&sess_lock);
> +	/* After unlock session can be freed, so careful */
> +	schedule();
> +	mutex_lock(&sess_lock);
> +}

This doesn't look right: any random wake_up() call can wake up this 
function. Shouldn't there be a loop in this function that causes the 
schedule() call to be repeated until the disconnect has happened?

> +
> +static struct ibnbd_clt_session *__find_and_get_sess(const char *sessname)
> +__releases(&sess_lock)
> +__acquires(&sess_lock)
> +{
> +	struct ibnbd_clt_session *sess;
> +	int err;
> +
> +again:
> +	list_for_each_entry(sess, &sess_list, list) {
> +		if (strcmp(sessname, sess->sessname))
> +			continue;
> +
> +		if (unlikely(sess->ibtrs_ready && IS_ERR_OR_NULL(sess->ibtrs)))
> +			/*
> +			 * No IBTRS connection, session is dying.
> +			 */
> +			continue;
> +
> +		if (likely(ibnbd_clt_get_sess(sess))) {
> +			/*
> +			 * Alive session is found, wait for IBTRS connection.
> +			 */
> +			mutex_unlock(&sess_lock);
> +			err = wait_for_ibtrs_connection(sess);
> +			if (unlikely(err))
> +				ibnbd_clt_put_sess(sess);
> +			mutex_lock(&sess_lock);
> +
> +			if (unlikely(err))
> +				/* Session is dying, repeat the loop */
> +				goto again;
> +
> +			return sess;
> +		}
> +		/*
> +		 * Ref is 0, session is dying, wait for IBTRS disconnect
> +		 * in order to avoid session names clashes.
> +		 */
> +		wait_for_ibtrs_disconnection(sess);
> +		/*
> +		 * IBTRS is disconnected and soon session will be freed,
> +		 * so repeat a loop.
> +		 */
> +		goto again;
> +	}
> +
> +	return NULL;
> +}
 >
> +
> +static struct ibnbd_clt_session *find_and_get_sess(const char *sessname)
> +{
> +	struct ibnbd_clt_session *sess;
> +
> +	mutex_lock(&sess_lock);
> +	sess = __find_and_get_sess(sessname);
> +	mutex_unlock(&sess_lock);
> +
> +	return sess;
> +}

Shouldn't __find_and_get_sess() function increase the reference count of 
sess before it returns? In other words, what prevents that the session 
is freed from another thread before find_and_get_sess() returns?

> +/*
> + * Get iorio of current task
> + */
> +static short ibnbd_current_ioprio(void)
> +{
> +	struct task_struct *tsp = current;
> +	unsigned short prio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
> +
> +	if (likely(tsp->io_context))
> +		prio = tsp->io_context->ioprio;
> +	return prio;
> +}

ibnbd should use req_get_ioprio() and should not look at 
current->io_context->ioprio. I think it is the responsibility of the 
block layer to extract the I/O priority from the task context. As an 
example, here is how the aio code does this:

		req->ki_ioprio = get_current_ioprio();

> +static blk_status_t ibnbd_queue_rq(struct blk_mq_hw_ctx *hctx,
> +				   const struct blk_mq_queue_data *bd)
> +{
> +	struct request *rq = bd->rq;
> +	struct ibnbd_clt_dev *dev = rq->rq_disk->private_data;
> +	struct ibnbd_iu *iu = blk_mq_rq_to_pdu(rq);
> +	int err;
> +
> +	if (unlikely(!ibnbd_clt_dev_is_mapped(dev)))
> +		return BLK_STS_IOERR;
> +
> +	iu->tag = ibnbd_get_tag(dev->sess, IBTRS_IO_CON, IBTRS_TAG_NOWAIT);
> +	if (unlikely(!iu->tag)) {
> +		ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_IFBUSY);
> +		return BLK_STS_RESOURCE;
> +	}
> +
> +	blk_mq_start_request(rq);
> +	err = ibnbd_client_xfer_request(dev, rq, iu);
> +	if (likely(err == 0))
> +		return BLK_STS_OK;
> +	if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
> +		ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_10ms);
> +		ibnbd_put_tag(dev->sess, iu->tag);
> +		return BLK_STS_RESOURCE;
> +	}
> +
> +	ibnbd_put_tag(dev->sess, iu->tag);
> +	return BLK_STS_IOERR;
> +}

Every other block driver relies on the block layer core for tag 
allocation. Why does ibnbd need its own tag management?

> +static void setup_request_queue(struct ibnbd_clt_dev *dev)
> +{
> +	blk_queue_logical_block_size(dev->queue, dev->logical_block_size);
> +	blk_queue_physical_block_size(dev->queue, dev->physical_block_size);
> +	blk_queue_max_hw_sectors(dev->queue, dev->max_hw_sectors);
> +	blk_queue_max_write_same_sectors(dev->queue,
> +					 dev->max_write_same_sectors);
> +
> +	/*
> +	 * we don't support discards to "discontiguous" segments
> +	 * in on request
               ^^
               one?
> +	 */
> +	blk_queue_max_discard_segments(dev->queue, 1);
> +
> +	blk_queue_max_discard_sectors(dev->queue, dev->max_discard_sectors);
> +	dev->queue->limits.discard_granularity	= dev->discard_granularity;
> +	dev->queue->limits.discard_alignment	= dev->discard_alignment;
> +	if (dev->max_discard_sectors)
> +		blk_queue_flag_set(QUEUE_FLAG_DISCARD, dev->queue);
> +	if (dev->secure_discard)
> +		blk_queue_flag_set(QUEUE_FLAG_SECERASE, dev->queue);
> +
> +	blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, dev->queue);
> +	blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, dev->queue);
> +	blk_queue_max_segments(dev->queue, dev->max_segments);
> +	blk_queue_io_opt(dev->queue, dev->sess->max_io_size);
> +	blk_queue_virt_boundary(dev->queue, 4095);
> +	blk_queue_write_cache(dev->queue, true, true);
> +	dev->queue->queuedata = dev;
> +}

> +static void destroy_gen_disk(struct ibnbd_clt_dev *dev)
> +{
> +	del_gendisk(dev->gd);

> +	/*
> +	 * Before marking queue as dying (blk_cleanup_queue() does that)
> +	 * we have to be sure that everything in-flight has gone.
> +	 * Blink with freeze/unfreeze.
> +	 */
> +	blk_mq_freeze_queue(dev->queue);
> +	blk_mq_unfreeze_queue(dev->queue);

Please remove the above seven lines. blk_cleanup_queue() calls 
blk_set_queue_dying() and the second call in blk_set_queue_dying() is 
blk_freeze_queue_start().

> +	blk_cleanup_queue(dev->queue);
> +	put_disk(dev->gd);
> +}

> +
> +static void destroy_sysfs(struct ibnbd_clt_dev *dev,
> +			  const struct attribute *sysfs_self)
> +{
> +	ibnbd_clt_remove_dev_symlink(dev);
> +	if (dev->kobj.state_initialized) {
> +		if (sysfs_self)
> +			/* To avoid deadlock firstly commit suicide */
                                                             ^^^^^^^
Please chose terminology that is more appropriate for a professional 
context.

> +			sysfs_remove_file_self(&dev->kobj, sysfs_self);
> +		kobject_del(&dev->kobj);
> +		kobject_put(&dev->kobj);
> +	}
> +}

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
       [not found] ` <20190620150337.7847-26-jinpuwang@gmail.com>
  2019-07-09 15:10   ` [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Leon Romanovsky
@ 2019-09-13 23:56   ` Bart Van Assche
  2019-09-19 10:30     ` Jinpu Wang
  1 sibling, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-13 23:56 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> From: Roman Pen <roman.penyaev@profitbricks.com>
> 
> Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
> ---
>   MAINTAINERS | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a6954776a37e..0b7fd93f738d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7590,6 +7590,20 @@ IBM ServeRAID RAID DRIVER
>   S:	Orphan
>   F:	drivers/scsi/ips.*
>   
> +IBNBD BLOCK DRIVERS
> +M:	IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> +L:	linux-block@vger.kernel.org
> +S:	Maintained
> +T:	git git://github.com/profitbricks/ibnbd.git
> +F:	drivers/block/ibnbd/
> +
> +IBTRS TRANSPORT DRIVERS
> +M:	IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> +L:	linux-rdma@vger.kernel.org
> +S:	Maintained
> +T:	git git://github.com/profitbricks/ibnbd.git
> +F:	drivers/infiniband/ulp/ibtrs/
> +
>   ICH LPC AND GPIO DRIVER
>   M:	Peter Tyser <ptyser@xes-inc.com>
>   S:	Maintained

I think the T: entry is for kernel trees against which developers should 
prepare their patches. Since the ibnbd repository on github is an 
out-of-tree kernel driver I don't think that it should appear in the 
MAINTAINERS file.

Bart.



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 24/25] ibnbd: a bit of documentation
       [not found] ` <20190620150337.7847-25-jinpuwang@gmail.com>
@ 2019-09-13 23:58   ` Bart Van Assche
  2019-09-18 12:22     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-13 23:58 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> From: Roman Pen <roman.penyaev@profitbricks.com>
> 
> README with description of major sysfs entries.

Please have a look at Documentation/ABI/README and follow the 
instructions from that document.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
       [not found] ` <20190620150337.7847-18-jinpuwang@gmail.com>
  2019-09-13 23:46   ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche
@ 2019-09-14  0:00   ` Bart Van Assche
  1 sibling, 0 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-14  0:00 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> From: Roman Pen <roman.penyaev@profitbricks.com>

A "From" address should be a valid email address. For the above address 
I got the following reply:

550 5.1.1 The email account that you tried to reach does not exist. 
Please try double-checking the recipient's email address for typos or 
unnecessary spaces.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-13 22:10   ` [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers Bart Van Assche
@ 2019-09-15 14:30     ` Jinpu Wang
  2019-09-16  5:27       ` Leon Romanovsky
                         ` (3 more replies)
  0 siblings, 4 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-15 14:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

Thanks Bart for detailed review, reply inline.

On Sat, Sep 14, 2019 at 12:10 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#define ibnbd_log(fn, dev, fmt, ...) ({                              \
> > +     __builtin_choose_expr(                                          \
> > +             __builtin_types_compatible_p(                           \
> > +                     typeof(dev), struct ibnbd_clt_dev *),           \
> > +             fn("<%s@%s> " fmt, (dev)->pathname,                     \
> > +             (dev)->sess->sessname,                                  \
> > +                ##__VA_ARGS__),                                      \
> > +             __builtin_choose_expr(                                  \
> > +                     __builtin_types_compatible_p(typeof(dev),       \
> > +                                     struct ibnbd_srv_sess_dev *),   \
> > +                     fn("<%s@%s>: " fmt, (dev)->pathname,            \
> > +                        (dev)->sess->sessname, ##__VA_ARGS__),       \
> > +                     unknown_type()));                               \
> > +})
>
> Please remove the __builtin_choose_expr() /
> __builtin_types_compatible_p() construct and split this macro into two
> macros or inline functions: one for struct ibnbd_clt_dev and another one
> for struct ibnbd_srv_sess_dev.
Ok, will split to two macros.

>
> > +#define IBNBD_PROTO_VER_MAJOR 2
> > +#define IBNBD_PROTO_VER_MINOR 0
> > +
> > +#define IBNBD_PROTO_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
> > +                            __stringify(IBNBD_PROTO_VER_MINOR)
> > +
> > +#ifndef IBNBD_VER_STRING
> > +#define IBNBD_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
> > +                      __stringify(IBNBD_PROTO_VER_MINOR)
>
> Upstream code should not have a version number.
IBNBD_VER_STRING can be removed together with MODULE_VERSION.
>
> > +/* TODO: should be configurable */
> > +#define IBTRS_PORT 1234
>
> How about converting this macro into a kernel module parameter?
Sounds good, will do.
>
> > +enum ibnbd_access_mode {
> > +     IBNBD_ACCESS_RO,
> > +     IBNBD_ACCESS_RW,
> > +     IBNBD_ACCESS_MIGRATION,
> > +};
>
> Some more information about what IBNBD_ACCESS_MIGRATION represents would
> be welcome.
This is a special mode to allow temporarily RW access mode during VM
migration, will add  comments next round.
>
> > +#define _IBNBD_FILEIO  0
> > +#define _IBNBD_BLOCKIO 1
> > +#define _IBNBD_AUTOIO  2
>  >
> > +enum ibnbd_io_mode {
> > +     IBNBD_FILEIO = _IBNBD_FILEIO,
> > +     IBNBD_BLOCKIO = _IBNBD_BLOCKIO,
> > +     IBNBD_AUTOIO = _IBNBD_AUTOIO,
> > +};
>
> Since the IBNBD_* and _IBNBD_* constants have the same numerical value,
> are the former constants really necessary?
Seems we can remove _IBNBD_*.
>
> > +/**
> > + * struct ibnbd_msg_sess_info - initial session info from client to server
> > + * @hdr:             message header
> > + * @ver:             IBNBD protocol version
> > + */
> > +struct ibnbd_msg_sess_info {
> > +     struct ibnbd_msg_hdr hdr;
> > +     u8              ver;
> > +     u8              reserved[31];
> > +};
>
> Since the wire protocol is versioned, is it really necessary to add 31
> reserved bytes?
You will never know, we prefer to keep the reserved bytes for future extension,
31 bytes is not much, isn't it?


>
> > +struct ibnbd_msg_sess_info_rsp {
> > +     struct ibnbd_msg_hdr hdr;
> > +     u8              ver;
> > +     u8              reserved[31];
> > +};
>
> Same comment here.
Dito.
>
> > +/**
> > + * struct ibnbd_msg_open_rsp - response message to IBNBD_MSG_OPEN
> > + * @hdr:             message header
> > + * @nsectors:                number of sectors
>
> What is the size of a single sector?
512b, will mention explicitly in the next round.
>
> > + * @device_id:               device_id on server side to identify the device
>
> Please use the same order for the members in the kernel-doc header as in
> the structure.
Ok, will fix
>
> > + * @queue_flags:     queue_flags of the device on server side
>
> Where is the queue_flags member?
Oh, will remove it, left over.
>
> > + * @discard_granularity: size of the internal discard allocation unit
> > + * @discard_alignment: offset from internal allocation assignment
> > + * @physical_block_size: physical block size device supports
> > + * @logical_block_size: logical block size device supports
>
> What is the unit for these four members?
will update to be more clear.
>
> > + * @max_segments:    max segments hardware support in one transfer
>
> Does 'hardware' refer to the RDMA adapter that transfers the IBNBD
> message or to the storage device? In the latter case, I assume that
> transfer refers to a DMA transaction?
"hardware" refers to the storage device on the server-side.

>
> > + * @io_mode:         io_mode device is opened.
>
> Should a reference to enum ibnbd_io_mode be added?
sounds good.
>
> > +     u8                      __padding[10];
>
> Why ten padding bytes? Does alignment really matter for a data structure
> like this one?
It's more a reserved space for future usage, will rename padding to reserved.
>
> > +/**
> > + * struct ibnbd_msg_io_old - message for I/O read/write for
> > + * ver < IBNBD_PROTO_VER_MAJOR
> > + * This structure is there only to know the size of the "old" message format
> > + * @hdr:     message header
> > + * @device_id:       device_id on server side to find the right device
> > + * @sector:  bi_sector attribute from struct bio
> > + * @rw:              bitmask, valid values are defined in enum ibnbd_io_flags
> > + * @bi_size:    number of bytes for I/O read/write
> > + * @prio:       priority
> > + */
> > +struct ibnbd_msg_io_old {
> > +     struct ibnbd_msg_hdr hdr;
> > +     __le32          device_id;
> > +     __le64          sector;
> > +     __le32          rw;
> > +     __le32          bi_size;
> > +};
>
> Since this is the first version of IBNBD that is being sent upstream, I
> think that ibnbd_msg_io_old should be left out.

>
> > +
> > +/**
> > + * struct ibnbd_msg_io - message for I/O read/write
> > + * @hdr:     message header
> > + * @device_id:       device_id on server side to find the right device
> > + * @sector:  bi_sector attribute from struct bio
> > + * @rw:              bitmask, valid values are defined in enum ibnbd_io_flags
>
> enum ibnbd_io_flags doesn't look like a bitmask but rather like a bit
> field (https://en.wikipedia.org/wiki/Bit_field)?
I will remove the "bitmask", I probably will also rename "rw "to "opf".
>
> > +static inline u32 ibnbd_to_bio_flags(u32 ibnbd_flags)
> > +{
> > +     u32 bio_flags;
>
> The names ibnbd_flags and bio_flags are confusing since these two
> variables not only contain flags but also an operation. How about
> changing 'flags' into 'opf' or 'op_flags'?
Sounds good, will change to ibnbd_opf and bio_opf.
>
> > +static inline const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
> > +{
> > +     switch (mode) {
> > +     case IBNBD_FILEIO:
> > +             return "fileio";
> > +     case IBNBD_BLOCKIO:
> > +             return "blockio";
> > +     case IBNBD_AUTOIO:
> > +             return "autoio";
> > +     default:
> > +             return "unknown";
> > +     }
> > +}
> > +
> > +static inline const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
> > +{
> > +     switch (mode) {
> > +     case IBNBD_ACCESS_RO:
> > +             return "ro";
> > +     case IBNBD_ACCESS_RW:
> > +             return "rw";
> > +     case IBNBD_ACCESS_MIGRATION:
> > +             return "migration";
> > +     default:
> > +             return "unknown";
> > +     }
> > +}
>
> These two functions are not in the hot path and hence should not be
> inline functions.
Sounds reasonable, will remove the inline.
>
> Note: I plan to review the entire patch series but it may take some time
> before I have finished reviewing the entire patch series.
>
That will be great, thanks a  lot, Bart
> Bart.


Regards,
-- 
Jack Wang
Linux Kernel Developer
Platform Engineering Compute (IONOS Cloud)

1&1 IONOS SE | Greifswalder Str. 207 | 10405 Berlin | Germany
Phone: +49 30 57700-8042 | Fax: +49 30 57700-8598
E-mail: jinpu.wang@cloud.ionos.com | Web: www.ionos.de

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-15 14:30     ` Jinpu Wang
@ 2019-09-16  5:27       ` Leon Romanovsky
  2019-09-16 13:45         ` Bart Van Assche
  2019-09-16  7:08       ` Danil Kipnis
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 123+ messages in thread
From: Leon Romanovsky @ 2019-09-16  5:27 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Sun, Sep 15, 2019 at 04:30:04PM +0200, Jinpu Wang wrote:
> Thanks Bart for detailed review, reply inline.
>
> On Sat, Sep 14, 2019 at 12:10 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 6/20/19 8:03 AM, Jack Wang wrote:
> > > +#define ibnbd_log(fn, dev, fmt, ...) ({                              \
> > > +     __builtin_choose_expr(                                          \
> > > +             __builtin_types_compatible_p(                           \
> > > +                     typeof(dev), struct ibnbd_clt_dev *),           \
> > > +             fn("<%s@%s> " fmt, (dev)->pathname,                     \
> > > +             (dev)->sess->sessname,                                  \
> > > +                ##__VA_ARGS__),                                      \
> > > +             __builtin_choose_expr(                                  \
> > > +                     __builtin_types_compatible_p(typeof(dev),       \
> > > +                                     struct ibnbd_srv_sess_dev *),   \
> > > +                     fn("<%s@%s>: " fmt, (dev)->pathname,            \
> > > +                        (dev)->sess->sessname, ##__VA_ARGS__),       \
> > > +                     unknown_type()));                               \
> > > +})
> >
> > Please remove the __builtin_choose_expr() /
> > __builtin_types_compatible_p() construct and split this macro into two
> > macros or inline functions: one for struct ibnbd_clt_dev and another one
> > for struct ibnbd_srv_sess_dev.
> Ok, will split to two macros.
>
> >
> > > +#define IBNBD_PROTO_VER_MAJOR 2
> > > +#define IBNBD_PROTO_VER_MINOR 0
> > > +
> > > +#define IBNBD_PROTO_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
> > > +                            __stringify(IBNBD_PROTO_VER_MINOR)
> > > +
> > > +#ifndef IBNBD_VER_STRING
> > > +#define IBNBD_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
> > > +                      __stringify(IBNBD_PROTO_VER_MINOR)
> >
> > Upstream code should not have a version number.
> IBNBD_VER_STRING can be removed together with MODULE_VERSION.
> >
> > > +/* TODO: should be configurable */
> > > +#define IBTRS_PORT 1234
> >
> > How about converting this macro into a kernel module parameter?
> Sounds good, will do.

Don't rush to do it and defer it to be the last change before merging,
this is controversial request which not everyone will like here.

Thanks

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-15 14:30     ` Jinpu Wang
  2019-09-16  5:27       ` Leon Romanovsky
@ 2019-09-16  7:08       ` Danil Kipnis
  2019-09-16 14:57       ` Jinpu Wang
  2019-09-16 15:39       ` Jinpu Wang
  3 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-09-16  7:08 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Roman Pen

> > > +/**
> > > + * struct ibnbd_msg_open_rsp - response message to IBNBD_MSG_OPEN
> > > + * @hdr:             message header
> > > + * @nsectors:                number of sectors
> >
> > What is the size of a single sector?
> 512b, will mention explicitly in the next round.
We only have KERNEL_SECTOR_SIZE=512, defined in ibnbd-clt.c. Looks we
only depend on this exact number to set the capacity of the block
device on client side. I'm not sure whether it is worth extending the
protocol to send the number from the server instead.

> > > + * @max_segments:    max segments hardware support in one transfer
> >
> > Does 'hardware' refer to the RDMA adapter that transfers the IBNBD
> > message or to the storage device? In the latter case, I assume that
> > transfer refers to a DMA transaction?
> "hardware" refers to the storage device on the server-side.
The field contains queue_max_segments() of the target block device.
And is used to call blk_queue_max_segments() on the corresponding
device on the client side.
We do also have BMAX_SEGMENTS define in ibnbd-clt.h which sets an
upper limit to max_segments and does refer to the capabilities of the
RDMA adapter. This information should only be known to the transport
module and ideally would be returned to IBNBD during the registration
in IBTRS.

> > Note: I plan to review the entire patch series but it may take some time
> > before I have finished reviewing the entire patch series.
> >
> That will be great, thanks a  lot, Bart
Thank you Bart!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-16  5:27       ` Leon Romanovsky
@ 2019-09-16 13:45         ` Bart Van Assche
  2019-09-17 15:41           ` Leon Romanovsky
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-16 13:45 UTC (permalink / raw)
  To: Leon Romanovsky, Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On 9/15/19 10:27 PM, Leon Romanovsky wrote:
> On Sun, Sep 15, 2019 at 04:30:04PM +0200, Jinpu Wang wrote:
>> On Sat, Sep 14, 2019 at 12:10 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>> +/* TODO: should be configurable */
>>>> +#define IBTRS_PORT 1234
>>>
>>> How about converting this macro into a kernel module parameter?
>> Sounds good, will do.
> 
> Don't rush to do it and defer it to be the last change before merging,
> this is controversial request which not everyone will like here.

Hi Leon,

If you do not agree with changing this macro into a kernel module 
parameter please suggest an alternative.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-13 23:46   ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche
@ 2019-09-16 14:17     ` Danil Kipnis
  2019-09-16 16:46       ` Bart Van Assche
  2019-09-17 13:09     ` Jinpu Wang
  2019-09-18 16:05     ` Jinpu Wang
  2 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-16 14:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang, Roman Pen

On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +MODULE_VERSION(IBNBD_VER_STRING);
>
> No version numbers in upstream code please.
Will drop this, thanks.
>
> > +/*
> > + * This is for closing devices when unloading the module:
> > + * we might be closing a lot (>256) of devices in parallel
> > + * and it is better not to use the system_wq.
> > + */
> > +static struct workqueue_struct *unload_wq;
>
> I think that a better motivation is needed for the introduction of a new
> workqueue.
We didn't want to pollute the system workqueue when unmapping a big
number of devices at once in parallel. Will reiterate on it.

>
> > +#define KERNEL_SECTOR_SIZE      512
>
> Please use SECTOR_SIZE instead of redefining it.
Right.

>
> > +static int ibnbd_clt_revalidate_disk(struct ibnbd_clt_dev *dev,
> > +                                  size_t new_nsectors)
> > +{
> > +     int err = 0;
> > +
> > +     ibnbd_info(dev, "Device size changed from %zu to %zu sectors\n",
> > +                dev->nsectors, new_nsectors);
> > +     dev->nsectors = new_nsectors;
> > +     set_capacity(dev->gd,
> > +                  dev->nsectors * (dev->logical_block_size /
> > +                                   KERNEL_SECTOR_SIZE));
> > +     err = revalidate_disk(dev->gd);
> > +     if (err)
> > +             ibnbd_err(dev, "Failed to change device size from"
> > +                       " %zu to %zu, err: %d\n", dev->nsectors,
> > +                       new_nsectors, err);
> > +     return err;
> > +}
>
> Since this function changes the block device size, I think that the name
> ibnbd_clt_revalidate_disk() is confusing. Please rename this function.
I guess ibnbd_clt_resize_disk() would be more appropriate.

>
> > +/**
> > + * ibnbd_get_cpu_qlist() - finds a list with HW queues to be requeued
> > + *
> > + * Description:
> > + *     Each CPU has a list of HW queues, which needs to be requeed.  If a list
> > + *     is not empty - it is marked with a bit.  This function finds first
> > + *     set bit in a bitmap and returns corresponding CPU list.
> > + */
>
> What does it mean to requeue a queue? Queue elements can be requeued but
> a queue in its entirety not. Please make this comment more clear.
Will fix the comment. The right wording should probably be "..., which
need to be rerun". We have a list of "stopped" queues for each cpu. We
need to select a list and a queue on that list to rerun, when an IO is
completed.

>
> > +/**
> > + * ibnbd_requeue_if_needed() - requeue if CPU queue is marked as non empty
> > + *
> > + * Description:
> > + *     Each CPU has it's own list of HW queues, which should be requeued.
> > + *     Function finds such list with HW queues, takes a list lock, picks up
> > + *     the first HW queue out of the list and requeues it.
> > + *
> > + * Return:
> > + *     True if the queue was requeued, false otherwise.
> > + *
> > + * Context:
> > + *     Does not matter.
> > + */
>
> Same comment here.
>
> > +/**
> > + * ibnbd_requeue_all_if_idle() - requeue all queues left in the list if
> > + *     session is idling (there are no requests in-flight).
> > + *
> > + * Description:
> > + *     This function tries to rerun all stopped queues if there are no
> > + *     requests in-flight anymore.  This function tries to solve an obvious
> > + *     problem, when number of tags < than number of queues (hctx), which
> > + *     are stopped and put to sleep.  If last tag, which has been just put,
> > + *     does not wake up all left queues (hctxs), IO requests hang forever.
> > + *
> > + *     That can happen when all number of tags, say N, have been exhausted
> > + *     from one CPU, and we have many block devices per session, say M.
> > + *     Each block device has it's own queue (hctx) for each CPU, so eventually
> > + *     we can put that number of queues (hctxs) to sleep: M x nr_cpu_ids.
> > + *     If number of tags N < M x nr_cpu_ids finally we will get an IO hang.
> > + *
> > + *     To avoid this hang last caller of ibnbd_put_tag() (last caller is the
> > + *     one who observes sess->busy == 0) must wake up all remaining queues.
> > + *
> > + * Context:
> > + *     Does not matter.
> > + */
>
> Same comment here.
>
> A more general question is why ibnbd needs its own queue management
> while no other block driver needs this?
Each IBNBD device promises to have a queue_depth (of say 512) on each
of its num_cpus hardware queues. In fact we can only process a
queue_depth inflights at once on the whole ibtrs session connecting a
given client with a given server. Those 512 inflights (corresponding
to the number of buffers reserved by the server for this particular
client) have to be shared among all the devices mapped on this
session. This leads to the situation, that we receive more requests
than we can process at the moment. So we need to stop queues and start
them again later in some fair fashion.

>
> > +static void ibnbd_softirq_done_fn(struct request *rq)
> > +{
> > +     struct ibnbd_clt_dev *dev       = rq->rq_disk->private_data;
> > +     struct ibnbd_clt_session *sess  = dev->sess;
> > +     struct ibnbd_iu *iu;
> > +
> > +     iu = blk_mq_rq_to_pdu(rq);
> > +     ibnbd_put_tag(sess, iu->tag);
> > +     blk_mq_end_request(rq, iu->status);
> > +}
> > +
> > +static void msg_io_conf(void *priv, int errno)
> > +{
> > +     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
> > +     struct ibnbd_clt_dev *dev = iu->dev;
> > +     struct request *rq = iu->rq;
> > +
> > +     iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
> > +
> > +     if (softirq_enable) {
> > +             blk_mq_complete_request(rq);
> > +     } else {
> > +             ibnbd_put_tag(dev->sess, iu->tag);
> > +             blk_mq_end_request(rq, iu->status);
> > +     }
>
> Block drivers must call blk_mq_complete_request() instead of
> blk_mq_end_request() to complete a request after processing of the
> request has been started. Calling blk_mq_end_request() to complete a
> request is racy in case a timeout occurs while blk_mq_end_request() is
> in progress.
I need some time to give this part a closer look.

>
> > +static void msg_conf(void *priv, int errno)
> > +{
> > +     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
>
> The kernel code I'm familiar with does not cast void pointers explicitly
> into another type. Please follow that convention and leave the cast out
> from the above and also from similar statements.
msg_conf() is a callback which IBNBD passes down with a request to
IBTRS when calling ibtrs_clt_request(). msg_conf() is called when a
request is completed with a pointer to a struct defined in IBNBD. So
IBTRS as transport doesn't know what's inside the private pointer
which IBNBD passed down with the request, it's opaque, since struct
ibnbd_iu is not visible in IBTRS. I will try to find how others avoid
a cast in similar situations.

>
> > +static int send_usr_msg(struct ibtrs_clt *ibtrs, int dir,
> > +                     struct ibnbd_iu *iu, struct kvec *vec, size_t nr,
> > +                     size_t len, struct scatterlist *sg, unsigned int sg_len,
> > +                     void (*conf)(struct work_struct *work),
> > +                     int *errno, bool wait)
> > +{
> > +     int err;
> > +
> > +     INIT_WORK(&iu->work, conf);
> > +     err = ibtrs_clt_request(dir, msg_conf, ibtrs, iu->tag,
> > +                             iu, vec, nr, len, sg, sg_len);
> > +     if (!err && wait) {
> > +             wait_event(iu->comp.wait, iu->comp.errno != INT_MAX);
>
> This looks weird. Why is this a wait_event() call instead of a
> wait_for_completion() call?
Looks, we could just use a wait_for_completion here.

>
> > +static struct blk_mq_ops ibnbd_mq_ops;
> > +static int setup_mq_tags(struct ibnbd_clt_session *sess)
> > +{
> > +     struct blk_mq_tag_set *tags = &sess->tag_set;
> > +
> > +     memset(tags, 0, sizeof(*tags));
> > +     tags->ops               = &ibnbd_mq_ops;
> > +     tags->queue_depth       = sess->queue_depth;
> > +     tags->numa_node         = NUMA_NO_NODE;
> > +     tags->flags             = BLK_MQ_F_SHOULD_MERGE |
> > +                               BLK_MQ_F_TAG_SHARED;
> > +     tags->cmd_size          = sizeof(struct ibnbd_iu);
> > +     tags->nr_hw_queues      = num_online_cpus();
> > +
> > +     return blk_mq_alloc_tag_set(tags);
> > +}
>
> Forward declarations should be avoided when possible. Can the forward
> declaration of ibnbd_mq_ops be avoided by moving the definition of
> setup_mq_tags() down?
Yes we can by moving a couple of things around, thank you!

>
> > +static inline void wake_up_ibtrs_waiters(struct ibnbd_clt_session *sess)
> > +{
> > +     /* paired with rmb() in wait_for_ibtrs_connection() */
> > +     smp_wmb();
> > +     sess->ibtrs_ready = true;
> > +     wake_up_all(&sess->ibtrs_waitq);
> > +}
>
> The placement of the smp_wmb() call looks wrong to me. Since
> wake_up_all() and wait_event() already guarantee acquire/release
> behavior, I think that the explicit barriers can be left out from this
> function and also from wait_for_ibtrs_connection().
I will have to look into this part again. At first glance wmb seems to
have to be after Sess->ibtrs_ready = true.

>
> > +static void wait_for_ibtrs_disconnection(struct ibnbd_clt_session *sess)
> > +__releases(&sess_lock)
> > +__acquires(&sess_lock)
> > +{
> > +     DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
> > +
> > +     prepare_to_wait(&sess->ibtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
> > +     if (IS_ERR_OR_NULL(sess->ibtrs)) {
> > +             finish_wait(&sess->ibtrs_waitq, &wait);
> > +             return;
> > +     }
> > +     mutex_unlock(&sess_lock);
> > +     /* After unlock session can be freed, so careful */
> > +     schedule();
> > +     mutex_lock(&sess_lock);
> > +}
>
> This doesn't look right: any random wake_up() call can wake up this
> function. Shouldn't there be a loop in this function that causes the
> schedule() call to be repeated until the disconnect has happened?
The loop is inside __find_and_get_sess(), which is calling that
function. We need to schedule() here in order for another thread to be
able to remove the dying session we just found and tried to get
reference to from the list of sessions, so that we can go over the
list again in __find_and_get_sess().

>
> > +
> > +static struct ibnbd_clt_session *__find_and_get_sess(const char *sessname)
> > +__releases(&sess_lock)
> > +__acquires(&sess_lock)
> > +{
> > +     struct ibnbd_clt_session *sess;
> > +     int err;
> > +
> > +again:
> > +     list_for_each_entry(sess, &sess_list, list) {
> > +             if (strcmp(sessname, sess->sessname))
> > +                     continue;
> > +
> > +             if (unlikely(sess->ibtrs_ready && IS_ERR_OR_NULL(sess->ibtrs)))
> > +                     /*
> > +                      * No IBTRS connection, session is dying.
> > +                      */
> > +                     continue;
> > +
> > +             if (likely(ibnbd_clt_get_sess(sess))) {
> > +                     /*
> > +                      * Alive session is found, wait for IBTRS connection.
> > +                      */
> > +                     mutex_unlock(&sess_lock);
> > +                     err = wait_for_ibtrs_connection(sess);
> > +                     if (unlikely(err))
> > +                             ibnbd_clt_put_sess(sess);
> > +                     mutex_lock(&sess_lock);
> > +
> > +                     if (unlikely(err))
> > +                             /* Session is dying, repeat the loop */
> > +                             goto again;
> > +
> > +                     return sess;
> > +             }
> > +             /*
> > +              * Ref is 0, session is dying, wait for IBTRS disconnect
> > +              * in order to avoid session names clashes.
> > +              */
> > +             wait_for_ibtrs_disconnection(sess);
> > +             /*
> > +              * IBTRS is disconnected and soon session will be freed,
> > +              * so repeat a loop.
> > +              */
> > +             goto again;
> > +     }
> > +
> > +     return NULL;
> > +}
>  >
> > +
> > +static struct ibnbd_clt_session *find_and_get_sess(const char *sessname)
> > +{
> > +     struct ibnbd_clt_session *sess;
> > +
> > +     mutex_lock(&sess_lock);
> > +     sess = __find_and_get_sess(sessname);
> > +     mutex_unlock(&sess_lock);
> > +
> > +     return sess;
> > +}
>
> Shouldn't __find_and_get_sess() function increase the reference count of
> sess before it returns? In other words, what prevents that the session
> is freed from another thread before find_and_get_sess() returns?
It does increase the refcount inside __find_and_get_sess()
(...ibnbd_clt_get_sess(sess) call).

> > +/*
> > + * Get iorio of current task
> > + */
> > +static short ibnbd_current_ioprio(void)
> > +{
> > +     struct task_struct *tsp = current;
> > +     unsigned short prio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
> > +
> > +     if (likely(tsp->io_context))
> > +             prio = tsp->io_context->ioprio;
> > +     return prio;
> > +}
>
> ibnbd should use req_get_ioprio() and should not look at
> current->io_context->ioprio. I think it is the responsibility of the
> block layer to extract the I/O priority from the task context. As an
> example, here is how the aio code does this:
>
>                 req->ki_ioprio = get_current_ioprio();
>
Didn't notice the get_current_ioprio(), thank you.
ibnbd_current_ioprio() is doing exactly the same, will drop it.

> > +static blk_status_t ibnbd_queue_rq(struct blk_mq_hw ctx *hctx,
> > +                                const struct blk_mq_queue_data *bd)
> > +{
> > +     struct request *rq = bd->rq;
> > +     struct ibnbd_clt_dev *dev = rq->rq_disk->private_data;
> > +     struct ibnbd_iu *iu = blk_mq_rq_to_pdu(rq);
> > +     int err;
> > +
> > +     if (unlikely(!ibnbd_clt_dev_is_mapped(dev)))
> > +             return BLK_STS_IOERR;
> > +
> > +     iu->tag = ibnbd_get_tag(dev->sess, IBTRS_IO_CON, IBTRS_TAG_NOWAIT);
> > +     if (unlikely(!iu->tag)) {
> > +             ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_IFBUSY);
> > +             return BLK_STS_RESOURCE;
> > +     }
> > +
> > +     blk_mq_start_request(rq);
> > +     err = ibnbd_client_xfer_request(dev, rq, iu);
> > +     if (likely(err == 0))
> > +             return BLK_STS_OK;
> > +     if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
> > +             ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_10ms);
> > +             ibnbd_put_tag(dev->sess, iu->tag);
> > +             return BLK_STS_RESOURCE;
> > +     }
> > +
> > +     ibnbd_put_tag(dev->sess, iu->tag);
> > +     return BLK_STS_IOERR;
> > +}
>
> Every other block driver relies on the block layer core for tag
> allocation. Why does ibnbd need its own tag management?
Those tags are wrappers around the transport layer (ibtrs) "permits"
(ibtrs_tags) - one such ibtrs_tag/"permits" is a reservation of one
particular memory chunk on server side. Those "permits" are shared
among all the devices mapped on a given session and all their hardware
queues. Maybe we should use a different word like "permit" for them to
avoid confusion?

>
> > +static void setup_request_queue(struct ibnbd_clt_dev *dev)
> > +{
> > +     blk_queue_logical_block_size(dev->queue, dev->logical_block_size);
> > +     blk_queue_physical_block_size(dev->queue, dev->physical_block_size);
> > +     blk_queue_max_hw_sectors(dev->queue, dev->max_hw_sectors);
> > +     blk_queue_max_write_same_sectors(dev->queue,
> > +                                      dev->max_write_same_sectors);
> > +
> > +     /*
> > +      * we don't support discards to "discontiguous" segments
> > +      * in on request
>                ^^
>                one?
> > +      */
> > +     blk_queue_max_discard_segments(dev->queue, 1);
> > +
> > +     blk_queue_max_discard_sectors(dev->queue, dev->max_discard_sectors);
> > +     dev->queue->limits.discard_granularity  = dev->discard_granularity;
> > +     dev->queue->limits.discard_alignment    = dev->discard_alignment;
> > +     if (dev->max_discard_sectors)
> > +             blk_queue_flag_set(QUEUE_FLAG_DISCARD, dev->queue);
> > +     if (dev->secure_discard)
> > +             blk_queue_flag_set(QUEUE_FLAG_SECERASE, dev->queue);
> > +
> > +     blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, dev->queue);
> > +     blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, dev->queue);
> > +     blk_queue_max_segments(dev->queue, dev->max_segments);
> > +     blk_queue_io_opt(dev->queue, dev->sess->max_io_size);
> > +     blk_queue_virt_boundary(dev->queue, 4095);
> > +     blk_queue_write_cache(dev->queue, true, true);
> > +     dev->queue->queuedata = dev;
> > +}
>
> > +static void destroy_gen_disk(struct ibnbd_clt_dev *dev)
> > +{
> > +     del_gendisk(dev->gd);
>
> > +     /*
> > +      * Before marking queue as dying (blk_cleanup_queue() does that)
> > +      * we have to be sure that everything in-flight has gone.
> > +      * Blink with freeze/unfreeze.
> > +      */
> > +     blk_mq_freeze_queue(dev->queue);
> > +     blk_mq_unfreeze_queue(dev->queue);
>
> Please remove the above seven lines. blk_cleanup_queue() calls
> blk_set_queue_dying() and the second call in blk_set_queue_dying() is
> blk_freeze_queue_start().
Thanks, will check this out.

>
> > +     blk_cleanup_queue(dev->queue);
> > +     put_disk(dev->gd);
> > +}
>
> > +
> > +static void destroy_sysfs(struct ibnbd_clt_dev *dev,
> > +                       const struct attribute *sysfs_self)
> > +{
> > +     ibnbd_clt_remove_dev_symlink(dev);
> > +     if (dev->kobj.state_initialized) {
> > +             if (sysfs_self)
> > +                     /* To avoid deadlock firstly commit suicide */
>                                                              ^^^^^^^
> Please chose terminology that is more appropriate for a professional
> context.
Will rephrase the comment, thanks.

>
> > +                     sysfs_remove_file_self(&dev->kobj, sysfs_self);
> > +             kobject_del(&dev->kobj);
> > +             kobject_put(&dev->kobj);
> > +     }
> > +}
>
> Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-15 14:30     ` Jinpu Wang
  2019-09-16  5:27       ` Leon Romanovsky
  2019-09-16  7:08       ` Danil Kipnis
@ 2019-09-16 14:57       ` Jinpu Wang
  2019-09-16 17:25         ` Bart Van Assche
  2019-09-16 15:39       ` Jinpu Wang
  3 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-09-16 14:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

> > > +#define _IBNBD_FILEIO  0
> > > +#define _IBNBD_BLOCKIO 1
> > > +#define _IBNBD_AUTOIO  2
> >  >
> > > +enum ibnbd_io_mode {
> > > +     IBNBD_FILEIO = _IBNBD_FILEIO,
> > > +     IBNBD_BLOCKIO = _IBNBD_BLOCKIO,
> > > +     IBNBD_AUTOIO = _IBNBD_AUTOIO,
> > > +};
> >
> > Since the IBNBD_* and _IBNBD_* constants have the same numerical value,
> > are the former constants really necessary?
> Seems we can remove _IBNBD_*.
Sorry, checked again,  we defined _IBNBD_* constants to show the right
value for def_io_mode description.
If we remove the _IBNBD_*, then the modinfo shows:
def_io_mode:By default, export devices in blockio(IBNBD_BLOCKIO) or
fileio(IBNBD_FILEIO) mode. (default: IBNBD_BLOCKIO (blockio))
instead of:
parm:           def_io_mode:By default, export devices in blockio(1)
or fileio(0) mode. (default: 1 (blockio))


> > > +/**
> > > + * struct ibnbd_msg_io_old - message for I/O read/write for
> > > + * ver < IBNBD_PROTO_VER_MAJOR
> > > + * This structure is there only to know the size of the "old" message format
> > > + * @hdr:     message header
> > > + * @device_id:       device_id on server side to find the right device
> > > + * @sector:  bi_sector attribute from struct bio
> > > + * @rw:              bitmask, valid values are defined in enum ibnbd_io_flags
> > > + * @bi_size:    number of bytes for I/O read/write
> > > + * @prio:       priority
> > > + */
> > > +struct ibnbd_msg_io_old {
> > > +     struct ibnbd_msg_hdr hdr;
> > > +     __le32          device_id;
> > > +     __le64          sector;
> > > +     __le32          rw;
> > > +     __le32          bi_size;
> > > +};
> >
> > Since this is the first version of IBNBD that is being sent upstream, I
> > think that ibnbd_msg_io_old should be left out.
After discuss with Danil, we will remove the ibnbd_msg_io_old next round.

Regards,

--
Jack Wang
Linux Kernel Developer
Platform Engineering Compute (IONOS Cloud)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-15 14:30     ` Jinpu Wang
                         ` (2 preceding siblings ...)
  2019-09-16 14:57       ` Jinpu Wang
@ 2019-09-16 15:39       ` Jinpu Wang
  2019-09-18 15:26         ` Bart Van Assche
  3 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-09-16 15:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

- Roman's pb emal address, it's no longer valid, will fix next round.


> >
> > > +static inline const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
> > > +{
> > > +     switch (mode) {
> > > +     case IBNBD_FILEIO:
> > > +             return "fileio";
> > > +     case IBNBD_BLOCKIO:
> > > +             return "blockio";
> > > +     case IBNBD_AUTOIO:
> > > +             return "autoio";
> > > +     default:
> > > +             return "unknown";
> > > +     }
> > > +}
> > > +
> > > +static inline const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
> > > +{
> > > +     switch (mode) {
> > > +     case IBNBD_ACCESS_RO:
> > > +             return "ro";
> > > +     case IBNBD_ACCESS_RW:
> > > +             return "rw";
> > > +     case IBNBD_ACCESS_MIGRATION:
> > > +             return "migration";
> > > +     default:
> > > +             return "unknown";
> > > +     }
> > > +}
> >
> > These two functions are not in the hot path and hence should not be
> > inline functions.
> Sounds reasonable, will remove the inline.
inline was added to fix the -Wunused-function warning  eg:

  CC [M]  /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.o
In file included from /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.h:34,
                 from /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.c:33:
/<<PKGBUILDDIR>>/ibnbd/ibnbd-proto.h:362:20: warning:
'ibnbd_access_mode_str' defined but not used [-Wunused-function]
 static const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
                    ^~~~~~~~~~~~~~~~~~~~~
/<<PKGBUILDDIR>>/ibnbd/ibnbd-proto.h:348:20: warning:
'ibnbd_io_mode_str' defined but not used [-Wunused-function]
 static const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)

We have to move both functions to a separate header file if we really
want to do it.
The function is simple and small, if you insist, I will do it.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-16 14:17     ` Danil Kipnis
@ 2019-09-16 16:46       ` Bart Van Assche
  2019-09-17 11:39         ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-16 16:46 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang, Roman Pen

On 9/16/19 7:17 AM, Danil Kipnis wrote:
> On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 6/20/19 8:03 AM, Jack Wang wrote:
>>> +/*
>>> + * This is for closing devices when unloading the module:
>>> + * we might be closing a lot (>256) of devices in parallel
>>> + * and it is better not to use the system_wq.
>>> + */
>>> +static struct workqueue_struct *unload_wq;
>>
>> I think that a better motivation is needed for the introduction of a new
>> workqueue.
 >
> We didn't want to pollute the system workqueue when unmapping a big
> number of devices at once in parallel. Will reiterate on it.

There are multiple system workqueues. From <linux/workqueue.h>:

extern struct workqueue_struct *system_wq;
extern struct workqueue_struct *system_highpri_wq;
extern struct workqueue_struct *system_long_wq;
extern struct workqueue_struct *system_unbound_wq;
extern struct workqueue_struct *system_freezable_wq;
extern struct workqueue_struct *system_power_efficient_wq;
extern struct workqueue_struct *system_freezable_power_efficient_wq;

Has it been considered to use e.g. system_long_wq?

>> A more general question is why ibnbd needs its own queue management
>> while no other block driver needs this?
>
> Each IBNBD device promises to have a queue_depth (of say 512) on each
> of its num_cpus hardware queues. In fact we can only process a
> queue_depth inflights at once on the whole ibtrs session connecting a
> given client with a given server. Those 512 inflights (corresponding
> to the number of buffers reserved by the server for this particular
> client) have to be shared among all the devices mapped on this
> session. This leads to the situation, that we receive more requests
> than we can process at the moment. So we need to stop queues and start
> them again later in some fair fashion.

Can a single CPU really sustain a queue depth of 512 commands? Is it 
really necessary to have one hardware queue per CPU or is e.g. four 
queues per NUMA node sufficient? Has it been considered to send the 
number of hardware queues that the initiator wants to use and also the 
command depth per queue during login to the target side? That would 
allow the target side to allocate an independent set of buffers for each 
initiator hardware queue and would allow to remove the queue management 
at the initiator side. This might even yield better performance.

>>> +static void msg_conf(void *priv, int errno)
>>> +{
>>> +     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
>>
>> The kernel code I'm familiar with does not cast void pointers explicitly
>> into another type. Please follow that convention and leave the cast out
>> from the above and also from similar statements.
> msg_conf() is a callback which IBNBD passes down with a request to
> IBTRS when calling ibtrs_clt_request(). msg_conf() is called when a
> request is completed with a pointer to a struct defined in IBNBD. So
> IBTRS as transport doesn't know what's inside the private pointer
> which IBNBD passed down with the request, it's opaque, since struct
> ibnbd_iu is not visible in IBTRS. I will try to find how others avoid
> a cast in similar situations.

Are you aware that the C language can cast a void pointer into a 
non-void pointer implicitly, that means, without having to use a cast?


>>> +static void wait_for_ibtrs_disconnection(struct ibnbd_clt_session *sess)
>>> +__releases(&sess_lock)
>>> +__acquires(&sess_lock)
>>> +{
>>> +     DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
>>> +
>>> +     prepare_to_wait(&sess->ibtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
>>> +     if (IS_ERR_OR_NULL(sess->ibtrs)) {
>>> +             finish_wait(&sess->ibtrs_waitq, &wait);
>>> +             return;
>>> +     }
>>> +     mutex_unlock(&sess_lock);
>>> +     /* After unlock session can be freed, so careful */
>>> +     schedule();
>>> +     mutex_lock(&sess_lock);
>>> +}
>>
>> This doesn't look right: any random wake_up() call can wake up this
>> function. Shouldn't there be a loop in this function that causes the
>> schedule() call to be repeated until the disconnect has happened?
> The loop is inside __find_and_get_sess(), which is calling that
> function. We need to schedule() here in order for another thread to be
> able to remove the dying session we just found and tried to get
> reference to from the list of sessions, so that we can go over the
> list again in __find_and_get_sess().

Thanks for the clarification.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-16 14:57       ` Jinpu Wang
@ 2019-09-16 17:25         ` Bart Van Assche
  2019-09-17 12:27           ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-16 17:25 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On 9/16/19 7:57 AM, Jinpu Wang wrote:
>>>> +#define _IBNBD_FILEIO  0
>>>> +#define _IBNBD_BLOCKIO 1
>>>> +#define _IBNBD_AUTOIO  2
>>>>
>>>> +enum ibnbd_io_mode {
>>>> +     IBNBD_FILEIO = _IBNBD_FILEIO,
>>>> +     IBNBD_BLOCKIO = _IBNBD_BLOCKIO,
>>>> +     IBNBD_AUTOIO = _IBNBD_AUTOIO,
>>>> +};
>>>
>>> Since the IBNBD_* and _IBNBD_* constants have the same numerical value,
>>> are the former constants really necessary?
 >>
>> Seems we can remove _IBNBD_*.
 >
> Sorry, checked again,  we defined _IBNBD_* constants to show the right
> value for def_io_mode description.
> If we remove the _IBNBD_*, then the modinfo shows:
> def_io_mode:By default, export devices in blockio(IBNBD_BLOCKIO) or
> fileio(IBNBD_FILEIO) mode. (default: IBNBD_BLOCKIO (blockio))
> instead of:
> parm:           def_io_mode:By default, export devices in blockio(1)
> or fileio(0) mode. (default: 1 (blockio))

So the user is required to enter def_io_mode as a number? Wouldn't it be 
more friendly towards users to change that parameter from a number into 
a string?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-16 16:46       ` Bart Van Assche
@ 2019-09-17 11:39         ` Danil Kipnis
  2019-09-18  7:14           ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-17 11:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang, Roman Pen

On Mon, Sep 16, 2019 at 6:46 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/16/19 7:17 AM, Danil Kipnis wrote:
> > On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >> On 6/20/19 8:03 AM, Jack Wang wrote:
> >>> +/*
> >>> + * This is for closing devices when unloading the module:
> >>> + * we might be closing a lot (>256) of devices in parallel
> >>> + * and it is better not to use the system_wq.
> >>> + */
> >>> +static struct workqueue_struct *unload_wq;
> >>
> >> I think that a better motivation is needed for the introduction of a new
> >> workqueue.
>  >
> > We didn't want to pollute the system workqueue when unmapping a big
> > number of devices at once in parallel. Will reiterate on it.
>
> There are multiple system workqueues. From <linux/workqueue.h>:
>
> extern struct workqueue_struct *system_wq;
> extern struct workqueue_struct *system_highpri_wq;
> extern struct workqueue_struct *system_long_wq;
> extern struct workqueue_struct *system_unbound_wq;
> extern struct workqueue_struct *system_freezable_wq;
> extern struct workqueue_struct *system_power_efficient_wq;
> extern struct workqueue_struct *system_freezable_power_efficient_wq;
>
> Has it been considered to use e.g. system_long_wq?
Will try to switch to system_long_wq, I do agree that a new wq for
just closing devices does make an impression of an overreaction.

>
> >> A more general question is why ibnbd needs its own queue management
> >> while no other block driver needs this?
> >
> > Each IBNBD device promises to have a queue_depth (of say 512) on each
> > of its num_cpus hardware queues. In fact we can only process a
> > queue_depth inflights at once on the whole ibtrs session connecting a
> > given client with a given server. Those 512 inflights (corresponding
> > to the number of buffers reserved by the server for this particular
> > client) have to be shared among all the devices mapped on this
> > session. This leads to the situation, that we receive more requests
> > than we can process at the moment. So we need to stop queues and start
> > them again later in some fair fashion.
>
> Can a single CPU really sustain a queue depth of 512 commands? Is it
> really necessary to have one hardware queue per CPU or is e.g. four
> queues per NUMA node sufficient? Has it been considered to send the
> number of hardware queues that the initiator wants to use and also the
> command depth per queue during login to the target side? That would
> allow the target side to allocate an independent set of buffers for each
> initiator hardware queue and would allow to remove the queue management
> at the initiator side. This might even yield better performance.
We needed a way which would allow us to address one particular
requirement: we'd like to be able to "enforce" that a response to an
IO would be processed on the same CPU the IO was originally submitted
on. In order to be able to do so we establish one rdma connection per
cpu, each having a separate cq_vector. The administrator can then
assign the corresponding IRQs to distinct CPUs. The server always
replies to an IO on the same connection he received the request on. If
the administrator did configure the /proc/irq/y/smp_affinity
accordingly, the response sent by the server will generate interrupt
on the same cpu, the IO was originally submitted on. The administrator
can configure IRQs differently, for example assign a given irq
(<->cq_vector) to a range of cpus belonging to a numa node, or
whatever assignment is best for his use-case.
Our transport module IBTRS establishes number of cpus connections
between a client and a server. The user of the transport module (i.e.
IBNBD) has no knowledge about the rdma connections, it only has a
pointer to an abstract "session", which connects  him somehow to a
remote host. IBNBD as a user of IBTRS creates block devices and uses a
given "session" to send IOs from all the block devices it created for
that session. That means IBNBD is limited in maximum number of his
inflights toward a given remote host by the capability of the
corresponding "session". So it needs to share the resources provided
by the session (in our current model those resources are in fact some
pre registered buffers on server side) among his devices.
It is possible to extend the IBTRS API so that the user (IBNBD) could
specify how many connections he wants to have on the session to be
established. It is also possible to extend the ibtrs_clt_get_tag API
(this is to get a send "permit") with a parameter specifying the
connection, the future IO is to be send on.
We now might have to change our communication model in IBTRS a bit in
order to fix the potential security problem raised during the recent
RDMA MC: https://etherpad.net/p/LPC2019_RDMA.

>
> >>> +static void msg_conf(void *priv, int errno)
> >>> +{
> >>> +     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
> >>
> >> The kernel code I'm familiar with does not cast void pointers explicitly
> >> into another type. Please follow that convention and leave the cast out
> >> from the above and also from similar statements.
> > msg_conf() is a callback which IBNBD passes down with a request to
> > IBTRS when calling ibtrs_clt_request(). msg_conf() is called when a
> > request is completed with a pointer to a struct defined in IBNBD. So
> > IBTRS as transport doesn't know what's inside the private pointer
> > which IBNBD passed down with the request, it's opaque, since struct
> > ibnbd_iu is not visible in IBTRS. I will try to find how others avoid
> > a cast in similar situations.
>
> Are you aware that the C language can cast a void pointer into a
> non-void pointer implicitly, that means, without having to use a cast?
Oh, I misunderstood your original comment: you suggest to just remove
the explicit (struct ibnbd_iu *) and similar casts from void pointers.
I think an explicit cast makes it easier for readers to follow the
code. But it does say "Casting the return value which is a void
pointer is redundant." at least in the "Allocating Memory" section of
https://www.kernel.org/doc/html/v4.10/process/coding-style.html and
seems others don't do that, at least not when declaring a variable.
Will drop those casts.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-16 17:25         ` Bart Van Assche
@ 2019-09-17 12:27           ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-17 12:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Mon, Sep 16, 2019 at 7:25 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/16/19 7:57 AM, Jinpu Wang wrote:
> >>>> +#define _IBNBD_FILEIO  0
> >>>> +#define _IBNBD_BLOCKIO 1
> >>>> +#define _IBNBD_AUTOIO  2
> >>>>
> >>>> +enum ibnbd_io_mode {
> >>>> +     IBNBD_FILEIO = _IBNBD_FILEIO,
> >>>> +     IBNBD_BLOCKIO = _IBNBD_BLOCKIO,
> >>>> +     IBNBD_AUTOIO = _IBNBD_AUTOIO,
> >>>> +};
> >>>
> >>> Since the IBNBD_* and _IBNBD_* constants have the same numerical value,
> >>> are the former constants really necessary?
>  >>
> >> Seems we can remove _IBNBD_*.
>  >
> > Sorry, checked again,  we defined _IBNBD_* constants to show the right
> > value for def_io_mode description.
> > If we remove the _IBNBD_*, then the modinfo shows:
> > def_io_mode:By default, export devices in blockio(IBNBD_BLOCKIO) or
> > fileio(IBNBD_FILEIO) mode. (default: IBNBD_BLOCKIO (blockio))
> > instead of:
> > parm:           def_io_mode:By default, export devices in blockio(1)
> > or fileio(0) mode. (default: 1 (blockio))
>
> So the user is required to enter def_io_mode as a number? Wouldn't it be
> more friendly towards users to change that parameter from a number into
> a string?
>
Ok, it's a bit more code, will change to allow user to set "blockio"
or "fileio" as string.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-13 23:46   ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche
  2019-09-16 14:17     ` Danil Kipnis
@ 2019-09-17 13:09     ` Jinpu Wang
  2019-09-17 16:46       ` Bart Van Assche
  2019-09-18 16:05     ` Jinpu Wang
  2 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-09-17 13:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

> > +static void ibnbd_softirq_done_fn(struct request *rq)
> > +{
> > +     struct ibnbd_clt_dev *dev       = rq->rq_disk->private_data;
> > +     struct ibnbd_clt_session *sess  = dev->sess;
> > +     struct ibnbd_iu *iu;
> > +
> > +     iu = blk_mq_rq_to_pdu(rq);
> > +     ibnbd_put_tag(sess, iu->tag);
> > +     blk_mq_end_request(rq, iu->status);
> > +}
> > +
> > +static void msg_io_conf(void *priv, int errno)
> > +{
> > +     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
> > +     struct ibnbd_clt_dev *dev = iu->dev;
> > +     struct request *rq = iu->rq;
> > +
> > +     iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
> > +
> > +     if (softirq_enable) {
> > +             blk_mq_complete_request(rq);
> > +     } else {
> > +             ibnbd_put_tag(dev->sess, iu->tag);
> > +             blk_mq_end_request(rq, iu->status);
> > +     }
>
> Block drivers must call blk_mq_complete_request() instead of
> blk_mq_end_request() to complete a request after processing of the
> request has been started. Calling blk_mq_end_request() to complete a
> request is racy in case a timeout occurs while blk_mq_end_request() is
> in progress.

Hi Bart,

Could you elaborate a bit more, blk_mq_end_request is exported function and
used by a lot of block drivers: scsi, dm, etc.
Is there an open bug report for the problem?

Regards,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-16 13:45         ` Bart Van Assche
@ 2019-09-17 15:41           ` Leon Romanovsky
  2019-09-17 15:52             ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Leon Romanovsky @ 2019-09-17 15:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jinpu Wang, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Mon, Sep 16, 2019 at 06:45:17AM -0700, Bart Van Assche wrote:
> On 9/15/19 10:27 PM, Leon Romanovsky wrote:
> > On Sun, Sep 15, 2019 at 04:30:04PM +0200, Jinpu Wang wrote:
> > > On Sat, Sep 14, 2019 at 12:10 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > > > > +/* TODO: should be configurable */
> > > > > +#define IBTRS_PORT 1234
> > > >
> > > > How about converting this macro into a kernel module parameter?
> > > Sounds good, will do.
> >
> > Don't rush to do it and defer it to be the last change before merging,
> > this is controversial request which not everyone will like here.
>
> Hi Leon,
>
> If you do not agree with changing this macro into a kernel module parameter
> please suggest an alternative.

I didn't review code so my answer can be not fully accurate, but opening
some port to use this IB* seems strange from my non-sysadmin POV.
What about using RDMA-CM, like NVMe?

Thanks

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-17 15:41           ` Leon Romanovsky
@ 2019-09-17 15:52             ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-17 15:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Tue, Sep 17, 2019 at 5:42 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Sep 16, 2019 at 06:45:17AM -0700, Bart Van Assche wrote:
> > On 9/15/19 10:27 PM, Leon Romanovsky wrote:
> > > On Sun, Sep 15, 2019 at 04:30:04PM +0200, Jinpu Wang wrote:
> > > > On Sat, Sep 14, 2019 at 12:10 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > > > > > +/* TODO: should be configurable */
> > > > > > +#define IBTRS_PORT 1234
> > > > >
> > > > > How about converting this macro into a kernel module parameter?
> > > > Sounds good, will do.
> > >
> > > Don't rush to do it and defer it to be the last change before merging,
> > > this is controversial request which not everyone will like here.
> >
> > Hi Leon,
> >
> > If you do not agree with changing this macro into a kernel module parameter
> > please suggest an alternative.
>
> I didn't review code so my answer can be not fully accurate, but opening
> some port to use this IB* seems strange from my non-sysadmin POV.
> What about using RDMA-CM, like NVMe?
Hi Leon,

We are using rdma-cm, the port number here is same like addr_trsvcid
in NVMeoF, it controls which port
rdma_listen is listening on.

Currently, it's hardcoded, I've adapted the code to have a kernel
module parameter port_nr in ibnbd_server, so it's possible
to change it if the sysadmin wants.

Thanks,
-- 
Jack Wang
Linux Kernel Developer
Platform Engineering Compute (IONOS Cloud)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 16/25] ibnbd: client: private header with client structs and functions
  2019-09-13 22:25   ` [PATCH v4 16/25] ibnbd: client: private header with client structs and functions Bart Van Assche
@ 2019-09-17 16:36     ` Jinpu Wang
  2019-09-25 23:43       ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-09-17 16:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Sat, Sep 14, 2019 at 12:25 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +     char                    pathname[NAME_MAX];
> [ ... ]
>  > +    char                    blk_symlink_name[NAME_MAX];
>
> Please allocate path names dynamically instead of hard-coding the upper
> length for a path.
>
> Bart.
Hi Bart,

ok,  will dynamically allocate the path and blk_symlink_name as you suggested.

Thank you
Jinpu

--
Jack Wang

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-17 13:09     ` Jinpu Wang
@ 2019-09-17 16:46       ` Bart Van Assche
  2019-09-18 12:02         ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-17 16:46 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On 9/17/19 6:09 AM, Jinpu Wang wrote:
>>> +static void ibnbd_softirq_done_fn(struct request *rq)
>>> +{
>>> +     struct ibnbd_clt_dev *dev       = rq->rq_disk->private_data;
>>> +     struct ibnbd_clt_session *sess  = dev->sess;
>>> +     struct ibnbd_iu *iu;
>>> +
>>> +     iu = blk_mq_rq_to_pdu(rq);
>>> +     ibnbd_put_tag(sess, iu->tag);
>>> +     blk_mq_end_request(rq, iu->status);
>>> +}
>>> +
>>> +static void msg_io_conf(void *priv, int errno)
>>> +{
>>> +     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
>>> +     struct ibnbd_clt_dev *dev = iu->dev;
>>> +     struct request *rq = iu->rq;
>>> +
>>> +     iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
>>> +
>>> +     if (softirq_enable) {
>>> +             blk_mq_complete_request(rq);
>>> +     } else {
>>> +             ibnbd_put_tag(dev->sess, iu->tag);
>>> +             blk_mq_end_request(rq, iu->status);
>>> +     }
>>
>> Block drivers must call blk_mq_complete_request() instead of
>> blk_mq_end_request() to complete a request after processing of the
>> request has been started. Calling blk_mq_end_request() to complete a
>> request is racy in case a timeout occurs while blk_mq_end_request() is
>> in progress.
> 
> Could you elaborate a bit more, blk_mq_end_request is exported function and
> used by a lot of block drivers: scsi, dm, etc.
> Is there an open bug report for the problem?

Hi Jinpu,

There is only one blk_mq_end_request() call in the SCSI code and it's 
inside the FC timeout handler (fc_bsg_job_timeout()). Calling 
blk_mq_end_request() from inside a timeout handler is fine but not to 
report to the block layer that a request has completed from outside the 
timeout handler after a request has started.

The device mapper calls blk_mq_complete_request() to report request 
completion to the block layer. See also dm_complete_request(). 
blk_mq_end_request() is only called by the device mapper from inside 
dm_softirq_done(). That last function is called from inside 
blk_mq_complete_request() and is not called directly.

The NVMe PCIe driver only calls blk_mq_end_request() from inside 
nvme_complete_rq(). nvme_complete_rq() is called by the PCIe driver from 
inside nvme_pci_complete_rq() and that last function is called from 
inside blk_mq_complete_request().

In other words, the SCSI core, the device mapper and the NVMe PCIe 
driver all use blk_mq_complete_request() to report request completion to 
the block layer from outside timeout handlers after a request has been 
started.

This is not a new requirement. I think that the legacy block layer 
equivalent, blk_complete_request(), was introduced in 2006 and that 
since then block drivers are required to call blk_complete_request() to 
report completion of requests from outside a timeout handler after these 
have been started.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-17 11:39         ` Danil Kipnis
@ 2019-09-18  7:14           ` Danil Kipnis
  2019-09-18 15:47             ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-18  7:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang, Roman Pen

> > > On Sat, Sep 14, 2019 at 1:46 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > >> A more general question is why ibnbd needs its own queue management
> > >> while no other block driver needs this?
> > >
> > > Each IBNBD device promises to have a queue_depth (of say 512) on each
> > > of its num_cpus hardware queues. In fact we can only process a
> > > queue_depth inflights at once on the whole ibtrs session connecting a
> > > given client with a given server. Those 512 inflights (corresponding
> > > to the number of buffers reserved by the server for this particular
> > > client) have to be shared among all the devices mapped on this
> > > session. This leads to the situation, that we receive more requests
> > > than we can process at the moment. So we need to stop queues and start
> > > them again later in some fair fashion.
> >
> > Can a single CPU really sustain a queue depth of 512 commands? Is it
> > really necessary to have one hardware queue per CPU or is e.g. four
> > queues per NUMA node sufficient? Has it been considered to send the
> > number of hardware queues that the initiator wants to use and also the
> > command depth per queue during login to the target side? That would
> > allow the target side to allocate an independent set of buffers for each
> > initiator hardware queue and would allow to remove the queue management
> > at the initiator side. This might even yield better performance.
> We needed a way which would allow us to address one particular
> requirement: we'd like to be able to "enforce" that a response to an
> IO would be processed on the same CPU the IO was originally submitted
> on. In order to be able to do so we establish one rdma connection per
> cpu, each having a separate cq_vector. The administrator can then
> assign the corresponding IRQs to distinct CPUs. The server always
> replies to an IO on the same connection he received the request on. If
> the administrator did configure the /proc/irq/y/smp_affinity
> accordingly, the response sent by the server will generate interrupt
> on the same cpu, the IO was originally submitted on. The administrator
> can configure IRQs differently, for example assign a given irq
> (<->cq_vector) to a range of cpus belonging to a numa node, or
> whatever assignment is best for his use-case.
> Our transport module IBTRS establishes number of cpus connections
> between a client and a server. The user of the transport module (i.e.
> IBNBD) has no knowledge about the rdma connections, it only has a
> pointer to an abstract "session", which connects  him somehow to a
> remote host. IBNBD as a user of IBTRS creates block devices and uses a
> given "session" to send IOs from all the block devices it created for
> that session. That means IBNBD is limited in maximum number of his
> inflights toward a given remote host by the capability of the
> corresponding "session". So it needs to share the resources provided
> by the session (in our current model those resources are in fact some
> pre registered buffers on server side) among his devices.
> It is possible to extend the IBTRS API so that the user (IBNBD) could
> specify how many connections he wants to have on the session to be
> established. It is also possible to extend the ibtrs_clt_get_tag API
> (this is to get a send "permit") with a parameter specifying the
> connection, the future IO is to be send on.
> We now might have to change our communication model in IBTRS a bit in
> order to fix the potential security problem raised during the recent
> RDMA MC: https://etherpad.net/p/LPC2019_RDMA.
>
I'm not familiar with dm code, but don't they need to deal with the
same situation: if I configure 100 logical volumes on top of a single
NVME drive with X hardware queues, each queue_depth deep, then each dm
block device would need to advertise X hardware queues in order to
achieve highest performance in case only this one volume is accessed,
while in fact those X physical queues have to be shared among all 100
logical volumes, if they are accessed in parallel?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-17 16:46       ` Bart Van Assche
@ 2019-09-18 12:02         ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-18 12:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Tue, Sep 17, 2019 at 6:46 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/17/19 6:09 AM, Jinpu Wang wrote:
> >>> +static void ibnbd_softirq_done_fn(struct request *rq)
> >>> +{
> >>> +     struct ibnbd_clt_dev *dev       = rq->rq_disk->private_data;
> >>> +     struct ibnbd_clt_session *sess  = dev->sess;
> >>> +     struct ibnbd_iu *iu;
> >>> +
> >>> +     iu = blk_mq_rq_to_pdu(rq);
> >>> +     ibnbd_put_tag(sess, iu->tag);
> >>> +     blk_mq_end_request(rq, iu->status);
> >>> +}
> >>> +
> >>> +static void msg_io_conf(void *priv, int errno)
> >>> +{
> >>> +     struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
> >>> +     struct ibnbd_clt_dev *dev = iu->dev;
> >>> +     struct request *rq = iu->rq;
> >>> +
> >>> +     iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
> >>> +
> >>> +     if (softirq_enable) {
> >>> +             blk_mq_complete_request(rq);
> >>> +     } else {
> >>> +             ibnbd_put_tag(dev->sess, iu->tag);
> >>> +             blk_mq_end_request(rq, iu->status);
> >>> +     }
> >>
> >> Block drivers must call blk_mq_complete_request() instead of
> >> blk_mq_end_request() to complete a request after processing of the
> >> request has been started. Calling blk_mq_end_request() to complete a
> >> request is racy in case a timeout occurs while blk_mq_end_request() is
> >> in progress.
> >
> > Could you elaborate a bit more, blk_mq_end_request is exported function and
> > used by a lot of block drivers: scsi, dm, etc.
> > Is there an open bug report for the problem?
>
> Hi Jinpu,
>
> There is only one blk_mq_end_request() call in the SCSI code and it's
> inside the FC timeout handler (fc_bsg_job_timeout()). Calling
> blk_mq_end_request() from inside a timeout handler is fine but not to
> report to the block layer that a request has completed from outside the
> timeout handler after a request has started.
>
> The device mapper calls blk_mq_complete_request() to report request
> completion to the block layer. See also dm_complete_request().
> blk_mq_end_request() is only called by the device mapper from inside
> dm_softirq_done(). That last function is called from inside
> blk_mq_complete_request() and is not called directly.
>
> The NVMe PCIe driver only calls blk_mq_end_request() from inside
> nvme_complete_rq(). nvme_complete_rq() is called by the PCIe driver from
> inside nvme_pci_complete_rq() and that last function is called from
> inside blk_mq_complete_request().
>
> In other words, the SCSI core, the device mapper and the NVMe PCIe
> driver all use blk_mq_complete_request() to report request completion to
> the block layer from outside timeout handlers after a request has been
> started.
>
> This is not a new requirement. I think that the legacy block layer
> equivalent, blk_complete_request(), was introduced in 2006 and that
> since then block drivers are required to call blk_complete_request() to
> report completion of requests from outside a timeout handler after these
> have been started.
>
> Bart.

Thanks for the detailed explanation, I will switch to
blk_mq_complete_request(), will also drop the
softirq_done module parameter, not useful.

Regards,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 24/25] ibnbd: a bit of documentation
  2019-09-13 23:58   ` [PATCH v4 24/25] ibnbd: a bit of documentation Bart Van Assche
@ 2019-09-18 12:22     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-18 12:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Sat, Sep 14, 2019 at 1:58 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > From: Roman Pen <roman.penyaev@profitbricks.com>
> >
> > README with description of major sysfs entries.
>
> Please have a look at Documentation/ABI/README and follow the
> instructions from that document.
>
> Thanks,
>
> Bart.

Thanks, will move the sysfs description to
Documentation/ABI/testing/[sysfs-class-ibnbd-client|sysfs-block-ibnbd],
will also move ibtrs sysfs description there.


Regards,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-16 15:39       ` Jinpu Wang
@ 2019-09-18 15:26         ` Bart Van Assche
  2019-09-18 16:11           ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-18 15:26 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On 9/16/19 8:39 AM, Jinpu Wang wrote:
> - Roman's pb emal address, it's no longer valid, will fix next round.
> 
> 
>>>
>>>> +static inline const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
>>>> +{
>>>> +     switch (mode) {
>>>> +     case IBNBD_FILEIO:
>>>> +             return "fileio";
>>>> +     case IBNBD_BLOCKIO:
>>>> +             return "blockio";
>>>> +     case IBNBD_AUTOIO:
>>>> +             return "autoio";
>>>> +     default:
>>>> +             return "unknown";
>>>> +     }
>>>> +}
>>>> +
>>>> +static inline const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
>>>> +{
>>>> +     switch (mode) {
>>>> +     case IBNBD_ACCESS_RO:
>>>> +             return "ro";
>>>> +     case IBNBD_ACCESS_RW:
>>>> +             return "rw";
>>>> +     case IBNBD_ACCESS_MIGRATION:
>>>> +             return "migration";
>>>> +     default:
>>>> +             return "unknown";
>>>> +     }
>>>> +}
>>>
>>> These two functions are not in the hot path and hence should not be
>>> inline functions.
>> Sounds reasonable, will remove the inline.
> inline was added to fix the -Wunused-function warning  eg:
> 
>    CC [M]  /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.o
> In file included from /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.h:34,
>                   from /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.c:33:
> /<<PKGBUILDDIR>>/ibnbd/ibnbd-proto.h:362:20: warning:
> 'ibnbd_access_mode_str' defined but not used [-Wunused-function]
>   static const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
>                      ^~~~~~~~~~~~~~~~~~~~~
> /<<PKGBUILDDIR>>/ibnbd/ibnbd-proto.h:348:20: warning:
> 'ibnbd_io_mode_str' defined but not used [-Wunused-function]
>   static const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
> 
> We have to move both functions to a separate header file if we really
> want to do it.
> The function is simple and small, if you insist, I will do it.

Please move these functions into a .c file. That will reduce the size of 
the kernel modules and will also reduce the size of the header file.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-18  7:14           ` Danil Kipnis
@ 2019-09-18 15:47             ` Bart Van Assche
  2019-09-20  8:29               ` Danil Kipnis
  2019-09-25 22:26               ` Danil Kipnis
  0 siblings, 2 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-18 15:47 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang, Roman Pen

On 9/18/19 12:14 AM, Danil Kipnis wrote:
> I'm not familiar with dm code, but don't they need to deal with the
> same situation: if I configure 100 logical volumes on top of a single
> NVME drive with X hardware queues, each queue_depth deep, then each dm
> block device would need to advertise X hardware queues in order to
> achieve highest performance in case only this one volume is accessed,
> while in fact those X physical queues have to be shared among all 100
> logical volumes, if they are accessed in parallel?

Combining multiple queues (a) into a single queue (b) that is smaller 
than the combined source queues without sacrificing performance is 
tricky. We already have one such implementation in the block layer core 
and it took considerable time to get that implementation right. See e.g. 
blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart().

dm drivers are expected to return DM_MAPIO_REQUEUE or 
DM_MAPIO_DELAY_REQUEUE if the queue (b) is full. It turned out to be 
difficult to get this right in the dm-mpath driver and at the same time 
to achieve good performance.

The ibnbd driver introduces a third implementation of code that combines 
multiple (per-cpu) queues into one queue per CPU. It is considered 
important in the Linux kernel to avoid code duplication. Hence my 
question whether ibnbd can reuse the block layer infrastructure for 
sharing tag sets.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-13 23:46   ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche
  2019-09-16 14:17     ` Danil Kipnis
  2019-09-17 13:09     ` Jinpu Wang
@ 2019-09-18 16:05     ` Jinpu Wang
  2 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-18 16:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

> > +static void destroy_gen_disk(struct ibnbd_clt_dev *dev)
> > +{
> > +     del_gendisk(dev->gd);
>
> > +     /*
> > +      * Before marking queue as dying (blk_cleanup_queue() does that)
> > +      * we have to be sure that everything in-flight has gone.
> > +      * Blink with freeze/unfreeze.
> > +      */
> > +     blk_mq_freeze_queue(dev->queue);
> > +     blk_mq_unfreeze_queue(dev->queue);
>
> Please remove the above seven lines. blk_cleanup_queue() calls
> blk_set_queue_dying() and the second call in blk_set_queue_dying() is
> blk_freeze_queue_start().
>
It was an old bug we had in 2016, we retested with newer kernel like
4.14+, the bug is fixed,
I will remove the above seven lines.

Thanks
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers
  2019-09-18 15:26         ` Bart Van Assche
@ 2019-09-18 16:11           ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-18 16:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Wed, Sep 18, 2019 at 5:26 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/16/19 8:39 AM, Jinpu Wang wrote:
> > - Roman's pb emal address, it's no longer valid, will fix next round.
> >
> >
> >>>
> >>>> +static inline const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
> >>>> +{
> >>>> +     switch (mode) {
> >>>> +     case IBNBD_FILEIO:
> >>>> +             return "fileio";
> >>>> +     case IBNBD_BLOCKIO:
> >>>> +             return "blockio";
> >>>> +     case IBNBD_AUTOIO:
> >>>> +             return "autoio";
> >>>> +     default:
> >>>> +             return "unknown";
> >>>> +     }
> >>>> +}
> >>>> +
> >>>> +static inline const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
> >>>> +{
> >>>> +     switch (mode) {
> >>>> +     case IBNBD_ACCESS_RO:
> >>>> +             return "ro";
> >>>> +     case IBNBD_ACCESS_RW:
> >>>> +             return "rw";
> >>>> +     case IBNBD_ACCESS_MIGRATION:
> >>>> +             return "migration";
> >>>> +     default:
> >>>> +             return "unknown";
> >>>> +     }
> >>>> +}
> >>>
> >>> These two functions are not in the hot path and hence should not be
> >>> inline functions.
> >> Sounds reasonable, will remove the inline.
> > inline was added to fix the -Wunused-function warning  eg:
> >
> >    CC [M]  /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.o
> > In file included from /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.h:34,
> >                   from /<<PKGBUILDDIR>>/ibnbd/ibnbd-clt.c:33:
> > /<<PKGBUILDDIR>>/ibnbd/ibnbd-proto.h:362:20: warning:
> > 'ibnbd_access_mode_str' defined but not used [-Wunused-function]
> >   static const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
> >                      ^~~~~~~~~~~~~~~~~~~~~
> > /<<PKGBUILDDIR>>/ibnbd/ibnbd-proto.h:348:20: warning:
> > 'ibnbd_io_mode_str' defined but not used [-Wunused-function]
> >   static const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
> >
> > We have to move both functions to a separate header file if we really
> > want to do it.
> > The function is simple and small, if you insist, I will do it.
>
> Please move these functions into a .c file. That will reduce the size of
> the kernel modules and will also reduce the size of the header file.
>
> Thanks,
>
> Bart.
>

Ok, will do.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 18/25] ibnbd: client: sysfs interface functions
       [not found] ` <20190620150337.7847-19-jinpuwang@gmail.com>
@ 2019-09-18 16:28   ` Bart Van Assche
  2019-09-19 15:55     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-18 16:28 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt

Including the line number in all messages is too much information. 
Please don't do this. Additionally, this will make the line number occur 
twice in messages produced by pr_debug().

> +static unsigned int ibnbd_opt_mandatory[] = {
> +	IBNBD_OPT_PATH,
> +	IBNBD_OPT_DEV_PATH,
> +	IBNBD_OPT_SESSNAME,
> +};

Should this array have been declared const?

 > +/* remove new line from string */
 > +static void strip(char *s)
 > +{
 > +	char *p = s;
 > +
 > +	while (*s != '\0') {
 > +		if (*s != '\n')
 > +			*p++ = *s++;
 > +		else
 > +			++s;
 > +	}
 > +	*p = '\0';
 > +}

This function can remove a newline from the middle of a string. Are you 
sure that's what you want?

Is it useful to strip newline characters only and to keep other 
whitespace? Could this function be dropped and can the callers use 
strim() instead?

> +static int ibnbd_clt_parse_map_options(const char *buf,
> +				       char *sessname,
> +				       struct ibtrs_addr *paths,
> +				       size_t *path_cnt,
> +				       size_t max_path_cnt,
> +				       char *pathname,
> +				       enum ibnbd_access_mode *access_mode,
> +				       enum ibnbd_io_mode *io_mode)
> +{

Please introduce a structure for all the output parameters of this 
function and pass a pointer to that structure to this function. That 
will make it easier to introduce support for new parameters.

> +	char *options, *sep_opt;
> +	char *p;
> +	substring_t args[MAX_OPT_ARGS];
> +	int opt_mask = 0;
> +	int token;
> +	int ret = -EINVAL;
> +	int i;
> +	int p_cnt = 0;
> +
> +	options = kstrdup(buf, GFP_KERNEL);
> +	if (!options)
> +		return -ENOMEM;
> +
> +	sep_opt = strstrip(options);
> +	strip(sep_opt);

Are you sure that strstrip() does not remove trailing newline characters?

> +	while ((p = strsep(&sep_opt, " ")) != NULL) {
> +		if (!*p)
> +			continue;
> +
> +		token = match_token(p, ibnbd_opt_tokens, args);
> +		opt_mask |= token;
> +
> +		switch (token) {
> +		case IBNBD_OPT_SESSNAME:
> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			if (strlen(p) > NAME_MAX) {
> +				pr_err("map_device: sessname too long\n");
> +				ret = -EINVAL;
> +				kfree(p);
> +				goto out;
> +			}
> +			strlcpy(sessname, p, NAME_MAX);
> +			kfree(p);
> +			break;

Please change sessname from a fixed size buffer into a dynamically 
allocated buffer. That will remove the need to perform a strlcpy() and 
will also allow to remove the NAME_MAX checks.

> +		case IBNBD_OPT_DEV_PATH:
> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			if (strlen(p) > NAME_MAX) {
> +				pr_err("map_device: Device path too long\n");
> +				ret = -EINVAL;
> +				kfree(p);
> +				goto out;
> +			}
> +			strlcpy(pathname, p, NAME_MAX);
> +			kfree(p);
> +			break;

Same comment here - please change pathname from a fixed-size array into 
a dynamically allocated buffer.

> +static ssize_t ibnbd_clt_state_show(struct kobject *kobj,
> +				    struct kobj_attribute *attr, char *page)
> +{
> +	struct ibnbd_clt_dev *dev;
> +
> +	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
> +
> +	switch (dev->dev_state) {
> +	case (DEV_STATE_INIT):
> +		return scnprintf(page, PAGE_SIZE, "init\n");
> +	case (DEV_STATE_MAPPED):
> +		/* TODO fix cli tool before changing to proper state */
> +		return scnprintf(page, PAGE_SIZE, "open\n");
> +	case (DEV_STATE_MAPPED_DISCONNECTED):
> +		/* TODO fix cli tool before changing to proper state */
> +		return scnprintf(page, PAGE_SIZE, "closed\n");
> +	case (DEV_STATE_UNMAPPED):
> +		return scnprintf(page, PAGE_SIZE, "unmapped\n");
> +	default:
> +		return scnprintf(page, PAGE_SIZE, "unknown\n");
> +	}
> +}

Please remove the superfluous parentheses from around the DEV_STATE_* 
constants.

Additionally, using scnprintf() here is overkill. snprintf() should be 
sufficient.

> +static struct kobj_attribute ibnbd_clt_state_attr =
> +	__ATTR(state, 0444, ibnbd_clt_state_show, NULL);

Please use DEVICE_ATTR_RO() instead of __ATTR() for all read-only 
attributes.

> +static ssize_t ibnbd_clt_unmap_dev_store(struct kobject *kobj,
> +					 struct kobj_attribute *attr,
> +					 const char *buf, size_t count)
> +{
> +	struct ibnbd_clt_dev *dev;
> +	char *opt, *options;
> +	bool force;
> +	int err;
> +
> +	opt = kstrdup(buf, GFP_KERNEL);
> +	if (!opt)
> +		return -ENOMEM;
> +
> +	options = strstrip(opt);
> +	strip(options);
> +
> +	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
> +
> +	if (sysfs_streq(options, "normal")) {
> +		force = false;
> +	} else if (sysfs_streq(options, "force")) {
> +		force = true;
> +	} else {
> +		ibnbd_err(dev, "unmap_device: Invalid value: %s\n", options);
> +		err = -EINVAL;
> +		goto out;
> +	}

Wasn't sysfs_streq() introduced to avoid having to duplicate and strip 
the input string?

> +	/*
> +	 * We take explicit module reference only for one reason: do not
> +	 * race with lockless ibnbd_destroy_sessions().
> +	 */
> +	if (!try_module_get(THIS_MODULE)) {
> +		err = -ENODEV;
> +		goto out;
> +	}
> +	err = ibnbd_clt_unmap_device(dev, force, &attr->attr);
> +	if (unlikely(err)) {
> +		if (unlikely(err != -EALREADY))
> +			ibnbd_err(dev, "unmap_device: %d\n",  err);
> +		goto module_put;
> +	}
> +
> +	/*
> +	 * Here device can be vanished!
> +	 */
> +
> +	err = count;
> +
> +module_put:
> +	module_put(THIS_MODULE);

I've never before seen a module_get() / module_put() pair inside a sysfs 
  callback function. Can this race be fixed by making 
ibnbd_destroy_sessions() remove this sysfs attribute before it tries to 
destroy any sessions?

> +void ibnbd_clt_remove_dev_symlink(struct ibnbd_clt_dev *dev)
> +{
> +	/*
> +	 * The module_is_live() check is crucial and helps to avoid annoying
> +	 * sysfs warning raised in sysfs_remove_link(), when the whole sysfs
> +	 * path was just removed, see ibnbd_close_sessions().
> +	 */
> +	if (strlen(dev->blk_symlink_name) && module_is_live(THIS_MODULE))
> +		sysfs_remove_link(ibnbd_devs_kobj, dev->blk_symlink_name);
> +}

I haven't been able to find any other sysfs code that calls 
module_is_live()? Please elaborate why that check is needed.

> +int ibnbd_clt_create_sysfs_files(void)
> +{
> +	int err;
> +
> +	ibnbd_dev_class = class_create(THIS_MODULE, "ibnbd-client");
> +	if (unlikely(IS_ERR(ibnbd_dev_class)))
> +		return PTR_ERR(ibnbd_dev_class);
> +
> +	ibnbd_dev = device_create(ibnbd_dev_class, NULL,
> +				  MKDEV(0, 0), NULL, "ctl");
> +	if (unlikely(IS_ERR(ibnbd_dev))) {
> +		err = PTR_ERR(ibnbd_dev);
> +		goto cls_destroy;
> +	}
> +	ibnbd_devs_kobj = kobject_create_and_add("devices", &ibnbd_dev->kobj);
> +	if (unlikely(!ibnbd_devs_kobj)) {
> +		err = -ENOMEM;
> +		goto dev_destroy;
> +	}
> +	err = sysfs_create_group(&ibnbd_dev->kobj, &default_attr_group);
> +	if (unlikely(err))
> +		goto put_devs_kobj;
> +
> +	return 0;
> +
> +put_devs_kobj:
> +	kobject_del(ibnbd_devs_kobj);
> +	kobject_put(ibnbd_devs_kobj);
> +dev_destroy:
> +	device_destroy(ibnbd_dev_class, MKDEV(0, 0));
> +cls_destroy:
> +	class_destroy(ibnbd_dev_class);
> +
> +	return err;
> +}

I think this is the wrong way to create a device node because this 
approach will inform udev about device creation before the sysfs group 
has been created. Please use device_create_with_groups() instead of 
calling device_create() and sysfs_create_group() separately.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 20/25] ibnbd: server: main functionality
       [not found] ` <20190620150337.7847-21-jinpuwang@gmail.com>
@ 2019-09-18 17:41   ` Bart Van Assche
  2019-09-20  7:36     ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-18 17:41 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt

Same comment here as for a previous patch - please do not include line 
number information in pr_fmt().

> +MODULE_AUTHOR("ibnbd@profitbricks.com");
> +MODULE_VERSION(IBNBD_VER_STRING);
> +MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
> +MODULE_LICENSE("GPL");

Please remove the version number (MODULE_VERSION()).

> +static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;

Please change dev_search_path[] into a dynamically allocated string to 
avoid a hard-coded length limit.

> +	if (dup[strlen(dup) - 1] == '\n')
> +		dup[strlen(dup) - 1] = '\0';

Can this be changed into a call to strim()?

> +static void ibnbd_endio(void *priv, int error)
> +{
> +	struct ibnbd_io_private *ibnbd_priv = priv;
> +	struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
> +
> +	ibnbd_put_sess_dev(sess_dev);
> +
> +	ibtrs_srv_resp_rdma(ibnbd_priv->id, error);
> +
> +	kfree(priv);
> +}

Since ibtrs_srv_resp_rdma() starts an RDMA WRITE without waiting for the 
write completion, shouldn't the session reference be retained until the 
completion for that RDMA WRITE has been received? In other words, is 
there a risk with the current approach that the buffer that is being 
transferred to the client will be freed before the RDMA WRITE has finished?

> +static struct ibnbd_srv_sess_dev *
> +ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess)
> +{
> +	struct ibnbd_srv_sess_dev *sess_dev;
> +	int ret = 0;
> +
> +	read_lock(&srv_sess->index_lock);
> +	sess_dev = idr_find(&srv_sess->index_idr, dev_id);
> +	if (likely(sess_dev))
> +		ret = kref_get_unless_zero(&sess_dev->kref);
> +	read_unlock(&srv_sess->index_lock);
> +
> +	if (unlikely(!sess_dev || !ret))
> +		return ERR_PTR(-ENXIO);
> +
> +	return sess_dev;
> +}

Something that is not important: isn't the sess_dev check superfluous in 
the if-statement just above the return statement? If ret == 1, does that 
imply that sess_dev != 0 ?

Has it been considered to return -ENODEV instead of -ENXIO if no device 
is found?

> +static int create_sess(struct ibtrs_srv *ibtrs)
> +{
 > [ ... ]
> +	strlcpy(srv_sess->sessname, sessname, sizeof(srv_sess->sessname));

Please change the session name into a dynamically allocated string such 
that strdup() can be used instead of strlcpy().

> +static int process_msg_open(struct ibtrs_srv *ibtrs,
> +			    struct ibnbd_srv_session *srv_sess,
> +			    const void *msg, size_t len,
> +			    void *data, size_t datalen);
> +
> +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
> +				 struct ibnbd_srv_session *srv_sess,
> +				 const void *msg, size_t len,
> +				 void *data, size_t datalen);

Can the code be reordered such that these forward declarations can be 
dropped?

> +static struct ibnbd_srv_sess_dev *
> +ibnbd_srv_create_set_sess_dev(struct ibnbd_srv_session *srv_sess,
> +			      const struct ibnbd_msg_open *open_msg,
> +			      struct ibnbd_dev *ibnbd_dev, fmode_t open_flags,
> +			      struct ibnbd_srv_dev *srv_dev)
> +{
> +	struct ibnbd_srv_sess_dev *sdev = ibnbd_sess_dev_alloc(srv_sess);
> +
> +	if (IS_ERR(sdev))
> +		return sdev;
> +
> +	kref_init(&sdev->kref);
> +
> +	strlcpy(sdev->pathname, open_msg->dev_name, sizeof(sdev->pathname));

Can the path name be changed into a dynamically allocated string?

> +static char *ibnbd_srv_get_full_path(struct ibnbd_srv_session *srv_sess,
> +				     const char *dev_name)
> +{
> +	char *full_path;
> +	char *a, *b;
> +
> +	full_path = kmalloc(PATH_MAX, GFP_KERNEL);
> +	if (!full_path)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * Replace %SESSNAME% with a real session name in order to
> +	 * create device namespace.
> +	 */
> +	a = strnstr(dev_search_path, "%SESSNAME%", sizeof(dev_search_path));
> +	if (a) {
> +		int len = a - dev_search_path;
> +
> +		len = snprintf(full_path, PATH_MAX, "%.*s/%s/%s", len,
> +			       dev_search_path, srv_sess->sessname, dev_name);
> +		if (len >= PATH_MAX) {
> +			pr_err("Tooooo looong path: %s, %s, %s\n",
> +			       dev_search_path, srv_sess->sessname, dev_name);
> +			kfree(full_path);
> +			return ERR_PTR(-EINVAL);
> +		}
> +	} else {
> +		snprintf(full_path, PATH_MAX, "%s/%s",
> +			 dev_search_path, dev_name);
> +	}

Has it been considered to use kasprintf() instead of kmalloc() + snprintf()?

> +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
> +				 struct ibnbd_srv_session *srv_sess,
> +				 const void *msg, size_t len,
> +				 void *data, size_t datalen)
> +{
> +	const struct ibnbd_msg_sess_info *sess_info_msg = msg;
> +	struct ibnbd_msg_sess_info_rsp *rsp = data;
> +
> +	srv_sess->ver = min_t(u8, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
> +	pr_debug("Session %s using protocol version %d (client version: %d,"
> +		 " server version: %d)\n", srv_sess->sessname,
> +		 srv_sess->ver, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);

Has this patch been verified with checkpatch? I think checkpatch 
recommends not to split literal strings.

> +/**
> + * find_srv_sess_dev() - a dev is already opened by this name
> + *
> + * Return struct ibnbd_srv_sess_dev if srv_sess already opened the dev_name
> + * NULL if the session didn't open the device yet.
> + */
> +static struct ibnbd_srv_sess_dev *
> +find_srv_sess_dev(struct ibnbd_srv_session *srv_sess, const char *dev_name)
> +{
> +	struct ibnbd_srv_sess_dev *sess_dev;
> +
> +	if (list_empty(&srv_sess->sess_dev_list))
> +		return NULL;
> +
> +	list_for_each_entry(sess_dev, &srv_sess->sess_dev_list, sess_list)
> +		if (!strcmp(sess_dev->pathname, dev_name))
> +			return sess_dev;
> +
> +	return NULL;
> +}

Is explicit the list_empty() check really necessary? Would the behavior 
of this function change if that check is left out?

Has the posted code been compiled with W=1? I'm asking this because the 
documentation of the function arguments is missing from the kernel-doc 
header. I expect that a warning will be reported if this code is 
compiled with W=1.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
       [not found] ` <20190620150337.7847-22-jinpuwang@gmail.com>
@ 2019-09-18 21:46   ` Bart Van Assche
  2019-09-26 14:04     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-18 21:46 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt

Same comment as for a previous patch: please do not include line number 
information in pr_fmt().

> +static int ibnbd_dev_vfs_open(struct ibnbd_dev *dev, const char *path,
> +			      fmode_t flags)
> +{
> +	int oflags = O_DSYNC; /* enable write-through */
> +
> +	if (flags & FMODE_WRITE)
> +		oflags |= O_RDWR;
> +	else if (flags & FMODE_READ)
> +		oflags |= O_RDONLY;
> +	else
> +		return -EINVAL;
> +
> +	dev->file = filp_open(path, oflags, 0);
> +	return PTR_ERR_OR_ZERO(dev->file);
> +}

Isn't the use of O_DSYNC something that should be configurable?

> +struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
> +				 enum ibnbd_io_mode mode, struct bio_set *bs,
> +				 ibnbd_dev_io_fn io_cb)
> +{
> +	struct ibnbd_dev *dev;
> +	int ret;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (mode == IBNBD_BLOCKIO) {
> +		dev->blk_open_flags = flags;
> +		ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> +		if (ret)
> +			goto err;
> +	} else if (mode == IBNBD_FILEIO) {
> +		dev->blk_open_flags = FMODE_READ;
> +		ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> +		if (ret)
> +			goto err;
> +
> +		ret = ibnbd_dev_vfs_open(dev, path, flags);
> +		if (ret)
> +			goto blk_put;

This looks really weird. Why to call ibnbd_dev_blk_open() first for file 
I/O mode? Why to set dev->blk_open_flags to FMODE_READ in file I/O mode?

> +static int ibnbd_dev_blk_submit_io(struct ibnbd_dev *dev, sector_t sector,
> +				   void *data, size_t len, u32 bi_size,
> +				   enum ibnbd_io_flags flags, short prio,
> +				   void *priv)
> +{
> +	struct request_queue *q = bdev_get_queue(dev->bdev);
> +	struct ibnbd_dev_blk_io *io;
> +	struct bio *bio;
> +
> +	/* check if the buffer is suitable for bdev */
> +	if (unlikely(WARN_ON(!blk_rq_aligned(q, (unsigned long)data, len))))
> +		return -EINVAL;
> +
> +	/* Generate bio with pages pointing to the rdma buffer */
> +	bio = ibnbd_bio_map_kern(q, data, dev->ibd_bio_set, len, GFP_KERNEL);
> +	if (unlikely(IS_ERR(bio)))
> +		return PTR_ERR(bio);
> +
> +	io = kmalloc(sizeof(*io), GFP_KERNEL);
> +	if (unlikely(!io)) {
> +		bio_put(bio);
> +		return -ENOMEM;
> +	}
> +
> +	io->dev		= dev;
> +	io->priv	= priv;
> +
> +	bio->bi_end_io		= ibnbd_dev_bi_end_io;
> +	bio->bi_private		= io;
> +	bio->bi_opf		= ibnbd_to_bio_flags(flags);
> +	bio->bi_iter.bi_sector	= sector;
> +	bio->bi_iter.bi_size	= bi_size;
> +	bio_set_prio(bio, prio);
> +	bio_set_dev(bio, dev->bdev);
> +
> +	submit_bio(bio);
> +
> +	return 0;
> +}

Can struct bio and struct ibnbd_dev_blk_io be combined into a single 
data structure by passing the size of the latter data structure as the 
front_pad argument to bioset_init()?

> +static void ibnbd_dev_file_submit_io_worker(struct work_struct *w)
> +{
> +	struct ibnbd_dev_file_io_work *dev_work;
> +	struct file *f;
> +	int ret, len;
> +	loff_t off;
> +
> +	dev_work = container_of(w, struct ibnbd_dev_file_io_work, work);
> +	off = dev_work->sector * ibnbd_dev_get_logical_bsize(dev_work->dev);
> +	f = dev_work->dev->file;
> +	len = dev_work->bi_size;
> +
> +	if (ibnbd_op(dev_work->flags) == IBNBD_OP_FLUSH) {
> +		ret = ibnbd_dev_file_handle_flush(dev_work, off);
> +		if (unlikely(ret))
> +			goto out;
> +	}
> +
> +	if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE_SAME) {
> +		ret = ibnbd_dev_file_handle_write_same(dev_work);
> +		if (unlikely(ret))
> +			goto out;
> +	}
> +
> +	/* TODO Implement support for DIRECT */
> +	if (dev_work->bi_size) {
> +		loff_t off_tmp = off;
> +
> +		if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE)
> +			ret = kernel_write(f, dev_work->data, dev_work->bi_size,
> +					   &off_tmp);
> +		else
> +			ret = kernel_read(f, dev_work->data, dev_work->bi_size,
> +					  &off_tmp);
> +
> +		if (unlikely(ret < 0)) {
> +			goto out;
> +		} else if (unlikely(ret != dev_work->bi_size)) {
> +			/* TODO implement support for partial completions */
> +			ret = -EIO;
> +			goto out;
> +		} else {
> +			ret = 0;
> +		}
> +	}
> +
> +	if (dev_work->flags & IBNBD_F_FUA)
> +		ret = ibnbd_dev_file_handle_fua(dev_work, off);
> +out:
> +	dev_work->dev->io_cb(dev_work->priv, ret);
> +	kfree(dev_work);
> +}
> +
> +static int ibnbd_dev_file_submit_io(struct ibnbd_dev *dev, sector_t sector,
> +				    void *data, size_t len, size_t bi_size,
> +				    enum ibnbd_io_flags flags, void *priv)
> +{
> +	struct ibnbd_dev_file_io_work *w;
> +
> +	if (!ibnbd_flags_supported(flags)) {
> +		pr_info_ratelimited("Unsupported I/O flags: 0x%x on device "
> +				    "%s\n", flags, dev->name);
> +		return -ENOTSUPP;
> +	}
> +
> +	w = kmalloc(sizeof(*w), GFP_KERNEL);
> +	if (!w)
> +		return -ENOMEM;
> +
> +	w->dev		= dev;
> +	w->priv		= priv;
> +	w->sector	= sector;
> +	w->data		= data;
> +	w->len		= len;
> +	w->bi_size	= bi_size;
> +	w->flags	= flags;
> +	INIT_WORK(&w->work, ibnbd_dev_file_submit_io_worker);
> +
> +	if (unlikely(!queue_work(fileio_wq, &w->work))) {
> +		kfree(w);
> +		return -EEXIST;
> +	}
> +
> +	return 0;
> +}

Please use the in-kernel asynchronous I/O API instead of kernel_read() 
and kernel_write() and remove the fileio_wq workqueue. Examples of how 
to use call_read_iter() and call_write_iter() are available in the loop 
driver and also in drivers/target/target_core_file.c.

> +/** ibnbd_dev_init() - Initialize ibnbd_dev
> + *
> + * This functions initialized the ibnbd-dev component.
> + * It has to be called 1x time before ibnbd_dev_open() is used
> + */
> +int ibnbd_dev_init(void);

It is great so see kernel-doc headers above functions but I'm not sure 
these should be in .h files. I think most kernel developers prefer to 
see kernel-doc headers for functions in .c files because that makes it 
more likely that the implementation and the documentation stay in sync.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
  2019-09-13 23:56   ` Bart Van Assche
@ 2019-09-19 10:30     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-19 10:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Sat, Sep 14, 2019 at 1:56 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > From: Roman Pen <roman.penyaev@profitbricks.com>
> >
> > Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
> > Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
> > ---
> >   MAINTAINERS | 14 ++++++++++++++
> >   1 file changed, 14 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index a6954776a37e..0b7fd93f738d 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -7590,6 +7590,20 @@ IBM ServeRAID RAID DRIVER
> >   S:  Orphan
> >   F:  drivers/scsi/ips.*
> >
> > +IBNBD BLOCK DRIVERS
> > +M:   IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> > +L:   linux-block@vger.kernel.org
> > +S:   Maintained
> > +T:   git git://github.com/profitbricks/ibnbd.git
> > +F:   drivers/block/ibnbd/
> > +
> > +IBTRS TRANSPORT DRIVERS
> > +M:   IBNBD/IBTRS Storage Team <ibnbd@cloud.ionos.com>
> > +L:   linux-rdma@vger.kernel.org
> > +S:   Maintained
> > +T:   git git://github.com/profitbricks/ibnbd.git
> > +F:   drivers/infiniband/ulp/ibtrs/
> > +
> >   ICH LPC AND GPIO DRIVER
> >   M:  Peter Tyser <ptyser@xes-inc.com>
> >   S:  Maintained
>
> I think the T: entry is for kernel trees against which developers should
> prepare their patches. Since the ibnbd repository on github is an
> out-of-tree kernel driver I don't think that it should appear in the
> MAINTAINERS file.
>
> Bart.
>
>
Ok, we will remove the link to github.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 18/25] ibnbd: client: sysfs interface functions
  2019-09-18 16:28   ` [PATCH v4 18/25] ibnbd: client: sysfs interface functions Bart Van Assche
@ 2019-09-19 15:55     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-19 15:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Wed, Sep 18, 2019 at 6:28 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
>
> Including the line number in all messages is too much information.
> Please don't do this. Additionally, this will make the line number occur
> twice in messages produced by pr_debug().
We feel it's quite handy for debugging to have line number, I checked
in mainline,
some driver even include __func__ and __LINE__.
Also did a test, the line number occur only once from pr_debug.
>
> > +static unsigned int ibnbd_opt_mandatory[] = {
> > +     IBNBD_OPT_PATH,
> > +     IBNBD_OPT_DEV_PATH,
> > +     IBNBD_OPT_SESSNAME,
> > +};
>
> Should this array have been declared const?
Sounds good.
>
>  > +/* remove new line from string */
>  > +static void strip(char *s)
>  > +{
>  > +    char *p = s;
>  > +
>  > +    while (*s != '\0') {
>  > +            if (*s != '\n')
>  > +                    *p++ = *s++;
>  > +            else
>  > +                    ++s;
>  > +    }
>  > +    *p = '\0';
>  > +}
>
> This function can remove a newline from the middle of a string. Are you
> sure that's what you want?
Yes, we want a strip all newline in the string, when print with
> Is it useful to strip newline characters only and to keep other
> whitespace? Could this function be dropped and can the callers use
> strim() instead?
 We strstrip/strim afterwards to remove the whitespace.
>
> > +static int ibnbd_clt_parse_map_options(const char *buf,
> > +                                    char *sessname,
> > +                                    struct ibtrs_addr *paths,
> > +                                    size_t *path_cnt,
> > +                                    size_t max_path_cnt,
> > +                                    char *pathname,
> > +                                    enum ibnbd_access_mode *access_mode,
> > +                                    enum ibnbd_io_mode *io_mode)
> > +{
>
> Please introduce a structure for all the output parameters of this
> function and pass a pointer to that structure to this function. That
> will make it easier to introduce support for new parameters.
>
> > +     char *options, *sep_opt;
> > +     char *p;
> > +     substring_t args[MAX_OPT_ARGS];
> > +     int opt_mask = 0;
> > +     int token;
> > +     int ret = -EINVAL;
> > +     int i;
> > +     int p_cnt = 0;
> > +
> > +     options = kstrdup(buf, GFP_KERNEL);
> > +     if (!options)
> > +             return -ENOMEM;
> > +
> > +     sep_opt = strstrip(options);
> > +     strip(sep_opt);
>
> Are you sure that strstrip() does not remove trailing newline characters?
Yes, it only removes the whitespace
>
> > +     while ((p = strsep(&sep_opt, " ")) != NULL) {
> > +             if (!*p)
> > +                     continue;
> > +
> > +             token = match_token(p, ibnbd_opt_tokens, args);
> > +             opt_mask |= token;
> > +
> > +             switch (token) {
> > +             case IBNBD_OPT_SESSNAME:
> > +                     p = match_strdup(args);
> > +                     if (!p) {
> > +                             ret = -ENOMEM;
> > +                             goto out;
> > +                     }
> > +                     if (strlen(p) > NAME_MAX) {
> > +                             pr_err("map_device: sessname too long\n");
> > +                             ret = -EINVAL;
> > +                             kfree(p);
> > +                             goto out;
> > +                     }
> > +                     strlcpy(sessname, p, NAME_MAX);
> > +                     kfree(p);
> > +                     break;
>
> Please change sessname from a fixed size buffer into a dynamically
> allocated buffer. That will remove the need to perform a strlcpy() and
> will also allow to remove the NAME_MAX checks.
We can change sessname to be dynamically allocated, but I think the
the NAME_MAX check
is not conflicting, we don't want to have that long sessname anyway.

>
> > +             case IBNBD_OPT_DEV_PATH:
> > +                     p = match_strdup(args);
> > +                     if (!p) {
> > +                             ret = -ENOMEM;
> > +                             goto out;
> > +                     }
> > +                     if (strlen(p) > NAME_MAX) {
> > +                             pr_err("map_device: Device path too long\n");
> > +                             ret = -EINVAL;
> > +                             kfree(p);
> > +                             goto out;
> > +                     }
> > +                     strlcpy(pathname, p, NAME_MAX);
> > +                     kfree(p);
> > +                     break;
>
> Same comment here - please change pathname from a fixed-size array into
> a dynamically allocated buffer.
Ditto
>
> > +static ssize_t ibnbd_clt_state_show(struct kobject *kobj,
> > +                                 struct kobj_attribute *attr, char *page)
> > +{
> > +     struct ibnbd_clt_dev *dev;
> > +
> > +     dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
> > +
> > +     switch (dev->dev_state) {
> > +     case (DEV_STATE_INIT):
> > +             return scnprintf(page, PAGE_SIZE, "init\n");
> > +     case (DEV_STATE_MAPPED):
> > +             /* TODO fix cli tool before changing to proper state */
> > +             return scnprintf(page, PAGE_SIZE, "open\n");
> > +     case (DEV_STATE_MAPPED_DISCONNECTED):
> > +             /* TODO fix cli tool before changing to proper state */
> > +             return scnprintf(page, PAGE_SIZE, "closed\n");
> > +     case (DEV_STATE_UNMAPPED):
> > +             return scnprintf(page, PAGE_SIZE, "unmapped\n");
> > +     default:
> > +             return scnprintf(page, PAGE_SIZE, "unknown\n");
> > +     }
> > +}
>
> Please remove the superfluous parentheses from around the DEV_STATE_*
> constants.
>
> Additionally, using scnprintf() here is overkill. snprintf() should be
> sufficient.
You're right, will address both.
>
> > +static struct kobj_attribute ibnbd_clt_state_attr =
> > +     __ATTR(state, 0444, ibnbd_clt_state_show, NULL);
>
> Please use DEVICE_ATTR_RO() instead of __ATTR() for all read-only
> attributes.
DEVICE_ATTR_RO doesn't fit here, will use __ATTR_RO, thanks
>
> > +static ssize_t ibnbd_clt_unmap_dev_store(struct kobject *kobj,
> > +                                      struct kobj_attribute *attr,
> > +                                      const char *buf, size_t count)
> > +{
> > +     struct ibnbd_clt_dev *dev;
> > +     char *opt, *options;
> > +     bool force;
> > +     int err;
> > +
> > +     opt = kstrdup(buf, GFP_KERNEL);
> > +     if (!opt)
> > +             return -ENOMEM;
> > +
> > +     options = strstrip(opt);
> > +     strip(options);
> > +
> > +     dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
> > +
> > +     if (sysfs_streq(options, "normal")) {
> > +             force = false;
> > +     } else if (sysfs_streq(options, "force")) {
> > +             force = true;
> > +     } else {
> > +             ibnbd_err(dev, "unmap_device: Invalid value: %s\n", options);
> > +             err = -EINVAL;
> > +             goto out;
> > +     }
>
> Wasn't sysfs_streq() introduced to avoid having to duplicate and strip
> the input string?
sysfs_streq is only tolerant for trailing newline. we use strstrip to
strip whitespaces, strip for newlines.

>
> > +     /*
> > +      * We take explicit module reference only for one reason: do not
> > +      * race with lockless ibnbd_destroy_sessions().
> > +      */
> > +     if (!try_module_get(THIS_MODULE)) {
> > +             err = -ENODEV;
> > +             goto out;
> > +     }
> > +     err = ibnbd_clt_unmap_device(dev, force, &attr->attr);
> > +     if (unlikely(err)) {
> > +             if (unlikely(err != -EALREADY))
> > +                     ibnbd_err(dev, "unmap_device: %d\n",  err);
> > +             goto module_put;
> > +     }
> > +
> > +     /*
> > +      * Here device can be vanished!
> > +      */
> > +
> > +     err = count;
> > +
> > +module_put:
> > +     module_put(THIS_MODULE);
>
> I've never before seen a module_get() / module_put() pair inside a sysfs
>   callback function. Can this race be fixed by making
> ibnbd_destroy_sessions() remove this sysfs attribute before it tries to
> destroy any sessions?
That's the first thing we do in ibnbd_destroy_sessions already.
>
> > +void ibnbd_clt_remove_dev_symlink(struct ibnbd_clt_dev *dev)
> > +{
> > +     /*
> > +      * The module_is_live() check is crucial and helps to avoid annoying
> > +      * sysfs warning raised in sysfs_remove_link(), when the whole sysfs
> > +      * path was just removed, see ibnbd_close_sessions().
> > +      */
> > +     if (strlen(dev->blk_symlink_name) && module_is_live(THIS_MODULE))
> > +             sysfs_remove_link(ibnbd_devs_kobj, dev->blk_symlink_name);
> > +}
>
> I haven't been able to find any other sysfs code that calls
> module_is_live()? Please elaborate why that check is needed.

The reason might be lost in the dust, I can retest without module_* to
see if our tests don't break

>
> > +int ibnbd_clt_create_sysfs_files(void)
> > +{
> > +     int err;
> > +
> > +     ibnbd_dev_class = class_create(THIS_MODULE, "ibnbd-client");
> > +     if (unlikely(IS_ERR(ibnbd_dev_class)))
> > +             return PTR_ERR(ibnbd_dev_class);
> > +
> > +     ibnbd_dev = device_create(ibnbd_dev_class, NULL,
> > +                               MKDEV(0, 0), NULL, "ctl");
> > +     if (unlikely(IS_ERR(ibnbd_dev))) {
> > +             err = PTR_ERR(ibnbd_dev);
> > +             goto cls_destroy;
> > +     }
> > +     ibnbd_devs_kobj = kobject_create_and_add("devices", &ibnbd_dev->kobj);
> > +     if (unlikely(!ibnbd_devs_kobj)) {
> > +             err = -ENOMEM;
> > +             goto dev_destroy;
> > +     }
> > +     err = sysfs_create_group(&ibnbd_dev->kobj, &default_attr_group);
> > +     if (unlikely(err))
> > +             goto put_devs_kobj;
> > +
> > +     return 0;
> > +
> > +put_devs_kobj:
> > +     kobject_del(ibnbd_devs_kobj);
> > +     kobject_put(ibnbd_devs_kobj);
> > +dev_destroy:
> > +     device_destroy(ibnbd_dev_class, MKDEV(0, 0));
> > +cls_destroy:
> > +     class_destroy(ibnbd_dev_class);
> > +
> > +     return err;
> > +}
>
> I think this is the wrong way to create a device node because this
> approach will inform udev about device creation before the sysfs group
> has been created. Please use device_create_with_groups() instead of
> calling device_create() and sysfs_create_group() separately.
>
> Bart.
I'm not aware of device_create_with_groups, will try it out.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 20/25] ibnbd: server: main functionality
  2019-09-18 17:41   ` [PATCH v4 20/25] ibnbd: server: main functionality Bart Van Assche
@ 2019-09-20  7:36     ` Danil Kipnis
  2019-09-20 15:42       ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-20  7:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Roman Pen, Jack Wang

On Wed, Sep 18, 2019 at 7:41 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
>
> Same comment here as for a previous patch - please do not include line
> number information in pr_fmt().

Will drop it, thanks.

> > +MODULE_AUTHOR("ibnbd@profitbricks.com");
> > +MODULE_VERSION(IBNBD_VER_STRING);
> > +MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
> > +MODULE_LICENSE("GPL");
>
> Please remove the version number (MODULE_VERSION()).

OK.

> > +static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
>
> Please change dev_search_path[] into a dynamically allocated string to
> avoid a hard-coded length limit.

OK.

> > +     if (dup[strlen(dup) - 1] == '\n')
> > +             dup[strlen(dup) - 1] = '\0';
>
> Can this be changed into a call to strim()?

A directory name can start and end with spaces, for example this
works: mkdir "     x      "

> > +static void ibnbd_endio(void *priv, int error)
> > +{
> > +     struct ibnbd_io_private *ibnbd_priv = priv;
> > +     struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
> > +
> > +     ibnbd_put_sess_dev(sess_dev);
> > +
> > +     ibtrs_srv_resp_rdma(ibnbd_priv->id, error);
> > +
> > +     kfree(priv);
> > +}
>
> Since ibtrs_srv_resp_rdma() starts an RDMA WRITE without waiting for the
> write completion, shouldn't the session reference be retained until the
> completion for that RDMA WRITE has been received? In other words, is
> there a risk with the current approach that the buffer that is being
> transferred to the client will be freed before the RDMA WRITE has finished?

ibtrs-srv.c is keeping track of inflights. When closing session it
first marks the queue as closing, so that no new write requests would
be posted, when IBNBD calls ibtrs_srv_resp_rdma():
1831         if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING,
1832                                            &old_state)
Then ibtrs-srv schedules the ibtrs_srv_close_work, that drains the
queue and then waits for all inflights to return from IBNBD:
...
1274                 ib_drain_qp(con->c.qp);
1275         }
1276         /* Wait for all inflights */
1277         ibtrs_srv_wait_ops_ids(sess);
....
Only then the resources can be deallocated:
1282         unmap_cont_bufs(sess);
1283         ibtrs_srv_free_ops_ids(sess);

>
> > +static struct ibnbd_srv_sess_dev *
> > +ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess)
> > +{
> > +     struct ibnbd_srv_sess_dev *sess_dev;
> > +     int ret = 0;
> > +
> > +     read_lock(&srv_sess->index_lock);
> > +     sess_dev = idr_find(&srv_sess->index_idr, dev_id);
> > +     if (likely(sess_dev))
> > +             ret = kref_get_unless_zero(&sess_dev->kref);
> > +     read_unlock(&srv_sess->index_lock);
> > +
> > +     if (unlikely(!sess_dev || !ret))
> > +             return ERR_PTR(-ENXIO);
> > +
> > +     return sess_dev;
> > +}
>
> Something that is not important: isn't the sess_dev check superfluous in
> the if-statement just above the return statement? If ret == 1, does that
> imply that sess_dev != 0 ?

We want to have found the device (sess_dev != NULL) and we want to
have been able to take reference to it (ret != 0)... You are right, if
ret != 0 then sess_dev can't be NULL.

> Has it been considered to return -ENODEV instead of -ENXIO if no device
> is found?

The backend block device, i.e. /dev/nullb0, is still there and might
even be still exported over other session(s). So we thought "No such
device or address" is more appropriate.

>
> > +static int create_sess(struct ibtrs_srv *ibtrs)
> > +{
>  > [ ... ]
> > +     strlcpy(srv_sess->sessname, sessname, sizeof(srv_sess->sessname));
>
> Please change the session name into a dynamically allocated string such
> that strdup() can be used instead of strlcpy().

OK.

>
> > +static int process_msg_open(struct ibtrs_srv *ibtrs,
> > +                         struct ibnbd_srv_session *srv_sess,
> > +                         const void *msg, size_t len,
> > +                         void *data, size_t datalen);
> > +
> > +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
> > +                              struct ibnbd_srv_session *srv_sess,
> > +                              const void *msg, size_t len,
> > +                              void *data, size_t datalen);
>
> Can the code be reordered such that these forward declarations can be
> dropped?

Will try to.

>
> > +static struct ibnbd_srv_sess_dev *
> > +ibnbd_srv_create_set_sess_dev(struct ibnbd_srv_session *srv_sess,
> > +                           const struct ibnbd_msg_open *open_msg,
> > +                           struct ibnbd_dev *ibnbd_dev, fmode_t open_flags,
> > +                           struct ibnbd_srv_dev *srv_dev)
> > +{
> > +     struct ibnbd_srv_sess_dev *sdev = ibnbd_sess_dev_alloc(srv_sess);
> > +
> > +     if (IS_ERR(sdev))
> > +             return sdev;
> > +
> > +     kref_init(&sdev->kref);
> > +
> > +     strlcpy(sdev->pathname, open_msg->dev_name, sizeof(sdev->pathname));
>
> Can the path name be changed into a dynamically allocated string?

Probably we could just do strdup() and free it afterwards...

>
> > +static char *ibnbd_srv_get_full_path(struct ibnbd_srv_session *srv_sess,
> > +                                  const char *dev_name)
> > +{
> > +     char *full_path;
> > +     char *a, *b;
> > +
> > +     full_path = kmalloc(PATH_MAX, GFP_KERNEL);
> > +     if (!full_path)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     /*
> > +      * Replace %SESSNAME% with a real session name in order to
> > +      * create device namespace.
> > +      */
> > +     a = strnstr(dev_search_path, "%SESSNAME%", sizeof(dev_search_path));
> > +     if (a) {
> > +             int len = a - dev_search_path;
> > +
> > +             len = snprintf(full_path, PATH_MAX, "%.*s/%s/%s", len,
> > +                            dev_search_path, srv_sess->sessname, dev_name);
> > +             if (len >= PATH_MAX) {
> > +                     pr_err("Tooooo looong path: %s, %s, %s\n",
> > +                            dev_search_path, srv_sess->sessname, dev_name);
> > +                     kfree(full_path);
> > +                     return ERR_PTR(-EINVAL);
> > +             }
> > +     } else {
> > +             snprintf(full_path, PATH_MAX, "%s/%s",
> > +                      dev_search_path, dev_name);
> > +     }
>
> Has it been considered to use kasprintf() instead of kmalloc() + snprintf()?

I didn't know there is kasprintf()... Looks it would fit here.

> > +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
> > +                              struct ibnbd_srv_session *srv_sess,
> > +                              const void *msg, size_t len,
> > +                              void *data, size_t datalen)
> > +{
> > +     const struct ibnbd_msg_sess_info *sess_info_msg = msg;
> > +     struct ibnbd_msg_sess_info_rsp *rsp = data;
> > +
> > +     srv_sess->ver = min_t(u8, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
> > +     pr_debug("Session %s using protocol version %d (client version: %d,"
> > +              " server version: %d)\n", srv_sess->sessname,
> > +              srv_sess->ver, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
>
> Has this patch been verified with checkpatch? I think checkpatch
> recommends not to split literal strings.

Yes it does complain about our splitted strings. But it's either
splitted string or line over 80 chars or "Avoid line continuations in
quoted strings" if we use backslash on previous line. I don't know how
to avoid all three of them.

> > +/**
> > + * find_srv_sess_dev() - a dev is already opened by this name
> > + *
> > + * Return struct ibnbd_srv_sess_dev if srv_sess already opened the dev_name
> > + * NULL if the session didn't open the device yet.
> > + */
> > +static struct ibnbd_srv_sess_dev *
> > +find_srv_sess_dev(struct ibnbd_srv_session *srv_sess, const char *dev_name)
> > +{
> > +     struct ibnbd_srv_sess_dev *sess_dev;
> > +
> > +     if (list_empty(&srv_sess->sess_dev_list))
> > +             return NULL;
> > +
> > +     list_for_each_entry(sess_dev, &srv_sess->sess_dev_list, sess_list)
> > +             if (!strcmp(sess_dev->pathname, dev_name))
> > +                     return sess_dev;
> > +
> > +     return NULL;
> > +}
>
> Is explicit the list_empty() check really necessary? Would the behavior
> of this function change if that check is left out?
Will drop the check and fix if necessary if doesn't work without
(which I hope it does), thanks.

> Has the posted code been compiled with W=1? I'm asking this because the
> documentation of the function arguments is missing from the kernel-doc
> header. I expect that a warning will be reported if this code is
> compiled with W=1.
Yes it does, I didn't know about W=1. Will fix those warnings, thank you!

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-18 15:47             ` Bart Van Assche
@ 2019-09-20  8:29               ` Danil Kipnis
  2019-09-25 22:26               ` Danil Kipnis
  1 sibling, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-09-20  8:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang, Roman Pen

On Wed, Sep 18, 2019 at 5:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/18/19 12:14 AM, Danil Kipnis wrote:
> > I'm not familiar with dm code, but don't they need to deal with the
> > same situation: if I configure 100 logical volumes on top of a single
> > NVME drive with X hardware queues, each queue_depth deep, then each dm
> > block device would need to advertise X hardware queues in order to
> > achieve highest performance in case only this one volume is accessed,
> > while in fact those X physical queues have to be shared among all 100
> > logical volumes, if they are accessed in parallel?
>
> Combining multiple queues (a) into a single queue (b) that is smaller
> than the combined source queues without sacrificing performance is
> tricky. We already have one such implementation in the block layer core
> and it took considerable time to get that implementation right. See e.g.
> blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart().
We will need some time, to check if we can reuse those...

> dm drivers are expected to return DM_MAPIO_REQUEUE or
> DM_MAPIO_DELAY_REQUEUE if the queue (b) is full. It turned out to be
> difficult to get this right in the dm-mpath driver and at the same time
> to achieve good performance.
We also first tried to just return error codes in case we can't
process an incoming request, but this was causing huge performance
degradation when number of devices mapped over the same session is
growing. Since we introduced those per cpu per devices lists of
stopped queues, we do scale very well.

>
> The ibnbd driver introduces a third implementation of code that combines
> multiple (per-cpu) queues into one queue per CPU. It is considered
> important in the Linux kernel to avoid code duplication. Hence my
> question whether ibnbd can reuse the block layer infrastructure for
> sharing tag sets.
Yes, will have to reiterate on this.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 20/25] ibnbd: server: main functionality
  2019-09-20  7:36     ` Danil Kipnis
@ 2019-09-20 15:42       ` Bart Van Assche
  2019-09-23 15:19         ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-20 15:42 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Roman Pen, Jack Wang

On 9/20/19 12:36 AM, Danil Kipnis wrote:
> On Wed, Sep 18, 2019 at 7:41 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 6/20/19 8:03 AM, Jack Wang wrote:
>>> +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
>>> +                              struct ibnbd_srv_session *srv_sess,
>>> +                              const void *msg, size_t len,
>>> +                              void *data, size_t datalen)
>>> +{
>>> +     const struct ibnbd_msg_sess_info *sess_info_msg = msg;
>>> +     struct ibnbd_msg_sess_info_rsp *rsp = data;
>>> +
>>> +     srv_sess->ver = min_t(u8, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
>>> +     pr_debug("Session %s using protocol version %d (client version: %d,"
>>> +              " server version: %d)\n", srv_sess->sessname,
>>> +              srv_sess->ver, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
>>
>> Has this patch been verified with checkpatch? I think checkpatch
>> recommends not to split literal strings.
> 
> Yes it does complain about our splitted strings. But it's either
> splitted string or line over 80 chars or "Avoid line continuations in
> quoted strings" if we use backslash on previous line. I don't know how
> to avoid all three of them.

Checkpatch shouldn't complain about constant strings that exceed 80 
columns. If it complains about such strings then that's a checkpatch bug.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 20/25] ibnbd: server: main functionality
  2019-09-20 15:42       ` Bart Van Assche
@ 2019-09-23 15:19         ` Danil Kipnis
  0 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-09-23 15:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Roman Pen, Jack Wang

On Fri, Sep 20, 2019 at 5:42 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/20/19 12:36 AM, Danil Kipnis wrote:
> > On Wed, Sep 18, 2019 at 7:41 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >> On 6/20/19 8:03 AM, Jack Wang wrote:
> >>> +static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
> >>> +                              struct ibnbd_srv_session *srv_sess,
> >>> +                              const void *msg, size_t len,
> >>> +                              void *data, size_t datalen)
> >>> +{
> >>> +     const struct ibnbd_msg_sess_info *sess_info_msg = msg;
> >>> +     struct ibnbd_msg_sess_info_rsp *rsp = data;
> >>> +
> >>> +     srv_sess->ver = min_t(u8, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
> >>> +     pr_debug("Session %s using protocol version %d (client version: %d,"
> >>> +              " server version: %d)\n", srv_sess->sessname,
> >>> +              srv_sess->ver, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
> >>
> >> Has this patch been verified with checkpatch? I think checkpatch
> >> recommends not to split literal strings.
> >
> > Yes it does complain about our splitted strings. But it's either
> > splitted string or line over 80 chars or "Avoid line continuations in
> > quoted strings" if we use backslash on previous line. I don't know how
> > to avoid all three of them.
>
> Checkpatch shouldn't complain about constant strings that exceed 80
> columns. If it complains about such strings then that's a checkpatch bug.
It doesn't in deed... Will concat those splitted quoted string, thank you.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 01/25] sysfs: export sysfs_remove_file_self()
  2019-06-20 15:03 ` [PATCH v4 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
@ 2019-09-23 17:21   ` Bart Van Assche
  2019-09-25  9:30     ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 17:21 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, linux-kernel

On 6/20/19 8:03 AM, Jack Wang wrote:
> Function is going to be used in transport over RDMA module
> in subsequent patches.

It seems like several words are missing from this patch description.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections
       [not found] ` <20190620150337.7847-3-jinpuwang@gmail.com>
@ 2019-09-23 17:44   ` Bart Van Assche
  2019-09-25 10:20     ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 17:44 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> From: Roman Pen <roman.penyaev@profitbricks.com>
> 
> Introduce public header which provides set of API functions to
> establish RDMA connections from client to server machine using
> IBTRS protocol, which manages RDMA connections for each session,
> does multipathing and load balancing.
> 
> Main functions for client (active) side:
> 
>   ibtrs_clt_open() - Creates set of RDMA connections incapsulated
                              ^^^                       ^^^^^^^^^^^^
                                a?                      encapsulated?

>                      in IBTRS session and returns pointer on IBTRS
                         ^^^                       ^^^       ^^
                          a?                        a?       to an?
> 		    session object.
[ ... ]
> +/**
> + * enum ibtrs_clt_link_ev - Events about connectivity state of a client
> + * @IBTRS_CLT_LINK_EV_RECONNECTED	Client was reconnected.
> + * @IBTRS_CLT_LINK_EV_DISCONNECTED	Client was disconnected.
> + */
> +enum ibtrs_clt_link_ev {
> +	IBTRS_CLT_LINK_EV_RECONNECTED,
> +	IBTRS_CLT_LINK_EV_DISCONNECTED,
> +};
> +
> +/**
> + * Source and destination address of a path to be established
> + */
> +struct ibtrs_addr {
> +	struct sockaddr_storage *src;
> +	struct sockaddr_storage *dst;
> +};

Is it really useful to define a structure to hold two pointers or can 
these two pointers also be passed as separate arguments?

> +/**
> + * ibtrs_clt_open() - Open a session to a IBTRS client
> + * @priv:		User supplied private data.
> + * @link_ev:		Event notification for connection state changes
> + *	@priv:			user supplied data that was passed to
> + *				ibtrs_clt_open()
> + *	@ev:			Occurred event
> + * @sessname: name of the session
> + * @paths: Paths to be established defined by their src and dst addresses
> + * @path_cnt: Number of elemnts in the @paths array
> + * @port: port to be used by the IBTRS session
> + * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
> + * @max_inflight_msg: Max. number of parallel inflight messages for the session
> + * @max_segments: Max. number of segments per IO request
> + * @reconnect_delay_sec: time between reconnect tries
> + * @max_reconnect_attempts: Number of times to reconnect on error before giving
> + *			    up, 0 for * disabled, -1 for forever
> + *
> + * Starts session establishment with the ibtrs_server. The function can block
> + * up to ~2000ms until it returns.
> + *
> + * Return a valid pointer on success otherwise PTR_ERR.
> + */
> +struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
> +				 const char *sessname,
> +				 const struct ibtrs_addr *paths,
> +				 size_t path_cnt, short port,
> +				 size_t pdu_sz, u8 reconnect_delay_sec,
> +				 u16 max_segments,
> +				 s16 max_reconnect_attempts);

Having detailed kernel-doc headers for describing API functions is great 
but I'm not sure a .h file is the best location for such documentation. 
Many kernel developers keep kernel-doc headers in .c files because that 
makes it more likely that the documentation and the implementation stay 
in sync.

> +
> +/**
> + * ibtrs_clt_close() - Close a session
> + * @sess: Session handler, is freed on return
                      ^^^^^^^
                      handle?

This sentence suggests that the handle is freed on return. I guess that 
you meant that the session is freed upon return?

> +/**
> + * ibtrs_clt_get_tag() - allocates tag for future RDMA operation
> + * @sess:	Current session
> + * @con_type:	Type of connection to use with the tag
> + * @wait:	Wait type
> + *
> + * Description:
> + *    Allocates tag for the following RDMA operation.  Tag is used
> + *    to preallocate all resources and to propagate memory pressure
> + *    up earlier.
> + *
> + * Context:
> + *    Can sleep if @wait == IBTRS_TAG_WAIT
> + */
> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *sess,
> +				    enum ibtrs_clt_con_type con_type,
> +				    int wait);

Since struct ibtrs_tag has another role than what is called a tag in the 
block layer I think a better description is needed of what struct 
ibtrs_tag actually represents.

> +/*
> + * Here goes IBTRS server API
> + */

Most software either uses the client API or the server API but not both 
at the same time. Has it been considered to use separate header files 
for the client and server APIs?

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
       [not found] ` <20190620150337.7847-7-jinpuwang@gmail.com>
@ 2019-09-23 21:51   ` Bart Van Assche
  2019-09-25 17:36     ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 21:51 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +static const struct ibtrs_ib_dev_pool_ops dev_pool_ops;
> +static struct ibtrs_ib_dev_pool dev_pool = {
> +	.ops = &dev_pool_ops
> +};

Can the definitions in this file be reordered such that the forward 
declaration of dev_pool_ops can be removed?

> +static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con);
> +static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
> +				     struct rdma_cm_event *ev);
> +static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
> +static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
> +			      bool notify, bool can_wait);
> +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
> +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);

Please also remove these forward declarations.

> +bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess)
> +{
> +	return sess->state == IBTRS_CLT_CONNECTED;
> +}

Is it really useful to introduce a one line function for testing the 
session state?

> +static inline struct ibtrs_tag *
> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
> +{
> +	size_t max_depth = clt->queue_depth;
> +	struct ibtrs_tag *tag;
> +	int cpu, bit;
> +
> +	cpu = get_cpu();
> +	do {
> +		bit = find_first_zero_bit(clt->tags_map, max_depth);
> +		if (unlikely(bit >= max_depth)) {
> +			put_cpu();
> +			return NULL;
> +		}
> +
> +	} while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
> +	put_cpu();
> +
> +	tag = GET_TAG(clt, bit);
> +	WARN_ON(tag->mem_id != bit);
> +	tag->cpu_id = cpu;
> +	tag->con_type = con_type;
> +
> +	return tag;
> +}

What is the role of the get_cpu() and put_cpu() calls in this function? 
How can it make sense to assign the cpu number to tag->cpu_id after 
put_cpu() has been called?

> +static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
> +				      struct ibtrs_clt_sess *sess,
> +				      ibtrs_conf_fn *conf,
> +				      struct ibtrs_tag *tag, void *priv,
> +				      const struct kvec *vec, size_t usr_len,
> +				      struct scatterlist *sg, size_t sg_cnt,
> +				      size_t data_len, int dir)
> +{
> +	struct iov_iter iter;
> +	size_t len;
> +
> +	req->tag = tag;
> +	req->in_use = true;
> +	req->usr_len = usr_len;
> +	req->data_len = data_len;
> +	req->sglist = sg;
> +	req->sg_cnt = sg_cnt;
> +	req->priv = priv;
> +	req->dir = dir;
> +	req->con = ibtrs_tag_to_clt_con(sess, tag);
> +	req->conf = conf;
> +	req->need_inv = false;
> +	req->need_inv_comp = false;
> +	req->inv_errno = 0;
> +
> +	iov_iter_kvec(&iter, READ, vec, 1, usr_len);
> +	len = _copy_from_iter(req->iu->buf, usr_len, &iter);
> +	WARN_ON(len != usr_len);
> +
> +	reinit_completion(&req->inv_comp);
> +	if (sess->stats.enable_rdma_lat)
> +		req->start_jiffies = jiffies;
> +}

A comment that explains what "req" stands for would be welcome. Since 
this function copies the entire payload, I assume that it is only used 
for control messages and not for reading or writing data from a block 
device?

> +static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
> +				  struct ibtrs_clt_io_req *fail_req)
> +{
> +	struct ibtrs_clt_sess *alive_sess;
> +	struct ibtrs_clt_io_req *req;
> +	int err = -ECONNABORTED;
> +	struct path_it it;
> +
> +	do_each_path(alive_sess, clt, &it) {
> +		if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
> +			continue;
> +		req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
> +		if (req->dir == DMA_TO_DEVICE)
> +			err = ibtrs_clt_write_req(req);
> +		else
> +			err = ibtrs_clt_read_req(req);
> +		if (unlikely(err)) {
> +			req->in_use = false;
> +			continue;
> +		}
> +		/* Success path */
> +		ibtrs_clt_inc_failover_cnt(&alive_sess->stats);
> +		break;
> +	} while_each_path(&it);
> +
> +	return err;
> +}

Also for this function, a comment that explains the purpose of this 
function would be welcome.

> +static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +	struct ibtrs_clt_io_req *req;
> +	int i, err;
> +
> +	if (!sess->reqs)
> +		return;
> +	for (i = 0; i < sess->queue_depth; ++i) {
> +		req = &sess->reqs[i];
> +		if (!req->in_use)
> +			continue;
> +
> +		/*
> +		 * Safely (without notification) complete failed request.
> +		 * After completion this request is still usebale and can
> +		 * be failovered to another path.
> +		 */
> +		complete_rdma_req(req, -ECONNABORTED, false, true);
> +
> +		err = ibtrs_clt_failover_req(clt, req);
> +		if (unlikely(err))
> +			/* Failover failed, notify anyway */
> +			req->conf(req->priv, err);
> +	}
> +}

What guarantees that this function does not call complete_rdma_req() 
while complete_rdma_req() is called from the regular completion path?

> +static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
> +				     enum ibtrs_clt_state new_state)
> +{
> +	enum ibtrs_clt_state old_state;
> +	bool changed = false;
> +
> +	old_state = sess->state;
> +	switch (new_state) {

Please use lockdep_assert_held() inside this function to verify at 
runtime that session state changes are serialized properly.

> +static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess)
> +{
> +	enum ibtrs_clt_state state;
> +
> +	spin_lock_irq(&sess->state_wq.lock);
> +	state = sess->state;
> +	spin_unlock_irq(&sess->state_wq.lock);
> +
> +	return state;
> +}

Please remove this function and read sess->state without holding 
state_wq.lock.

> +static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err)
> +{
> +	struct ibtrs_clt_con *con;
> +
> +	(void)err;
> +	con = container_of(c, typeof(*con), c);
> +	ibtrs_rdma_error_recovery(con);
> +}

Can "(void)err" be left out?

Can the declaration and assignment of 'con' be merged into a single line 
of code?

> +static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid)
> +{
> +	struct ibtrs_clt_con *con;
> +
> +	con = kzalloc(sizeof(*con), GFP_KERNEL);
> +	if (unlikely(!con))
> +		return -ENOMEM;
> +
> +	/* Map first two connections to the first CPU */
> +	con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
> +	con->c.cid = cid;
> +	con->c.sess = &sess->s;
> +	atomic_set(&con->io_cnt, 0);
> +
> +	sess->s.con[cid] = &con->c;
> +
> +	return 0;
> +}

The code to map a connection ID to onto a CPU occurs multiple times. Has 
it been considered to introduce a function for that mapping? Although 
one-line inline functions are not recommended in general, such a 
function will also make it easier to experiment with other mapping 
approaches, e.g. mapping hypertread siblings onto the same connection ID.

> +static inline bool xchg_sessions(struct ibtrs_clt_sess __rcu **rcu_ppcpu_path,
> +				 struct ibtrs_clt_sess *sess,
> +				 struct ibtrs_clt_sess *next)
> +{
> +	struct ibtrs_clt_sess **ppcpu_path;
> +
> +	/* Call cmpxchg() without sparse warnings */
> +	ppcpu_path = (typeof(ppcpu_path))rcu_ppcpu_path;
> +	return (sess == cmpxchg(ppcpu_path, sess, next));
> +}

This looks suspicious. Has it been considered to protect changes of 
rcu_ppcpu_path with a mutex and to protect reads with an RCU read lock?

> +static void ibtrs_clt_add_path_to_arr(struct ibtrs_clt_sess *sess,
> +				      struct ibtrs_addr *addr)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +
> +	mutex_lock(&clt->paths_mutex);
> +	clt->paths_num++;
> +
> +	/*
> +	 * Firstly increase paths_num, wait for GP and then
> +	 * add path to the list.  Why?  Since we add path with
> +	 * !CONNECTED state explanation is similar to what has
> +	 * been written in ibtrs_clt_remove_path_from_arr().
> +	 */
> +	synchronize_rcu();
> +
> +	list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
> +	mutex_unlock(&clt->paths_mutex);
> +}

synchronize_rcu() while a mutex is being held? Really?

> +static void ibtrs_clt_close_work(struct work_struct *work)
> +{
> +	struct ibtrs_clt_sess *sess;
> +
> +	sess = container_of(work, struct ibtrs_clt_sess, close_work);
> +
> +	cancel_delayed_work_sync(&sess->reconnect_dwork);
> +	ibtrs_clt_stop_and_destroy_conns(sess);
> +	/*
> +	 * Sounds stupid, huh?  No, it is not.  Consider this sequence:
> +	 *
> +	 *   #CPU0                              #CPU1
> +	 *   1.  CONNECTED->RECONNECTING
> +	 *   2.                                 RECONNECTING->CLOSING
> +	 *   3.  queue_work(&reconnect_dwork)
> +	 *   4.                                 queue_work(&close_work);
> +	 *   5.  reconnect_work();              close_work();
> +	 *
> +	 * To avoid that case do cancel twice: before and after.
> +	 */
> +	cancel_delayed_work_sync(&sess->reconnect_dwork);
> +	ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSED);
> +}

The above code looks suspicious to me. I think there should be an 
additional state change at the start of this function to prevent that 
reconnect_dwork gets requeued after having been canceled.

> +static void ibtrs_clt_dev_release(struct device *dev)
> +{
> +	/* Nobody plays with device references, so nop */
> +}

That comment sounds wrong. Have you reviewed all of the device driver 
core code and checked that there is no code in there that manipulates 
struct device refcounts? I think the code that frees struct ibtrs_clt 
should be moved from free_clt() into the above function.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers
       [not found] ` <20190620150337.7847-4-jinpuwang@gmail.com>
@ 2019-09-23 22:50   ` Bart Van Assche
  2019-09-25 21:45     ` Danil Kipnis
  2019-09-27  8:56     ` Jinpu Wang
  0 siblings, 2 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 22:50 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +#define P1 )
> +#define P2 ))
> +#define P3 )))
> +#define P4 ))))
> +#define P(N) P ## N
> +
> +#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
> +#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
> +
> +#define LIST(...)						\
> +	__VA_ARGS__,						\
> +	({ unknown_type(); NULL; })				\
> +	CAT(P, COUNT_ARGS(__VA_ARGS__))				\
> +
> +#define EMPTY()
> +#define DEFER(id) id EMPTY()
> +
> +#define _CASE(obj, type, member)				\
> +	__builtin_choose_expr(					\
> +	__builtin_types_compatible_p(				\
> +		typeof(obj), type),				\
> +		((type)obj)->member
> +#define CASE(o, t, m) DEFER(_CASE)(o, t, m)
> +
> +/*
> + * Below we define retrieving of sessname from common IBTRS types.
> + * Client or server related types have to be defined by special
> + * TYPES_TO_SESSNAME macro.
> + */
> +
> +void unknown_type(void);
> +
> +#ifndef TYPES_TO_SESSNAME
> +#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
> +#endif
> +
> +#define ibtrs_prefix(obj)					\
> +	_CASE(obj, struct ibtrs_con *,  sess->sessname),	\
> +	_CASE(obj, struct ibtrs_sess *, sessname),		\
> +	TYPES_TO_SESSNAME(obj)					\
> +	))

No preprocessor voodoo please. Please remove all of the above and modify 
the logging statements such that these pass the proper name string as 
first argument to logging macros.

> +struct ibtrs_msg_conn_req {
> +	u8		__cma_version; /* Is set to 0 by cma.c in case of
> +					* AF_IB, do not touch that. */
> +	u8		__ip_version;  /* On sender side that should be
> +					* set to 0, or cma_save_ip_info()
> +					* extract garbage and will fail. */
> +	__le16		magic;
> +	__le16		version;
> +	__le16		cid;
> +	__le16		cid_num;
> +	__le16		recon_cnt;
> +	uuid_t		sess_uuid;
> +	uuid_t		paths_uuid;
> +	u8		reserved[12];
> +};

Please remove the reserved[] array and check private_data_len in the 
code that receives the login request.

> +/**
> + * struct ibtrs_msg_conn_rsp - Server connection response to the client
> + * @magic:	   IBTRS magic
> + * @version:	   IBTRS protocol version
> + * @errno:	   If rdma_accept() then 0, if rdma_reject() indicates error
> + * @queue_depth:   max inflight messages (queue-depth) in this session
> + * @max_io_size:   max io size server supports
> + * @max_hdr_size:  max msg header size server supports
> + *
> + * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
> + */
> +struct ibtrs_msg_conn_rsp {
> +	__le16		magic;
> +	__le16		version;
> +	__le16		errno;
> +	__le16		queue_depth;
> +	__le32		max_io_size;
> +	__le32		max_hdr_size;
> +	u8		reserved[40];
> +};

Same comment here: please remove the reserved[] array and check 
private_data_len in the code that processes this data structure.

> +static inline int sockaddr_cmp(const struct sockaddr *a,
> +			       const struct sockaddr *b)
> +{
> +	switch (a->sa_family) {
> +	case AF_IB:
> +		return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
> +			      &((struct sockaddr_ib *)b)->sib_addr,
> +			      sizeof(struct ib_addr));
> +	case AF_INET:
> +		return memcmp(&((struct sockaddr_in *)a)->sin_addr,
> +			      &((struct sockaddr_in *)b)->sin_addr,
> +			      sizeof(struct in_addr));
> +	case AF_INET6:
> +		return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
> +			      &((struct sockaddr_in6 *)b)->sin6_addr,
> +			      sizeof(struct in6_addr));
> +	default:
> +		return -ENOENT;
> +	}
> +}
> +
> +static inline int sockaddr_to_str(const struct sockaddr *addr,
> +				   char *buf, size_t len)
> +{
> +	int cnt;
> +
> +	switch (addr->sa_family) {
> +	case AF_IB:
> +		cnt = scnprintf(buf, len, "gid:%pI6",
> +			&((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
> +		return cnt;
> +	case AF_INET:
> +		cnt = scnprintf(buf, len, "ip:%pI4",
> +			&((struct sockaddr_in *)addr)->sin_addr);
> +		return cnt;
> +	case AF_INET6:
> +		cnt = scnprintf(buf, len, "ip:%pI6c",
> +			  &((struct sockaddr_in6 *)addr)->sin6_addr);
> +		return cnt;
> +	}
> +	cnt = scnprintf(buf, len, "<invalid address family>");
> +	pr_err("Invalid address family\n");
> +	return cnt;
> +}

Since these functions are not in the hot path, please move these into a 
.c file.

> +/**
> + * ibtrs_invalidate_flag() - returns proper flags for invalidation
> + *
> + * NOTE: This function is needed for compat layer, so think twice before
> + *       rename or remove.
> + */
> +static inline u32 ibtrs_invalidate_flag(void)
> +{
> +	return IBTRS_MSG_NEED_INVAL_F;
> +}

An inline function that does nothing else than returning a compile-time 
constant? That does not look useful to me. How about inlining this function?

> +#define STAT_STORE_FUNC(type, store, reset)				\
> +static ssize_t store##_store(struct kobject *kobj,			\
> +			     struct kobj_attribute *attr,		\
> +			     const char *buf, size_t count)		\
> +{									\
> +	int ret = -EINVAL;						\
> +	type *sess = container_of(kobj, type, kobj_stats);		\
> +									\
> +	if (sysfs_streq(buf, "1"))					\
> +		ret = reset(&sess->stats, true);			\
> +	else if (sysfs_streq(buf, "0"))					\
> +		ret = reset(&sess->stats, false);			\
> +	if (ret)							\
> +		return ret;						\
> +									\
> +	return count;							\
> +}

The above macro concatenates the suffix "_store" to a macro argument 
with the name 'store'. Please chose a less confusing name for the macro 
argument. Additionally, using 'reset' for the name of an macro argument 
that is a function that stores a value seems confusing to me. How about 
renaming that macro argument into 'set' or 'store_value'?

> +#define STAT_SHOW_FUNC(type, show, print)				\
> +static ssize_t show##_show(struct kobject *kobj,			\
> +			   struct kobj_attribute *attr,			\
> +			   char *page)					\
> +{									\
> +	type *sess = container_of(kobj, type, kobj_stats);		\
> +									\
> +	return print(&sess->stats, page, PAGE_SIZE);			\
> +}

Same comment for the macro argument 'show' in the above function.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules
       [not found] ` <20190620150337.7847-5-jinpuwang@gmail.com>
@ 2019-09-23 23:03   ` Bart Van Assche
  2019-09-27 10:13     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 23:03 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +static int ibtrs_str_gid_to_sockaddr(const char *addr, size_t len,
> +				     short port, struct sockaddr_storage *dst)
> +{
> +	struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
> +	int ret;
> +
> +	/*
> +	 * We can use some of the I6 functions since GID is a valid
> +	 * IPv6 address format
> +	 */
> +	ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
> +	if (ret == 0)
> +		return -EINVAL;
> +
> +	dst_ib->sib_family = AF_IB;
> +	/*
> +	 * Use the same TCP server port number as the IB service ID
> +	 * on the IB port space range
> +	 */
> +	dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
> +	dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
> +	dst_ib->sib_pkey = cpu_to_be16(0xffff);
> +
> +	return 0;
> +}
> +
> +/**
> + * ibtrs_str_to_sockaddr() - Convert ibtrs address string to sockaddr
> + * @addr	String representation of an addr (IPv4, IPv6 or IB GID):
> + *              - "ip:192.168.1.1"
> + *              - "ip:fe80::200:5aee:feaa:20a2"
> + *              - "gid:fe80::200:5aee:feaa:20a2"
> + * @len         String address length
> + * @port	Destination port
> + * @dst		Destination sockaddr structure
> + *
> + * Returns 0 if conversion successful. Non-zero on error.
> + */
> +static int ibtrs_str_to_sockaddr(const char *addr, size_t len,
> +				 short port, struct sockaddr_storage *dst)
> +{
> +	if (strncmp(addr, "gid:", 4) == 0) {
> +		return ibtrs_str_gid_to_sockaddr(addr + 4, len - 4, port, dst);
> +	} else if (strncmp(addr, "ip:", 3) == 0) {
> +		char port_str[8];
> +		char *cpy;
> +		int err;
> +
> +		snprintf(port_str, sizeof(port_str), "%u", port);
> +		cpy = kstrndup(addr + 3, len - 3, GFP_KERNEL);
> +		err = cpy ? inet_pton_with_scope(&init_net, AF_UNSPEC,
> +						 cpy, port_str, dst) : -ENOMEM;
> +		kfree(cpy);
> +
> +		return err;
> +	}
> +	return -EPROTONOSUPPORT;
> +}

A considerable amount of code is required to support the IB/CM. Does 
supporting the IB/CM add any value? If that code would be left out, 
would anything break? Is it really useful to support IB networks where 
no IP address has been assigned to each IB port?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 05/25] ibtrs: client: private header with client structs and functions
       [not found] ` <20190620150337.7847-6-jinpuwang@gmail.com>
@ 2019-09-23 23:05   ` Bart Van Assche
  2019-09-27 10:18     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 23:05 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +static inline const char *ibtrs_clt_state_str(enum ibtrs_clt_state state)
> +{
> +	switch (state) {
> +	case IBTRS_CLT_CONNECTING:
> +		return "IBTRS_CLT_CONNECTING";
> +	case IBTRS_CLT_CONNECTING_ERR:
> +		return "IBTRS_CLT_CONNECTING_ERR";
> +	case IBTRS_CLT_RECONNECTING:
> +		return "IBTRS_CLT_RECONNECTING";
> +	case IBTRS_CLT_CONNECTED:
> +		return "IBTRS_CLT_CONNECTED";
> +	case IBTRS_CLT_CLOSING:
> +		return "IBTRS_CLT_CLOSING";
> +	case IBTRS_CLT_CLOSED:
> +		return "IBTRS_CLT_CLOSED";
> +	case IBTRS_CLT_DEAD:
> +		return "IBTRS_CLT_DEAD";
> +	default:
> +		return "UNKNOWN";
> +	}
> +}

Since this code is not in the hot path, please move it from a .h into a 
.c file.

> +static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
> +{
> +	return container_of(c, struct ibtrs_clt_con, c);
> +}
> +
> +static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
> +{
> +	return container_of(s, struct ibtrs_clt_sess, s);
> +}

Is it really useful to define functions for these conversions? Has it 
been considered to inline these functions?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 07/25] ibtrs: client: statistics functions
       [not found] ` <20190620150337.7847-8-jinpuwang@gmail.com>
@ 2019-09-23 23:15   ` Bart Van Assche
  2019-09-27 12:00     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 23:15 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *stats, bool read,
> +			       unsigned long ms)
> +{
> +	struct ibtrs_clt_stats_pcpu *s;
> +	int id;
> +
> +	id = ibtrs_clt_ms_to_id(ms);
> +	s = this_cpu_ptr(stats->pcpu_stats);
> +	if (read) {
> +		s->rdma_lat_distr[id].read++;
> +		if (s->rdma_lat_max.read < ms)
> +			s->rdma_lat_max.read = ms;
> +	} else {
> +		s->rdma_lat_distr[id].write++;
> +		if (s->rdma_lat_max.write < ms)
> +			s->rdma_lat_max.write = ms;
> +	}
> +}

Can it happen that this function is called simultaneously from thread 
context and from interrupt context?

> +void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_clt_stats *stats = &sess->stats;
> +	struct ibtrs_clt_stats_pcpu *s;
> +	int cpu;
> +
> +	cpu = raw_smp_processor_id();
> +	s = this_cpu_ptr(stats->pcpu_stats);
> +	s->wc_comp.cnt++;
> +	s->wc_comp.total_cnt++;
> +	if (unlikely(con->cpu != cpu)) {
> +		s->cpu_migr.to++;
> +
> +		/* Careful here, override s pointer */
> +		s = per_cpu_ptr(stats->pcpu_stats, con->cpu);
> +		atomic_inc(&s->cpu_migr.from);
> +	}
> +}

Same question here.

> +void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *stats)
> +{
> +	struct ibtrs_clt_stats_pcpu *s;
> +
> +	s = this_cpu_ptr(stats->pcpu_stats);
> +	s->rdma.failover_cnt++;
> +}

And here ...

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 09/25] ibtrs: server: private header with server structs and functions
       [not found] ` <20190620150337.7847-10-jinpuwang@gmail.com>
@ 2019-09-23 23:21   ` Bart Van Assche
  2019-09-27 12:04     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 23:21 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +static inline const char *ibtrs_srv_state_str(enum ibtrs_srv_state state)
> +{
> +	switch (state) {
> +	case IBTRS_SRV_CONNECTING:
> +		return "IBTRS_SRV_CONNECTING";
> +	case IBTRS_SRV_CONNECTED:
> +		return "IBTRS_SRV_CONNECTED";
> +	case IBTRS_SRV_CLOSING:
> +		return "IBTRS_SRV_CLOSING";
> +	case IBTRS_SRV_CLOSED:
> +		return "IBTRS_SRV_CLOSED";
> +	default:
> +		return "UNKNOWN";
> +	}
> +}

Since this function is not in the hot path, please move it into a .c file.

> +/* See ibtrs-log.h */
> +#define TYPES_TO_SESSNAME(obj)						\
> +	LIST(CASE(obj, struct ibtrs_srv_sess *, s.sessname))

Please remove this macro and pass 'sessname' explicitly to logging 
functions.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 10/25] ibtrs: server: main functionality
       [not found] ` <20190620150337.7847-11-jinpuwang@gmail.com>
@ 2019-09-23 23:49   ` Bart Van Assche
  2019-09-27 15:03     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 23:49 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +module_param_named(max_chunk_size, max_chunk_size, int, 0444);
> +MODULE_PARM_DESC(max_chunk_size,
> +		 "Max size for each IO request, when change the unit is in byte"
> +		 " (default: " __stringify(DEFAULT_MAX_CHUNK_SIZE_KB) "KB)");

Where can I find the definition of DEFAULT_MAX_CHUNK_SIZE_KB?

> +static char cq_affinity_list[256] = "";

No empty initializers for file-scope variables please.

> +	pr_info("cq_affinity_list changed to %*pbl\n",
> +		cpumask_pr_args(&cq_affinity_mask));

Should this pr_info() call perhaps be changed into pr_debug()?

> +static bool __ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
> +				     enum ibtrs_srv_state new_state)
> +{
> +	enum ibtrs_srv_state old_state;
> +	bool changed = false;
> +
> +	old_state = sess->state;
> +	switch (new_state) {

Please add a lockdep_assert_held() statement that checks whether calls 
of this function are serialized properly.

> +/**
> + * rdma_write_sg() - response on successful READ request
> + */
> +static int rdma_write_sg(struct ibtrs_srv_op *id)
> +{
> +	struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess);
> +	dma_addr_t dma_addr = sess->dma_addr[id->msg_id];
> +	struct ibtrs_srv *srv = sess->srv;
> +	struct ib_send_wr inv_wr, imm_wr;
> +	struct ib_rdma_wr *wr = NULL;
> +	const struct ib_send_wr *bad_wr;
> +	enum ib_send_flags flags;
> +	size_t sg_cnt;
> +	int err, i, offset;
> +	bool need_inval;
> +	u32 rkey = 0;
> +
> +	sg_cnt = le16_to_cpu(id->rd_msg->sg_cnt);
> +	need_inval = le16_to_cpu(id->rd_msg->flags) & IBTRS_MSG_NEED_INVAL_F;
> +	if (unlikely(!sg_cnt))
> +		return -EINVAL;
> +
> +	offset = 0;
> +	for (i = 0; i < sg_cnt; i++) {
> +		struct ib_sge *list;
> +
> +		wr		= &id->tx_wr[i];
> +		list		= &id->tx_sg[i];
> +		list->addr	= dma_addr + offset;
> +		list->length	= le32_to_cpu(id->rd_msg->desc[i].len);
> +
> +		/* WR will fail with length error
> +		 * if this is 0
> +		 */
> +		if (unlikely(list->length == 0)) {
> +			ibtrs_err(sess, "Invalid RDMA-Write sg list length 0\n");
> +			return -EINVAL;
> +		}
> +
> +		list->lkey = sess->s.dev->ib_pd->local_dma_lkey;
> +		offset += list->length;
> +
> +		wr->wr.wr_cqe	= &io_comp_cqe;
> +		wr->wr.sg_list	= list;
> +		wr->wr.num_sge	= 1;
> +		wr->remote_addr	= le64_to_cpu(id->rd_msg->desc[i].addr);
> +		wr->rkey	= le32_to_cpu(id->rd_msg->desc[i].key);
> +		if (rkey == 0)
> +			rkey = wr->rkey;
> +		else
> +			/* Only one key is actually used */
> +			WARN_ON_ONCE(rkey != wr->rkey);
> +
> +		if (i < (sg_cnt - 1))
> +			wr->wr.next = &id->tx_wr[i + 1].wr;
> +		else if (need_inval)
> +			wr->wr.next = &inv_wr;
> +		else
> +			wr->wr.next = &imm_wr;
> +
> +		wr->wr.opcode = IB_WR_RDMA_WRITE;
> +		wr->wr.ex.imm_data = 0;
> +		wr->wr.send_flags  = 0;
> +	}
> +	/*
> +	 * From time to time we have to post signalled sends,
> +	 * or send queue will fill up and only QP reset can help.
> +	 */
> +	flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
> +			0 : IB_SEND_SIGNALED;
> +
> +	if (need_inval) {
> +		inv_wr.next = &imm_wr;
> +		inv_wr.wr_cqe = &io_comp_cqe;
> +		inv_wr.sg_list = NULL;
> +		inv_wr.num_sge = 0;
> +		inv_wr.opcode = IB_WR_SEND_WITH_INV;
> +		inv_wr.send_flags = 0;
> +		inv_wr.ex.invalidate_rkey = rkey;
> +	}
> +	imm_wr.next = NULL;
> +	imm_wr.wr_cqe = &io_comp_cqe;
> +	imm_wr.sg_list = NULL;
> +	imm_wr.num_sge = 0;
> +	imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
> +	imm_wr.send_flags = flags;
> +	imm_wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id,
> +							     0, need_inval));
> +
> +	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, dma_addr,
> +				      offset, DMA_BIDIRECTIONAL);
> +
> +	err = ib_post_send(id->con->c.qp, &id->tx_wr[0].wr, &bad_wr);
> +	if (unlikely(err))
> +		ibtrs_err(sess,
> +			  "Posting RDMA-Write-Request to QP failed, err: %d\n",
> +			  err);
> +
> +	return err;
> +}

All other RDMA server implementations use rdma_rw_ctx_init() and 
rdma_rw_ctx_wrs(). Please use these functions in IBTRS too.

> +static void ibtrs_srv_hb_err_handler(struct ibtrs_con *c, int err)
> +{
> +	(void)err;
> +	close_sess(to_srv_sess(c->sess));
> +}

Is the (void)err statement really necessary?

> +static int ibtrs_srv_rdma_init(struct ibtrs_srv_ctx *ctx, unsigned int port)
> +{
> +	struct sockaddr_in6 sin = {
> +		.sin6_family	= AF_INET6,
> +		.sin6_addr	= IN6ADDR_ANY_INIT,
> +		.sin6_port	= htons(port),
> +	};
> +	struct sockaddr_ib sib = {
> +		.sib_family			= AF_IB,
> +		.sib_addr.sib_subnet_prefix	= 0ULL,
> +		.sib_addr.sib_interface_id	= 0ULL,
> +		.sib_sid	= cpu_to_be64(RDMA_IB_IP_PS_IB | port),
> +		.sib_sid_mask	= cpu_to_be64(0xffffffffffffffffULL),
> +		.sib_pkey	= cpu_to_be16(0xffff),
> +	};
> +	struct rdma_cm_id *cm_ip, *cm_ib;
> +	int ret;
> +
> +	/*
> +	 * We accept both IPoIB and IB connections, so we need to keep
> +	 * two cm id's, one for each socket type and port space.
> +	 * If the cm initialization of one of the id's fails, we abort
> +	 * everything.
> +	 */
> +	cm_ip = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sin, RDMA_PS_TCP);
> +	if (unlikely(IS_ERR(cm_ip)))
> +		return PTR_ERR(cm_ip);
> +
> +	cm_ib = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sib, RDMA_PS_IB);
> +	if (unlikely(IS_ERR(cm_ib))) {
> +		ret = PTR_ERR(cm_ib);
> +		goto free_cm_ip;
> +	}
> +
> +	ctx->cm_id_ip = cm_ip;
> +	ctx->cm_id_ib = cm_ib;
> +
> +	return 0;
> +
> +free_cm_ip:
> +	rdma_destroy_id(cm_ip);
> +
> +	return ret;
> +}

Will the above work if CONFIG_IPV6=n?

> +static int __init ibtrs_server_init(void)
> +{
> +	int err;
> +
> +	if (!strlen(cq_affinity_list))
> +		init_cq_affinity();

Is the above if-test useful? Can that if-test be left out?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 11/25] ibtrs: server: statistics functions
       [not found] ` <20190620150337.7847-12-jinpuwang@gmail.com>
@ 2019-09-23 23:56   ` Bart Van Assche
  2019-10-02 15:15     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-23 23:56 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
> +				    char *page, size_t len)
> +{
> +	struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> +	struct ibtrs_srv_sess *sess;
> +
> +	sess = container_of(stats, typeof(*sess), stats);
> +
> +	return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
> +			 (s64)atomic64_read(&r->dir[READ].cnt),
> +			 (s64)atomic64_read(&r->dir[READ].size_total),
> +			 (s64)atomic64_read(&r->dir[WRITE].cnt),
> +			 (s64)atomic64_read(&r->dir[WRITE].size_total),
> +			 atomic_read(&sess->ids_inflight));
> +}

Does this follow the sysfs one-value-per-file rule? See also 
Documentation/filesystems/sysfs.txt.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 12/25] ibtrs: server: sysfs interface functions
       [not found] ` <20190620150337.7847-13-jinpuwang@gmail.com>
@ 2019-09-24  0:00   ` Bart Van Assche
  2019-10-02 15:11     ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-24  0:00 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, jgg, dledford, danil.kipnis, rpenyaev,
	Roman Pen, Jack Wang

On 6/20/19 8:03 AM, Jack Wang wrote:
> +static void ibtrs_srv_dev_release(struct device *dev)
> +{
> +	/* Nobody plays with device references, so nop */
> +}

I doubt that the above comment is correct.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 01/25] sysfs: export sysfs_remove_file_self()
  2019-09-23 17:21   ` Bart Van Assche
@ 2019-09-25  9:30     ` Danil Kipnis
  0 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25  9:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, linux-kernel

On Mon, Sep 23, 2019 at 7:21 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > Function is going to be used in transport over RDMA module
> > in subsequent patches.
>
> It seems like several words are missing from this patch description.
Will extend with corresponding description of the function from
fs/sysfs/file.c and explanation why we need it.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections
  2019-09-23 17:44   ` [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections Bart Van Assche
@ 2019-09-25 10:20     ` Danil Kipnis
  2019-09-25 15:38       ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25 10:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Mon, Sep 23, 2019 at 7:44 PM Bart Van Assche <bvanassche@acm.org> wrote:
> > +/**
> > + * enum ibtrs_clt_link_ev - Events about connectivity state of a client
> > + * @IBTRS_CLT_LINK_EV_RECONNECTED    Client was reconnected.
> > + * @IBTRS_CLT_LINK_EV_DISCONNECTED   Client was disconnected.
> > + */
> > +enum ibtrs_clt_link_ev {
> > +     IBTRS_CLT_LINK_EV_RECONNECTED,
> > +     IBTRS_CLT_LINK_EV_DISCONNECTED,
> > +};
> > +
> > +/**
> > + * Source and destination address of a path to be established
> > + */
> > +struct ibtrs_addr {
> > +     struct sockaddr_storage *src;
> > +     struct sockaddr_storage *dst;
> > +};
>
> Is it really useful to define a structure to hold two pointers or can
> these two pointers also be passed as separate arguments?
We always need both src and dst throughout ibnbd and ibtrs code and
indeed one reason to introduce this struct is that "f(struct
ibtrs_addr *addr, ...);" is shorter than "f(struct sockaddr_storage
*src, struct sockaddr_storage *dst, ...);". But it also makes it
easier to extend the address information describing one ibtrs path in
the future.

> > +/**
> > + * ibtrs_clt_open() - Open a session to a IBTRS client
> > + * @priv:            User supplied private data.
> > + * @link_ev:         Event notification for connection state changes
> > + *   @priv:                  user supplied data that was passed to
> > + *                           ibtrs_clt_open()
> > + *   @ev:                    Occurred event
> > + * @sessname: name of the session
> > + * @paths: Paths to be established defined by their src and dst addresses
> > + * @path_cnt: Number of elemnts in the @paths array
> > + * @port: port to be used by the IBTRS session
> > + * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
> > + * @max_inflight_msg: Max. number of parallel inflight messages for the session
> > + * @max_segments: Max. number of segments per IO request
> > + * @reconnect_delay_sec: time between reconnect tries
> > + * @max_reconnect_attempts: Number of times to reconnect on error before giving
> > + *                       up, 0 for * disabled, -1 for forever
> > + *
> > + * Starts session establishment with the ibtrs_server. The function can block
> > + * up to ~2000ms until it returns.
> > + *
> > + * Return a valid pointer on success otherwise PTR_ERR.
> > + */
> > +struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
> > +                              const char *sessname,
> > +                              const struct ibtrs_addr *paths,
> > +                              size_t path_cnt, short port,
> > +                              size_t pdu_sz, u8 reconnect_delay_sec,
> > +                              u16 max_segments,
> > +                              s16 max_reconnect_attempts);
>
> Having detailed kernel-doc headers for describing API functions is great
> but I'm not sure a .h file is the best location for such documentation.
> Many kernel developers keep kernel-doc headers in .c files because that
> makes it more likely that the documentation and the implementation stay
> in sync.
What is better: to move it or to only copy it to the corresponding C file?

>
> > +
> > +/**
> > + * ibtrs_clt_close() - Close a session
> > + * @sess: Session handler, is freed on return
>                       ^^^^^^^
>                       handle?
>
> This sentence suggests that the handle is freed on return. I guess that
> you meant that the session is freed upon return?
Right, will fix the wording.

>
> > +/**
> > + * ibtrs_clt_get_tag() - allocates tag for future RDMA operation
> > + * @sess:    Current session
> > + * @con_type:        Type of connection to use with the tag
> > + * @wait:    Wait type
> > + *
> > + * Description:
> > + *    Allocates tag for the following RDMA operation.  Tag is used
> > + *    to preallocate all resources and to propagate memory pressure
> > + *    up earlier.
> > + *
> > + * Context:
> > + *    Can sleep if @wait == IBTRS_TAG_WAIT
> > + */
> > +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *sess,
> > +                                 enum ibtrs_clt_con_type con_type,
> > +                                 int wait);
>
> Since struct ibtrs_tag has another role than what is called a tag in the
> block layer I think a better description is needed of what struct
> ibtrs_tag actually represents.
I think it would be better to rename it to ibtrs_permit in order to
avoid confusion with block layer tags. Will extend the description
also.

> > +/*
> > + * Here goes IBTRS server API
> > + */
>
> Most software either uses the client API or the server API but not both
> at the same time. Has it been considered to use separate header files
> for the client and server APIs?
I don't have any really good reason to put API of server and client
into a single file. Except may be that the reader can see API calls
corresponding the full sequence of request -> indication -> response
-> confirmation in one place.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections
  2019-09-25 10:20     ` Danil Kipnis
@ 2019-09-25 15:38       ` Bart Van Assche
  0 siblings, 0 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-25 15:38 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/25/19 3:20 AM, Danil Kipnis wrote:
> On Mon, Sep 23, 2019 at 7:44 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>> +/**
>>> + * ibtrs_clt_open() - Open a session to a IBTRS client
>>> + * @priv:            User supplied private data.
>>> + * @link_ev:         Event notification for connection state changes
>>> + *   @priv:                  user supplied data that was passed to
>>> + *                           ibtrs_clt_open()
>>> + *   @ev:                    Occurred event
>>> + * @sessname: name of the session
>>> + * @paths: Paths to be established defined by their src and dst addresses
>>> + * @path_cnt: Number of elemnts in the @paths array
>>> + * @port: port to be used by the IBTRS session
>>> + * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
>>> + * @max_inflight_msg: Max. number of parallel inflight messages for the session
>>> + * @max_segments: Max. number of segments per IO request
>>> + * @reconnect_delay_sec: time between reconnect tries
>>> + * @max_reconnect_attempts: Number of times to reconnect on error before giving
>>> + *                       up, 0 for * disabled, -1 for forever
>>> + *
>>> + * Starts session establishment with the ibtrs_server. The function can block
>>> + * up to ~2000ms until it returns.
>>> + *
>>> + * Return a valid pointer on success otherwise PTR_ERR.
>>> + */
>>> +struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
>>> +                              const char *sessname,
>>> +                              const struct ibtrs_addr *paths,
>>> +                              size_t path_cnt, short port,
>>> +                              size_t pdu_sz, u8 reconnect_delay_sec,
>>> +                              u16 max_segments,
>>> +                              s16 max_reconnect_attempts);
>>
>> Having detailed kernel-doc headers for describing API functions is great
>> but I'm not sure a .h file is the best location for such documentation.
>> Many kernel developers keep kernel-doc headers in .c files because that
>> makes it more likely that the documentation and the implementation stay
>> in sync.
 >
> What is better: to move it or to only copy it to the corresponding C file?

Please move the kernel-doc header into the corresponding .c file and 
remove the kernel-doc header from the .h file.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-23 21:51   ` [PATCH v4 06/25] ibtrs: client: main functionality Bart Van Assche
@ 2019-09-25 17:36     ` Danil Kipnis
  2019-09-25 18:55       ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25 17:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

Hallo Bart,

On Mon, Sep 23, 2019 at 11:51 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +static const struct ibtrs_ib_dev_pool_ops dev_pool_ops;
> > +static struct ibtrs_ib_dev_pool dev_pool = {
> > +     .ops = &dev_pool_ops
> > +};
>
> Can the definitions in this file be reordered such that the forward
> declaration of dev_pool_ops can be removed?
Will try to.

> > +static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con);
> > +static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
> > +                                  struct rdma_cm_event *ev);
> > +static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
> > +static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
> > +                           bool notify, bool can_wait);
> > +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
> > +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);
>
> Please also remove these forward declarations.
OK

> > +bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess)
> > +{
> > +     return sess->state == IBTRS_CLT_CONNECTED;
> > +}
>
> Is it really useful to introduce a one line function for testing the
> session state?
No, not in that case really, thanks.

> > +static inline struct ibtrs_tag *
> > +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
> > +{
> > +     size_t max_depth = clt->queue_depth;
> > +     struct ibtrs_tag *tag;
> > +     int cpu, bit;
> > +
> > +     cpu = get_cpu();
> > +     do {
> > +             bit = find_first_zero_bit(clt->tags_map, max_depth);
> > +             if (unlikely(bit >= max_depth)) {
> > +                     put_cpu();
> > +                     return NULL;
> > +             }
> > +
> > +     } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
> > +     put_cpu();
> > +
> > +     tag = GET_TAG(clt, bit);
> > +     WARN_ON(tag->mem_id != bit);
> > +     tag->cpu_id = cpu;
> > +     tag->con_type = con_type;
> > +
> > +     return tag;
> > +}
>
> What is the role of the get_cpu() and put_cpu() calls in this function?
> How can it make sense to assign the cpu number to tag->cpu_id after
> put_cpu() has been called?
We disable preemption while looking for a free "ibtrs_tag" (permit) in
our tags_map. We store the cpu number the ibtrs_clt_get_tag() function
has been originally called on in the ibtrs_tag we just found, so that
when the user later would use this ibtrs_tag for an rdma operation
(ibtrs_clt_request()), we would select the rdma connection with
cq_vector corresponding to this cpu. If IRQ affinity is configured
accordingly, this enables for an IO response to be processed on the
same cpu the IO request was originally submitted on.

> > +static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
> > +                                   struct ibtrs_clt_sess *sess,
> > +                                   ibtrs_conf_fn *conf,
> > +                                   struct ibtrs_tag *tag, void *priv,
> > +                                   const struct kvec *vec, size_t usr_len,
> > +                                   struct scatterlist *sg, size_t sg_cnt,
> > +                                   size_t data_len, int dir)
> > +{
> > +     struct iov_iter iter;
> > +     size_t len;
> > +
> > +     req->tag = tag;
> > +     req->in_use = true;
> > +     req->usr_len = usr_len;
> > +     req->data_len = data_len;
> > +     req->sglist = sg;
> > +     req->sg_cnt = sg_cnt;
> > +     req->priv = priv;
> > +     req->dir = dir;
> > +     req->con = ibtrs_tag_to_clt_con(sess, tag);
> > +     req->conf = conf;
> > +     req->need_inv = false;
> > +     req->need_inv_comp = false;
> > +     req->inv_errno = 0;
> > +
> > +     iov_iter_kvec(&iter, READ, vec, 1, usr_len);
> > +     len = _copy_from_iter(req->iu->buf, usr_len, &iter);
> > +     WARN_ON(len != usr_len);
> > +
> > +     reinit_completion(&req->inv_comp);
> > +     if (sess->stats.enable_rdma_lat)
> > +             req->start_jiffies = jiffies;
> > +}
>
> A comment that explains what "req" stands for would be welcome. Since
> this function copies the entire payload, I assume that it is only used
> for control messages and not for reading or writing data from a block
> device?
Yes, we only copy control message provided by the user. Will extend
the description.

> > +static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
> > +                               struct ibtrs_clt_io_req *fail_req)
> > +{
> > +     struct ibtrs_clt_sess *alive_sess;
> > +     struct ibtrs_clt_io_req *req;
> > +     int err = -ECONNABORTED;
> > +     struct path_it it;
> > +
> > +     do_each_path(alive_sess, clt, &it) {
> > +             if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
> > +                     continue;
> > +             req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
> > +             if (req->dir == DMA_TO_DEVICE)
> > +                     err = ibtrs_clt_write_req(req);
> > +             else
> > +                     err = ibtrs_clt_read_req(req);
> > +             if (unlikely(err)) {
> > +                     req->in_use = false;
> > +                     continue;
> > +             }
> > +             /* Success path */
> > +             ibtrs_clt_inc_failover_cnt(&alive_sess->stats);
> > +             break;
> > +     } while_each_path(&it);
> > +
> > +     return err;
> > +}
>
> Also for this function, a comment that explains the purpose of this
> function would be welcome.
Will add a description to it.

>
> > +static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess)
> > +{
> > +     struct ibtrs_clt *clt = sess->clt;
> > +     struct ibtrs_clt_io_req *req;
> > +     int i, err;
> > +
> > +     if (!sess->reqs)
> > +             return;
> > +     for (i = 0; i < sess->queue_depth; ++i) {
> > +             req = &sess->reqs[i];
> > +             if (!req->in_use)
> > +                     continue;
> > +
> > +             /*
> > +              * Safely (without notification) complete failed request.
> > +              * After completion this request is still usebale and can
> > +              * be failovered to another path.
> > +              */
> > +             complete_rdma_req(req, -ECONNABORTED, false, true);
> > +
> > +             err = ibtrs_clt_failover_req(clt, req);
> > +             if (unlikely(err))
> > +                     /* Failover failed, notify anyway */
> > +                     req->conf(req->priv, err);
> > +     }
> > +}
>
> What guarantees that this function does not call complete_rdma_req()
> while complete_rdma_req() is called from the regular completion path?
Before calling this function all the qps are drained in
ibtrs_clt_stop_and_destroy_conns(...).

> > +static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
> > +                                  enum ibtrs_clt_state new_state)
> > +{
> > +     enum ibtrs_clt_state old_state;
> > +     bool changed = false;
> > +
> > +     old_state = sess->state;
> > +     switch (new_state) {
>
> Please use lockdep_assert_held() inside this function to verify at
> runtime that session state changes are serialized properly.
I haven't used lockdep_assert_held() before, will look into it.

> > +static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess)
> > +{
> > +     enum ibtrs_clt_state state;
> > +
> > +     spin_lock_irq(&sess->state_wq.lock);
> > +     state = sess->state;
> > +     spin_unlock_irq(&sess->state_wq.lock);
> > +
> > +     return state;
> > +}
>
> Please remove this function and read sess->state without holding
> state_wq.lock.
ok.

> > +static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err)
> > +{
> > +     struct ibtrs_clt_con *con;
> > +
> > +     (void)err;
> > +     con = container_of(c, typeof(*con), c);
> > +     ibtrs_rdma_error_recovery(con);
> > +}
>
> Can "(void)err" be left out?
Yes
> Can the declaration and assignment of 'con' be merged into a single line
> of code?
Yes

> > +static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid)
> > +{
> > +     struct ibtrs_clt_con *con;
> > +
> > +     con = kzalloc(sizeof(*con), GFP_KERNEL);
> > +     if (unlikely(!con))
> > +             return -ENOMEM;
> > +
> > +     /* Map first two connections to the first CPU */
> > +     con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
> > +     con->c.cid = cid;
> > +     con->c.sess = &sess->s;
> > +     atomic_set(&con->io_cnt, 0);
> > +
> > +     sess->s.con[cid] = &con->c;
> > +
> > +     return 0;
> > +}
>
> The code to map a connection ID to onto a CPU occurs multiple times. Has
> it been considered to introduce a function for that mapping? Although
> one-line inline functions are not recommended in general, such a
> function will also make it easier to experiment with other mapping
> approaches, e.g. mapping hypertread siblings onto the same connection ID.
We have one connection for "user control messages" and as many
connections as cpus for actual IO traffic. They all have different
cq_vectors. This way one can experiment with any mapping by just
setting a different smp_affinity for the IRQs corresponding to this
cq_vectors under /proc/irq/.

> > +static inline bool xchg_sessions(struct ibtrs_clt_sess __rcu **rcu_ppcpu_path,
> > +                              struct ibtrs_clt_sess *sess,
> > +                              struct ibtrs_clt_sess *next)
> > +{
> > +     struct ibtrs_clt_sess **ppcpu_path;
> > +
> > +     /* Call cmpxchg() without sparse warnings */
> > +     ppcpu_path = (typeof(ppcpu_path))rcu_ppcpu_path;
> > +     return (sess == cmpxchg(ppcpu_path, sess, next));
> > +}
>
> This looks suspicious. Has it been considered to protect changes of
> rcu_ppcpu_path with a mutex and to protect reads with an RCU read lock?
>
> > +static void ibtrs_clt_add_path_to_arr(struct ibtrs_clt_sess *sess,
> > +                                   struct ibtrs_addr *addr)
> > +{
> > +     struct ibtrs_clt *clt = sess->clt;
> > +
> > +     mutex_lock(&clt->paths_mutex);
> > +     clt->paths_num++;
> > +
> > +     /*
> > +      * Firstly increase paths_num, wait for GP and then
> > +      * add path to the list.  Why?  Since we add path with
> > +      * !CONNECTED state explanation is similar to what has
> > +      * been written in ibtrs_clt_remove_path_from_arr().
> > +      */
> > +     synchronize_rcu();
> > +
> > +     list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
> > +     mutex_unlock(&clt->paths_mutex);
> > +}
>
> synchronize_rcu() while a mutex is being held? Really?
The construct around our multipath implementation has been checked
https://lkml.org/lkml/2018/5/18/659 and then validated (is "validated"
the right word for this?): https://lkml.org/lkml/2018/5/28/2080.

> > +static void ibtrs_clt_close_work(struct work_struct *work)
> > +{
> > +     struct ibtrs_clt_sess *sess;
> > +
> > +     sess = container_of(work, struct ibtrs_clt_sess, close_work);
> > +
> > +     cancel_delayed_work_sync(&sess->reconnect_dwork);
> > +     ibtrs_clt_stop_and_destroy_conns(sess);
> > +     /*
> > +      * Sounds stupid, huh?  No, it is not.  Consider this sequence:
> > +      *
> > +      *   #CPU0                              #CPU1
> > +      *   1.  CONNECTED->RECONNECTING
> > +      *   2.                                 RECONNECTING->CLOSING
> > +      *   3.  queue_work(&reconnect_dwork)
> > +      *   4.                                 queue_work(&close_work);
> > +      *   5.  reconnect_work();              close_work();
> > +      *
> > +      * To avoid that case do cancel twice: before and after.
> > +      */
> > +     cancel_delayed_work_sync(&sess->reconnect_dwork);
> > +     ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSED);
> > +}
>
> The above code looks suspicious to me. I think there should be an
> additional state change at the start of this function to prevent that
> reconnect_dwork gets requeued after having been canceled
Will look into it again, thanks.

>
> > +static void ibtrs_clt_dev_release(struct device *dev)
> > +{
> > +     /* Nobody plays with device references, so nop */
> > +}
>
> That comment sounds wrong. Have you reviewed all of the device driver
> core code and checked that there is no code in there that manipulates
> struct device refcounts? I think the code that frees struct ibtrs_clt
> should be moved from free_clt() into the above function.

We only use the device to create an entry under /sys/class. free_clt()
is destroying sysfs first and unregisters the device afterwards. I
don't really see the need to free from the callback instead... Will
make it clear in the comment.

Thanks a lot,
Danil

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-25 17:36     ` Danil Kipnis
@ 2019-09-25 18:55       ` Bart Van Assche
  2019-09-25 20:50         ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-25 18:55 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/25/19 10:36 AM, Danil Kipnis wrote:
> On Mon, Sep 23, 2019 at 11:51 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 6/20/19 8:03 AM, Jack Wang wrote:
>>> +static void ibtrs_clt_dev_release(struct device *dev)
>>> +{
>>> +     /* Nobody plays with device references, so nop */
>>> +}
>>
>> That comment sounds wrong. Have you reviewed all of the device driver
>> core code and checked that there is no code in there that manipulates
>> struct device refcounts? I think the code that frees struct ibtrs_clt
>> should be moved from free_clt() into the above function.
> 
> We only use the device to create an entry under /sys/class. free_clt()
> is destroying sysfs first and unregisters the device afterwards. I
> don't really see the need to free from the callback instead... Will
> make it clear in the comment.

There is plenty of code under drivers/base that calls get_device() and
put_device(). Are you sure that none of the code under drivers/base will
ever call get_device() and put_device() for the ibtrs client device?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-25 18:55       ` Bart Van Assche
@ 2019-09-25 20:50         ` Danil Kipnis
  2019-09-25 21:08           ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25 20:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/25/19 10:36 AM, Danil Kipnis wrote:
> > On Mon, Sep 23, 2019 at 11:51 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >> On 6/20/19 8:03 AM, Jack Wang wrote:
> >>> +static void ibtrs_clt_dev_release(struct device *dev)
> >>> +{
> >>> +     /* Nobody plays with device references, so nop */
> >>> +}
> >>
> >> That comment sounds wrong. Have you reviewed all of the device driver
> >> core code and checked that there is no code in there that manipulates
> >> struct device refcounts? I think the code that frees struct ibtrs_clt
> >> should be moved from free_clt() into the above function.
> >
> > We only use the device to create an entry under /sys/class. free_clt()
> > is destroying sysfs first and unregisters the device afterwards. I
> > don't really see the need to free from the callback instead... Will
> > make it clear in the comment.
>
> There is plenty of code under drivers/base that calls get_device() and
> put_device(). Are you sure that none of the code under drivers/base will
> ever call get_device() and put_device() for the ibtrs client device?
You mean how could multiple kernel modules share the same ibtrs
session...? I really never thought that far...

> Thanks,
>
> Bart.
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-25 20:50         ` Danil Kipnis
@ 2019-09-25 21:08           ` Bart Van Assche
  2019-09-25 21:16             ` Bart Van Assche
  2019-09-25 22:53             ` Danil Kipnis
  0 siblings, 2 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-25 21:08 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/25/19 1:50 PM, Danil Kipnis wrote:
> On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> There is plenty of code under drivers/base that calls get_device() and
>> put_device(). Are you sure that none of the code under drivers/base will
>> ever call get_device() and put_device() for the ibtrs client device?
>
> You mean how could multiple kernel modules share the same ibtrs
> session...? I really never thought that far...

I meant something else: device_register() registers struct device
instances in multiple lists. The driver core may decide to iterate over
these lists and to call get_device() / put_device() on the devices it
finds in these lists.

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-25 21:08           ` Bart Van Assche
@ 2019-09-25 21:16             ` Bart Van Assche
  2019-09-25 22:53             ` Danil Kipnis
  1 sibling, 0 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-25 21:16 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/25/19 2:08 PM, Bart Van Assche wrote:
> On 9/25/19 1:50 PM, Danil Kipnis wrote:
>> On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>> There is plenty of code under drivers/base that calls get_device() and
>>> put_device(). Are you sure that none of the code under drivers/base will
>>> ever call get_device() and put_device() for the ibtrs client device?
>>
>> You mean how could multiple kernel modules share the same ibtrs
>> session...? I really never thought that far...
> 
> I meant something else: device_register() registers struct device
> instances in multiple lists. The driver core may decide to iterate over
> these lists and to call get_device() / put_device() on the devices it
> finds in these lists.

Examples of such functions are device_pm_add() (which is called
indirectly by device_register()) and dpm_prepare(). Although it is
unlikely that this code will be used in combination with suspend/resume,
I don't think these drivers should be written such that it these are
incompatible with the runtime power management code.

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers
  2019-09-23 22:50   ` [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers Bart Van Assche
@ 2019-09-25 21:45     ` Danil Kipnis
  2019-09-25 21:57       ` Bart Van Assche
  2019-09-27  8:56     ` Jinpu Wang
  1 sibling, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25 21:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Roman Pen, Jack Wang

On Tue, Sep 24, 2019 at 12:50 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#define P1 )
> > +#define P2 ))
> > +#define P3 )))
> > +#define P4 ))))
> > +#define P(N) P ## N
> > +
> > +#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
> > +#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
> > +
> > +#define LIST(...)                                            \
> > +     __VA_ARGS__,                                            \
> > +     ({ unknown_type(); NULL; })                             \
> > +     CAT(P, COUNT_ARGS(__VA_ARGS__))                         \
> > +
> > +#define EMPTY()
> > +#define DEFER(id) id EMPTY()
> > +
> > +#define _CASE(obj, type, member)                             \
> > +     __builtin_choose_expr(                                  \
> > +     __builtin_types_compatible_p(                           \
> > +             typeof(obj), type),                             \
> > +             ((type)obj)->member
> > +#define CASE(o, t, m) DEFER(_CASE)(o, t, m)
> > +
> > +/*
> > + * Below we define retrieving of sessname from common IBTRS types.
> > + * Client or server related types have to be defined by special
> > + * TYPES_TO_SESSNAME macro.
> > + */
> > +
> > +void unknown_type(void);
> > +
> > +#ifndef TYPES_TO_SESSNAME
> > +#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
> > +#endif
> > +
> > +#define ibtrs_prefix(obj)                                    \
> > +     _CASE(obj, struct ibtrs_con *,  sess->sessname),        \
> > +     _CASE(obj, struct ibtrs_sess *, sessname),              \
> > +     TYPES_TO_SESSNAME(obj)                                  \
> > +     ))
>
> No preprocessor voodoo please. Please remove all of the above and modify
> the logging statements such that these pass the proper name string as
> first argument to logging macros.

Hi Bart,

do you think it would make sense we first submit a new patchset for
IBTRS (with the changes you suggested plus closed security problem)
and later submit a separate one for IBNBD only?

Thank you,
Danil

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers
  2019-09-25 21:45     ` Danil Kipnis
@ 2019-09-25 21:57       ` Bart Van Assche
  0 siblings, 0 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-25 21:57 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Roman Pen, Jack Wang

On 9/25/19 2:45 PM, Danil Kipnis wrote:
> do you think it would make sense we first submit a new patchset for
> IBTRS (with the changes you suggested plus closed security problem)
> and later submit a separate one for IBNBD only?

I'm not sure what others prefer. Personally I prefer to see all the 
code, that means IBTRS and IBNBD.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-18 15:47             ` Bart Van Assche
  2019-09-20  8:29               ` Danil Kipnis
@ 2019-09-25 22:26               ` Danil Kipnis
  2019-09-26  9:55                 ` Roman Penyaev
  1 sibling, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25 22:26 UTC (permalink / raw)
  To: Bart Van Assche, Roman Pen
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Wed, Sep 18, 2019 at 5:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
> Combining multiple queues (a) into a single queue (b) that is smaller
> than the combined source queues without sacrificing performance is
> tricky. We already have one such implementation in the block layer core
> and it took considerable time to get that implementation right. See e.g.
> blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart().

Roma, can you please estimate the performance impact in case we switch to it?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-25 21:08           ` Bart Van Assche
  2019-09-25 21:16             ` Bart Van Assche
@ 2019-09-25 22:53             ` Danil Kipnis
  2019-09-25 23:21               ` Bart Van Assche
  1 sibling, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25 22:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Wed, Sep 25, 2019 at 11:08 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/25/19 1:50 PM, Danil Kipnis wrote:
> > On Wed, Sep 25, 2019 at 8:55 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >> There is plenty of code under drivers/base that calls get_device() and
> >> put_device(). Are you sure that none of the code under drivers/base will
> >> ever call get_device() and put_device() for the ibtrs client device?
> >
> > You mean how could multiple kernel modules share the same ibtrs
> > session...? I really never thought that far...
>
> I meant something else: device_register() registers struct device
> instances in multiple lists. The driver core may decide to iterate over
> these lists and to call get_device() / put_device() on the devices it
> finds in these lists.
Oh, you mean we just need stub functions for those, so that nobody
steps on a null?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-25 22:53             ` Danil Kipnis
@ 2019-09-25 23:21               ` Bart Van Assche
  2019-09-26  9:16                 ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-25 23:21 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/25/19 3:53 PM, Danil Kipnis wrote:
> Oh, you mean we just need stub functions for those, so that nobody
> steps on a null?

What I meant is that the memory that is backing a device must not be 
freed until the reference count of a device has dropped to zero. If a 
struct device is embedded in a larger structure that means signaling a 
completion from inside the release function (ibtrs_clt_dev_release()) 
and not freeing the struct device memory (kfree(clt) in free_clt()) 
before that completion has been triggered.

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 16/25] ibnbd: client: private header with client structs and functions
  2019-09-17 16:36     ` Jinpu Wang
@ 2019-09-25 23:43       ` Danil Kipnis
  2019-09-26 10:00         ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-25 23:43 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev

On Tue, Sep 17, 2019 at 6:36 PM Jinpu Wang <jinpu.wang@cloud.ionos.com> wrote:
>
> On Sat, Sep 14, 2019 at 12:25 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 6/20/19 8:03 AM, Jack Wang wrote:
> > > +     char                    pathname[NAME_MAX];
> > [ ... ]
> >  > +    char                    blk_symlink_name[NAME_MAX];
> >
> > Please allocate path names dynamically instead of hard-coding the upper
> > length for a path.
Those strings are used to name directories and files under sysfs,
which I think makes NAME_MAX a natural limitation for them. Client and
server only exchange those strings on connection establishment, not in
the IO path. We do not really need to safe 256K on a server with 1000
devices mapped in parallel. A patch to allocate those strings makes
the code longer, introduces new error paths and in my opinion doesn't
bring any benefits.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 06/25] ibtrs: client: main functionality
  2019-09-25 23:21               ` Bart Van Assche
@ 2019-09-26  9:16                 ` Danil Kipnis
  0 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-09-26  9:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Thu, Sep 26, 2019 at 1:21 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/25/19 3:53 PM, Danil Kipnis wrote:
> > Oh, you mean we just need stub functions for those, so that nobody
> > steps on a null?
>
> What I meant is that the memory that is backing a device must not be
> freed until the reference count of a device has dropped to zero. If a
> struct device is embedded in a larger structure that means signaling a
> completion from inside the release function (ibtrs_clt_dev_release())
> and not freeing the struct device memory (kfree(clt) in free_clt())
> before that completion has been triggered.

Got it, thank you. Will move free_clt into the release function.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-25 22:26               ` Danil Kipnis
@ 2019-09-26  9:55                 ` Roman Penyaev
  2019-09-26 15:01                   ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Roman Penyaev @ 2019-09-26  9:55 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Thu, Sep 26, 2019 at 12:26 AM Danil Kipnis
<danil.kipnis@cloud.ionos.com> wrote:
>
> On Wed, Sep 18, 2019 at 5:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
> > Combining multiple queues (a) into a single queue (b) that is smaller
> > than the combined source queues without sacrificing performance is
> > tricky. We already have one such implementation in the block layer core
> > and it took considerable time to get that implementation right. See e.g.
> > blk_mq_sched_mark_restart_hctx() and blk_mq_sched_restart().
>
> Roma, can you please estimate the performance impact in case we switch to it?

As I remember correctly I could not reuse the whole machinery with those
restarts from block core because shared tags are shared only between
hardware queues, i.e. different hardware queues share different tags sets.
IBTRS has many hardware queues (independent RDMA connections) but only one
tags set, which is equally shared between block devices.  What I dreamed
about is something like BLK_MQ_F_TAG_GLOBALLY_SHARED support in block
layer.

--
Roman

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 16/25] ibnbd: client: private header with client structs and functions
  2019-09-25 23:43       ` Danil Kipnis
@ 2019-09-26 10:00         ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-26 10:00 UTC (permalink / raw)
  To: Danil Kipnis, Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev

On Thu, Sep 26, 2019 at 1:43 AM Danil Kipnis
<danil.kipnis@cloud.ionos.com> wrote:
>
> On Tue, Sep 17, 2019 at 6:36 PM Jinpu Wang <jinpu.wang@cloud.ionos.com> wrote:
> >
> > On Sat, Sep 14, 2019 at 12:25 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > >
> > > On 6/20/19 8:03 AM, Jack Wang wrote:
> > > > +     char                    pathname[NAME_MAX];
> > > [ ... ]
> > >  > +    char                    blk_symlink_name[NAME_MAX];
> > >
> > > Please allocate path names dynamically instead of hard-coding the upper
> > > length for a path.
> Those strings are used to name directories and files under sysfs,
> which I think makes NAME_MAX a natural limitation for them. Client and
> server only exchange those strings on connection establishment, not in
> the IO path. We do not really need to safe 256K on a server with 1000
> devices mapped in parallel. A patch to allocate those strings makes
> the code longer, introduces new error paths and in my opinion doesn't
> bring any benefits.
Hi Bart,

We have a draft patch, but it looks ugly, after discussing in house,
due to the reason
Danil mentioned.

we dropped the patch.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
  2019-09-18 21:46   ` [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev Bart Van Assche
@ 2019-09-26 14:04     ` Jinpu Wang
  2019-09-26 15:11       ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-09-26 14:04 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

Sorry for the slow reply.

On Wed, Sep 18, 2019 at 11:46 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
>
> Same comment as for a previous patch: please do not include line number
> information in pr_fmt().
Ok, will be removed.

>
> > +static int ibnbd_dev_vfs_open(struct ibnbd_dev *dev, const char *path,
> > +                           fmode_t flags)
> > +{
> > +     int oflags = O_DSYNC; /* enable write-through */
> > +
> > +     if (flags & FMODE_WRITE)
> > +             oflags |= O_RDWR;
> > +     else if (flags & FMODE_READ)
> > +             oflags |= O_RDONLY;
> > +     else
> > +             return -EINVAL;
> > +
> > +     dev->file = filp_open(path, oflags, 0);
> > +     return PTR_ERR_OR_ZERO(dev->file);
> > +}
>
> Isn't the use of O_DSYNC something that should be configurable?
I know scst allow O_DSYNC to be configured, but in our production, we
only use with O_DSYNC,
 we sure can add options to allow it to configure it, but we don't
have a need yet.
>
> > +struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
> > +                              enum ibnbd_io_mode mode, struct bio_set *bs,
> > +                              ibnbd_dev_io_fn io_cb)
> > +{
> > +     struct ibnbd_dev *dev;
> > +     int ret;
> > +
> > +     dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +     if (!dev)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     if (mode == IBNBD_BLOCKIO) {
> > +             dev->blk_open_flags = flags;
> > +             ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> > +             if (ret)
> > +                     goto err;
> > +     } else if (mode == IBNBD_FILEIO) {
> > +             dev->blk_open_flags = FMODE_READ;
> > +             ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> > +             if (ret)
> > +                     goto err;
> > +
> > +             ret = ibnbd_dev_vfs_open(dev, path, flags);
> > +             if (ret)
> > +                     goto blk_put;
>
> This looks really weird. Why to call ibnbd_dev_blk_open() first for file
> I/O mode? Why to set dev->blk_open_flags to FMODE_READ in file I/O mode?

The reason behind is we want to be able to symlink to the block device.
And for File io mode, we only allow exporting block device.


>
> > +static int ibnbd_dev_blk_submit_io(struct ibnbd_dev *dev, sector_t sector,
> > +                                void *data, size_t len, u32 bi_size,
> > +                                enum ibnbd_io_flags flags, short prio,
> > +                                void *priv)
> > +{
> > +     struct request_queue *q = bdev_get_queue(dev->bdev);
> > +     struct ibnbd_dev_blk_io *io;
> > +     struct bio *bio;
> > +
> > +     /* check if the buffer is suitable for bdev */
> > +     if (unlikely(WARN_ON(!blk_rq_aligned(q, (unsigned long)data, len))))
> > +             return -EINVAL;
> > +
> > +     /* Generate bio with pages pointing to the rdma buffer */
> > +     bio = ibnbd_bio_map_kern(q, data, dev->ibd_bio_set, len, GFP_KERNEL);
> > +     if (unlikely(IS_ERR(bio)))
> > +             return PTR_ERR(bio);
> > +
> > +     io = kmalloc(sizeof(*io), GFP_KERNEL);
> > +     if (unlikely(!io)) {
> > +             bio_put(bio);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     io->dev         = dev;
> > +     io->priv        = priv;
> > +
> > +     bio->bi_end_io          = ibnbd_dev_bi_end_io;
> > +     bio->bi_private         = io;
> > +     bio->bi_opf             = ibnbd_to_bio_flags(flags);
> > +     bio->bi_iter.bi_sector  = sector;
> > +     bio->bi_iter.bi_size    = bi_size;
> > +     bio_set_prio(bio, prio);
> > +     bio_set_dev(bio, dev->bdev);
> > +
> > +     submit_bio(bio);
> > +
> > +     return 0;
> > +}
>
> Can struct bio and struct ibnbd_dev_blk_io be combined into a single
> data structure by passing the size of the latter data structure as the
> front_pad argument to bioset_init()?
Thanks for the suggestion, will look into it,
looks we can embed struct bio to struct ibnbd_dev_blk_io.
>
> > +static void ibnbd_dev_file_submit_io_worker(struct work_struct *w)
> > +{
> > +     struct ibnbd_dev_file_io_work *dev_work;
> > +     struct file *f;
> > +     int ret, len;
> > +     loff_t off;
> > +
> > +     dev_work = container_of(w, struct ibnbd_dev_file_io_work, work);
> > +     off = dev_work->sector * ibnbd_dev_get_logical_bsize(dev_work->dev);
> > +     f = dev_work->dev->file;
> > +     len = dev_work->bi_size;
> > +
> > +     if (ibnbd_op(dev_work->flags) == IBNBD_OP_FLUSH) {
> > +             ret = ibnbd_dev_file_handle_flush(dev_work, off);
> > +             if (unlikely(ret))
> > +                     goto out;
> > +     }
> > +
> > +     if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE_SAME) {
> > +             ret = ibnbd_dev_file_handle_write_same(dev_work);
> > +             if (unlikely(ret))
> > +                     goto out;
> > +     }
> > +
> > +     /* TODO Implement support for DIRECT */
> > +     if (dev_work->bi_size) {
> > +             loff_t off_tmp = off;
> > +
> > +             if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE)
> > +                     ret = kernel_write(f, dev_work->data, dev_work->bi_size,
> > +                                        &off_tmp);
> > +             else
> > +                     ret = kernel_read(f, dev_work->data, dev_work->bi_size,
> > +                                       &off_tmp);
> > +
> > +             if (unlikely(ret < 0)) {
> > +                     goto out;
> > +             } else if (unlikely(ret != dev_work->bi_size)) {
> > +                     /* TODO implement support for partial completions */
> > +                     ret = -EIO;
> > +                     goto out;
> > +             } else {
> > +                     ret = 0;
> > +             }
> > +     }
> > +
> > +     if (dev_work->flags & IBNBD_F_FUA)
> > +             ret = ibnbd_dev_file_handle_fua(dev_work, off);
> > +out:
> > +     dev_work->dev->io_cb(dev_work->priv, ret);
> > +     kfree(dev_work);
> > +}
> > +
> > +static int ibnbd_dev_file_submit_io(struct ibnbd_dev *dev, sector_t sector,
> > +                                 void *data, size_t len, size_t bi_size,
> > +                                 enum ibnbd_io_flags flags, void *priv)
> > +{
> > +     struct ibnbd_dev_file_io_work *w;
> > +
> > +     if (!ibnbd_flags_supported(flags)) {
> > +             pr_info_ratelimited("Unsupported I/O flags: 0x%x on device "
> > +                                 "%s\n", flags, dev->name);
> > +             return -ENOTSUPP;
> > +     }
> > +
> > +     w = kmalloc(sizeof(*w), GFP_KERNEL);
> > +     if (!w)
> > +             return -ENOMEM;
> > +
> > +     w->dev          = dev;
> > +     w->priv         = priv;
> > +     w->sector       = sector;
> > +     w->data         = data;
> > +     w->len          = len;
> > +     w->bi_size      = bi_size;
> > +     w->flags        = flags;
> > +     INIT_WORK(&w->work, ibnbd_dev_file_submit_io_worker);
> > +
> > +     if (unlikely(!queue_work(fileio_wq, &w->work))) {
> > +             kfree(w);
> > +             return -EEXIST;
> > +     }
> > +
> > +     return 0;
> > +}
>
> Please use the in-kernel asynchronous I/O API instead of kernel_read()
> and kernel_write() and remove the fileio_wq workqueue. Examples of how
> to use call_read_iter() and call_write_iter() are available in the loop
> driver and also in drivers/target/target_core_file.c.
What the benefits of using call_read_iter/call_write_iter, does it
offer better performance?

>
> > +/** ibnbd_dev_init() - Initialize ibnbd_dev
> > + *
> > + * This functions initialized the ibnbd-dev component.
> > + * It has to be called 1x time before ibnbd_dev_open() is used
> > + */
> > +int ibnbd_dev_init(void);
>
> It is great so see kernel-doc headers above functions but I'm not sure
> these should be in .h files. I think most kernel developers prefer to
> see kernel-doc headers for functions in .c files because that makes it
> more likely that the implementation and the documentation stay in sync.
>
Ok, will move the kernel doc to source code.
I feel for exported functions, it's more common to do it in header files.
For this case, I think it's fine to move the kernel-doc to the c file.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-26  9:55                 ` Roman Penyaev
@ 2019-09-26 15:01                   ` Bart Van Assche
  2019-09-27  8:52                     ` Roman Penyaev
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-26 15:01 UTC (permalink / raw)
  To: Roman Penyaev, Danil Kipnis
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/26/19 2:55 AM, Roman Penyaev wrote:
> As I remember correctly I could not reuse the whole machinery with those
> restarts from block core because shared tags are shared only between
> hardware queues, i.e. different hardware queues share different tags sets.
> IBTRS has many hardware queues (independent RDMA connections) but only one
> tags set, which is equally shared between block devices.  What I dreamed
> about is something like BLK_MQ_F_TAG_GLOBALLY_SHARED support in block
> layer.

A patch series that adds support for sharing tag sets across hardware 
queues is pending. See also "[PATCH V3 0/8] blk-mq & scsi: fix reply 
queue selection and improve host wide tagset" 
(https://lore.kernel.org/linux-block/20180227100750.32299-1-ming.lei@redhat.com/). 
Would that patch series allow to remove the queue management code from 
ibnbd?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
  2019-09-26 14:04     ` Jinpu Wang
@ 2019-09-26 15:11       ` Bart Van Assche
  2019-09-26 15:25         ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-26 15:11 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On 9/26/19 7:04 AM, Jinpu Wang wrote:
> On Wed, Sep 18, 2019 at 11:46 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 6/20/19 8:03 AM, Jack Wang wrote:
>> Isn't the use of O_DSYNC something that should be configurable?
> I know scst allow O_DSYNC to be configured, but in our production, we
> only use with O_DSYNC,
>   we sure can add options to allow it to configure it, but we don't
> have a need yet.

Shouldn't upstream code be general purpose instead of only satisfying 
the need of a single user?

>>> +struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
>>> +                              enum ibnbd_io_mode mode, struct bio_set *bs,
>>> +                              ibnbd_dev_io_fn io_cb)
>>> +{
>>> +     struct ibnbd_dev *dev;
>>> +     int ret;
>>> +
>>> +     dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +     if (!dev)
>>> +             return ERR_PTR(-ENOMEM);
>>> +
>>> +     if (mode == IBNBD_BLOCKIO) {
>>> +             dev->blk_open_flags = flags;
>>> +             ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
>>> +             if (ret)
>>> +                     goto err;
>>> +     } else if (mode == IBNBD_FILEIO) {
>>> +             dev->blk_open_flags = FMODE_READ;
>>> +             ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
>>> +             if (ret)
>>> +                     goto err;
>>> +
>>> +             ret = ibnbd_dev_vfs_open(dev, path, flags);
>>> +             if (ret)
>>> +                     goto blk_put;
>>
>> This looks really weird. Why to call ibnbd_dev_blk_open() first for file
>> I/O mode? Why to set dev->blk_open_flags to FMODE_READ in file I/O mode?
> 
> The reason behind is we want to be able to symlink to the block device.
> And for File io mode, we only allow exporting block device.

This sounds weird to me ...

>> Please use the in-kernel asynchronous I/O API instead of kernel_read()
>> and kernel_write() and remove the fileio_wq workqueue. Examples of how
>> to use call_read_iter() and call_write_iter() are available in the loop
>> driver and also in drivers/target/target_core_file.c.
>
> What the benefits of using call_read_iter/call_write_iter, does it
> offer better performance?

The benefits of using in-kernel asynchronous I/O I know of are:
* Better performance due to fewer context switches. For the posted code 
as many kernel threads will be active as the queue depth. So more 
context switches will be triggered than necessary.
* Removal the file I/O workqueue and hence a reduction of the number of 
kernel threads.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
  2019-09-26 15:11       ` Bart Van Assche
@ 2019-09-26 15:25         ` Danil Kipnis
  2019-09-26 15:29           ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-26 15:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jinpu Wang, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev

On Thu, Sep 26, 2019 at 5:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>> +struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
> >>> +                              enum ibnbd_io_mode mode, struct bio_set *bs,
> >>> +                              ibnbd_dev_io_fn io_cb)
> >>> +{
> >>> +     struct ibnbd_dev *dev;
> >>> +     int ret;
> >>> +
> >>> +     dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> >>> +     if (!dev)
> >>> +             return ERR_PTR(-ENOMEM);
> >>> +
> >>> +     if (mode == IBNBD_BLOCKIO) {
> >>> +             dev->blk_open_flags = flags;
> >>> +             ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> >>> +             if (ret)
> >>> +                     goto err;
> >>> +     } else if (mode == IBNBD_FILEIO) {
> >>> +             dev->blk_open_flags = FMODE_READ;
> >>> +             ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
> >>> +             if (ret)
> >>> +                     goto err;
> >>> +
> >>> +             ret = ibnbd_dev_vfs_open(dev, path, flags);
> >>> +             if (ret)
> >>> +                     goto blk_put;
> >>
> >> This looks really weird. Why to call ibnbd_dev_blk_open() first for file
> >> I/O mode? Why to set dev->blk_open_flags to FMODE_READ in file I/O mode?

Bart, would it in your opinion be OK to drop the file_io support in
IBNBD entirely? We implemented this feature in the beginning of the
project to see whether it could be beneficial in some use cases, but
never actually found any.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
  2019-09-26 15:25         ` Danil Kipnis
@ 2019-09-26 15:29           ` Bart Van Assche
  2019-09-26 15:38             ` Danil Kipnis
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-26 15:29 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jinpu Wang, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev

On 9/26/19 8:25 AM, Danil Kipnis wrote:
> On Thu, Sep 26, 2019 at 5:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>> This looks really weird. Why to call ibnbd_dev_blk_open() first for file
>>>> I/O mode? Why to set dev->blk_open_flags to FMODE_READ in file I/O mode?
> 
> Bart, would it in your opinion be OK to drop the file_io support in
> IBNBD entirely? We implemented this feature in the beginning of the
> project to see whether it could be beneficial in some use cases, but
> never actually found any.

I think that's reasonable since the loop driver can be used to convert a 
file into a block device.

Bart.



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
  2019-09-26 15:29           ` Bart Van Assche
@ 2019-09-26 15:38             ` Danil Kipnis
  2019-09-26 15:42               ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-26 15:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jinpu Wang, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev

> > Bart, would it in your opinion be OK to drop the file_io support in
> > IBNBD entirely? We implemented this feature in the beginning of the
> > project to see whether it could be beneficial in some use cases, but
> > never actually found any.
>
> I think that's reasonable since the loop driver can be used to convert a
> file into a block device.
Jack, shall we drop it?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev
  2019-09-26 15:38             ` Danil Kipnis
@ 2019-09-26 15:42               ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-26 15:42 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev

On Thu, Sep 26, 2019 at 5:38 PM Danil Kipnis
<danil.kipnis@cloud.ionos.com> wrote:
>
> > > Bart, would it in your opinion be OK to drop the file_io support in
> > > IBNBD entirely? We implemented this feature in the beginning of the
> > > project to see whether it could be beneficial in some use cases, but
> > > never actually found any.
> >
> > I think that's reasonable since the loop driver can be used to convert a
> > file into a block device.
> Jack, shall we drop it?

Yes, we should drop it in next round.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-26 15:01                   ` Bart Van Assche
@ 2019-09-27  8:52                     ` Roman Penyaev
  2019-09-27  9:32                       ` Danil Kipnis
  2019-09-27 16:37                       ` Bart Van Assche
  0 siblings, 2 replies; 123+ messages in thread
From: Roman Penyaev @ 2019-09-27  8:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Thu, Sep 26, 2019 at 5:01 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/26/19 2:55 AM, Roman Penyaev wrote:
> > As I remember correctly I could not reuse the whole machinery with those
> > restarts from block core because shared tags are shared only between
> > hardware queues, i.e. different hardware queues share different tags sets.
> > IBTRS has many hardware queues (independent RDMA connections) but only one
> > tags set, which is equally shared between block devices.  What I dreamed
> > about is something like BLK_MQ_F_TAG_GLOBALLY_SHARED support in block
> > layer.
>
> A patch series that adds support for sharing tag sets across hardware
> queues is pending. See also "[PATCH V3 0/8] blk-mq & scsi: fix reply
> queue selection and improve host wide tagset"
> (https://lore.kernel.org/linux-block/20180227100750.32299-1-ming.lei@redhat.com/).
> Would that patch series allow to remove the queue management code from
> ibnbd?

Hi Bart,

No, it seems this thingy is a bit different.  According to my
understanding patches 3 and 4 from this patchset do the
following: 1# split equally the whole queue depth on number
of hardware queues and 2# return tag number which is unique
host-wide (more or less similar to unique_tag, right?).

2# is not needed for ibtrs, and 1# can be easy done by dividing
queue_depth on number of hw queues on tag set allocation, e.g.
something like the following:

    ...
    tags->nr_hw_queues = num_online_cpus();
    tags->queue_depth  = sess->queue_deph / tags->nr_hw_queues;

    blk_mq_alloc_tag_set(tags);


And this trick won't work out for the performance.  ibtrs client
has a single resource: set of buffer chunks received from a
server side.  And these buffers should be dynamically distributed
between IO producers according to the load.  Having a hard split
of the whole queue depth between hw queues we can forget about a
dynamic load distribution, here is an example:

   - say server shares 1024 buffer chunks for a session (do not
     remember what is the actual number).

   - 1024 buffers are equally divided between hw queues, let's
     say 64 (number of cpus), so each queue is 16 requests depth.

   - only several CPUs produce IO, and instead of occupying the
     whole "bandwidth" of a session, i.e. 1024 buffer chunks,
     we limit ourselves to a small queue depth of an each hw
     queue.

And performance drops significantly when number of IO producers
is smaller than number of hw queues (CPUs), and it can be easily
tested and proved.

So for this particular ibtrs case tags should be globally shared,
and seems (unfortunately) there is no any other similar requirements
for other block devices.

--
Roman

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers
  2019-09-23 22:50   ` [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers Bart Van Assche
  2019-09-25 21:45     ` Danil Kipnis
@ 2019-09-27  8:56     ` Jinpu Wang
  1 sibling, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-27  8:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Tue, Sep 24, 2019 at 12:50 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +#define P1 )
> > +#define P2 ))
> > +#define P3 )))
> > +#define P4 ))))
> > +#define P(N) P ## N
> > +
> > +#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
> > +#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
> > +
> > +#define LIST(...)                                            \
> > +     __VA_ARGS__,                                            \
> > +     ({ unknown_type(); NULL; })                             \
> > +     CAT(P, COUNT_ARGS(__VA_ARGS__))                         \
> > +
> > +#define EMPTY()
> > +#define DEFER(id) id EMPTY()
> > +
> > +#define _CASE(obj, type, member)                             \
> > +     __builtin_choose_expr(                                  \
> > +     __builtin_types_compatible_p(                           \
> > +             typeof(obj), type),                             \
> > +             ((type)obj)->member
> > +#define CASE(o, t, m) DEFER(_CASE)(o, t, m)
> > +
> > +/*
> > + * Below we define retrieving of sessname from common IBTRS types.
> > + * Client or server related types have to be defined by special
> > + * TYPES_TO_SESSNAME macro.
> > + */
> > +
> > +void unknown_type(void);
> > +
> > +#ifndef TYPES_TO_SESSNAME
> > +#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
> > +#endif
> > +
> > +#define ibtrs_prefix(obj)                                    \
> > +     _CASE(obj, struct ibtrs_con *,  sess->sessname),        \
> > +     _CASE(obj, struct ibtrs_sess *, sessname),              \
> > +     TYPES_TO_SESSNAME(obj)                                  \
> > +     ))
>
> No preprocessor voodoo please. Please remove all of the above and modify
> the logging statements such that these pass the proper name string as
> first argument to logging macros.
Sure, will do.
>
> > +struct ibtrs_msg_conn_req {
> > +     u8              __cma_version; /* Is set to 0 by cma.c in case of
> > +                                     * AF_IB, do not touch that. */
> > +     u8              __ip_version;  /* On sender side that should be
> > +                                     * set to 0, or cma_save_ip_info()
> > +                                     * extract garbage and will fail. */
> > +     __le16          magic;
> > +     __le16          version;
> > +     __le16          cid;
> > +     __le16          cid_num;
> > +     __le16          recon_cnt;
> > +     uuid_t          sess_uuid;
> > +     uuid_t          paths_uuid;
> > +     u8              reserved[12];
> > +};
>
> Please remove the reserved[] array and check private_data_len in the
> code that receives the login request.
We already checked the private_data_len on server side, see ibtrs_rdma_connect,
and keep some reserved fields for future seems to be common practice
for protocol, IMO.
Also due to the fact, we already running the code in production, we
want to keep the protocol compatible, so future
transition could be smooth.
>
> > +/**
> > + * struct ibtrs_msg_conn_rsp - Server connection response to the client
> > + * @magic:      IBTRS magic
> > + * @version:    IBTRS protocol version
> > + * @errno:      If rdma_accept() then 0, if rdma_reject() indicates error
> > + * @queue_depth:   max inflight messages (queue-depth) in this session
> > + * @max_io_size:   max io size server supports
> > + * @max_hdr_size:  max msg header size server supports
> > + *
> > + * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
> > + */
> > +struct ibtrs_msg_conn_rsp {
> > +     __le16          magic;
> > +     __le16          version;
> > +     __le16          errno;
> > +     __le16          queue_depth;
> > +     __le32          max_io_size;
> > +     __le32          max_hdr_size;
> > +     u8              reserved[40];
> > +};
>
> Same comment here: please remove the reserved[] array and check
> private_data_len in the code that processes this data structure.
Ditto.
>
> > +static inline int sockaddr_cmp(const struct sockaddr *a,
> > +                            const struct sockaddr *b)
> > +{
> > +     switch (a->sa_family) {
> > +     case AF_IB:
> > +             return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
> > +                           &((struct sockaddr_ib *)b)->sib_addr,
> > +                           sizeof(struct ib_addr));
> > +     case AF_INET:
> > +             return memcmp(&((struct sockaddr_in *)a)->sin_addr,
> > +                           &((struct sockaddr_in *)b)->sin_addr,
> > +                           sizeof(struct in_addr));
> > +     case AF_INET6:
> > +             return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
> > +                           &((struct sockaddr_in6 *)b)->sin6_addr,
> > +                           sizeof(struct in6_addr));
> > +     default:
> > +             return -ENOENT;
> > +     }
> > +}
> > +
> > +static inline int sockaddr_to_str(const struct sockaddr *addr,
> > +                                char *buf, size_t len)
> > +{
> > +     int cnt;
> > +
> > +     switch (addr->sa_family) {
> > +     case AF_IB:
> > +             cnt = scnprintf(buf, len, "gid:%pI6",
> > +                     &((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
> > +             return cnt;
> > +     case AF_INET:
> > +             cnt = scnprintf(buf, len, "ip:%pI4",
> > +                     &((struct sockaddr_in *)addr)->sin_addr);
> > +             return cnt;
> > +     case AF_INET6:
> > +             cnt = scnprintf(buf, len, "ip:%pI6c",
> > +                       &((struct sockaddr_in6 *)addr)->sin6_addr);
> > +             return cnt;
> > +     }
> > +     cnt = scnprintf(buf, len, "<invalid address family>");
> > +     pr_err("Invalid address family\n");
> > +     return cnt;
> > +}
>
> Since these functions are not in the hot path, please move these into a
> .c file.
ok.
>
> > +/**
> > + * ibtrs_invalidate_flag() - returns proper flags for invalidation
> > + *
> > + * NOTE: This function is needed for compat layer, so think twice before
> > + *       rename or remove.
> > + */
> > +static inline u32 ibtrs_invalidate_flag(void)
> > +{
> > +     return IBTRS_MSG_NEED_INVAL_F;
> > +}
>
> An inline function that does nothing else than returning a compile-time
> constant? That does not look useful to me. How about inlining this function?
This is needed for the compact layer, we redefine some FR functions to
use FMR for our
ConnectX2 X3 HCA.
https://github.com/ionos-enterprise/ibnbd/tree/master/ibtrs/compat
It will finally fade out, but it will take time.


Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-27  8:52                     ` Roman Penyaev
@ 2019-09-27  9:32                       ` Danil Kipnis
  2019-09-27 12:18                         ` Danil Kipnis
  2019-09-27 16:37                       ` Bart Van Assche
  1 sibling, 1 reply; 123+ messages in thread
From: Danil Kipnis @ 2019-09-27  9:32 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Fri, Sep 27, 2019 at 10:52 AM Roman Penyaev <r.peniaev@gmail.com> wrote:
>
> No, it seems this thingy is a bit different.  According to my
> understanding patches 3 and 4 from this patchset do the
> following: 1# split equally the whole queue depth on number
> of hardware queues and 2# return tag number which is unique
> host-wide (more or less similar to unique_tag, right?).
>
> 2# is not needed for ibtrs, and 1# can be easy done by dividing
> queue_depth on number of hw queues on tag set allocation, e.g.
> something like the following:
>
>     ...
>     tags->nr_hw_queues = num_online_cpus();
>     tags->queue_depth  = sess->queue_deph / tags->nr_hw_queues;
>
>     blk_mq_alloc_tag_set(tags);
>
>
> And this trick won't work out for the performance.  ibtrs client
> has a single resource: set of buffer chunks received from a
> server side.  And these buffers should be dynamically distributed
> between IO producers according to the load.  Having a hard split
> of the whole queue depth between hw queues we can forget about a
> dynamic load distribution, here is an example:
>
>    - say server shares 1024 buffer chunks for a session (do not
>      remember what is the actual number).
>
>    - 1024 buffers are equally divided between hw queues, let's
>      say 64 (number of cpus), so each queue is 16 requests depth.
>
>    - only several CPUs produce IO, and instead of occupying the
>      whole "bandwidth" of a session, i.e. 1024 buffer chunks,
>      we limit ourselves to a small queue depth of an each hw
>      queue.
>
> And performance drops significantly when number of IO producers
> is smaller than number of hw queues (CPUs), and it can be easily
> tested and proved.
>
> So for this particular ibtrs case tags should be globally shared,
> and seems (unfortunately) there is no any other similar requirements
> for other block devices.
I don't see any difference between what you describe here and 100 dm
volumes sitting on top of a single NVME device.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules
  2019-09-23 23:03   ` [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules Bart Van Assche
@ 2019-09-27 10:13     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-27 10:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Sep 24, 2019 at 1:03 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +static int ibtrs_str_gid_to_sockaddr(const char *addr, size_t len,
> > +                                  short port, struct sockaddr_storage *dst)
> > +{
> > +     struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
> > +     int ret;
> > +
> > +     /*
> > +      * We can use some of the I6 functions since GID is a valid
> > +      * IPv6 address format
> > +      */
> > +     ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
> > +     if (ret == 0)
> > +             return -EINVAL;
> > +
> > +     dst_ib->sib_family = AF_IB;
> > +     /*
> > +      * Use the same TCP server port number as the IB service ID
> > +      * on the IB port space range
> > +      */
> > +     dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
> > +     dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
> > +     dst_ib->sib_pkey = cpu_to_be16(0xffff);
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * ibtrs_str_to_sockaddr() - Convert ibtrs address string to sockaddr
> > + * @addr     String representation of an addr (IPv4, IPv6 or IB GID):
> > + *              - "ip:192.168.1.1"
> > + *              - "ip:fe80::200:5aee:feaa:20a2"
> > + *              - "gid:fe80::200:5aee:feaa:20a2"
> > + * @len         String address length
> > + * @port     Destination port
> > + * @dst              Destination sockaddr structure
> > + *
> > + * Returns 0 if conversion successful. Non-zero on error.
> > + */
> > +static int ibtrs_str_to_sockaddr(const char *addr, size_t len,
> > +                              short port, struct sockaddr_storage *dst)
> > +{
> > +     if (strncmp(addr, "gid:", 4) == 0) {
> > +             return ibtrs_str_gid_to_sockaddr(addr + 4, len - 4, port, dst);
> > +     } else if (strncmp(addr, "ip:", 3) == 0) {
> > +             char port_str[8];
> > +             char *cpy;
> > +             int err;
> > +
> > +             snprintf(port_str, sizeof(port_str), "%u", port);
> > +             cpy = kstrndup(addr + 3, len - 3, GFP_KERNEL);
> > +             err = cpy ? inet_pton_with_scope(&init_net, AF_UNSPEC,
> > +                                              cpy, port_str, dst) : -ENOMEM;
> > +             kfree(cpy);
> > +
> > +             return err;
> > +     }
> > +     return -EPROTONOSUPPORT;
> > +}
>
> A considerable amount of code is required to support the IB/CM. Does
> supporting the IB/CM add any value? If that code would be left out,
> would anything break? Is it really useful to support IB networks where
> no IP address has been assigned to each IB port?

We had quite some problems with ipoib in the past, especially neighbor
discovery, from time to time
we encountered some IP are not reachable from other hosts.

That's why we want to have AF_IB support, which doesn't reply on IPoIB.

Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 05/25] ibtrs: client: private header with client structs and functions
  2019-09-23 23:05   ` [PATCH v4 05/25] ibtrs: client: private header with client structs and functions Bart Van Assche
@ 2019-09-27 10:18     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-27 10:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Tue, Sep 24, 2019 at 1:05 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +static inline const char *ibtrs_clt_state_str(enum ibtrs_clt_state state)
> > +{
> > +     switch (state) {
> > +     case IBTRS_CLT_CONNECTING:
> > +             return "IBTRS_CLT_CONNECTING";
> > +     case IBTRS_CLT_CONNECTING_ERR:
> > +             return "IBTRS_CLT_CONNECTING_ERR";
> > +     case IBTRS_CLT_RECONNECTING:
> > +             return "IBTRS_CLT_RECONNECTING";
> > +     case IBTRS_CLT_CONNECTED:
> > +             return "IBTRS_CLT_CONNECTED";
> > +     case IBTRS_CLT_CLOSING:
> > +             return "IBTRS_CLT_CLOSING";
> > +     case IBTRS_CLT_CLOSED:
> > +             return "IBTRS_CLT_CLOSED";
> > +     case IBTRS_CLT_DEAD:
> > +             return "IBTRS_CLT_DEAD";
> > +     default:
> > +             return "UNKNOWN";
> > +     }
> > +}
>
> Since this code is not in the hot path, please move it from a .h into a
> .c file.
ok.
>
> > +static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
> > +{
> > +     return container_of(c, struct ibtrs_clt_con, c);
> > +}
> > +
> > +static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
> > +{
> > +     return container_of(s, struct ibtrs_clt_sess, s);
> > +}
>
> Is it really useful to define functions for these conversions? Has it
> been considered to inline these functions?
We use them quite some places, it does make the code shorter.


Thanks
Jinpu Wang

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 07/25] ibtrs: client: statistics functions
  2019-09-23 23:15   ` [PATCH v4 07/25] ibtrs: client: statistics functions Bart Van Assche
@ 2019-09-27 12:00     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-27 12:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Sep 24, 2019 at 1:15 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *stats, bool read,
> > +                            unsigned long ms)
> > +{
> > +     struct ibtrs_clt_stats_pcpu *s;
> > +     int id;
> > +
> > +     id = ibtrs_clt_ms_to_id(ms);
> > +     s = this_cpu_ptr(stats->pcpu_stats);
> > +     if (read) {
> > +             s->rdma_lat_distr[id].read++;
> > +             if (s->rdma_lat_max.read < ms)
> > +                     s->rdma_lat_max.read = ms;
> > +     } else {
> > +             s->rdma_lat_distr[id].write++;
> > +             if (s->rdma_lat_max.write < ms)
> > +                     s->rdma_lat_max.write = ms;
> > +     }
> > +}
>
> Can it happen that this function is called simultaneously from thread
> context and from interrupt context?
This can't happen, we only call the function from complete_rdma_req, and
complete_rdma_req is call from cq callback except fail_all_outstanding_reqs,
cq callback context is softirq, fail_all_outstanding_reqs is process
context, but we
disconnect and drain_qp before call into fail_all_outstading_reqs

>
> > +void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con)
> > +{
> > +     struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> > +     struct ibtrs_clt_stats *stats = &sess->stats;
> > +     struct ibtrs_clt_stats_pcpu *s;
> > +     int cpu;
> > +
> > +     cpu = raw_smp_processor_id();
> > +     s = this_cpu_ptr(stats->pcpu_stats);
> > +     s->wc_comp.cnt++;
> > +     s->wc_comp.total_cnt++;
> > +     if (unlikely(con->cpu != cpu)) {
> > +             s->cpu_migr.to++;
> > +
> > +             /* Careful here, override s pointer */
> > +             s = per_cpu_ptr(stats->pcpu_stats, con->cpu);
> > +             atomic_inc(&s->cpu_migr.from);
> > +     }
> > +}
>
> Same question here.
The function is only called from cq done callback,
>
> > +void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *stats)
> > +{
> > +     struct ibtrs_clt_stats_pcpu *s;
> > +
> > +     s = this_cpu_ptr(stats->pcpu_stats);
> > +     s->rdma.failover_cnt++;
> > +}
>
> And here ...
this function only call from process context.

>
> Thanks,
>
> Bart.
Thanks,
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 09/25] ibtrs: server: private header with server structs and functions
  2019-09-23 23:21   ` [PATCH v4 09/25] ibtrs: server: private header with server structs and functions Bart Van Assche
@ 2019-09-27 12:04     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-27 12:04 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Tue, Sep 24, 2019 at 1:21 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +static inline const char *ibtrs_srv_state_str(enum ibtrs_srv_state state)
> > +{
> > +     switch (state) {
> > +     case IBTRS_SRV_CONNECTING:
> > +             return "IBTRS_SRV_CONNECTING";
> > +     case IBTRS_SRV_CONNECTED:
> > +             return "IBTRS_SRV_CONNECTED";
> > +     case IBTRS_SRV_CLOSING:
> > +             return "IBTRS_SRV_CLOSING";
> > +     case IBTRS_SRV_CLOSED:
> > +             return "IBTRS_SRV_CLOSED";
> > +     default:
> > +             return "UNKNOWN";
> > +     }
> > +}
>
> Since this function is not in the hot path, please move it into a .c file.
Ok.
>
> > +/* See ibtrs-log.h */
> > +#define TYPES_TO_SESSNAME(obj)                                               \
> > +     LIST(CASE(obj, struct ibtrs_srv_sess *, s.sessname))
>
> Please remove this macro and pass 'sessname' explicitly to logging
> functions.
Ok.
>
> Thanks,
>
> Bart.
Thanks!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-27  9:32                       ` Danil Kipnis
@ 2019-09-27 12:18                         ` Danil Kipnis
  0 siblings, 0 replies; 123+ messages in thread
From: Danil Kipnis @ 2019-09-27 12:18 UTC (permalink / raw)
  To: Roman Penyaev, Christoph Hellwig
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Sagi Grimberg, Jason Gunthorpe, Doug Ledford, rpenyaev,
	Jack Wang


On 27.09.19 11:32, Danil Kipnis wrote:
> On Fri, Sep 27, 2019 at 10:52 AM Roman Penyaev <r.peniaev@gmail.com> wrote:
>> No, it seems this thingy is a bit different.  According to my
>> understanding patches 3 and 4 from this patchset do the
>> following: 1# split equally the whole queue depth on number
>> of hardware queues and 2# return tag number which is unique
>> host-wide (more or less similar to unique_tag, right?).
>>
>> 2# is not needed for ibtrs, and 1# can be easy done by dividing
>> queue_depth on number of hw queues on tag set allocation, e.g.
>> something like the following:
>>
>>      ...
>>      tags->nr_hw_queues = num_online_cpus();
>>      tags->queue_depth  = sess->queue_deph / tags->nr_hw_queues;
>>
>>      blk_mq_alloc_tag_set(tags);
>>
>>
>> And this trick won't work out for the performance.  ibtrs client
>> has a single resource: set of buffer chunks received from a
>> server side.  And these buffers should be dynamically distributed
>> between IO producers according to the load.  Having a hard split
>> of the whole queue depth between hw queues we can forget about a
>> dynamic load distribution, here is an example:
>>
>>     - say server shares 1024 buffer chunks for a session (do not
>>       remember what is the actual number).
>>
>>     - 1024 buffers are equally divided between hw queues, let's
>>       say 64 (number of cpus), so each queue is 16 requests depth.
>>
>>     - only several CPUs produce IO, and instead of occupying the
>>       whole "bandwidth" of a session, i.e. 1024 buffer chunks,
>>       we limit ourselves to a small queue depth of an each hw
>>       queue.
>>
>> And performance drops significantly when number of IO producers
>> is smaller than number of hw queues (CPUs), and it can be easily
>> tested and proved.
>>
>> So for this particular ibtrs case tags should be globally shared,
>> and seems (unfortunately) there is no any other similar requirements
>> for other block devices.
> I don't see any difference between what you describe here and 100 dm
> volumes sitting on top of a single NVME device.

Hallo Christoph,

am I wrong?

Thank you,

Danil.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 10/25] ibtrs: server: main functionality
  2019-09-23 23:49   ` [PATCH v4 10/25] ibtrs: server: main functionality Bart Van Assche
@ 2019-09-27 15:03     ` Jinpu Wang
  2019-09-27 15:11       ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-09-27 15:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Tue, Sep 24, 2019 at 1:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +module_param_named(max_chunk_size, max_chunk_size, int, 0444);
> > +MODULE_PARM_DESC(max_chunk_size,
> > +              "Max size for each IO request, when change the unit is in byte"
> > +              " (default: " __stringify(DEFAULT_MAX_CHUNK_SIZE_KB) "KB)");
>
> Where can I find the definition of DEFAULT_MAX_CHUNK_SIZE_KB?
oh, it's a typo, should be DEFAULT_MAX_CHUNK_SIZE.
>
> > +static char cq_affinity_list[256] = "";
>
> No empty initializers for file-scope variables please.
Is it guaranteed by the compiler, the file-scope variables will be
empty initialized?
>
> > +     pr_info("cq_affinity_list changed to %*pbl\n",
> > +             cpumask_pr_args(&cq_affinity_mask));
>
> Should this pr_info() call perhaps be changed into pr_debug()?
Because the setting could lead to performance drop, pr_info seems more
appropriate.

>
> > +static bool __ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
> > +                                  enum ibtrs_srv_state new_state)
> > +{
> > +     enum ibtrs_srv_state old_state;
> > +     bool changed = false;
> > +
> > +     old_state = sess->state;
> > +     switch (new_state) {
>
> Please add a lockdep_assert_held() statement that checks whether calls
> of this function are serialized properly.
will look into it.
>
> > +/**
> > + * rdma_write_sg() - response on successful READ request
> > + */
> > +static int rdma_write_sg(struct ibtrs_srv_op *id)
> > +{
> > +     struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess);
> > +     dma_addr_t dma_addr = sess->dma_addr[id->msg_id];
> > +     struct ibtrs_srv *srv = sess->srv;
> > +     struct ib_send_wr inv_wr, imm_wr;
> > +     struct ib_rdma_wr *wr = NULL;
> > +     const struct ib_send_wr *bad_wr;
> > +     enum ib_send_flags flags;
> > +     size_t sg_cnt;
> > +     int err, i, offset;
> > +     bool need_inval;
> > +     u32 rkey = 0;
> > +
> > +     sg_cnt = le16_to_cpu(id->rd_msg->sg_cnt);
> > +     need_inval = le16_to_cpu(id->rd_msg->flags) & IBTRS_MSG_NEED_INVAL_F;
> > +     if (unlikely(!sg_cnt))
> > +             return -EINVAL;
> > +
> > +     offset = 0;
> > +     for (i = 0; i < sg_cnt; i++) {
> > +             struct ib_sge *list;
> > +
> > +             wr              = &id->tx_wr[i];
> > +             list            = &id->tx_sg[i];
> > +             list->addr      = dma_addr + offset;
> > +             list->length    = le32_to_cpu(id->rd_msg->desc[i].len);
> > +
> > +             /* WR will fail with length error
> > +              * if this is 0
> > +              */
> > +             if (unlikely(list->length == 0)) {
> > +                     ibtrs_err(sess, "Invalid RDMA-Write sg list length 0\n");
> > +                     return -EINVAL;
> > +             }
> > +
> > +             list->lkey = sess->s.dev->ib_pd->local_dma_lkey;
> > +             offset += list->length;
> > +
> > +             wr->wr.wr_cqe   = &io_comp_cqe;
> > +             wr->wr.sg_list  = list;
> > +             wr->wr.num_sge  = 1;
> > +             wr->remote_addr = le64_to_cpu(id->rd_msg->desc[i].addr);
> > +             wr->rkey        = le32_to_cpu(id->rd_msg->desc[i].key);
> > +             if (rkey == 0)
> > +                     rkey = wr->rkey;
> > +             else
> > +                     /* Only one key is actually used */
> > +                     WARN_ON_ONCE(rkey != wr->rkey);
> > +
> > +             if (i < (sg_cnt - 1))
> > +                     wr->wr.next = &id->tx_wr[i + 1].wr;
> > +             else if (need_inval)
> > +                     wr->wr.next = &inv_wr;
> > +             else
> > +                     wr->wr.next = &imm_wr;
> > +
> > +             wr->wr.opcode = IB_WR_RDMA_WRITE;
> > +             wr->wr.ex.imm_data = 0;
> > +             wr->wr.send_flags  = 0;
> > +     }
> > +     /*
> > +      * From time to time we have to post signalled sends,
> > +      * or send queue will fill up and only QP reset can help.
> > +      */
> > +     flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
> > +                     0 : IB_SEND_SIGNALED;
> > +
> > +     if (need_inval) {
> > +             inv_wr.next = &imm_wr;
> > +             inv_wr.wr_cqe = &io_comp_cqe;
> > +             inv_wr.sg_list = NULL;
> > +             inv_wr.num_sge = 0;
> > +             inv_wr.opcode = IB_WR_SEND_WITH_INV;
> > +             inv_wr.send_flags = 0;
> > +             inv_wr.ex.invalidate_rkey = rkey;
> > +     }
> > +     imm_wr.next = NULL;
> > +     imm_wr.wr_cqe = &io_comp_cqe;
> > +     imm_wr.sg_list = NULL;
> > +     imm_wr.num_sge = 0;
> > +     imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
> > +     imm_wr.send_flags = flags;
> > +     imm_wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id,
> > +                                                          0, need_inval));
> > +
> > +     ib_dma_sync_single_for_device(sess->s.dev->ib_dev, dma_addr,
> > +                                   offset, DMA_BIDIRECTIONAL);
> > +
> > +     err = ib_post_send(id->con->c.qp, &id->tx_wr[0].wr, &bad_wr);
> > +     if (unlikely(err))
> > +             ibtrs_err(sess,
> > +                       "Posting RDMA-Write-Request to QP failed, err: %d\n",
> > +                       err);
> > +
> > +     return err;
> > +}
>
> All other RDMA server implementations use rdma_rw_ctx_init() and
> rdma_rw_ctx_wrs(). Please use these functions in IBTRS too.
rdma_rw_ctx_* api doesn't support RDMA_WRITE_WITH_IMM, and
ibtrs mainly use RDMA_WRITE_WITH_IMM.

>
> > +static void ibtrs_srv_hb_err_handler(struct ibtrs_con *c, int err)
> > +{
> > +     (void)err;
> > +     close_sess(to_srv_sess(c->sess));
> > +}
>
> Is the (void)err statement really necessary?
No, will be removed.
>
> > +static int ibtrs_srv_rdma_init(struct ibtrs_srv_ctx *ctx, unsigned int port)
> > +{
> > +     struct sockaddr_in6 sin = {
> > +             .sin6_family    = AF_INET6,
> > +             .sin6_addr      = IN6ADDR_ANY_INIT,
> > +             .sin6_port      = htons(port),
> > +     };
> > +     struct sockaddr_ib sib = {
> > +             .sib_family                     = AF_IB,
> > +             .sib_addr.sib_subnet_prefix     = 0ULL,
> > +             .sib_addr.sib_interface_id      = 0ULL,
> > +             .sib_sid        = cpu_to_be64(RDMA_IB_IP_PS_IB | port),
> > +             .sib_sid_mask   = cpu_to_be64(0xffffffffffffffffULL),
> > +             .sib_pkey       = cpu_to_be16(0xffff),
> > +     };
> > +     struct rdma_cm_id *cm_ip, *cm_ib;
> > +     int ret;
> > +
> > +     /*
> > +      * We accept both IPoIB and IB connections, so we need to keep
> > +      * two cm id's, one for each socket type and port space.
> > +      * If the cm initialization of one of the id's fails, we abort
> > +      * everything.
> > +      */
> > +     cm_ip = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sin, RDMA_PS_TCP);
> > +     if (unlikely(IS_ERR(cm_ip)))
> > +             return PTR_ERR(cm_ip);
> > +
> > +     cm_ib = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sib, RDMA_PS_IB);
> > +     if (unlikely(IS_ERR(cm_ib))) {
> > +             ret = PTR_ERR(cm_ib);
> > +             goto free_cm_ip;
> > +     }
> > +
> > +     ctx->cm_id_ip = cm_ip;
> > +     ctx->cm_id_ib = cm_ib;
> > +
> > +     return 0;
> > +
> > +free_cm_ip:
> > +     rdma_destroy_id(cm_ip);
> > +
> > +     return ret;
> > +}
>
> Will the above work if CONFIG_IPV6=n?
I tested with CONFIG_IPV6=n, it compiles.
>
> > +static int __init ibtrs_server_init(void)
> > +{
> > +     int err;
> > +
> > +     if (!strlen(cq_affinity_list))
> > +             init_cq_affinity();
>
> Is the above if-test useful? Can that if-test be left out?
You're right, will remove.
>
> Thanks,
>
> Bart.
Thanks!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 10/25] ibtrs: server: main functionality
  2019-09-27 15:03     ` Jinpu Wang
@ 2019-09-27 15:11       ` Bart Van Assche
  2019-09-27 15:19         ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-27 15:11 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On 9/27/19 8:03 AM, Jinpu Wang wrote:
> On Tue, Sep 24, 2019 at 1:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 6/20/19 8:03 AM, Jack Wang wrote:
>>> +static char cq_affinity_list[256] = "";
>>
>> No empty initializers for file-scope variables please.
 >
> Is it guaranteed by the compiler, the file-scope variables will be
> empty initialized?

That is guaranteed by the C standard. See also 
https://stackoverflow.com/questions/3373108/why-are-static-variables-auto-initialized-to-zero.

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 10/25] ibtrs: server: main functionality
  2019-09-27 15:11       ` Bart Van Assche
@ 2019-09-27 15:19         ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-09-27 15:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Fri, Sep 27, 2019 at 5:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/27/19 8:03 AM, Jinpu Wang wrote:
> > On Tue, Sep 24, 2019 at 1:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >> On 6/20/19 8:03 AM, Jack Wang wrote:
> >>> +static char cq_affinity_list[256] = "";
> >>
> >> No empty initializers for file-scope variables please.
>  >
> > Is it guaranteed by the compiler, the file-scope variables will be
> > empty initialized?
>
> That is guaranteed by the C standard. See also
> https://stackoverflow.com/questions/3373108/why-are-static-variables-auto-initialized-to-zero.
>
> Bart.
Thanks, will remove the initializer.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-27  8:52                     ` Roman Penyaev
  2019-09-27  9:32                       ` Danil Kipnis
@ 2019-09-27 16:37                       ` Bart Van Assche
  2019-09-27 16:50                         ` Roman Penyaev
  1 sibling, 1 reply; 123+ messages in thread
From: Bart Van Assche @ 2019-09-27 16:37 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/27/19 1:52 AM, Roman Penyaev wrote:
> No, it seems this thingy is a bit different.  According to my
> understanding patches 3 and 4 from this patchset do the
> following: 1# split equally the whole queue depth on number
> of hardware queues and 2# return tag number which is unique
> host-wide (more or less similar to unique_tag, right?).
> 
> 2# is not needed for ibtrs, and 1# can be easy done by dividing
> queue_depth on number of hw queues on tag set allocation, e.g.
> something like the following:
> 
>      ...
>      tags->nr_hw_queues = num_online_cpus();
>      tags->queue_depth  = sess->queue_deph / tags->nr_hw_queues;
> 
>      blk_mq_alloc_tag_set(tags);
> 
> 
> And this trick won't work out for the performance.  ibtrs client
> has a single resource: set of buffer chunks received from a
> server side.  And these buffers should be dynamically distributed
> between IO producers according to the load.  Having a hard split
> of the whole queue depth between hw queues we can forget about a
> dynamic load distribution, here is an example:
> 
>     - say server shares 1024 buffer chunks for a session (do not
>       remember what is the actual number).
> 
>     - 1024 buffers are equally divided between hw queues, let's
>       say 64 (number of cpus), so each queue is 16 requests depth.
> 
>     - only several CPUs produce IO, and instead of occupying the
>       whole "bandwidth" of a session, i.e. 1024 buffer chunks,
>       we limit ourselves to a small queue depth of an each hw
>       queue.
> 
> And performance drops significantly when number of IO producers
> is smaller than number of hw queues (CPUs), and it can be easily
> tested and proved.
> 
> So for this particular ibtrs case tags should be globally shared,
> and seems (unfortunately) there is no any other similar requirements
> for other block devices.

Hi Roman,

I agree that BLK_MQ_F_HOST_TAGS partitions a tag set across hardware 
queues while ibnbd shares a single tag set across multiple hardware 
queues. Since such sharing may be useful for other block drivers, isn't 
that something that should be implemented in the block layer core 
instead of in the ibnbd driver? If that logic would be moved into the 
block layer core, would that allow to reuse the queue restarting logic 
that already exists in the block layer core?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-27 16:37                       ` Bart Van Assche
@ 2019-09-27 16:50                         ` Roman Penyaev
  2019-09-27 17:16                           ` Bart Van Assche
  0 siblings, 1 reply; 123+ messages in thread
From: Roman Penyaev @ 2019-09-27 16:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On Fri, Sep 27, 2019 at 6:37 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/27/19 1:52 AM, Roman Penyaev wrote:
> > No, it seems this thingy is a bit different.  According to my
> > understanding patches 3 and 4 from this patchset do the
> > following: 1# split equally the whole queue depth on number
> > of hardware queues and 2# return tag number which is unique
> > host-wide (more or less similar to unique_tag, right?).
> >
> > 2# is not needed for ibtrs, and 1# can be easy done by dividing
> > queue_depth on number of hw queues on tag set allocation, e.g.
> > something like the following:
> >
> >      ...
> >      tags->nr_hw_queues = num_online_cpus();
> >      tags->queue_depth  = sess->queue_deph / tags->nr_hw_queues;
> >
> >      blk_mq_alloc_tag_set(tags);
> >
> >
> > And this trick won't work out for the performance.  ibtrs client
> > has a single resource: set of buffer chunks received from a
> > server side.  And these buffers should be dynamically distributed
> > between IO producers according to the load.  Having a hard split
> > of the whole queue depth between hw queues we can forget about a
> > dynamic load distribution, here is an example:
> >
> >     - say server shares 1024 buffer chunks for a session (do not
> >       remember what is the actual number).
> >
> >     - 1024 buffers are equally divided between hw queues, let's
> >       say 64 (number of cpus), so each queue is 16 requests depth.
> >
> >     - only several CPUs produce IO, and instead of occupying the
> >       whole "bandwidth" of a session, i.e. 1024 buffer chunks,
> >       we limit ourselves to a small queue depth of an each hw
> >       queue.
> >
> > And performance drops significantly when number of IO producers
> > is smaller than number of hw queues (CPUs), and it can be easily
> > tested and proved.
> >
> > So for this particular ibtrs case tags should be globally shared,
> > and seems (unfortunately) there is no any other similar requirements
> > for other block devices.
>
> Hi Roman,
>
> I agree that BLK_MQ_F_HOST_TAGS partitions a tag set across hardware
> queues while ibnbd shares a single tag set across multiple hardware
> queues. Since such sharing may be useful for other block drivers, isn't
> that something that should be implemented in the block layer core
> instead of in the ibnbd driver? If that logic would be moved into the
> block layer core, would that allow to reuse the queue restarting logic
> that already exists in the block layer core?

Definitely yes, but what other block drivers you have in mind?

--
Roman

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 17/25] ibnbd: client: main functionality
  2019-09-27 16:50                         ` Roman Penyaev
@ 2019-09-27 17:16                           ` Bart Van Assche
  0 siblings, 0 replies; 123+ messages in thread
From: Bart Van Assche @ 2019-09-27 17:16 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	rpenyaev, Jack Wang

On 9/27/19 9:50 AM, Roman Penyaev wrote:
> On Fri, Sep 27, 2019 at 6:37 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> I agree that BLK_MQ_F_HOST_TAGS partitions a tag set across hardware
>> queues while ibnbd shares a single tag set across multiple hardware
>> queues. Since such sharing may be useful for other block drivers, isn't
>> that something that should be implemented in the block layer core
>> instead of in the ibnbd driver? If that logic would be moved into the
>> block layer core, would that allow to reuse the queue restarting logic
>> that already exists in the block layer core?
> 
> Definitely yes, but what other block drivers you have in mind?

I'd like to hear the opinion of Jens and Christoph about this topic. My 
concern is that if the code for sharing a tag set across hwqs stays in 
the ibnbd driver and if another block driver is submitted in the future 
that needs the same logic that in order to end up with a single 
implementation of the tag set sharing code that the authors of the new 
driver would have to be asked to modify the ibnbd driver. I think it 
would be inappropriate to ask the authors of such a new driver to modify 
the ibnbd driver.

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 12/25] ibtrs: server: sysfs interface functions
  2019-09-24  0:00   ` [PATCH v4 12/25] ibtrs: server: sysfs interface functions Bart Van Assche
@ 2019-10-02 15:11     ` Jinpu Wang
  0 siblings, 0 replies; 123+ messages in thread
From: Jinpu Wang @ 2019-10-02 15:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev, Roman Pen

On Tue, Sep 24, 2019 at 2:00 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +static void ibtrs_srv_dev_release(struct device *dev)
> > +{
> > +     /* Nobody plays with device references, so nop */
> > +}
>
> I doubt that the above comment is correct.
>
> Thanks,
>
> Bart.
will fix it,

Thank you, Bart!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 11/25] ibtrs: server: statistics functions
  2019-09-23 23:56   ` [PATCH v4 11/25] ibtrs: server: statistics functions Bart Van Assche
@ 2019-10-02 15:15     ` Jinpu Wang
  2019-10-02 15:42       ` Leon Romanovsky
  0 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-10-02 15:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Sep 24, 2019 at 1:56 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/20/19 8:03 AM, Jack Wang wrote:
> > +ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
> > +                                 char *page, size_t len)
> > +{
> > +     struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> > +     struct ibtrs_srv_sess *sess;
> > +
> > +     sess = container_of(stats, typeof(*sess), stats);
> > +
> > +     return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
> > +                      (s64)atomic64_read(&r->dir[READ].cnt),
> > +                      (s64)atomic64_read(&r->dir[READ].size_total),
> > +                      (s64)atomic64_read(&r->dir[WRITE].cnt),
> > +                      (s64)atomic64_read(&r->dir[WRITE].size_total),
> > +                      atomic_read(&sess->ids_inflight));
> > +}
>
> Does this follow the sysfs one-value-per-file rule? See also
> Documentation/filesystems/sysfs.txt.
>
> Thanks,
>
> Bart.
It looks overkill to create one file for each value to me, and there
are enough stats in sysfs contain multiple values.

Thanks
Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 11/25] ibtrs: server: statistics functions
  2019-10-02 15:15     ` Jinpu Wang
@ 2019-10-02 15:42       ` Leon Romanovsky
  2019-10-02 15:45         ` Jinpu Wang
  0 siblings, 1 reply; 123+ messages in thread
From: Leon Romanovsky @ 2019-10-02 15:42 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Wed, Oct 02, 2019 at 05:15:10PM +0200, Jinpu Wang wrote:
> On Tue, Sep 24, 2019 at 1:56 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 6/20/19 8:03 AM, Jack Wang wrote:
> > > +ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
> > > +                                 char *page, size_t len)
> > > +{
> > > +     struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> > > +     struct ibtrs_srv_sess *sess;
> > > +
> > > +     sess = container_of(stats, typeof(*sess), stats);
> > > +
> > > +     return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
> > > +                      (s64)atomic64_read(&r->dir[READ].cnt),
> > > +                      (s64)atomic64_read(&r->dir[READ].size_total),
> > > +                      (s64)atomic64_read(&r->dir[WRITE].cnt),
> > > +                      (s64)atomic64_read(&r->dir[WRITE].size_total),
> > > +                      atomic_read(&sess->ids_inflight));
> > > +}
> >
> > Does this follow the sysfs one-value-per-file rule? See also
> > Documentation/filesystems/sysfs.txt.
> >
> > Thanks,
> >
> > Bart.
> It looks overkill to create one file for each value to me, and there
> are enough stats in sysfs contain multiple values.

Not for statistics.

Thanks

>
> Thanks
> Jinpu

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 11/25] ibtrs: server: statistics functions
  2019-10-02 15:42       ` Leon Romanovsky
@ 2019-10-02 15:45         ` Jinpu Wang
  2019-10-02 16:00           ` Leon Romanovsky
  0 siblings, 1 reply; 123+ messages in thread
From: Jinpu Wang @ 2019-10-02 15:45 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Wed, Oct 2, 2019 at 5:42 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Oct 02, 2019 at 05:15:10PM +0200, Jinpu Wang wrote:
> > On Tue, Sep 24, 2019 at 1:56 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > >
> > > On 6/20/19 8:03 AM, Jack Wang wrote:
> > > > +ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
> > > > +                                 char *page, size_t len)
> > > > +{
> > > > +     struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> > > > +     struct ibtrs_srv_sess *sess;
> > > > +
> > > > +     sess = container_of(stats, typeof(*sess), stats);
> > > > +
> > > > +     return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
> > > > +                      (s64)atomic64_read(&r->dir[READ].cnt),
> > > > +                      (s64)atomic64_read(&r->dir[READ].size_total),
> > > > +                      (s64)atomic64_read(&r->dir[WRITE].cnt),
> > > > +                      (s64)atomic64_read(&r->dir[WRITE].size_total),
> > > > +                      atomic_read(&sess->ids_inflight));
> > > > +}
> > >
> > > Does this follow the sysfs one-value-per-file rule? See also
> > > Documentation/filesystems/sysfs.txt.
> > >
> > > Thanks,
> > >
> > > Bart.
> > It looks overkill to create one file for each value to me, and there
> > are enough stats in sysfs contain multiple values.
>
> Not for statistics.
2 examples:
cat /sys/block/nvme0n1/inflight
       0        0
cat /sys/block/nvme0n1/stat
 1267566       53 85396638   927624  4790532  3076340 198306930
19413605        0  2459788 17013620    74392        0 397606816
6864

Thanks

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH v4 11/25] ibtrs: server: statistics functions
  2019-10-02 15:45         ` Jinpu Wang
@ 2019-10-02 16:00           ` Leon Romanovsky
  0 siblings, 0 replies; 123+ messages in thread
From: Leon Romanovsky @ 2019-10-02 16:00 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Jason Gunthorpe, Doug Ledford,
	Danil Kipnis, rpenyaev

On Wed, Oct 02, 2019 at 05:45:04PM +0200, Jinpu Wang wrote:
> On Wed, Oct 2, 2019 at 5:42 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Oct 02, 2019 at 05:15:10PM +0200, Jinpu Wang wrote:
> > > On Tue, Sep 24, 2019 at 1:56 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > > >
> > > > On 6/20/19 8:03 AM, Jack Wang wrote:
> > > > > +ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
> > > > > +                                 char *page, size_t len)
> > > > > +{
> > > > > +     struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> > > > > +     struct ibtrs_srv_sess *sess;
> > > > > +
> > > > > +     sess = container_of(stats, typeof(*sess), stats);
> > > > > +
> > > > > +     return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
> > > > > +                      (s64)atomic64_read(&r->dir[READ].cnt),
> > > > > +                      (s64)atomic64_read(&r->dir[READ].size_total),
> > > > > +                      (s64)atomic64_read(&r->dir[WRITE].cnt),
> > > > > +                      (s64)atomic64_read(&r->dir[WRITE].size_total),
> > > > > +                      atomic_read(&sess->ids_inflight));
> > > > > +}
> > > >
> > > > Does this follow the sysfs one-value-per-file rule? See also
> > > > Documentation/filesystems/sysfs.txt.
> > > >
> > > > Thanks,
> > > >
> > > > Bart.
> > > It looks overkill to create one file for each value to me, and there
> > > are enough stats in sysfs contain multiple values.
> >
> > Not for statistics.
> 2 examples:
> cat /sys/block/nvme0n1/inflight
>        0        0
> cat /sys/block/nvme0n1/stat
>  1267566       53 85396638   927624  4790532  3076340 198306930
> 19413605        0  2459788 17013620    74392        0 397606816
> 6864

OMG, I feel sorry for users who now should go and read code to see what
column 3 in second row means.

We respect our users, please don't do like they did.

Thanks

>
> Thanks

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, back to index

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190620150337.7847-1-jinpuwang@gmail.com>
2019-06-20 15:03 ` [PATCH v4 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
2019-09-23 17:21   ` Bart Van Assche
2019-09-25  9:30     ` Danil Kipnis
2019-07-09  9:55 ` [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Danil Kipnis
2019-07-09 11:00   ` Leon Romanovsky
2019-07-09 11:17     ` Greg KH
2019-07-09 11:57       ` Jinpu Wang
2019-07-09 13:32       ` Leon Romanovsky
2019-07-09 15:39       ` Bart Van Assche
2019-07-09 11:37     ` Jinpu Wang
2019-07-09 12:06       ` Jason Gunthorpe
2019-07-09 13:15         ` Jinpu Wang
2019-07-09 13:19           ` Jason Gunthorpe
2019-07-09 14:17             ` Jinpu Wang
2019-07-09 21:27             ` Sagi Grimberg
2019-07-19 13:12               ` Danil Kipnis
2019-07-10 14:55     ` Danil Kipnis
2019-07-09 12:04   ` Jason Gunthorpe
2019-07-09 19:45   ` Sagi Grimberg
2019-07-10 13:55     ` Jason Gunthorpe
2019-07-10 16:25       ` Sagi Grimberg
2019-07-10 17:25         ` Jason Gunthorpe
2019-07-10 19:11           ` Sagi Grimberg
2019-07-11  7:27             ` Danil Kipnis
2019-07-11  8:54     ` Danil Kipnis
2019-07-12  0:22       ` Sagi Grimberg
2019-07-12  7:57         ` Jinpu Wang
2019-07-12 19:40           ` Sagi Grimberg
2019-07-15 11:21             ` Jinpu Wang
2019-07-12 10:58         ` Danil Kipnis
     [not found] ` <20190620150337.7847-26-jinpuwang@gmail.com>
2019-07-09 15:10   ` [PATCH v4 25/25] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Leon Romanovsky
2019-07-09 15:18     ` Jinpu Wang
2019-07-09 15:51       ` Leon Romanovsky
2019-09-13 23:56   ` Bart Van Assche
2019-09-19 10:30     ` Jinpu Wang
     [not found] ` <20190620150337.7847-16-jinpuwang@gmail.com>
2019-09-13 22:10   ` [PATCH v4 15/25] ibnbd: private headers with IBNBD protocol structs and helpers Bart Van Assche
2019-09-15 14:30     ` Jinpu Wang
2019-09-16  5:27       ` Leon Romanovsky
2019-09-16 13:45         ` Bart Van Assche
2019-09-17 15:41           ` Leon Romanovsky
2019-09-17 15:52             ` Jinpu Wang
2019-09-16  7:08       ` Danil Kipnis
2019-09-16 14:57       ` Jinpu Wang
2019-09-16 17:25         ` Bart Van Assche
2019-09-17 12:27           ` Jinpu Wang
2019-09-16 15:39       ` Jinpu Wang
2019-09-18 15:26         ` Bart Van Assche
2019-09-18 16:11           ` Jinpu Wang
     [not found] ` <20190620150337.7847-17-jinpuwang@gmail.com>
2019-09-13 22:25   ` [PATCH v4 16/25] ibnbd: client: private header with client structs and functions Bart Van Assche
2019-09-17 16:36     ` Jinpu Wang
2019-09-25 23:43       ` Danil Kipnis
2019-09-26 10:00         ` Jinpu Wang
     [not found] ` <20190620150337.7847-18-jinpuwang@gmail.com>
2019-09-13 23:46   ` [PATCH v4 17/25] ibnbd: client: main functionality Bart Van Assche
2019-09-16 14:17     ` Danil Kipnis
2019-09-16 16:46       ` Bart Van Assche
2019-09-17 11:39         ` Danil Kipnis
2019-09-18  7:14           ` Danil Kipnis
2019-09-18 15:47             ` Bart Van Assche
2019-09-20  8:29               ` Danil Kipnis
2019-09-25 22:26               ` Danil Kipnis
2019-09-26  9:55                 ` Roman Penyaev
2019-09-26 15:01                   ` Bart Van Assche
2019-09-27  8:52                     ` Roman Penyaev
2019-09-27  9:32                       ` Danil Kipnis
2019-09-27 12:18                         ` Danil Kipnis
2019-09-27 16:37                       ` Bart Van Assche
2019-09-27 16:50                         ` Roman Penyaev
2019-09-27 17:16                           ` Bart Van Assche
2019-09-17 13:09     ` Jinpu Wang
2019-09-17 16:46       ` Bart Van Assche
2019-09-18 12:02         ` Jinpu Wang
2019-09-18 16:05     ` Jinpu Wang
2019-09-14  0:00   ` Bart Van Assche
     [not found] ` <20190620150337.7847-25-jinpuwang@gmail.com>
2019-09-13 23:58   ` [PATCH v4 24/25] ibnbd: a bit of documentation Bart Van Assche
2019-09-18 12:22     ` Jinpu Wang
     [not found] ` <20190620150337.7847-19-jinpuwang@gmail.com>
2019-09-18 16:28   ` [PATCH v4 18/25] ibnbd: client: sysfs interface functions Bart Van Assche
2019-09-19 15:55     ` Jinpu Wang
     [not found] ` <20190620150337.7847-21-jinpuwang@gmail.com>
2019-09-18 17:41   ` [PATCH v4 20/25] ibnbd: server: main functionality Bart Van Assche
2019-09-20  7:36     ` Danil Kipnis
2019-09-20 15:42       ` Bart Van Assche
2019-09-23 15:19         ` Danil Kipnis
     [not found] ` <20190620150337.7847-22-jinpuwang@gmail.com>
2019-09-18 21:46   ` [PATCH v4 21/25] ibnbd: server: functionality for IO submission to file or block dev Bart Van Assche
2019-09-26 14:04     ` Jinpu Wang
2019-09-26 15:11       ` Bart Van Assche
2019-09-26 15:25         ` Danil Kipnis
2019-09-26 15:29           ` Bart Van Assche
2019-09-26 15:38             ` Danil Kipnis
2019-09-26 15:42               ` Jinpu Wang
     [not found] ` <20190620150337.7847-3-jinpuwang@gmail.com>
2019-09-23 17:44   ` [PATCH v4 02/25] ibtrs: public interface header to establish RDMA connections Bart Van Assche
2019-09-25 10:20     ` Danil Kipnis
2019-09-25 15:38       ` Bart Van Assche
     [not found] ` <20190620150337.7847-7-jinpuwang@gmail.com>
2019-09-23 21:51   ` [PATCH v4 06/25] ibtrs: client: main functionality Bart Van Assche
2019-09-25 17:36     ` Danil Kipnis
2019-09-25 18:55       ` Bart Van Assche
2019-09-25 20:50         ` Danil Kipnis
2019-09-25 21:08           ` Bart Van Assche
2019-09-25 21:16             ` Bart Van Assche
2019-09-25 22:53             ` Danil Kipnis
2019-09-25 23:21               ` Bart Van Assche
2019-09-26  9:16                 ` Danil Kipnis
     [not found] ` <20190620150337.7847-4-jinpuwang@gmail.com>
2019-09-23 22:50   ` [PATCH v4 03/25] ibtrs: private headers with IBTRS protocol structs and helpers Bart Van Assche
2019-09-25 21:45     ` Danil Kipnis
2019-09-25 21:57       ` Bart Van Assche
2019-09-27  8:56     ` Jinpu Wang
     [not found] ` <20190620150337.7847-5-jinpuwang@gmail.com>
2019-09-23 23:03   ` [PATCH v4 04/25] ibtrs: core: lib functions shared between client and server modules Bart Van Assche
2019-09-27 10:13     ` Jinpu Wang
     [not found] ` <20190620150337.7847-6-jinpuwang@gmail.com>
2019-09-23 23:05   ` [PATCH v4 05/25] ibtrs: client: private header with client structs and functions Bart Van Assche
2019-09-27 10:18     ` Jinpu Wang
     [not found] ` <20190620150337.7847-8-jinpuwang@gmail.com>
2019-09-23 23:15   ` [PATCH v4 07/25] ibtrs: client: statistics functions Bart Van Assche
2019-09-27 12:00     ` Jinpu Wang
     [not found] ` <20190620150337.7847-10-jinpuwang@gmail.com>
2019-09-23 23:21   ` [PATCH v4 09/25] ibtrs: server: private header with server structs and functions Bart Van Assche
2019-09-27 12:04     ` Jinpu Wang
     [not found] ` <20190620150337.7847-11-jinpuwang@gmail.com>
2019-09-23 23:49   ` [PATCH v4 10/25] ibtrs: server: main functionality Bart Van Assche
2019-09-27 15:03     ` Jinpu Wang
2019-09-27 15:11       ` Bart Van Assche
2019-09-27 15:19         ` Jinpu Wang
     [not found] ` <20190620150337.7847-12-jinpuwang@gmail.com>
2019-09-23 23:56   ` [PATCH v4 11/25] ibtrs: server: statistics functions Bart Van Assche
2019-10-02 15:15     ` Jinpu Wang
2019-10-02 15:42       ` Leon Romanovsky
2019-10-02 15:45         ` Jinpu Wang
2019-10-02 16:00           ` Leon Romanovsky
     [not found] ` <20190620150337.7847-13-jinpuwang@gmail.com>
2019-09-24  0:00   ` [PATCH v4 12/25] ibtrs: server: sysfs interface functions Bart Van Assche
2019-10-02 15:11     ` Jinpu Wang

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git