linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
@ 2018-05-18 13:03 Roman Pen
  2018-05-18 13:03 ` [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Roman Pen
                   ` (26 more replies)
  0 siblings, 27 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

Hi all,

This is v2 of series, which introduces IBNBD/IBTRS modules.

This cover letter is split on three parts:

1. Introduction, which almost repeats everything from previous cover
   letters.
2. Changelog.
3. Performance measurements on linux-4.17.0-rc2 and on two different
   Mellanox cards: ConnectX-2 and ConnectX-3 and CPUs: Intel and AMD.


 Introduction
 -------------

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is optimized to transfer (read/write) IO blocks
in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.

IBTRS is multipath capalbdke and provides I/O fail-over and load-balancing
functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
CMs and particular path is selected according to the load-balancing policy.

IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.

Why?

   - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
     thus internal protocol is simple.
   - IBTRS was developed as an independent RDMA transport library, which
     supports fail-over and load-balancing policies using multipath, thus
     it can be used for any other IO needs rather than only for block
     device.
   - IBNBD/IBTRS is faster than NVME over RDMA.
     Old comparison results:
     https://www.spinics.net/lists/linux-rdma/msg48799.html
     New comparison results: see performance measurements section below.

Key features of IBTRS transport library and IBNBD block device:

o High throughput and low latency due to:
   - Only two RDMA messages per IO.
   - IMM InfiniBand messages on responses to reduce round trip latency.
   - Simplified memory management: memory allocation happens once on
     server side when IBTRS session is established.

o IO fail-over and load-balancing by using multipath.  According to
  our test loads additional path brings ~20% of bandwidth.  

o Simple configuration of IBNBD:
   - Server side is completely passive: volumes do not need to be
     explicitly exported.
   - Only IB port GID and device path needed on client side to map
     a block device.
   - A device is remapped automatically i.e. after storage reboot.

Commits for kernel can be found here:
   https://github.com/profitbricks/ibnbd/commits/linux-4.17-rc2

The out-of-tree modules are here:
   https://github.com/profitbricks/ibnbd/

Vault 2017 presentation:
   http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf


 Changelog
 ---------

v2:
  o IBNBD:
     - No legacy request IO mode, only MQ is left.

  o IBTRS:
     - No FMR registration, only FR is left.

     - By default memory is always registered for the sake of the security,
	   i.e. by default no pd is created with IB_PD_UNSAFE_GLOBAL_RKEY.

	 - Server side (target) always does memory registration and exchanges
	   MRs dma addresses with client for direct writes from client side.

	 - Client side (initiator) has `noreg_cnt` module option, which specifies
	   sg number, from which read IO should be registered.  By default 0
	   is set, i.e. always register memory for read IOs. (IBTRS protocol
	   does not require registration for writes, which always go directly
	   to server memory).

	 - Proper DMA sync with ib_dma_sync_single_for_(cpu|device) calls.

     - Do signalled IB_WR_LOCAL_INV.

	 - Avoid open-coding of string conversion to IPv4/6 sockaddr,
	   inet_pton_with_scope() is used instead.

     - Introduced block device namespaces configuration on server side
	   (target) to avoid security gap in not trusted environment, when
	   client can map a block device which does not belong to him.
	   When device namespaces are enabled on server side, server opens
	   device using client's session name in the device path, where
	   session name is a random token, e.g. GUID.  If server is configured
	   to find device namespaces in a folder /run/ibnbd-guid/, then
	   request to map device 'sda1' from client with session 'A' (or any
	   token) will be resolved by path /run/ibnbd-guid/A/sda1.

     - README is extended with description of IBTRS and IBNBD protocol,
	   e.g. how IB IMM field is used to acknowledge IO requests or
	   heartbeats.

     - IBTRS/IBNBD client and server modules are registered as devices in
	   the kernel in order to have all sysfs configuration entries under
	   /sys/devices/virtual/ in order not to spoil /sys/kernel directory.
	   I failed to switch configuration to configfs, because of the
	   several reasons:

	   a) configfs entries created from kernel side using
	      configfs_register_group() API call can't be removed from
		  userspace side using rmdir() syscall.  That is required
		  behaviour for IBTRS when session is created by API call and
		  not from userspace.
			   
          Actually, I have a patch for configfs to solve a), but then
		  b) comes.

       b) configfs show/store callbacks are racy by design (in
	      contradiction to kernfs), i.e. even dentry is unhashed, opener
		  of it can be faster and in few moments later those callbacks
		  can be invoked.  To guarantee that all openers left and nobody
		  is able to access an entry after configfs_drop_dentry() is
		  returned additional hairy code should be written with wait
		  queues, locks, etc.  I didn't like at all what I eventually
		  got, gave up and left as is, i.e. sysfs.
			    

  What is left unchanged on IBTRS side but was suggested to modify:

     - Bart suggested to use sbitmap instead of calling find_first_zero_bit()
	   and friends.  I found calling pure bit API is more explicit in
	   comparison to sbitmap - there is no need in using sbitmap_queue
	   and all the power of wait queues, no benefits in terms of LoC
	   as well.
	   
     - I did several attempts to unify approach of wrapping ib_device
	   with ULP device structure (e.g. device pool or using ib_client
	   API) but it turns out to be that none of these approaches bring
	   simplicity, so IBTRS still creates ULP specific device on demand
	   and keeps it in the list.

     - Sagi suggested to extend inet_pton_with_scope() with gid to
	   sockaddr conversion, but after IPv6 conversion (gid is compliant
	   with IPv6) special RDMA magic should be done in order to setup
	   IB port space range, which is very specific and does not fit to
	   be some generic library helper.  And am I right that gid is not
	   used and seems dying?


v1:
  - IBTRS: load-balancing and IO fail-over using multipath features were added.

  - Major parts of the code were rewritten, simplified and overall code
    size was reduced by a quarter.

  * https://lwn.net/Articles/746342/

v0:
  - Initial submission

  * https://lwn.net/Articles/718181/


 Performance measurements
 ------------------------

o FR and FMR:

  Firstly I would like to start performance measurements with (probably
  well known) observations that FR is slower than FMR by ~40% on Mellanox
  ConnectX-2 and by ~15% on ConnectX-3. That is a huge numbers, e.g. FIO
  results on IBNBD:

  - on ConnectX-2 (MT26428)
    x64 CPUs AMD Opteron(tm) Processor 6282 SE

     rw=randread, bandwidth in Kbytes:
     jobs   IBNBD (FMR)   IBNBD (FR)     Change
       x1       1037624       932951     -10.1%
       x8       2569649       1543074    -40.0%
      x16       2751461       1531282    -44.3%
      x24       2360887       1396153    -40.9%
      x32       1873174       1215334    -35.1%
      x40       1995846       1255781    -37.1%
      x48       2004740       1240931    -38.1%
      x56       2076871       1250333    -39.8%
      x64       2051668       1229389    -40.1%

  - on ConnectX-3 (MT4099)
    x40 CPUs Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

     rw=randread, bandwidth in Kbytes:
     jobs   IBNBD (FMR)   IBNBD (FR)     Change
       x1       2243322       1961216    -12.6%
       x8       4389048       4012912     -8.6%
      x16       4473103       4033837     -9.8%
      x24       4570209       3939186    -13.8%
      x32       4576757       3843434    -16.0%
      x40       4468110       3696896    -17.3%
      x48       4848049       4106259    -15.3%
      x56       4872790       4141374    -15.0%
      x64       4967287       4207317    -15.3%

  I missed the whole history why FMR is considered as outdated, why FMR
  is a no way, I would very much appreciate if someone would explain me
  why FR should be prefered. Is there a link with a clear explanation?

o IBNBD and NVMEoRDMA

  Here I would like to publish IBNBD and NVMEoRDMA comparison results
  with FR memory registration on each IO (i.e. with the following modules
  params: 'register_always=Y' for NVME and 'noreg_cnt=0' for IBTRS).

  - on ConnectX-2 (MT26428)
    x64 CPUs AMD Opteron(tm) Processor 6282 SE

     rw=randread, bandwidth in Kbytes:
     jobs        IBNBD     NVMEoRDMA     Change
       x1       932951        975425      +4.6%
       x8      1543074       1504416      -2.5%
      x16      1531282       1432937      -6.4%
      x24      1396153       1244858     -10.8%
      x32      1215334       1066607     -12.2%
      x40      1255781       1076841     -14.2%
      x48      1240931       1066453     -14.1%
      x56      1250333       1065879     -14.8%
      x64      1229389       1064199     -13.4%

     rw=randwrite, bandwidth in Kbytes:
     jobs        IBNBD     NVMEoRDMA     Change
       x1       1416413      1181102     -16.6%
       x8       2438615      1977051     -18.9%
      x16       2436924      1854223     -23.9%
      x24       2430527      1714580     -29.5%
      x32       2425552      1641288     -32.3%
      x40       2378784      1592788     -33.0%
      x48       2202260      1511895     -31.3%
      x56       2207013      1493400     -32.3%
      x64       2098949      1432951     -31.7%


  - on ConnectX-3 (MT4099)
    x40 CPUs Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

     rw=randread, bandwidth in Kbytes:
     jobs        IBNBD     NVMEoRDMA     Change
       x1      1961216       2046572      +4.4%
       x8      4012912       4059410      +1.2%
      x16      4033837       3968410      -1.6%
      x24      3939186       3770729      -4.3%
      x32      3843434       3623869      -5.7%
      x40      3696896       3448772      -6.7%
      x48      4106259       3729201      -9.2%
      x56      4141374       3732954      -9.9%
      x64      4207317       3805638      -9.5%

     rw=randwrite, bandwidth in Kbytes:
     jobs        IBNBD     NVMEoRDMA     Change
       x1      3195637       2479068     -22.4%
       x8      4576924       4541743      -0.8%
      x16      4581528       4555459      -0.6%
      x24      4692540       4595963      -2.1%
      x32      4686968       4540456      -3.1%
      x40      4583814       4404859      -3.9%
      x48      4969587       4710902      -5.2%
      x56      4996101       4701814      -5.9%
      x64      5083460       4759663      -6.4%

  The interesting observation is that on machine with Intel CPUs and
  ConnectX-3 card the difference between IBNBD and NVME bandwidth is
  significantly smaller comparing to AMD and ConnectX-2.  I did not
  thoroughly investiage that behaviour, but suspect that the devil
  is in Intel vs AMD architecture and probably how NUMAs are organized,
  i.e. Intel has 2 NUMA nodes against 8 on AMD.  If someone is interested
  in those results and can point me out where to dig on NVME side I can
  investigate deeply why exactly NVME bandwidth significantly drops on
  AMD machine with Connect-X2.

  Shiny graphs are here:
  https://docs.google.com/spreadsheets/d/1vxSoIvfjPbOWD61XMeN2_gPGxsxrbIUOZADk1UX5lj0

Roman Pen (26):
  rculist: introduce list_next_or_null_rr_rcu()
  sysfs: export sysfs_remove_file_self()
  ibtrs: public interface header to establish RDMA connections
  ibtrs: private headers with IBTRS protocol structs and helpers
  ibtrs: core: lib functions shared between client and server modules
  ibtrs: client: private header with client structs and functions
  ibtrs: client: main functionality
  ibtrs: client: statistics functions
  ibtrs: client: sysfs interface functions
  ibtrs: server: private header with server structs and functions
  ibtrs: server: main functionality
  ibtrs: server: statistics functions
  ibtrs: server: sysfs interface functions
  ibtrs: include client and server modules into kernel compilation
  ibtrs: a bit of documentation
  ibnbd: private headers with IBNBD protocol structs and helpers
  ibnbd: client: private header with client structs and functions
  ibnbd: client: main functionality
  ibnbd: client: sysfs interface functions
  ibnbd: server: private header with server structs and functions
  ibnbd: server: main functionality
  ibnbd: server: functionality for IO submission to file or block dev
  ibnbd: server: sysfs interface functions
  ibnbd: include client and server modules into kernel compilation
  ibnbd: a bit of documentation
  MAINTAINERS: Add maintainer for IBNBD/IBTRS modules

 MAINTAINERS                                    |   14 +
 drivers/block/Kconfig                          |    2 +
 drivers/block/Makefile                         |    1 +
 drivers/block/ibnbd/Kconfig                    |   22 +
 drivers/block/ibnbd/Makefile                   |   13 +
 drivers/block/ibnbd/README                     |  299 +++
 drivers/block/ibnbd/ibnbd-clt-sysfs.c          |  669 ++++++
 drivers/block/ibnbd/ibnbd-clt.c                | 1818 +++++++++++++++
 drivers/block/ibnbd/ibnbd-clt.h                |  171 ++
 drivers/block/ibnbd/ibnbd-log.h                |   71 +
 drivers/block/ibnbd/ibnbd-proto.h              |  364 +++
 drivers/block/ibnbd/ibnbd-srv-dev.c            |  410 ++++
 drivers/block/ibnbd/ibnbd-srv-dev.h            |  149 ++
 drivers/block/ibnbd/ibnbd-srv-sysfs.c          |  242 ++
 drivers/block/ibnbd/ibnbd-srv.c                |  922 ++++++++
 drivers/block/ibnbd/ibnbd-srv.h                |  100 +
 drivers/infiniband/Kconfig                     |    1 +
 drivers/infiniband/ulp/Makefile                |    1 +
 drivers/infiniband/ulp/ibtrs/Kconfig           |   20 +
 drivers/infiniband/ulp/ibtrs/Makefile          |   15 +
 drivers/infiniband/ulp/ibtrs/README            |  358 +++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c |  455 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c |  482 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c       | 2814 ++++++++++++++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h       |  304 +++
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h       |   91 +
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h       |  458 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c |  110 +
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c |  271 +++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c       | 1980 +++++++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h       |  174 ++
 drivers/infiniband/ulp/ibtrs/ibtrs.c           |  609 +++++
 drivers/infiniband/ulp/ibtrs/ibtrs.h           |  331 +++
 fs/sysfs/file.c                                |    1 +
 include/linux/rculist.h                        |   19 +
 35 files changed, 13761 insertions(+)
 create mode 100644 drivers/block/ibnbd/Kconfig
 create mode 100644 drivers/block/ibnbd/Makefile
 create mode 100644 drivers/block/ibnbd/README
 create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.h
 create mode 100644 drivers/block/ibnbd/ibnbd-log.h
 create mode 100644 drivers/block/ibnbd/ibnbd-proto.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
 create mode 100644 drivers/infiniband/ulp/ibtrs/README
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>

-- 
2.13.1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 16:56   ` Linus Torvalds
  2018-05-19 16:37   ` Paul E. McKenney
  2018-05-18 13:03 ` [PATCH v2 02/26] sysfs: export sysfs_remove_file_self() Roman Pen
                   ` (25 subsequent siblings)
  26 siblings, 2 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen, Paul E . McKenney, linux-kernel

Function is going to be used in transport over RDMA module
in subsequent patches.

Function returns next element in round-robin fashion,
i.e. head will be skipped.  NULL will be returned if list
is observed as empty.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/rculist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 127f534fec94..b0840d5ab25a 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
 })
 
 /**
+ * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
+ * @head:	the head for the list.
+ * @ptr:        the list head to take the next element from.
+ * @type:       the type of the struct this is embedded in.
+ * @memb:       the name of the list_head within the struct.
+ *
+ * Next element returned in round-robin fashion, i.e. head will be skipped,
+ * but if list is observed as empty, NULL will be returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ */
+#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
+({ \
+	list_next_or_null_rcu(head, ptr, type, memb) ?: \
+		list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
+})
+
+/**
  * list_for_each_entry_rcu	-	iterate over rcu list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 02/26] sysfs: export sysfs_remove_file_self()
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
  2018-05-18 13:03 ` [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 15:08   ` Tejun Heo
  2018-05-18 13:03 ` [PATCH v2 03/26] ibtrs: public interface header to establish RDMA connections Roman Pen
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen, Tejun Heo, linux-kernel

Function is going to be used in transport over RDMA module
in subsequent patches.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 fs/sysfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 5c13f29bfcdb..ff7443ac2aa7 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -444,6 +444,7 @@ bool sysfs_remove_file_self(struct kobject *kobj, const struct attribute *attr)
 	kernfs_put(kn);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(sysfs_remove_file_self);
 
 void sysfs_remove_files(struct kobject *kobj, const struct attribute **ptr)
 {
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 03/26] ibtrs: public interface header to establish RDMA connections
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
  2018-05-18 13:03 ` [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Roman Pen
  2018-05-18 13:03 ` [PATCH v2 02/26] sysfs: export sysfs_remove_file_self() Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 04/26] ibtrs: private headers with IBTRS protocol structs and helpers Roman Pen
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

Introduce public header which provides set of API functions to
establish RDMA connections from client to server machine using
IBTRS protocol, which manages RDMA connections for each session,
does multipathing and load balancing.

Main functions for client (active) side:

 ibtrs_clt_open() - Creates set of RDMA connections incapsulated
                    in IBTRS session and returns pointer on IBTRS
		    session object.
 ibtrs_clt_close() - Closes RDMA connections associated with IBTRS
                     session.
 ibtrs_clt_request() - Requests zero-copy RDMA transfer to/from
                       server.

Main functions for server (passive) side:

 ibtrs_srv_open() - Starts listening for IBTRS clients on specified
                    port and invokes IBTRS callbacks for incoming
		    RDMA requests or link events.
 ibtrs_srv_close() - Closes IBTRS server context.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs.h | 324 +++++++++++++++++++++++++++++++++++
 1 file changed, 324 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.h b/drivers/infiniband/ulp/ibtrs/ibtrs.h
new file mode 100644
index 000000000000..08325e39a41e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.h
@@ -0,0 +1,324 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_H
+#define IBTRS_H
+
+#include <linux/socket.h>
+#include <linux/scatterlist.h>
+
+struct ibtrs_tag;
+struct ibtrs_clt;
+struct ibtrs_srv_ctx;
+struct ibtrs_srv;
+struct ibtrs_srv_op;
+
+/*
+ * Here goes IBTRS client API
+ */
+
+/**
+ * enum ibtrs_clt_link_ev - Events about connectivity state of a client
+ * @IBTRS_CLT_LINK_EV_RECONNECTED	Client was reconnected.
+ * @IBTRS_CLT_LINK_EV_DISCONNECTED	Client was disconnected.
+ */
+enum ibtrs_clt_link_ev {
+	IBTRS_CLT_LINK_EV_RECONNECTED,
+	IBTRS_CLT_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * Source and destination address of a path to be established
+ */
+struct ibtrs_addr {
+	struct sockaddr_storage *src;
+	struct sockaddr_storage *dst;
+};
+
+typedef void (link_clt_ev_fn)(void *priv, enum ibtrs_clt_link_ev ev);
+/**
+ * ibtrs_clt_open() - Open a session to a IBTRS client
+ * @priv:		User supplied private data.
+ * @link_ev:		Event notification for connection state changes
+ *	@priv:			user supplied data that was passed to
+ *				ibtrs_clt_open()
+ *	@ev:			Occurred event
+ * @sessname: name of the session
+ * @paths: Paths to be established defined by their src and dst addresses
+ * @path_cnt: Number of elemnts in the @paths array
+ * @port: port to be used by the IBTRS session
+ * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
+ * @max_inflight_msg: Max. number of parallel inflight messages for the session
+ * @max_segments: Max. number of segments per IO request
+ * @reconnect_delay_sec: time between reconnect tries
+ * @max_reconnect_attempts: Number of times to reconnect on error before giving
+ *			    up, 0 for * disabled, -1 for forever
+ *
+ * Starts session establishment with the ibtrs_server. The function can block
+ * up to ~2000ms until it returns.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct ibtrs_addr *paths,
+				 size_t path_cnt, short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts);
+
+/**
+ * ibtrs_clt_close() - Close a session
+ * @sess: Session handler, is freed on return
+ */
+void ibtrs_clt_close(struct ibtrs_clt *sess);
+
+/**
+ * ibtrs_tag_from_pdu() - converts opaque pdu pointer to ibtrs_tag
+ * @pdu: opaque pointer
+ */
+struct ibtrs_tag *ibtrs_tag_from_pdu(void *pdu);
+
+/**
+ * ibtrs_tag_to_pdu() - converts ibtrs_tag to opaque pdu pointer
+ * @tag: IBTRS tag pointer
+ */
+void *ibtrs_tag_to_pdu(struct ibtrs_tag *tag);
+
+enum {
+	IBTRS_TAG_NOWAIT = 0,
+	IBTRS_TAG_WAIT   = 1,
+};
+
+/**
+ * enum ibtrs_clt_con_type() type of ib connection to use with a given tag
+ * @USR_CON - use connection reserved vor "service" messages
+ * @IO_CON - use a connection reserved for IO
+ */
+enum ibtrs_clt_con_type {
+	IBTRS_USR_CON,
+	IBTRS_IO_CON
+};
+
+/**
+ * ibtrs_clt_get_tag() - allocates tag for future RDMA operation
+ * @sess:	Current session
+ * @con_type:	Type of connection to use with the tag
+ * @wait:	Wait type
+ *
+ * Description:
+ *    Allocates tag for the following RDMA operation.  Tag is used
+ *    to preallocate all resources and to propagate memory pressure
+ *    up earlier.
+ *
+ * Context:
+ *    Can sleep if @wait == IBTRS_TAG_WAIT
+ */
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *sess,
+				    enum ibtrs_clt_con_type con_type,
+				    int wait);
+
+/**
+ * ibtrs_clt_put_tag() - puts allocated tag
+ * @sess:	Current session
+ * @tag:	Tag to be freed
+ *
+ * Context:
+ *    Does not matter
+ */
+void ibtrs_clt_put_tag(struct ibtrs_clt *sess, struct ibtrs_tag *tag);
+
+typedef void (ibtrs_conf_fn)(void *priv, int errno);
+/**
+ * ibtrs_clt_request() - Request data transfer to/from server via RDMA.
+ *
+ * @dir:	READ/WRITE
+ * @conf:	callback function to be called as confirmation
+ * @sess:	Session
+ * @tag:	Preallocated tag
+ * @priv:	User provided data, passed back with corresponding
+ *		@(conf) confirmation.
+ * @vec:	Message that is send to server together with the request.
+ *		Sum of len of all @vec elements limited to <= IO_MSG_SIZE.
+ *		Since the msg is copied internally it can be allocated on stack.
+ * @nr:		Number of elements in @vec.
+ * @len:	length of data send to/from server
+ * @sg:		Pages to be sent/received to/from server.
+ * @sg_cnt:	Number of elements in the @sg
+ *
+ * Return:
+ * 0:		Success
+ * <0:		Error
+ *
+ * On dir=READ ibtrs client will request a data transfer from Server to client.
+ * The data that the server will respond with will be stored in @sg when
+ * the user receives an %IBTRS_CLT_RDMA_EV_RDMA_REQUEST_WRITE_COMPL event.
+ * On dir=WRITE ibtrs client will rdma write data in sg to server side.
+ */
+int ibtrs_clt_request(int dir, ibtrs_conf_fn *conf, struct ibtrs_clt *sess,
+		      struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		      size_t nr, size_t len, struct scatterlist *sg,
+		      unsigned int sg_cnt);
+
+/**
+ * ibtrs_attrs - IBTRS session attributes
+ */
+struct ibtrs_attrs {
+	u32	queue_depth;
+	u32	max_io_size;
+	u8	sessname[NAME_MAX];
+};
+
+/**
+ * ibtrs_clt_query() - queries IBTRS session attributes
+ *
+ * Returns:
+ *    0 on success
+ *    -ECOMM		no connection to the server
+ */
+int ibtrs_clt_query(struct ibtrs_clt *sess, struct ibtrs_attrs *attr);
+
+/*
+ * Here goes IBTRS server API
+ */
+
+/**
+ * enum ibtrs_srv_link_ev - Server link events
+ * @IBTRS_SRV_LINK_EV_CONNECTED:	Connection from client established
+ * @IBTRS_SRV_LINK_EV_DISCONNECTED:	Connection was disconnected, all
+ *					connection IBTRS resources were freed.
+ */
+enum ibtrs_srv_link_ev {
+	IBTRS_SRV_LINK_EV_CONNECTED,
+	IBTRS_SRV_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * rdma_ev_fn():	Event notification for RDMA operations
+ *			If the callback returns a value != 0, an error message
+ *			for the data transfer will be sent to the client.
+
+ *	@sess:		Session
+ *	@priv:		Private data set by ibtrs_srv_set_sess_priv()
+ *	@id:		internal IBTRS operation id
+ *	@dir:		READ/WRITE
+ *	@data:		Pointer to (bidirectional) rdma memory area:
+ *			- in case of %IBTRS_SRV_RDMA_EV_RECV contains
+ *			data sent by the client
+ *			- in case of %IBTRS_SRV_RDMA_EV_WRITE_REQ points to the
+ *			memory area where the response is to be written to
+ *	@datalen:	Size of the memory area in @data
+ *	@usr:		The extra user message sent by the client (%vec)
+ *	@usrlen:	Size of the user message
+ */
+typedef int (rdma_ev_fn)(struct ibtrs_srv *sess, void *priv,
+			 struct ibtrs_srv_op *id, int dir,
+			 void *data, size_t datalen, const void *usr,
+			 size_t usrlen);
+
+/**
+ * link_ev_fn():	Events about connective state changes
+ *			If the callback returns != 0 and the event
+ *			%IBTRS_SRV_LINK_EV_CONNECTED the corresponding session
+ *			will be destroyed.
+ *	@sess:		Session
+ *	@ev:		event
+ *	@priv:		Private data from user if previously set with
+ *			ibtrs_srv_set_sess_priv()
+ */
+typedef int (link_ev_fn)(struct ibtrs_srv *sess, enum ibtrs_srv_link_ev ev,
+			 void *priv);
+
+/**
+ * ibtrs_srv_open() - open IBTRS server context
+ * @ops:		callback functions
+ *
+ * Creates server context with specified callbacks.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port);
+
+/**
+ * ibtrs_srv_close() - close IBTRS server context
+ * @ctx: pointer to server context
+ *
+ * Closes IBTRS server context with all client sessions.
+ */
+void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx);
+
+/**
+ * ibtrs_srv_resp_rdma() - Finish an RDMA request
+ *
+ * @id:		Internal IBTRS operation identifier
+ * @errno:	Response Code send to the other side for this operation.
+ *		0 = success, <=0 error
+ *
+ * Finish a RDMA operation. A message is sent to the client and the
+ * corresponding memory areas will be released.
+ */
+void ibtrs_srv_resp_rdma(struct ibtrs_srv_op *id, int errno);
+
+/**
+ * ibtrs_srv_set_sess_priv() - Set private pointer in ibtrs_srv.
+ * @sess:	Session
+ * @priv:	The private pointer that is associated with the session.
+ */
+void ibtrs_srv_set_sess_priv(struct ibtrs_srv *sess, void *priv);
+
+/**
+ * ibtrs_srv_get_sess_qdepth() - Get ibtrs_srv qdepth.
+ * @sess:	Session
+ */
+int ibtrs_srv_get_queue_depth(struct ibtrs_srv *sess);
+
+/**
+ * ibtrs_srv_get_sess_name() - Get ibtrs_srv peer hostname.
+ * @sess:	Session
+ * @sessname:	Sessname buffer
+ * @len:	Length of sessname buffer
+ */
+int ibtrs_srv_get_sess_name(struct ibtrs_srv *sess, char *sessname, size_t len);
+
+/**
+ * ibtrs_addr_to_sockaddr() - convert path string "src,dst" to sockaddreses
+ * @str		string containing source and destination addr of a path
+ *		separated by comma. I.e. "ip:1.1.1.1,ip:1.1.1.2". If str
+ *		contains only one address it's considered to be destination.
+ * @len		string length
+ * @addr->dst	will be set to the destination sockadddr.
+ * @addr->src	will be set to the source address or to NULL
+ *		if str doesn't contain any sorce address.
+ *
+ * Returns zero if conversion successful. Non-zero otherwise.
+ */
+int ibtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct ibtrs_addr *addr);
+#endif
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 04/26] ibtrs: private headers with IBTRS protocol structs and helpers
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (2 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 03/26] ibtrs: public interface header to establish RDMA connections Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 05/26] ibtrs: core: lib functions shared between client and server modules Roman Pen
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

These are common private headers with IBTRS protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h |  91 ++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h | 459 +++++++++++++++++++++++++++++++
 2 files changed, 550 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-log.h b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
new file mode 100644
index 000000000000..f56257eabdee
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
@@ -0,0 +1,91 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_LOG_H
+#define IBTRS_LOG_H
+
+#define P1 )
+#define P2 ))
+#define P3 )))
+#define P4 ))))
+#define P(N) P ## N
+
+#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
+#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
+
+#define LIST(...)						\
+	__VA_ARGS__,						\
+	({ unknown_type(); NULL; })				\
+	CAT(P, COUNT_ARGS(__VA_ARGS__))				\
+
+#define EMPTY()
+#define DEFER(id) id EMPTY()
+
+#define _CASE(obj, type, member)				\
+	__builtin_choose_expr(					\
+	__builtin_types_compatible_p(				\
+		typeof(obj), type),				\
+		((type)obj)->member
+#define CASE(o, t, m) DEFER(_CASE)(o,t,m)
+
+/*
+ * Below we define retrieving of sessname from common IBTRS types.
+ * Client or server related types have to be defined by special
+ * TYPES_TO_SESSNAME macro.
+ */
+
+void unknown_type(void);
+
+#ifndef TYPES_TO_SESSNAME
+#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
+#endif
+
+#define ibtrs_prefix(obj)					\
+	_CASE(obj, struct ibtrs_con *,  sess->sessname),	\
+	_CASE(obj, struct ibtrs_sess *, sessname),		\
+	TYPES_TO_SESSNAME(obj)					\
+	))
+
+#define ibtrs_log(fn, obj, fmt, ...)				\
+	fn("<%s>: " fmt, ibtrs_prefix(obj), ##__VA_ARGS__)
+
+#define ibtrs_err(obj, fmt, ...)	\
+	ibtrs_log(pr_err, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_err_rl(obj, fmt, ...)	\
+	ibtrs_log(pr_err_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn(obj, fmt, ...)	\
+	ibtrs_log(pr_warn, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn_rl(obj, fmt, ...) \
+	ibtrs_log(pr_warn_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info(obj, fmt, ...) \
+	ibtrs_log(pr_info, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info_rl(obj, fmt, ...) \
+	ibtrs_log(pr_info_ratelimited, obj, fmt, ##__VA_ARGS__)
+
+#endif /* IBTRS_LOG_H */
diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
new file mode 100644
index 000000000000..40647f066840
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
@@ -0,0 +1,459 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_PRI_H
+#define IBTRS_PRI_H
+
+#include <linux/uuid.h>
+#include <rdma/rdma_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib.h>
+
+#include "ibtrs.h"
+
+#define IBTRS_PROTO_VER_MAJOR 2
+#define IBTRS_PROTO_VER_MINOR 0
+
+#define IBTRS_PROTO_VER_STRING __stringify(IBTRS_PROTO_VER_MAJOR) "." \
+			       __stringify(IBTRS_PROTO_VER_MINOR)
+
+#ifndef IBTRS_VER_STRING
+#define IBTRS_VER_STRING __stringify(IBTRS_PROTO_VER_MAJOR) "." \
+			 __stringify(IBTRS_PROTO_VER_MINOR)
+#endif
+
+enum ibtrs_imm_const {
+	MAX_IMM_TYPE_BITS = 4,
+	MAX_IMM_TYPE_MASK = ((1 << MAX_IMM_TYPE_BITS) - 1),
+	MAX_IMM_PAYL_BITS = 28,
+	MAX_IMM_PAYL_MASK = ((1 << MAX_IMM_PAYL_BITS) - 1),
+};
+
+enum ibtrs_imm_type {
+	IBTRS_IO_REQ_IMM       = 0, /* client to server */
+	IBTRS_IO_RSP_IMM       = 1, /* server to client */
+	IBTRS_IO_RSP_W_INV_IMM = 2, /* server to client */
+
+	IBTRS_HB_MSG_IMM = 8,
+	IBTRS_HB_ACK_IMM = 9,
+
+	IBTRS_LAST_IMM,
+};
+
+enum {
+	SERVICE_CON_QUEUE_DEPTH = 512,
+
+	MIN_RTR_CNT = 1,
+	MAX_RTR_CNT = 7,
+
+	MAX_PATHS_NUM = 128,
+
+	/*
+	 * With the current size of the tag allocated on the client, 4K
+	 * is the maximum number of tags we can allocate.  This number is
+	 * also used on the client to allocate the IU for the user connection
+	 * to receive the RDMA addresses from the server.
+	 */
+	MAX_SESS_QUEUE_DEPTH = 4096,
+
+	IBTRS_HB_INTERVAL_MS = 5000,
+	IBTRS_HB_MISSED_MAX = 5,
+
+	IBTRS_MAGIC = 0x1BBD,
+	IBTRS_PROTO_VER = (IBTRS_PROTO_VER_MAJOR << 8) | IBTRS_PROTO_VER_MINOR,
+};
+
+struct ibtrs_ib_dev;
+
+struct ibtrs_ib_dev_pool_ops {
+	struct ibtrs_ib_dev *(*alloc)(void);
+	void (*free)(struct ibtrs_ib_dev *);
+	int (*init)(struct ibtrs_ib_dev *);
+	void (*deinit)(struct ibtrs_ib_dev *);
+};
+
+struct ibtrs_ib_dev_pool {
+	struct mutex		mutex;
+	struct list_head	list;
+	enum ib_pd_flags	pd_flags;
+	const struct ibtrs_ib_dev_pool_ops *ops;
+};
+
+struct ibtrs_ib_dev {
+	struct ib_device	 *ib_dev;
+	struct ib_pd		 *ib_pd;
+	struct kref		 ref;
+	struct list_head	 entry;
+	struct ibtrs_ib_dev_pool *pool;
+};
+
+struct ibtrs_con {
+	struct ibtrs_sess	*sess;
+	struct ib_qp		*qp;
+	struct ib_cq		*cq;
+	struct rdma_cm_id	*cm_id;
+	unsigned		cid;
+};
+
+typedef void (ibtrs_hb_handler_t)(struct ibtrs_con *con, int err);
+
+struct ibtrs_sess {
+	struct list_head	entry;
+	struct sockaddr_storage dst_addr;
+	struct sockaddr_storage src_addr;
+	char			sessname[NAME_MAX];
+	uuid_t			uuid;
+	struct ibtrs_con	**con;
+	unsigned int		con_num;
+	unsigned int		recon_cnt;
+	struct ibtrs_ib_dev	*dev;
+	int			dev_ref;
+	struct ib_cqe		*hb_cqe;
+	ibtrs_hb_handler_t	*hb_err_handler;
+	struct workqueue_struct *hb_wq;
+	struct delayed_work	hb_dwork;
+	unsigned		hb_interval_ms;
+	unsigned		hb_missed_cnt;
+	unsigned		hb_missed_max;
+};
+
+struct ibtrs_iu {
+	struct list_head        list;
+	struct ib_cqe           cqe;
+	dma_addr_t              dma_addr;
+	void                    *buf;
+	size_t                  size;
+	enum dma_data_direction direction;
+	u32			tag;
+};
+
+/**
+ * enum ibtrs_msg_types - IBTRS message types.
+ * @IBTRS_MSG_INFO_REQ:		Client additional info request to the server
+ * @IBTRS_MSG_INFO_RSP:		Server additional info response to the client
+ * @IBTRS_MSG_WRITE:		Client writes data per RDMA to server
+ * @IBTRS_MSG_READ:		Client requests data transfer from server
+ */
+enum ibtrs_msg_types {
+	IBTRS_MSG_INFO_REQ,
+	IBTRS_MSG_INFO_RSP,
+	IBTRS_MSG_WRITE,
+	IBTRS_MSG_READ,
+};
+
+/**
+ * enum ibtrs_msg_flags - IBTRS message flags.
+ * @IBTRS_NEED_INVAL:	Send invalidation in response.
+ */
+enum ibtrs_msg_flags {
+	IBTRS_MSG_NEED_INVAL_F = 1<<0
+};
+
+/**
+ * struct ibtrs_sg_desc - RDMA-Buffer entry description
+ * @addr:	Address of RDMA destination buffer
+ * @key:	Authorization rkey to write to the buffer
+ * @len:	Size of the buffer
+ */
+struct ibtrs_sg_desc {
+	__le64			addr;
+	__le32			key;
+	__le32			len;
+};
+
+/**
+ * struct ibtrs_msg_conn_req - Client connection request to the server
+ * @magic:	   IBTRS magic
+ * @version:	   IBTRS protocol version
+ * @cid:	   Current connection id
+ * @cid_num:	   Number of connections per session
+ * @recon_cnt:	   Reconnections counter
+ * @sess_uuid:	   UUID of a session (path)
+ * @paths_uuid:	   UUID of a group of sessions (paths)
+ *
+ * NOTE: max size 56 bytes, see man rdma_connect().
+ */
+struct ibtrs_msg_conn_req {
+	u8		__cma_version; /* Is set to 0 by cma.c in case of
+					* AF_IB, do not touch that. */
+	u8		__ip_version;  /* On sender side that should be
+					* set to 0, or cma_save_ip_info()
+					* extract garbage and will fail. */
+	__le16		magic;
+	__le16		version;
+	__le16		cid;
+	__le16		cid_num;
+	__le16		recon_cnt;
+	uuid_t		sess_uuid;
+	uuid_t		paths_uuid;
+	u8		reserved[12];
+};
+
+/**
+ * struct ibtrs_msg_conn_rsp - Server connection response to the client
+ * @magic:	   IBTRS magic
+ * @version:	   IBTRS protocol version
+ * @errno:	   If rdma_accept() then 0, if rdma_reject() indicates error
+ * @queue_depth:   max inflight messages (queue-depth) in this session
+ * @max_io_size:   max io size server supports
+ * @max_hdr_size:  max msg header size server supports
+ *
+ * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
+ */
+struct ibtrs_msg_conn_rsp {
+	__le16		magic;
+	__le16		version;
+	__le16		errno;
+	__le16		queue_depth;
+	__le32		max_io_size;
+	__le32		max_hdr_size;
+	u8		reserved[40];
+};
+
+/**
+ * struct ibtrs_msg_info_req
+ * @type:		@IBTRS_MSG_INFO_REQ
+ * @sessname:		Session name chosen by client
+ */
+struct ibtrs_msg_info_req {
+	__le16		type;
+	u8		sessname[NAME_MAX];
+	u8		reserved[15];
+};
+
+/**
+ * struct ibtrs_msg_info_rsp
+ * @type:		@IBTRS_MSG_INFO_RSP
+ * @sg_cnt:		Number of @desc entries
+ * @desc:		RDMA buffers where the client can write to server
+ */
+struct ibtrs_msg_info_rsp {
+	__le16		type;
+	__le16          sg_cnt;
+	u8              reserved[4];
+	struct ibtrs_sg_desc desc[];
+};
+
+/**
+ * struct ibtrs_msg_rdma_read - RDMA data transfer request from client
+ * @type:		always @IBTRS_MSG_READ
+ * @usr_len:		length of user payload
+ * @sg_cnt:		number of @desc entries
+ * @desc:		RDMA buffers where the server can write the result to
+ */
+struct ibtrs_msg_rdma_read {
+	__le16			type;
+	__le16			usr_len;
+	__le16			flags;
+	__le16			sg_cnt;
+	struct ibtrs_sg_desc    desc[];
+};
+
+/**
+ * struct_msg_rdma_write - Message transferred to server with RDMA-Write
+ * @type:		always @IBTRS_MSG_WRITE
+ * @usr_len:		length of user payload
+ */
+struct ibtrs_msg_rdma_write {
+	__le16			type;
+	__le16			usr_len;
+};
+
+/* ibtrs.c */
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t t,
+				struct ib_device *dev, enum dma_data_direction,
+				void (*done)(struct ib_cq *cq, struct ib_wc *wc));
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+		   struct ib_device *dev);
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu);
+int ibtrs_iu_post_send(struct ibtrs_con *con, struct ibtrs_iu *iu, size_t size,
+		       struct ib_send_wr *head);
+int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
+				 struct ib_sge *sge, unsigned int num_sge,
+				 u32 rkey, u64 rdma_addr, u32 imm_data,
+				 enum ib_send_flags flags,
+				 struct ib_send_wr *head);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe);
+int ibtrs_post_recv_empty_x2(struct ibtrs_con *con, struct ib_cqe *cqe);
+int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
+				    u32 imm_data, enum ib_send_flags flags,
+				    struct ib_send_wr *head);
+
+int ibtrs_cq_qp_create(struct ibtrs_sess *ibtrs_sess, struct ibtrs_con *con,
+		       u32 max_send_sge, int cq_vector, u16 cq_size,
+		       u16 wr_queue_size, enum ib_poll_context poll_ctx);
+void ibtrs_cq_qp_destroy(struct ibtrs_con *con);
+
+void ibtrs_init_hb(struct ibtrs_sess *sess, struct ib_cqe *cqe,
+		   unsigned interval_ms, unsigned missed_max,
+		   ibtrs_hb_handler_t *err_handler,
+		   struct workqueue_struct *wq);
+void ibtrs_start_hb(struct ibtrs_sess *sess);
+void ibtrs_stop_hb(struct ibtrs_sess *sess);
+void ibtrs_send_hb_ack(struct ibtrs_sess *sess);
+
+void ibtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
+			    struct ibtrs_ib_dev_pool *pool);
+void ibtrs_ib_dev_pool_deinit(struct ibtrs_ib_dev_pool *pool);
+
+struct ibtrs_ib_dev *ibtrs_ib_dev_find_or_add(struct ib_device *ib_dev,
+					      struct ibtrs_ib_dev_pool *pool);
+int ibtrs_ib_dev_put(struct ibtrs_ib_dev *dev);
+
+static inline int sockaddr_cmp(const struct sockaddr *a,
+			       const struct sockaddr *b)
+{
+	switch (a->sa_family) {
+	case AF_IB:
+		return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
+			      &((struct sockaddr_ib *)b)->sib_addr,
+			      sizeof(struct ib_addr));
+	case AF_INET:
+		return memcmp(&((struct sockaddr_in *)a)->sin_addr,
+			      &((struct sockaddr_in *)b)->sin_addr,
+			      sizeof(struct in_addr));
+	case AF_INET6:
+		return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
+			      &((struct sockaddr_in6 *)b)->sin6_addr,
+			      sizeof(struct in6_addr));
+	default:
+		return -ENOENT;
+	}
+}
+
+static inline void sockaddr_to_str(const struct sockaddr *addr,
+				   char *buf, size_t len)
+{
+	switch (addr->sa_family) {
+	case AF_IB:
+		scnprintf(buf, len, "gid:%pI6",
+			  &((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
+		return;
+	case AF_INET:
+		scnprintf(buf, len, "ip:%pI4",
+			  &((struct sockaddr_in *)addr)->sin_addr);
+		return;
+	case AF_INET6:
+		scnprintf(buf, len, "ip:%pI6c",
+			  &((struct sockaddr_in6 *)addr)->sin6_addr);
+		return;
+	}
+	scnprintf(buf, len, "<invalid address family>");
+	pr_err("Invalid address family\n");
+}
+
+/**
+ * ibtrs_invalidate_flag() - returns proper flags for invalidation
+ *
+ * NOTE: This function is needed for compat layer, so think twice before
+ *       rename or remove.
+ */
+static inline u32 ibtrs_invalidate_flag(void)
+{
+	return IBTRS_MSG_NEED_INVAL_F;
+}
+
+static inline u32 ibtrs_to_imm(u32 type, u32 payload)
+{
+	BUILD_BUG_ON(32 != MAX_IMM_PAYL_BITS + MAX_IMM_TYPE_BITS);
+	BUILD_BUG_ON(IBTRS_LAST_IMM > (1<<MAX_IMM_TYPE_BITS));
+	return ((type & MAX_IMM_TYPE_MASK) << MAX_IMM_PAYL_BITS) |
+		(payload & MAX_IMM_PAYL_MASK);
+}
+
+static inline void ibtrs_from_imm(u32 imm, u32 *type, u32 *payload)
+{
+	*payload = (imm & MAX_IMM_PAYL_MASK);
+	*type = (imm >> MAX_IMM_PAYL_BITS);
+}
+
+static inline u32 ibtrs_to_io_req_imm(u32 addr)
+{
+	return ibtrs_to_imm(IBTRS_IO_REQ_IMM, addr);
+}
+
+static inline u32 ibtrs_to_io_rsp_imm(u32 msg_id, int errno, bool w_inval)
+{
+	enum ibtrs_imm_type type;
+	u32 payload;
+
+	/* 9 bits for errno, 19 bits for msg_id */
+	payload = (abs(errno) & 0x1ff) << 19 | (msg_id & 0x7ffff);
+	type = (w_inval ? IBTRS_IO_RSP_W_INV_IMM : IBTRS_IO_RSP_IMM);
+
+	return ibtrs_to_imm(type, payload);
+}
+
+static inline void ibtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
+{
+	/* 9 bits for errno, 19 bits for msg_id */
+	*msg_id = (payload & 0x7ffff);
+	*errno = -(int)((payload >> 19) & 0x1ff);
+}
+
+#define STAT_STORE_FUNC(type, store, reset)				\
+static ssize_t store##_store(struct kobject *kobj,			\
+			     struct kobj_attribute *attr,		\
+			     const char *buf, size_t count)		\
+{									\
+	int ret = -EINVAL;						\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	if (sysfs_streq(buf, "1"))					\
+		ret = reset(&sess->stats, true);			\
+	else if (sysfs_streq(buf, "0"))					\
+		ret = reset(&sess->stats, false);			\
+	if (ret)							\
+		return ret;						\
+									\
+	return count;							\
+}
+
+#define STAT_SHOW_FUNC(type, show, print)				\
+static ssize_t show##_show(struct kobject *kobj,			\
+			   struct kobj_attribute *attr,			\
+			   char *page)					\
+{									\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	return print(&sess->stats, page, PAGE_SIZE);			\
+}
+
+#define STAT_ATTR(type, stat, print, reset)				\
+STAT_STORE_FUNC(type, stat, reset)					\
+STAT_SHOW_FUNC(type, stat, print)					\
+static struct kobj_attribute stat##_attr =				\
+		__ATTR(stat, 0644,					\
+		       stat##_show,					\
+		       stat##_store)
+
+#endif /* IBTRS_PRI_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 05/26] ibtrs: core: lib functions shared between client and server modules
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (3 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 04/26] ibtrs: private headers with IBTRS protocol structs and helpers Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 06/26] ibtrs: client: private header with client structs and functions Roman Pen
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is a set of library functions existing as a ibtrs-core module,
used by client and server modules.

Mainly these functions wrap IB and RDMA calls and provide a bit higher
abstraction for implementing of IBTRS protocol on client or server
sides.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs.c | 609 +++++++++++++++++++++++++++++++++++
 1 file changed, 609 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.c b/drivers/infiniband/ulp/ibtrs/ibtrs.c
new file mode 100644
index 000000000000..39a933fe528e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.c
@@ -0,0 +1,609 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/inet.h>
+
+#include "ibtrs-pri.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Core");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t gfp_mask,
+				struct ib_device *dma_dev,
+				enum dma_data_direction direction,
+				void (*done)(struct ib_cq *cq,
+					     struct ib_wc *wc))
+{
+	struct ibtrs_iu *iu;
+
+	iu = kmalloc(sizeof(*iu), gfp_mask);
+	if (unlikely(!iu))
+		return NULL;
+
+	iu->buf = kzalloc(size, gfp_mask);
+	if (unlikely(!iu->buf))
+		goto err1;
+
+	iu->dma_addr = ib_dma_map_single(dma_dev, iu->buf, size, direction);
+	if (unlikely(ib_dma_mapping_error(dma_dev, iu->dma_addr)))
+		goto err2;
+
+	iu->cqe.done  = done;
+	iu->size      = size;
+	iu->direction = direction;
+	iu->tag       = tag;
+
+	return iu;
+
+err2:
+	kfree(iu->buf);
+err1:
+	kfree(iu);
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_alloc);
+
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+		   struct ib_device *ibdev)
+{
+	if (!iu)
+		return;
+
+	ib_dma_unmap_single(ibdev, iu->dma_addr, iu->size, dir);
+	kfree(iu->buf);
+	kfree(iu);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_free);
+
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu)
+{
+	struct ibtrs_sess *sess = con->sess;
+	struct ib_recv_wr wr, *bad_wr;
+	struct ib_sge list;
+
+	list.addr   = iu->dma_addr;
+	list.length = iu->size;
+	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
+
+	if (WARN_ON(list.length == 0)) {
+		ibtrs_wrn(con, "Posting receive work request failed,"
+			  " sg list is empty\n");
+		return -EINVAL;
+	}
+
+	wr.next    = NULL;
+	wr.wr_cqe  = &iu->cqe;
+	wr.sg_list = &list;
+	wr.num_sge = 1;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_recv);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+	struct ib_recv_wr wr, *bad_wr;
+
+	wr.next    = NULL;
+	wr.wr_cqe  = cqe;
+	wr.sg_list = NULL;
+	wr.num_sge = 0;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);
+
+int ibtrs_post_recv_empty_x2(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+	struct ib_recv_wr wr_arr[2], *wr, *bad_wr;
+	int i;
+
+	memset(wr_arr, 0, sizeof(wr_arr));
+	for (i = 0; i < ARRAY_SIZE(wr_arr); i++) {
+		wr = &wr_arr[i];
+		wr->wr_cqe  = cqe;
+		if (i)
+			/* Chain backwards */
+			wr->next = &wr_arr[i - 1];
+	}
+
+	return ib_post_recv(con->qp, wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty_x2);
+
+int ibtrs_iu_post_send(struct ibtrs_con *con, struct ibtrs_iu *iu, size_t size,
+		       struct ib_send_wr *head)
+{
+	struct ibtrs_sess *sess = con->sess;
+	struct ib_send_wr wr, *bad_wr;
+	struct ib_sge list;
+
+	if ((WARN_ON(size == 0)))
+		return -EINVAL;
+
+	list.addr   = iu->dma_addr;
+	list.length = size;
+	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.next       = NULL;
+	wr.wr_cqe     = &iu->cqe;
+	wr.sg_list    = &list;
+	wr.num_sge    = 1;
+	wr.opcode     = IB_WR_SEND;
+	wr.send_flags = IB_SEND_SIGNALED;
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr;
+	}
+	else
+		head = &wr;
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_send);
+
+int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
+				 struct ib_sge *sge, unsigned int num_sge,
+				 u32 rkey, u64 rdma_addr, u32 imm_data,
+				 enum ib_send_flags flags,
+				 struct ib_send_wr *head)
+{
+	struct ib_send_wr *bad_wr;
+	struct ib_rdma_wr wr;
+	int i;
+
+	wr.wr.next	  = NULL;
+	wr.wr.wr_cqe	  = &iu->cqe;
+	wr.wr.sg_list	  = sge;
+	wr.wr.num_sge	  = num_sge;
+	wr.rkey		  = rkey;
+	wr.remote_addr	  = rdma_addr;
+	wr.wr.opcode	  = IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.wr.ex.imm_data = cpu_to_be32(imm_data);
+	wr.wr.send_flags  = flags;
+
+	/*
+	 * If one of the sges has 0 size, the operation will fail with an
+	 * length error
+	 */
+	for (i = 0; i < num_sge; i++)
+		if (WARN_ON(sge[i].length == 0))
+			return -EINVAL;
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr.wr;
+	}
+	else
+		head = &wr.wr;
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_rdma_write_imm);
+
+int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
+				    u32 imm_data, enum ib_send_flags flags,
+				    struct ib_send_wr *head)
+{
+	struct ib_send_wr wr, *bad_wr;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.wr_cqe	= cqe;
+	wr.send_flags	= flags;
+	wr.opcode	= IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.ex.imm_data	= cpu_to_be32(imm_data);
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr;
+	}
+	else
+		head = &wr;
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_rdma_write_imm_empty);
+
+static void qp_event_handler(struct ib_event *ev, void *ctx)
+{
+	struct ibtrs_con *con = ctx;
+
+	switch (ev->event) {
+	case IB_EVENT_COMM_EST:
+		ibtrs_info(con, "QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		rdma_notify(con->cm_id, IB_EVENT_COMM_EST);
+		break;
+	default:
+		ibtrs_info(con, "Unhandled QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		break;
+	}
+}
+
+static int create_cq(struct ibtrs_con *con, int cq_vector, u16 cq_size,
+		     enum ib_poll_context poll_ctx)
+{
+	struct rdma_cm_id *cm_id = con->cm_id;
+	struct ib_cq *cq;
+
+	cq = ib_alloc_cq(cm_id->device, con, cq_size,
+			 cq_vector, poll_ctx);
+	if (unlikely(IS_ERR(cq))) {
+		ibtrs_err(con, "Creating completion queue failed, errno: %ld\n",
+			  PTR_ERR(cq));
+		return PTR_ERR(cq);
+	}
+	con->cq = cq;
+
+	return 0;
+}
+
+static int create_qp(struct ibtrs_con *con, struct ib_pd *pd,
+		     u16 wr_queue_size, u32 max_send_sge)
+{
+	struct ib_qp_init_attr init_attr = {NULL};
+	struct rdma_cm_id *cm_id = con->cm_id;
+	int ret;
+
+	init_attr.cap.max_send_wr = wr_queue_size;
+	init_attr.cap.max_recv_wr = wr_queue_size;
+	init_attr.cap.max_recv_sge = 1;
+	init_attr.event_handler = qp_event_handler;
+	init_attr.qp_context = con;
+	init_attr.cap.max_send_sge = max_send_sge;
+
+	init_attr.qp_type = IB_QPT_RC;
+	init_attr.send_cq = con->cq;
+	init_attr.recv_cq = con->cq;
+	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+
+	ret = rdma_create_qp(cm_id, pd, &init_attr);
+	if (unlikely(ret)) {
+		ibtrs_err(con, "Creating QP failed, err: %d\n", ret);
+		return ret;
+	}
+	con->qp = cm_id->qp;
+
+	return ret;
+}
+
+int ibtrs_cq_qp_create(struct ibtrs_sess *sess, struct ibtrs_con *con,
+		       u32 max_send_sge, int cq_vector, u16 cq_size,
+		       u16 wr_queue_size, enum ib_poll_context poll_ctx)
+{
+	int err;
+
+	err = create_cq(con, cq_vector, cq_size, poll_ctx);
+	if (unlikely(err))
+		return err;
+
+	err = create_qp(con, sess->dev->ib_pd, wr_queue_size, max_send_sge);
+	if (unlikely(err)) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+		return err;
+	}
+	con->sess = sess;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ibtrs_cq_qp_create);
+
+void ibtrs_cq_qp_destroy(struct ibtrs_con *con)
+{
+	if (con->qp) {
+		rdma_destroy_qp(con->cm_id);
+		con->qp = NULL;
+	}
+	if (con->cq) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(ibtrs_cq_qp_destroy);
+
+static void schedule_hb(struct ibtrs_sess *sess)
+{
+	queue_delayed_work(sess->hb_wq, &sess->hb_dwork,
+			   msecs_to_jiffies(sess->hb_interval_ms));
+}
+
+void ibtrs_send_hb_ack(struct ibtrs_sess *sess)
+{
+	struct ibtrs_con *usr_con = sess->con[0];
+	u32 imm;
+	int err;
+
+	imm = ibtrs_to_imm(IBTRS_HB_ACK_IMM, 0);
+	err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe, imm,
+					      IB_SEND_SIGNALED, NULL);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con, err);
+		return;
+	}
+}
+EXPORT_SYMBOL_GPL(ibtrs_send_hb_ack);
+
+static void hb_work(struct work_struct *work)
+{
+	struct ibtrs_con *usr_con;
+	struct ibtrs_sess *sess;
+	u32 imm;
+	int err;
+
+	sess = container_of(to_delayed_work(work), typeof(*sess), hb_dwork);
+	usr_con = sess->con[0];
+
+	if (sess->hb_missed_cnt > sess->hb_missed_max) {
+		sess->hb_err_handler(usr_con, -ETIMEDOUT);
+		return;
+	}
+	if (sess->hb_missed_cnt++) {
+		/* Reschedule work without sending hb */
+		schedule_hb(sess);
+		return;
+	}
+	imm = ibtrs_to_imm(IBTRS_HB_MSG_IMM, 0);
+	err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe, imm,
+					      IB_SEND_SIGNALED, NULL);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con, err);
+		return;
+	}
+
+	schedule_hb(sess);
+}
+
+void ibtrs_init_hb(struct ibtrs_sess *sess, struct ib_cqe *cqe,
+		   unsigned int interval_ms, unsigned int missed_max,
+		   ibtrs_hb_handler_t *err_handler,
+		   struct workqueue_struct *wq)
+{
+	sess->hb_cqe = cqe;
+	sess->hb_interval_ms = interval_ms;
+	sess->hb_err_handler = err_handler;
+	sess->hb_wq = wq;
+	sess->hb_missed_max = missed_max;
+	sess->hb_missed_cnt = 0;
+	INIT_DELAYED_WORK(&sess->hb_dwork, hb_work);
+}
+EXPORT_SYMBOL_GPL(ibtrs_init_hb);
+
+void ibtrs_start_hb(struct ibtrs_sess *sess)
+{
+	schedule_hb(sess);
+}
+EXPORT_SYMBOL_GPL(ibtrs_start_hb);
+
+void ibtrs_stop_hb(struct ibtrs_sess *sess)
+{
+	cancel_delayed_work_sync(&sess->hb_dwork);
+	sess->hb_missed_cnt = 0;
+	sess->hb_missed_max = 0;
+}
+EXPORT_SYMBOL_GPL(ibtrs_stop_hb);
+
+static int ibtrs_str_gid_to_sockaddr(const char *addr, size_t len,
+				     short port, struct sockaddr_storage *dst)
+{
+	struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
+	int ret;
+
+	/*
+	 * We can use some of the I6 functions since GID is a valid
+	 * IPv6 address format
+	 */
+	ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
+	if (ret == 0)
+		return -EINVAL;
+
+	dst_ib->sib_family = AF_IB;
+	/*
+	 * Use the same TCP server port number as the IB service ID
+	 * on the IB port space range
+	 */
+	dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
+	dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
+	dst_ib->sib_pkey = cpu_to_be16(0xffff);
+
+	return 0;
+}
+
+/**
+ * ibtrs_str_to_sockaddr() - Convert ibtrs address string to sockaddr
+ * @addr	String representation of an addr (IPv4, IPv6 or IB GID):
+ *              - "ip:192.168.1.1"
+ *              - "ip:fe80::200:5aee:feaa:20a2"
+ *              - "gid:fe80::200:5aee:feaa:20a2"
+ * @len         String address length
+ * @port	Destination port
+ * @dst		Destination sockaddr structure
+ *
+ * Returns 0 if conversion successful. Non-zero on error.
+ */
+static int ibtrs_str_to_sockaddr(const char *addr, size_t len,
+				 short port, struct sockaddr_storage *dst)
+{
+	if (strncmp(addr, "gid:", 4) == 0) {
+		return ibtrs_str_gid_to_sockaddr(addr + 4, len - 4, port, dst);
+	} else if (strncmp(addr, "ip:", 3) == 0) {
+		char port_str[8];
+		char *cpy;
+		int err;
+
+		snprintf(port_str, sizeof(port_str), "%u", port);
+		cpy = kstrndup(addr + 3, len - 3, GFP_KERNEL);
+		err = cpy ? inet_pton_with_scope(&init_net, AF_UNSPEC,
+						 cpy, port_str, dst) : -ENOMEM;
+		kfree(cpy);
+
+		return err;
+	}
+	return -EPROTONOSUPPORT;
+}
+
+int ibtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct ibtrs_addr *addr)
+{
+	const char *d;
+	int ret;
+
+	d = strchr(str, ',');
+	if (d) {
+		if (ibtrs_str_to_sockaddr(str, d - str, 0, addr->src))
+			return -EINVAL;
+		d += 1;
+		len -= d - str;
+		str  = d;
+
+	} else {
+		addr->src = NULL;
+	}
+	ret = ibtrs_str_to_sockaddr(str, len, port, addr->dst);
+
+	return ret;
+}
+EXPORT_SYMBOL(ibtrs_addr_to_sockaddr);
+
+void ibtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
+			    struct ibtrs_ib_dev_pool *pool)
+{
+	WARN_ON(pool->ops && (!pool->ops->alloc ^ !pool->ops->free));
+	INIT_LIST_HEAD(&pool->list);
+	mutex_init(&pool->mutex);
+	pool->pd_flags = pd_flags;
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_pool_init);
+
+void ibtrs_ib_dev_pool_deinit(struct ibtrs_ib_dev_pool *pool)
+{
+	WARN_ON(!list_empty(&pool->list));
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_pool_deinit);
+
+static void dev_free(struct kref *ref)
+{
+	struct ibtrs_ib_dev_pool *pool;
+	struct ibtrs_ib_dev *dev;
+
+	dev = container_of(ref, typeof(*dev), ref);
+	pool = dev->pool;
+
+	mutex_lock(&pool->mutex);
+	list_del(&dev->entry);
+	mutex_unlock(&pool->mutex);
+
+	if (pool->ops && pool->ops->deinit)
+		pool->ops->deinit(dev);
+
+	ib_dealloc_pd(dev->ib_pd);
+
+	if (pool->ops && pool->ops->free)
+		pool->ops->free(dev);
+	else
+		kfree(dev);
+}
+
+int ibtrs_ib_dev_put(struct ibtrs_ib_dev *dev)
+{
+	return kref_put(&dev->ref, dev_free);
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_put);
+
+static int ibtrs_ib_dev_get(struct ibtrs_ib_dev *dev)
+{
+	return kref_get_unless_zero(&dev->ref);
+}
+
+struct ibtrs_ib_dev *
+ibtrs_ib_dev_find_or_add(struct ib_device *ib_dev,
+			 struct ibtrs_ib_dev_pool *pool)
+{
+	struct ibtrs_ib_dev *dev;
+
+	mutex_lock(&pool->mutex);
+	list_for_each_entry(dev, &pool->list, entry) {
+		if (dev->ib_dev->node_guid == ib_dev->node_guid &&
+		    ibtrs_ib_dev_get(dev))
+			goto out_unlock;
+	}
+	if (pool->ops && pool->ops->alloc)
+		dev = pool->ops->alloc();
+	else
+		dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (unlikely(IS_ERR_OR_NULL(dev)))
+		goto out_err;
+
+	kref_init(&dev->ref);
+	dev->pool = pool;
+	dev->ib_dev = ib_dev;
+	dev->ib_pd = ib_alloc_pd(ib_dev, pool->pd_flags);
+	if (unlikely(IS_ERR(dev->ib_pd)))
+		goto out_free_dev;
+
+	if (pool->ops && pool->ops->init && pool->ops->init(dev))
+		goto out_free_pd;
+
+	list_add(&dev->entry, &pool->list);
+out_unlock:
+	mutex_unlock(&pool->mutex);
+	return dev;
+
+out_free_pd:
+	ib_dealloc_pd(dev->ib_pd);
+out_free_dev:
+	if (pool->ops && pool->ops->free)
+		pool->ops->free(dev);
+	else
+		kfree(dev);
+out_err:
+	mutex_unlock(&pool->mutex);
+	return NULL;
+}
+EXPORT_SYMBOL(ibtrs_ib_dev_find_or_add);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 06/26] ibtrs: client: private header with client structs and functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (4 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 05/26] ibtrs: core: lib functions shared between client and server modules Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 07/26] ibtrs: client: main functionality Roman Pen
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This header describes main structs and functions used by ibtrs-client
module, mainly for managing IBTRS sessions, creating/destroying sysfs
entries, accounting statistics on client side.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h | 315 +++++++++++++++++++++++++++++++
 1 file changed, 315 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
new file mode 100644
index 000000000000..0323da91ca01
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
@@ -0,0 +1,315 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_CLT_H
+#define IBTRS_CLT_H
+
+#include <linux/device.h>
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_clt_state - Client states.
+ */
+enum ibtrs_clt_state {
+	IBTRS_CLT_CONNECTING,
+	IBTRS_CLT_CONNECTING_ERR,
+	IBTRS_CLT_RECONNECTING,
+	IBTRS_CLT_CONNECTED,
+	IBTRS_CLT_CLOSING,
+	IBTRS_CLT_CLOSED,
+	IBTRS_CLT_DEAD,
+};
+
+static inline const char *ibtrs_clt_state_str(enum ibtrs_clt_state state)
+{
+	switch (state) {
+	case IBTRS_CLT_CONNECTING:
+		return "IBTRS_CLT_CONNECTING";
+	case IBTRS_CLT_CONNECTING_ERR:
+		return "IBTRS_CLT_CONNECTING_ERR";
+	case IBTRS_CLT_RECONNECTING:
+		return "IBTRS_CLT_RECONNECTING";
+	case IBTRS_CLT_CONNECTED:
+		return "IBTRS_CLT_CONNECTED";
+	case IBTRS_CLT_CLOSING:
+		return "IBTRS_CLT_CLOSING";
+	case IBTRS_CLT_CLOSED:
+		return "IBTRS_CLT_CLOSED";
+	case IBTRS_CLT_DEAD:
+		return "IBTRS_CLT_DEAD";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+enum ibtrs_mp_policy {
+	MP_POLICY_RR,
+	MP_POLICY_MIN_INFLIGHT,
+};
+
+struct ibtrs_clt_stats_reconnects {
+	int successful_cnt;
+	int fail_cnt;
+};
+
+struct ibtrs_clt_stats_wc_comp {
+	u32 cnt;
+	u64 total_cnt;
+};
+
+struct ibtrs_clt_stats_cpu_migr {
+	atomic_t from;
+	int to;
+};
+
+struct ibtrs_clt_stats_rdma {
+	struct {
+		u64 cnt;
+		u64 size_total;
+	} dir[2];
+
+	u64 failover_cnt;
+};
+
+struct ibtrs_clt_stats_rdma_lat {
+	u64 read;
+	u64 write;
+};
+
+#define MIN_LOG_SG 2
+#define MAX_LOG_SG 5
+#define MAX_LIN_SG BIT(MIN_LOG_SG)
+#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)
+
+#define MAX_LOG_LAT 16
+#define MIN_LOG_LAT 0
+#define LOG_LAT_SZ (MAX_LOG_LAT - MIN_LOG_LAT + 2)
+
+struct ibtrs_clt_stats_pcpu {
+	struct ibtrs_clt_stats_cpu_migr		cpu_migr;
+	struct ibtrs_clt_stats_rdma		rdma;
+	u64					sg_list_total;
+	u64					sg_list_distr[SG_DISTR_SZ];
+	struct ibtrs_clt_stats_rdma_lat		rdma_lat_distr[LOG_LAT_SZ];
+	struct ibtrs_clt_stats_rdma_lat		rdma_lat_max;
+	struct ibtrs_clt_stats_wc_comp		wc_comp;
+};
+
+struct ibtrs_clt_stats {
+	bool					enable_rdma_lat;
+	struct ibtrs_clt_stats_pcpu    __percpu	*pcpu_stats;
+	struct ibtrs_clt_stats_reconnects	reconnects;
+	atomic_t				inflight;
+};
+
+struct ibtrs_clt_con {
+	struct ibtrs_con	c;
+	unsigned		cpu;
+	atomic_t		io_cnt;
+	int			cm_err;
+};
+
+/**
+ * ibtrs_tag - tags the memory allocation for future RDMA operation
+ */
+struct ibtrs_tag {
+	enum ibtrs_clt_con_type con_type;
+	unsigned int cpu_id;
+	unsigned int mem_id;
+	unsigned int mem_off;
+};
+
+struct ibtrs_clt_io_req {
+	struct list_head        list;
+	struct ibtrs_iu		*iu;
+	struct scatterlist	*sglist; /* list holding user data */
+	unsigned int		sg_cnt;
+	unsigned int		sg_size;
+	unsigned int		data_len;
+	unsigned int		usr_len;
+	void			*priv;
+	bool			in_use;
+	struct ibtrs_clt_con	*con;
+	struct ibtrs_sg_desc	*desc;
+	struct ib_sge		*sge;
+	struct ibtrs_tag	*tag;
+	enum dma_data_direction dir;
+	ibtrs_conf_fn		*conf;
+	unsigned long		start_jiffies;
+
+	struct ib_mr		*mr;
+	struct ib_cqe		inv_cqe;
+	struct completion	inv_comp;
+	int			inv_errno;
+	bool			need_inv_comp;
+	bool			need_inv;
+};
+
+struct ibtrs_rbuf {
+	u64 addr;
+	u32 rkey;
+};
+
+struct ibtrs_clt_sess {
+	struct ibtrs_sess	s;
+	struct ibtrs_clt	*clt;
+	wait_queue_head_t	state_wq;
+	enum ibtrs_clt_state	state;
+	atomic_t		connected_cnt;
+	struct mutex		init_mutex;
+	struct ibtrs_clt_io_req	*reqs;
+	struct delayed_work	reconnect_dwork;
+	struct work_struct	close_work;
+	unsigned		reconnect_attempts;
+	bool			established;
+	struct ibtrs_rbuf	*rbufs;
+	size_t			max_io_size;
+	u32			max_hdr_size;
+	u32			chunk_size;
+	size_t			queue_depth;
+	u32			max_pages_per_mr;
+	int			max_sge;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct ibtrs_clt_stats  stats;
+	/* cache hca_port and hca_name to display in sysfs */
+	u8			hca_port;
+	char                    hca_name[IB_DEVICE_NAME_MAX];
+	struct list_head __percpu
+				*mp_skip_entry;
+};
+
+struct ibtrs_clt {
+	struct list_head   /* __rcu */ paths_list;
+	size_t			       paths_num;
+	struct ibtrs_clt_sess
+		      __percpu * __rcu *pcpu_path;
+
+	bool			opened;
+	uuid_t			paths_uuid;
+	int			paths_up;
+	struct mutex		paths_mutex;
+	struct mutex		paths_ev_mutex;
+	char			sessname[NAME_MAX];
+	short			port;
+	unsigned		max_reconnect_attempts;
+	unsigned		reconnect_delay_sec;
+	unsigned		max_segments;
+	void			*tags;
+	unsigned long		*tags_map;
+	size_t			queue_depth;
+	size_t			max_io_size;
+	wait_queue_head_t	tags_wait;
+	size_t			pdu_sz;
+	void			*priv;
+	link_clt_ev_fn		*link_ev;
+	struct device		dev;
+	struct kobject		kobj_paths;
+	enum ibtrs_mp_policy	mp_policy;
+};
+
+static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
+{
+	return container_of(c, struct ibtrs_clt_con, c);
+}
+
+static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
+{
+	return container_of(s, struct ibtrs_clt_sess, s);
+}
+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj)						\
+	LIST(CASE(obj, struct ibtrs_clt_sess *, s.sessname),		\
+	     CASE(obj, struct ibtrs_clt *, sessname))
+
+#define TAG_SIZE(clt) (sizeof(struct ibtrs_tag) + (clt)->pdu_sz)
+#define GET_TAG(clt, idx) ((clt)->tags + TAG_SIZE(clt) * idx)
+
+int ibtrs_clt_reconnect_from_sysfs(struct ibtrs_clt_sess *sess);
+int ibtrs_clt_disconnect_from_sysfs(struct ibtrs_clt_sess *sess);
+int ibtrs_clt_create_path_from_sysfs(struct ibtrs_clt *clt,
+				     struct ibtrs_addr *addr);
+int ibtrs_clt_remove_path_from_sysfs(struct ibtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self);
+
+void ibtrs_clt_set_max_reconnect_attempts(struct ibtrs_clt *clt, int value);
+int ibtrs_clt_get_max_reconnect_attempts(const struct ibtrs_clt *clt);
+
+/* ibtrs-clt-stats.c */
+
+int ibtrs_clt_init_stats(struct ibtrs_clt_stats *stats);
+void ibtrs_clt_free_stats(struct ibtrs_clt_stats *stats);
+
+void ibtrs_clt_decrease_inflight(struct ibtrs_clt_stats *s);
+void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *s);
+
+void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *s, bool read,
+			       unsigned long ms);
+void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con);
+void ibtrs_clt_update_all_stats(struct ibtrs_clt_io_req *req, int dir);
+
+int ibtrs_clt_reset_sg_list_distr_stats(struct ibtrs_clt_stats *stats,
+					bool enable);
+int ibtrs_clt_stats_sg_list_distr_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len);
+int ibtrs_clt_reset_rdma_lat_distr_stats(struct ibtrs_clt_stats *stats,
+					 bool enable);
+ssize_t ibtrs_clt_stats_rdma_lat_distr_to_str(struct ibtrs_clt_stats *stats,
+					      char *page, size_t len);
+int ibtrs_clt_reset_cpu_migr_stats(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_migration_cnt_to_str(struct ibtrs_clt_stats *stats, char *buf,
+					 size_t len);
+int ibtrs_clt_reset_reconnects_stat(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_reconnects_to_str(struct ibtrs_clt_stats *stats, char *buf,
+				      size_t len);
+int ibtrs_clt_reset_wc_comp_stats(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_wc_completion_to_str(struct ibtrs_clt_stats *stats, char *buf,
+					 size_t len);
+int ibtrs_clt_reset_rdma_stats(struct ibtrs_clt_stats *stats, bool enable);
+ssize_t ibtrs_clt_stats_rdma_to_str(struct ibtrs_clt_stats *stats,
+				    char *page, size_t len);
+bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess);
+int ibtrs_clt_reset_all_stats(struct ibtrs_clt_stats *stats, bool enable);
+ssize_t ibtrs_clt_reset_all_help(struct ibtrs_clt_stats *stats,
+				 char *page, size_t len);
+
+/* ibtrs-clt-sysfs.c */
+
+int ibtrs_clt_create_sysfs_root_folders(struct ibtrs_clt *clt);
+int ibtrs_clt_create_sysfs_root_files(struct ibtrs_clt *clt);
+void ibtrs_clt_destroy_sysfs_root_folders(struct ibtrs_clt *clt);
+void ibtrs_clt_destroy_sysfs_root_files(struct ibtrs_clt *clt);
+
+int ibtrs_clt_create_sess_files(struct ibtrs_clt_sess *sess);
+void ibtrs_clt_destroy_sess_files(struct ibtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self);
+
+#endif /* IBTRS_CLT_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 07/26] ibtrs: client: main functionality
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (5 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 06/26] ibtrs: client: private header with client structs and functions Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 08/26] ibtrs: client: statistics functions Roman Pen
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is main functionality of ibtrs-client module, which manages
set of RDMA connections for each IBTRS session, does multipathing,
load balancing and failover of RDMA requests.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c | 2818 ++++++++++++++++++++++++++++++
 1 file changed, 2818 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
new file mode 100644
index 000000000000..0983f0939b19
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
@@ -0,0 +1,2818 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/rculist.h>
+
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+#define MAX_SEGMENTS 31
+#define IBTRS_CONNECT_TIMEOUT_MS 5000
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Client");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static ushort nr_cons_per_session;
+module_param(nr_cons_per_session, ushort, 0444);
+MODULE_PARM_DESC(nr_cons_per_session, "Number of connections per session."
+		 " (default: nr_cpu_ids)");
+
+static int retry_cnt = 7;
+module_param_named(retry_cnt, retry_cnt, int, 0644);
+MODULE_PARM_DESC(retry_cnt, "Number of times to send the message if the"
+		 " remote side didn't respond with Ack or Nack (default: 7,"
+		 " min: " __stringify(MIN_RTR_CNT) ", max: "
+		 __stringify(MAX_RTR_CNT) ")");
+
+static int __read_mostly noreg_cnt = 0;
+module_param_named(noreg_cnt, noreg_cnt, int, 0444);
+MODULE_PARM_DESC(noreg_cnt, "Max number of SG entries when MR registration "
+		 "does not happen (default: 0)");
+
+static const struct ibtrs_ib_dev_pool_ops dev_pool_ops;
+static struct ibtrs_ib_dev_pool dev_pool = {
+	.ops = &dev_pool_ops
+};
+static struct workqueue_struct *ibtrs_wq;
+static struct class *ibtrs_dev_class;
+
+static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con);
+static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev);
+static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
+			      bool notify, bool can_wait);
+static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
+static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);
+
+bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess)
+{
+	return sess->state == IBTRS_CLT_CONNECTED;
+}
+
+static inline bool ibtrs_clt_is_connected(const struct ibtrs_clt *clt)
+{
+	struct ibtrs_clt_sess *sess;
+	bool connected = false;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry)
+		connected |= ibtrs_clt_sess_is_connected(sess);
+	rcu_read_unlock();
+
+	return connected;
+}
+
+static inline struct ibtrs_tag *
+__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
+{
+	size_t max_depth = clt->queue_depth;
+	struct ibtrs_tag *tag;
+	int cpu, bit;
+
+	cpu = get_cpu();
+	do {
+		bit = find_first_zero_bit(clt->tags_map, max_depth);
+		if (unlikely(bit >= max_depth)) {
+			put_cpu();
+			return NULL;
+		}
+
+	} while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
+	put_cpu();
+
+	tag = GET_TAG(clt, bit);
+	WARN_ON(tag->mem_id != bit);
+	tag->cpu_id = cpu;
+	tag->con_type = con_type;
+
+	return tag;
+}
+
+static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
+				   struct ibtrs_tag *tag)
+{
+	clear_bit_unlock(tag->mem_id, clt->tags_map);
+}
+
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
+				    enum ibtrs_clt_con_type con_type,
+				    int can_wait)
+{
+	struct ibtrs_tag *tag;
+	DEFINE_WAIT(wait);
+
+	tag = __ibtrs_get_tag(clt, con_type);
+	if (likely(tag) || !can_wait)
+		return tag;
+
+	do {
+		prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
+		tag = __ibtrs_get_tag(clt, con_type);
+		if (likely(tag))
+			break;
+
+		io_schedule();
+	} while (1);
+
+	finish_wait(&clt->tags_wait, &wait);
+
+	return tag;
+}
+EXPORT_SYMBOL(ibtrs_clt_get_tag);
+
+void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
+{
+	if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
+		return;
+
+	__ibtrs_put_tag(clt, tag);
+
+	/*
+	 * Putting a tag is a barrier, so we will observe
+	 * new entry in the wait list, no worries.
+	 */
+	if (waitqueue_active(&clt->tags_wait))
+		wake_up(&clt->tags_wait);
+}
+EXPORT_SYMBOL(ibtrs_clt_put_tag);
+
+struct ibtrs_tag *ibtrs_tag_from_pdu(void *pdu)
+{
+	return pdu - sizeof(struct ibtrs_tag);
+}
+EXPORT_SYMBOL(ibtrs_tag_from_pdu);
+
+void *ibtrs_tag_to_pdu(struct ibtrs_tag *tag)
+{
+	return tag + 1;
+}
+EXPORT_SYMBOL(ibtrs_tag_to_pdu);
+
+/**
+ * ibtrs_tag_to_clt_con() - returns RDMA connection id by the tag
+ *
+ * Note:
+ *     IO connection starts from 1.
+ *     0 connection is for user messages.
+ */
+static struct ibtrs_clt_con *ibtrs_tag_to_clt_con(struct ibtrs_clt_sess *sess,
+						  struct ibtrs_tag *tag)
+{
+	int id = 0;
+
+	if (likely(tag->con_type == IBTRS_IO_CON))
+		id = (tag->cpu_id % (sess->s.con_num - 1)) + 1;
+
+	return to_clt_con(sess->s.con[id]);
+}
+
+static void ibtrs_clt_fast_reg_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+       struct ibtrs_clt_con *con = cq->cq_context;
+       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+       if (unlikely(wc->status != IB_WC_SUCCESS)) {
+               ibtrs_err(sess, "Failed IB_WR_REG_MR: %s\n",
+                         ib_wc_status_msg(wc->status));
+               ibtrs_rdma_error_recovery(con);
+       }
+}
+
+static struct ib_cqe fast_reg_cqe = {
+       .done = ibtrs_clt_fast_reg_done
+};
+
+static void ibtrs_clt_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_io_req *req =
+		container_of(wc->wr_cqe, typeof(*req), inv_cqe);
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Failed IB_WR_LOCAL_INV: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_rdma_error_recovery(con);
+	}
+	req->need_inv = false;
+	if (likely(req->need_inv_comp))
+		complete(&req->inv_comp);
+	else
+		/* Complete request from INV callback */
+		complete_rdma_req(req, req->inv_errno, true, false);
+}
+
+static int ibtrs_inv_rkey(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ib_send_wr *bad_wr;
+	struct ib_send_wr wr = {
+		.opcode		    = IB_WR_LOCAL_INV,
+		.wr_cqe		    = &req->inv_cqe,
+		.next		    = NULL,
+		.num_sge	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
+		.ex.invalidate_rkey = req->mr->rkey,
+	};
+	req->inv_cqe.done = ibtrs_clt_inv_rkey_done;
+
+	return ib_post_send(con->c.qp, &wr, &bad_wr);
+}
+
+static int ibtrs_post_send_rdma(struct ibtrs_clt_con *con,
+				struct ibtrs_clt_io_req *req,
+				struct ibtrs_rbuf *rbuf, u32 off,
+				u32 imm, struct ib_send_wr *wr)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	enum ib_send_flags flags;
+	struct ib_sge sge;
+
+	if (unlikely(!req->sg_size)) {
+		ibtrs_wrn(sess, "Doing RDMA Write failed, no data supplied\n");
+		return -EINVAL;
+	}
+
+	/* user data and user message in the first list element */
+	sge.addr   = req->iu->dma_addr;
+	sge.length = req->sg_size;
+	sge.lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr,
+				      req->sg_size, DMA_TO_DEVICE);
+
+	return ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, &sge, 1,
+					    rbuf->rkey, rbuf->addr + off,
+					    imm, flags, wr);
+}
+
+static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno,
+			      bool notify, bool can_wait)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess;
+	struct ibtrs_clt *clt;
+	int err;
+
+	if (WARN_ON(!req->in_use))
+		return;
+	if (WARN_ON(!req->con))
+		return;
+	sess = to_clt_sess(con->c.sess);
+	clt = sess->clt;
+
+	if (req->sg_cnt) {
+		if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) {
+			/*
+			 * We are here to invalidate RDMA read requests
+			 * ourselves.  In normal scenario server should
+			 * send INV for all requested RDMA reads, but
+			 * we are here, thus two things could happen:
+			 *
+			 *    1.  this is failover, when errno != 0
+			 *        and can_wait == 1,
+			 *
+			 *    2.  something totally bad happened and
+			 *        server forgot to send INV, so we
+			 *        should do that ourselves.
+			 */
+
+			if (likely(can_wait))
+				req->need_inv_comp = true;
+			else {
+				/* This should be IO path, so always notify */
+				WARN_ON(!notify);
+				/* Save errno for INV callback */
+				req->inv_errno = errno;
+			}
+
+			err = ibtrs_inv_rkey(req);
+			if (unlikely(err))
+				ibtrs_err(sess, "Send INV WR key=%#x: %d\n",
+					  req->mr->rkey, err);
+			else if (likely(can_wait))
+				wait_for_completion(&req->inv_comp);
+			else {
+				/*
+				 * Something went wrong, so request will be
+				 * completed from INV callback.
+				 */
+				WARN_ON_ONCE(1);
+
+				return;
+			}
+		}
+		ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
+				req->sg_cnt, req->dir);
+	}
+	if (sess->stats.enable_rdma_lat)
+		ibtrs_clt_update_rdma_lat(&sess->stats,
+				req->dir == DMA_FROM_DEVICE,
+				jiffies_to_msecs(jiffies - req->start_jiffies));
+	ibtrs_clt_decrease_inflight(&sess->stats);
+
+	req->in_use = false;
+	req->con = NULL;
+
+	if (notify)
+		req->conf(req->priv, errno);
+}
+
+static void process_io_rsp(struct ibtrs_clt_sess *sess, u32 msg_id,
+			   s16 errno, bool w_inval)
+{
+	struct ibtrs_clt_io_req *req;
+
+	if (WARN_ON(msg_id >= sess->queue_depth))
+		return;
+
+	req = &sess->reqs[msg_id];
+	/* Drop need_inv if server responsed with invalidation */
+	req->need_inv &= !w_inval;
+	complete_rdma_req(req, errno, true, false);
+}
+
+static struct ib_cqe io_comp_cqe = {
+	.done = ibtrs_clt_rdma_done
+};
+
+static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u32 imm_type, imm_payload;
+	bool w_inval = false;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			ibtrs_err(sess, "RDMA failed: %s\n",
+				  ib_wc_status_msg(wc->status));
+			ibtrs_rdma_error_recovery(con);
+		}
+		return;
+	}
+	ibtrs_clt_update_wc_stats(con);
+
+	switch (wc->opcode) {
+	case IB_WC_RDMA_WRITE:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+	case IB_WC_RECV:
+		/*
+		 * Key invalidations from server side
+		 */
+		WARN_ON(!(wc->wc_flags & IB_WC_WITH_INVALIDATE));
+		WARN_ON(wc->wr_cqe != &io_comp_cqe);
+		break;
+
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
+			return;
+
+		ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == IBTRS_IO_RSP_IMM ||
+			   imm_type == IBTRS_IO_RSP_W_INV_IMM)) {
+			u32 msg_id;
+
+			w_inval = (imm_type == IBTRS_IO_RSP_W_INV_IMM);
+			ibtrs_from_io_rsp_imm(imm_payload, &msg_id, &err);
+			process_io_rsp(sess, msg_id, err, w_inval);
+		} else if (imm_type == IBTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			ibtrs_send_hb_ack(&sess->s);
+		} else if (imm_type == IBTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+		} else {
+			ibtrs_wrn(sess, "Unknown IMM type %u\n", imm_type);
+		}
+		if (w_inval)
+			/*
+			 * Post x2 empty WRs: first is for this RDMA with IMM,
+			 * second is for RECV with INV, which happened earlier.
+			 */
+			err = ibtrs_post_recv_empty_x2(&con->c, &io_comp_cqe);
+		else
+			err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "ibtrs_post_recv_empty(): %d\n", err);
+			ibtrs_rdma_error_recovery(con);
+			break;
+		}
+		break;
+	default:
+		ibtrs_wrn(sess, "Unexpected WC type: %d\n", wc->opcode);
+		return;
+	}
+}
+
+static int post_recv_io(struct ibtrs_clt_con *con, size_t q_size)
+{
+	int err, i;
+
+	for (i = 0; i < q_size; i++) {
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct ibtrs_clt_sess *sess)
+{
+	size_t q_size;
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (cid == 0)
+			q_size = SERVICE_CON_QUEUE_DEPTH;
+		else
+			q_size = sess->queue_depth;
+
+		/*
+		 * x2 for RDMA read responses + FR key invalidations,
+		 * RDMA writes do not require any FR registrations.
+		 */
+		q_size *= 2;
+
+		err = post_recv_io(to_clt_con(sess->s.con[cid]), q_size);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+struct path_it {
+	int i;
+	struct list_head skip_list;
+	struct ibtrs_clt *clt;
+	struct ibtrs_clt_sess *(*next_path)(struct path_it *);
+};
+
+#define do_each_path(path, clt, it) {					\
+	path_it_init(it, clt);						\
+	rcu_read_lock();						\
+	for ((it)->i = 0; ((path) = ((it)->next_path)(it)) &&		\
+			  (it)->i < (it)->clt->paths_num;		\
+	     (it)->i++)
+
+#define while_each_path(it)						\
+	path_it_deinit(it);						\
+	rcu_read_unlock();						\
+	}
+
+/**
+ * get_next_path_rr() - Returns path in round-robin fashion.
+ *
+ * Related to @MP_POLICY_RR
+ *
+ * Locks:
+ *    rcu_read_lock() must be hold.
+ */
+static struct ibtrs_clt_sess *get_next_path_rr(struct path_it *it)
+{
+	struct ibtrs_clt_sess __percpu * __rcu *ppcpu_path, *path;
+	struct ibtrs_clt *clt = it->clt;
+
+	ppcpu_path = this_cpu_ptr(clt->pcpu_path);
+	path = rcu_dereference(*ppcpu_path);
+	if (unlikely(!path))
+		path = list_first_or_null_rcu(&clt->paths_list,
+					      typeof(*path), s.entry);
+	else
+		path = list_next_or_null_rr_rcu(&clt->paths_list,
+						&path->s.entry,
+						typeof(*path),
+						s.entry);
+	rcu_assign_pointer(*ppcpu_path, path);
+
+	return path;
+}
+
+/**
+ * get_next_path_min_inflight() - Returns path with minimal inflight count.
+ *
+ * Related to @MP_POLICY_MIN_INFLIGHT
+ *
+ * Locks:
+ *    rcu_read_lock() must be hold.
+ */
+static struct ibtrs_clt_sess *get_next_path_min_inflight(struct path_it *it)
+{
+	struct ibtrs_clt_sess *min_path = NULL;
+	struct ibtrs_clt *clt = it->clt;
+	struct ibtrs_clt_sess *sess;
+	int min_inflight = INT_MAX;
+	int inflight;
+
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry) {
+		if (unlikely(!list_empty(raw_cpu_ptr(sess->mp_skip_entry))))
+			continue;
+
+		inflight = atomic_read(&sess->stats.inflight);
+
+		if (inflight < min_inflight) {
+			min_inflight = inflight;
+			min_path = sess;
+		}
+	}
+
+	/*
+	 * add the path to the skip list, so that next time we can get
+	 * a different one
+	 */
+	if (min_path)
+		list_add(raw_cpu_ptr(min_path->mp_skip_entry), &it->skip_list);
+
+	return min_path;
+}
+
+static inline void path_it_init(struct path_it *it, struct ibtrs_clt *clt)
+{
+	INIT_LIST_HEAD(&it->skip_list);
+	it->clt = clt;
+	it->i = 0;
+
+	if (clt->mp_policy == MP_POLICY_RR)
+		it->next_path = get_next_path_rr;
+	else
+		it->next_path = get_next_path_min_inflight;
+}
+
+static inline void path_it_deinit(struct path_it *it)
+{
+	struct list_head *skip, *tmp;
+	/*
+	 * The skip_list is used only for the MIN_INFLIGHT policy.
+	 * We need to remove paths from it, so that next IO can insert
+	 * paths (->mp_skip_entry) into a skip_list again.
+	 */
+	list_for_each_safe(skip, tmp, &it->skip_list)
+		list_del_init(skip);
+}
+
+static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
+				      struct ibtrs_clt_sess *sess,
+				      ibtrs_conf_fn *conf,
+				      struct ibtrs_tag *tag, void *priv,
+				      const struct kvec *vec, size_t usr_len,
+				      struct scatterlist *sg, size_t sg_cnt,
+				      size_t data_len, int dir)
+{
+	struct iov_iter iter;
+	size_t len;
+
+	req->tag = tag;
+	req->in_use = true;
+	req->usr_len = usr_len;
+	req->data_len = data_len;
+	req->sglist = sg;
+	req->sg_cnt = sg_cnt;
+	req->priv = priv;
+	req->dir = dir;
+	req->con = ibtrs_tag_to_clt_con(sess, tag);
+	req->conf = conf;
+	req->need_inv = false;
+	req->need_inv_comp = false;
+	req->inv_errno = 0;
+
+	iov_iter_kvec(&iter, ITER_KVEC, vec, 1, usr_len);
+	len = _copy_from_iter(req->iu->buf, usr_len, &iter);
+	WARN_ON(len != usr_len);
+
+	reinit_completion(&req->inv_comp);
+	if (sess->stats.enable_rdma_lat)
+		req->start_jiffies = jiffies;
+}
+
+static inline struct ibtrs_clt_io_req *
+ibtrs_clt_get_req(struct ibtrs_clt_sess *sess, ibtrs_conf_fn *conf,
+		  struct ibtrs_tag *tag, void *priv,
+		  const struct kvec *vec, size_t usr_len,
+		  struct scatterlist *sg, size_t sg_cnt,
+		  size_t data_len, int dir)
+{
+	struct ibtrs_clt_io_req *req;
+
+	req = &sess->reqs[tag->mem_id];
+	ibtrs_clt_init_req(req, sess, conf, tag, priv, vec, usr_len,
+			   sg, sg_cnt, data_len, dir);
+	return req;
+}
+
+static inline struct ibtrs_clt_io_req *
+ibtrs_clt_get_copy_req(struct ibtrs_clt_sess *alive_sess,
+		       struct ibtrs_clt_io_req *fail_req)
+{
+	struct ibtrs_clt_io_req *req;
+	struct kvec vec = {
+		.iov_base = fail_req->iu->buf,
+		.iov_len  = fail_req->usr_len
+	};
+
+	req = &alive_sess->reqs[fail_req->tag->mem_id];
+	ibtrs_clt_init_req(req, alive_sess, fail_req->conf, fail_req->tag,
+			   fail_req->priv, &vec, fail_req->usr_len,
+			   fail_req->sglist, fail_req->sg_cnt,
+			   fail_req->data_len, fail_req->dir);
+	return req;
+}
+
+static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
+				  struct ibtrs_clt_io_req *fail_req)
+{
+	struct ibtrs_clt_sess *alive_sess;
+	struct ibtrs_clt_io_req *req;
+	int err = -ECONNABORTED;
+	struct path_it it;
+
+	do_each_path(alive_sess, clt, &it) {
+		if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
+			continue;
+		req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
+		if (req->dir == DMA_TO_DEVICE)
+			err = ibtrs_clt_write_req(req);
+		else
+			err = ibtrs_clt_read_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		ibtrs_clt_inc_failover_cnt(&alive_sess->stats);
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+
+static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_clt_io_req *req;
+	int i, err;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		if (!req->in_use)
+			continue;
+
+		/*
+		 * Safely (without notification) complete failed request.
+		 * After completion this request is still usebale and can
+		 * be failovered to another path.
+		 */
+		complete_rdma_req(req, -ECONNABORTED, false, true);
+
+		err = ibtrs_clt_failover_req(clt, req);
+		if (unlikely(err))
+			/* Failover failed, notify anyway */
+			req->conf(req->priv, err);
+	}
+}
+
+static void free_sess_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_io_req *req;
+	int i;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		if (req->mr)
+			ib_dereg_mr(req->mr);
+		kfree(req->sge);
+		ibtrs_iu_free(req->iu, DMA_TO_DEVICE,
+			      sess->s.dev->ib_dev);
+	}
+	kfree(sess->reqs);
+	sess->reqs = NULL;
+}
+
+static int alloc_sess_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_io_req *req;
+	struct ibtrs_clt *clt = sess->clt;
+	int i, err = -ENOMEM;
+
+	sess->reqs = kcalloc(sess->queue_depth, sizeof(*sess->reqs),
+			     GFP_KERNEL);
+	if (unlikely(!sess->reqs))
+		return -ENOMEM;
+
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		req->iu = ibtrs_iu_alloc(i, sess->max_hdr_size, GFP_KERNEL,
+					 sess->s.dev->ib_dev, DMA_TO_DEVICE,
+					 ibtrs_clt_rdma_done);
+		if (unlikely(!req->iu))
+			goto out;
+
+		req->sge = kmalloc_array(clt->max_segments + 1,
+					 sizeof(*req->sge), GFP_KERNEL);
+		if (unlikely(!req->sge))
+			goto out;
+
+		req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
+				      clt->max_segments + 1);
+		if (unlikely(IS_ERR(req->mr))) {
+			err = PTR_ERR(req->mr);
+			req->mr = NULL;
+			goto out;
+		}
+
+		init_completion(&req->inv_comp);
+	}
+
+	return 0;
+
+out:
+	free_sess_reqs(sess);
+
+	return err;
+}
+
+static int alloc_tags(struct ibtrs_clt *clt)
+{
+	unsigned int chunk_bits;
+	int err, i;
+
+	clt->tags_map = kcalloc(BITS_TO_LONGS(clt->queue_depth), sizeof(long),
+				GFP_KERNEL);
+	if (unlikely(!clt->tags_map)) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+	clt->tags = kcalloc(clt->queue_depth, TAG_SIZE(clt), GFP_KERNEL);
+	if (unlikely(!clt->tags)) {
+		err = -ENOMEM;
+		goto err_map;
+	}
+	chunk_bits = ilog2(clt->queue_depth - 1) + 1;
+	for (i = 0; i < clt->queue_depth; i++) {
+		struct ibtrs_tag *tag;
+
+		tag = GET_TAG(clt, i);
+		tag->mem_id = i;
+		tag->mem_off = i << (MAX_IMM_PAYL_BITS - chunk_bits);
+	}
+
+	return 0;
+
+err_map:
+	kfree(clt->tags_map);
+	clt->tags_map = NULL;
+out_err:
+	return err;
+}
+
+static void free_tags(struct ibtrs_clt *clt)
+{
+	kfree(clt->tags_map);
+	clt->tags_map = NULL;
+	kfree(clt->tags);
+	clt->tags = NULL;
+}
+
+static void query_fast_reg_mode(struct ibtrs_clt_sess *sess)
+{
+	struct ib_device *ib_dev;
+	u64 max_pages_per_mr;
+	int mr_page_shift;
+
+	ib_dev = sess->s.dev->ib_dev;
+
+	/*
+	 * Use the smallest page size supported by the HCA, down to a
+	 * minimum of 4096 bytes. We're unlikely to build large sglists
+	 * out of smaller entries.
+	 */
+	mr_page_shift      = max(12, ffs(ib_dev->attrs.page_size_cap) - 1);
+	max_pages_per_mr   = ib_dev->attrs.max_mr_size;
+	do_div(max_pages_per_mr, (1ull << mr_page_shift));
+	sess->max_pages_per_mr =
+		min3(sess->max_pages_per_mr, (u32)max_pages_per_mr,
+		     ib_dev->attrs.max_fast_reg_page_list_len);
+	sess->max_sge = ib_dev->attrs.max_sge;
+}
+
+static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
+				     enum ibtrs_clt_state new_state)
+{
+	enum ibtrs_clt_state old_state;
+	bool changed = false;
+
+	old_state = sess->state;
+	switch (new_state) {
+	case IBTRS_CLT_CONNECTING:
+		switch (old_state) {
+		case IBTRS_CLT_RECONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_RECONNECTING:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTED:
+		case IBTRS_CLT_CONNECTING_ERR:
+		case IBTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CONNECTED:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CONNECTING_ERR:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CLOSING:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+		case IBTRS_CLT_CONNECTING_ERR:
+		case IBTRS_CLT_RECONNECTING:
+		case IBTRS_CLT_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CLOSED:
+		switch (old_state) {
+		case IBTRS_CLT_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_DEAD:
+		switch (old_state) {
+		case IBTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed) {
+		sess->state = new_state;
+		wake_up_locked(&sess->state_wq);
+	}
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state_from_to(struct ibtrs_clt_sess *sess,
+					   enum ibtrs_clt_state old_state,
+					   enum ibtrs_clt_state new_state)
+{
+	bool changed = false;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	if (sess->state == old_state)
+		changed = __ibtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state_get_old(struct ibtrs_clt_sess *sess,
+					   enum ibtrs_clt_state new_state,
+					   enum ibtrs_clt_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	*old_state = sess->state;
+	changed = __ibtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
+				   enum ibtrs_clt_state new_state)
+{
+	enum ibtrs_clt_state old_state;
+
+	return ibtrs_clt_change_state_get_old(sess, new_state, &old_state);
+}
+
+static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess)
+{
+	enum ibtrs_clt_state state;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	state = sess->state;
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return state;
+}
+
+static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err)
+{
+	struct ibtrs_clt_con *con;
+
+	(void)err;
+	con = container_of(c, typeof(*con), c);
+	ibtrs_rdma_error_recovery(con);
+}
+
+static void ibtrs_clt_init_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_init_hb(&sess->s, &io_comp_cqe,
+		      IBTRS_HB_INTERVAL_MS,
+		      IBTRS_HB_MISSED_MAX,
+		      ibtrs_clt_hb_err_handler,
+		      ibtrs_wq);
+}
+
+static void ibtrs_clt_start_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_start_hb(&sess->s);
+}
+
+static void ibtrs_clt_stop_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_stop_hb(&sess->s);
+}
+
+static void ibtrs_clt_reconnect_work(struct work_struct *work);
+static void ibtrs_clt_close_work(struct work_struct *work);
+
+static struct ibtrs_clt_sess *alloc_sess(struct ibtrs_clt *clt,
+					 const struct ibtrs_addr *path,
+					 size_t con_num, u16 max_segments)
+{
+	struct ibtrs_clt_sess *sess;
+	int err = -ENOMEM;
+	int cpu;
+
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	/* Extra connection for user messages */
+	con_num += 1;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_sess;
+
+	mutex_init(&sess->init_mutex);
+	uuid_gen(&sess->s.uuid);
+	memcpy(&sess->s.dst_addr, path->dst,
+	       rdma_addr_size((struct sockaddr *)path->dst));
+
+	/*
+	 * rdma_resolve_addr() passes src_addr to cma_bind_addr, which
+	 * checks the sa_family to be non-zero. If user passed src_addr=NULL
+	 * the sess->src_addr will contain only zeros, which is then fine.
+	 */
+	if (path->src)
+		memcpy(&sess->s.src_addr, path->src,
+		       rdma_addr_size((struct sockaddr *)path->src));
+	strlcpy(sess->s.sessname, clt->sessname, sizeof(sess->s.sessname));
+	sess->s.con_num = con_num;
+	sess->clt = clt;
+	sess->max_pages_per_mr = max_segments;
+	init_waitqueue_head(&sess->state_wq);
+	sess->state = IBTRS_CLT_CONNECTING;
+	atomic_set(&sess->connected_cnt, 0);
+	INIT_WORK(&sess->close_work, ibtrs_clt_close_work);
+	INIT_DELAYED_WORK(&sess->reconnect_dwork, ibtrs_clt_reconnect_work);
+	ibtrs_clt_init_hb(sess);
+
+	sess->mp_skip_entry = alloc_percpu(typeof(*sess->mp_skip_entry));
+	if (unlikely(!sess->mp_skip_entry))
+		goto err_free_con;
+
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(per_cpu_ptr(sess->mp_skip_entry, cpu));
+
+	err = ibtrs_clt_init_stats(&sess->stats);
+	if (unlikely(err))
+		goto err_free_percpu;
+
+	return sess;
+
+err_free_percpu:
+	free_percpu(sess->mp_skip_entry);
+err_free_con:
+	kfree(sess->s.con);
+err_free_sess:
+	kfree(sess);
+err:
+	return ERR_PTR(err);
+}
+
+static void free_sess(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_clt_free_stats(&sess->stats);
+	free_percpu(sess->mp_skip_entry);
+	kfree(sess->s.con);
+	kfree(sess->rbufs);
+	kfree(sess);
+}
+
+static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid)
+{
+	struct ibtrs_clt_con *con;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con))
+		return -ENOMEM;
+
+	/* Map first two connections to the first CPU */
+	con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
+	con->c.cid = cid;
+	con->c.sess = &sess->s;
+	atomic_set(&con->io_cnt, 0);
+
+	sess->s.con[cid] = &con->c;
+
+	return 0;
+}
+
+static void destroy_con(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	sess->s.con[con->c.cid] = NULL;
+	kfree(con);
+}
+
+static int create_con_cq_qp(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u16 cq_size, wr_queue_size;
+	int err, cq_vector;
+
+	/*
+	 * This function can fail, but still destroy_con_cq_qp() should
+	 * be called, this is because create_con_cq_qp() is called on cm
+	 * event path, thus caller/waiter never knows: have we failed before
+	 * create_con_cq_qp() or after.  To solve this dilemma without
+	 * creating any additional flags just allow destroy_con_cq_qp() be
+	 * called many times.
+	 */
+
+	if (con->c.cid == 0) {
+		/*
+		 * One completion for each receive and two for each send
+		 * (send request + registration)
+		 * + 2 for drain and heartbeat
+		 * in case qp gets into error state
+		 */
+		cq_size = wr_queue_size = SERVICE_CON_QUEUE_DEPTH * 3 + 2;
+		/* We must be the first here */
+		if (WARN_ON(sess->s.dev))
+			return -EINVAL;
+
+		/*
+		 * The whole session uses device from user connection.
+		 * Be careful not to close user connection before ib dev
+		 * is gracefully put.
+		 */
+		sess->s.dev = ibtrs_ib_dev_find_or_add(
+			con->c.cm_id->device, &dev_pool);
+		if (unlikely(!sess->s.dev)) {
+			ibtrs_wrn(sess, "ibtrs_ib_dev_find_get_or_add(): no memory\n");
+			return -ENOMEM;
+		}
+		sess->s.dev_ref = 1;
+		query_fast_reg_mode(sess);
+	} else {
+		/*
+		 * Here we assume that session members are correctly set.
+		 * This is always true if user connection (cid == 0) is
+		 * established first.
+		 */
+		if (WARN_ON(!sess->s.dev))
+			return -EINVAL;
+		if (WARN_ON(!sess->queue_depth))
+			return -EINVAL;
+
+		/* Shared between connections */
+		sess->s.dev_ref++;
+		cq_size = wr_queue_size =
+			min_t(int, sess->s.dev->ib_dev->attrs.max_qp_wr,
+			      /* QD * (REQ + RSP + FR REGS or INVS) + drain */
+			      sess->queue_depth * 3 + 1);
+	}
+	cq_vector = con->cpu % sess->s.dev->ib_dev->num_comp_vectors;
+	err = ibtrs_cq_qp_create(&sess->s, &con->c, sess->max_sge,
+				 cq_vector, cq_size, wr_queue_size,
+				 IB_POLL_SOFTIRQ);
+	/*
+	 * In case of error we do not bother to clean previous allocations,
+	 * since destroy_con_cq_qp() must be called.
+	 */
+
+	if (unlikely(err))
+		return err;
+
+	return err;
+}
+
+static void destroy_con_cq_qp(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	/*
+	 * Be careful here: destroy_con_cq_qp() can be called even
+	 * create_con_cq_qp() failed, see comments there.
+	 */
+
+	ibtrs_cq_qp_destroy(&con->c);
+	if (sess->s.dev_ref && !--sess->s.dev_ref) {
+		ibtrs_ib_dev_put(sess->s.dev);
+		sess->s.dev = NULL;
+	}
+}
+
+static void stop_cm(struct ibtrs_clt_con *con)
+{
+	rdma_disconnect(con->c.cm_id);
+	if (con->c.qp)
+		ib_drain_qp(con->c.qp);
+}
+
+static void destroy_cm(struct ibtrs_clt_con *con)
+{
+	rdma_destroy_id(con->c.cm_id);
+	con->c.cm_id = NULL;
+}
+
+static int create_cm(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rdma_cm_id *cm_id;
+	int err;
+
+	cm_id = rdma_create_id(&init_net, ibtrs_clt_rdma_cm_handler, con,
+			       sess->s.dst_addr.ss_family == AF_IB ?
+			       RDMA_PS_IB : RDMA_PS_TCP, IB_QPT_RC);
+	if (unlikely(IS_ERR(cm_id))) {
+		err = PTR_ERR(cm_id);
+		ibtrs_err(sess, "Failed to create CM ID, err: %d\n", err);
+
+		return err;
+	}
+	con->c.cm_id = cm_id;
+	con->cm_err = 0;
+	/* allow the port to be reused */
+	err = rdma_set_reuseaddr(cm_id, 1);
+	if (err != 0) {
+		ibtrs_err(sess, "Set address reuse failed, err: %d\n", err);
+		goto destroy_cm;
+	}
+	err = rdma_resolve_addr(cm_id, (struct sockaddr *)&sess->s.src_addr,
+				(struct sockaddr *)&sess->s.dst_addr,
+				IBTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "Failed to resolve address, err: %d\n", err);
+		goto destroy_cm;
+	}
+	/*
+	 * Combine connection status and session events. This is needed
+	 * for waiting two possible cases: cm_err has something meaningful
+	 * or session state was really changed to error by device removal.
+	 */
+	err = wait_event_interruptible_timeout(sess->state_wq,
+			con->cm_err || sess->state != IBTRS_CLT_CONNECTING,
+			msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(err == 0 || err == -ERESTARTSYS)) {
+		if (err == 0)
+			err = -ETIMEDOUT;
+		/* Timedout or interrupted */
+		goto errr;
+	}
+	if (unlikely(con->cm_err < 0)) {
+		err = con->cm_err;
+		goto errr;
+	}
+	if (unlikely(sess->state != IBTRS_CLT_CONNECTING)) {
+		/* Device removal */
+		err = -ECONNABORTED;
+		goto errr;
+	}
+
+	return 0;
+
+errr:
+	stop_cm(con);
+	/* Is safe to call destroy if cq_qp is not inited */
+	destroy_con_cq_qp(con);
+destroy_cm:
+	destroy_cm(con);
+
+	return err;
+}
+
+static void ibtrs_clt_sess_up(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	int up;
+
+	/*
+	 * We can fire RECONNECTED event only when all paths were
+	 * connected on ibtrs_clt_open(), then each was disconnected
+	 * and the first one connected again.  That's why this nasty
+	 * game with counter value.
+	 */
+
+	mutex_lock(&clt->paths_ev_mutex);
+	up = ++clt->paths_up;
+	/*
+	 * Here it is safe to access paths num directly since up counter
+	 * is greater than MAX_PATHS_NUM only while ibtrs_clt_open() is
+	 * in progress, thus paths removals are impossible.
+	 */
+	if (up > MAX_PATHS_NUM && up == MAX_PATHS_NUM + clt->paths_num)
+		clt->paths_up = clt->paths_num;
+	else if (up == 1)
+		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_RECONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+	sess->reconnect_attempts = 0;
+	sess->stats.reconnects.successful_cnt++;
+}
+
+static void ibtrs_clt_sess_down(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&clt->paths_ev_mutex);
+	WARN_ON(!clt->paths_up);
+	if (--clt->paths_up == 0)
+		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_DISCONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+}
+
+static void ibtrs_clt_stop_and_destroy_conns(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_con *con;
+	unsigned int cid;
+
+	WARN_ON(sess->state == IBTRS_CLT_CONNECTED);
+
+	/*
+	 * Possible race with ibtrs_clt_open(), when DEVICE_REMOVAL comes
+	 * exactly in between.  Start destroying after it finishes.
+	 */
+	mutex_lock(&sess->init_mutex);
+	mutex_unlock(&sess->init_mutex);
+
+	/*
+	 * All IO paths must observe !CONNECTED state before we
+	 * free everything.
+	 */
+	synchronize_rcu();
+
+	ibtrs_clt_stop_hb(sess);
+
+	/*
+	 * The order it utterly crucial: firstly disconnect and complete all
+	 * rdma requests with error (thus set in_use=false for requests),
+	 * then fail outstanding requests checking in_use for each, and
+	 * eventually notify upper layer about session disconnection.
+	 */
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (!sess->s.con[cid])
+			break;
+		con = to_clt_con(sess->s.con[cid]);
+		stop_cm(con);
+	}
+	fail_all_outstanding_reqs(sess);
+	free_sess_reqs(sess);
+	ibtrs_clt_sess_down(sess);
+
+	/*
+	 * Wait for graceful shutdown, namely when peer side invokes
+	 * rdma_disconnect(). 'connected_cnt' is decremented only on
+	 * CM events, thus if other side had crashed and hb has detected
+	 * something is wrong, here we will stuck for exactly timeout ms,
+	 * since CM does not fire anything.  That is fine, we are not in
+	 * hurry.
+	 */
+	wait_event_timeout(sess->state_wq, !atomic_read(&sess->connected_cnt),
+			   msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (!sess->s.con[cid])
+			break;
+		con = to_clt_con(sess->s.con[cid]);
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+}
+
+static void ibtrs_clt_remove_path_from_arr(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_clt_sess *next;
+	int cpu;
+
+	mutex_lock(&clt->paths_mutex);
+	list_del_rcu(&sess->s.entry);
+
+	/* Make sure everybody observes path removal. */
+	synchronize_rcu();
+
+	/*
+	 * Decrement paths number only after grace period, because
+	 * caller of do_each_path() must firstly observe list without
+	 * path and only then decremented paths number.
+	 *
+	 * Otherwise there can be the following situation:
+	 *    o Two paths exist and IO is coming.
+	 *    o One path is removed:
+	 *      CPU#0                          CPU#1
+	 *      do_each_path():                ibtrs_clt_remove_path_from_arr():
+	 *          path = get_next_path()
+	 *          ^^^                            list_del_rcu(path)
+	 *          [!CONNECTED path]              clt->paths_num--
+	 *                                              ^^^^^^^^^
+	 *          load clt->paths_num                 from 2 to 1
+	 *                    ^^^^^^^^^
+	 *                    sees 1
+	 *
+	 *      path is observed as !CONNECTED, but do_each_path() loop
+	 *      ends, because expression i < clt->paths_num is false.
+	 */
+	clt->paths_num--;
+
+	next = list_next_or_null_rr_rcu(&clt->paths_list, &sess->s.entry,
+					typeof(*next), s.entry);
+
+	/*
+	 * Pcpu paths can still point to the path which is going to be
+	 * removed, so change the pointer manually.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct ibtrs_clt_sess **ppcpu_path;
+
+		ppcpu_path = per_cpu_ptr(clt->pcpu_path, cpu);
+		if (*ppcpu_path != sess)
+			/*
+			 * synchronize_rcu() was called just after deleting
+			 * entry from the list, thus IO code path cannot
+			 * change pointer back to the pointer which is going
+			 * to be removed, we are safe here.
+			 */
+			continue;
+
+		/*
+		 * We race with IO code path, which also changes pointer,
+		 * thus we have to be careful not to override it.
+		 */
+		cmpxchg(ppcpu_path, sess, next);
+	}
+	mutex_unlock(&clt->paths_mutex);
+}
+
+static inline bool __ibtrs_clt_path_exists(struct ibtrs_clt *clt,
+					   struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt_sess *sess;
+
+	list_for_each_entry(sess, &clt->paths_list, s.entry)
+		if (!sockaddr_cmp((struct sockaddr *)&sess->s.dst_addr,
+				  (struct sockaddr *)addr->dst))
+			return true;
+
+	return false;
+}
+
+static bool ibtrs_clt_path_exists(struct ibtrs_clt *clt,
+				  struct ibtrs_addr *addr)
+{
+	bool res;
+
+	mutex_lock(&clt->paths_mutex);
+	res = __ibtrs_clt_path_exists(clt, addr);
+	mutex_unlock(&clt->paths_mutex);
+
+	return res;
+}
+
+static int ibtrs_clt_add_path_to_arr(struct ibtrs_clt_sess *sess,
+				     struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	int err = 0;
+
+	mutex_lock(&clt->paths_mutex);
+	if (!__ibtrs_clt_path_exists(clt, addr)) {
+
+		clt->paths_num++;
+
+		/*
+		 * Firstly increase paths_num, wait for GP and then
+		 * add path to the list.  Why?  Since we add path with
+		 * !CONNECTED state explanation is similar to what has
+		 * been written in ibtrs_clt_remove_path_from_arr().
+		 */
+		synchronize_rcu();
+
+		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+	} else
+		err = -EEXIST;
+	mutex_unlock(&clt->paths_mutex);
+
+	return err;
+}
+
+static void ibtrs_clt_close_work(struct work_struct *work)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(work, struct ibtrs_clt_sess, close_work);
+
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	ibtrs_clt_stop_and_destroy_conns(sess);
+	/*
+	 * Sounds stupid, huh?  No, it is not.  Consider this sequence:
+	 *
+	 *   #CPU0                              #CPU1
+	 *   1.  CONNECTED->RECONNECTING
+	 *   2.                                 RECONNECTING->CLOSING
+	 *   3.  queue_work(&reconnect_dwork)
+	 *   4.                                 queue_work(&close_work);
+	 *   5.  reconnect_work();              close_work();
+	 *
+	 * To avoid that case do cancel twice: before and after.
+	 */
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSED);
+}
+
+static void ibtrs_clt_close_conns(struct ibtrs_clt_sess *sess, bool wait)
+{
+	if (ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSING))
+		queue_work(ibtrs_wq, &sess->close_work);
+	if (wait)
+		flush_work(&sess->close_work);
+}
+
+static int init_conns(struct ibtrs_clt_sess *sess)
+{
+	unsigned int cid;
+	int err;
+
+	/*
+	 * On every new session connections increase reconnect counter
+	 * to avoid clashes with previous sessions not yet closed
+	 * sessions on a server side.
+	 */
+	sess->s.recon_cnt++;
+
+	/* Establish all RDMA connections  */
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		err = create_con(sess, cid);
+		if (unlikely(err))
+			goto destroy;
+
+		err = create_cm(to_clt_con(sess->s.con[cid]));
+		if (unlikely(err)) {
+			destroy_con(to_clt_con(sess->s.con[cid]));
+			goto destroy;
+		}
+	}
+	err = alloc_sess_reqs(sess);
+	if (unlikely(err))
+		goto destroy;
+
+	ibtrs_clt_start_hb(sess);
+
+	return 0;
+
+destroy:
+	while (cid--) {
+		struct ibtrs_clt_con *con = to_clt_con(sess->s.con[cid]);
+
+		stop_cm(con);
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+	/*
+	 * If we've never taken async path and got an error, say,
+	 * doing rdma_resolve_addr(), switch to CONNECTION_ERR state
+	 * manually to keep reconnecting.
+	 */
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+static int ibtrs_rdma_addr_resolved(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int err;
+
+	err = create_con_cq_qp(con);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "create_con_cq_qp(), err: %d\n", err);
+		return err;
+	}
+	err = rdma_resolve_route(con->c.cm_id, IBTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "Resolving route failed, err: %d\n", err);
+		destroy_con_cq_qp(con);
+	}
+
+	return err;
+}
+
+static int ibtrs_rdma_route_resolved(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_msg_conn_req msg;
+	struct rdma_conn_param param;
+
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.retry_count = clamp(retry_cnt, MIN_RTR_CNT, MAX_RTR_CNT);
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	/*
+	 * Those two are the part of struct cma_hdr which is shared
+	 * with private_data in case of AF_IB, so put zeroes to avoid
+	 * wrong validation inside cma.c on receiver side.
+	 */
+	msg.__cma_version = 0;
+	msg.__ip_version = 0;
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_PROTO_VER);
+	msg.cid = cpu_to_le16(con->c.cid);
+	msg.cid_num = cpu_to_le16(sess->s.con_num);
+	msg.recon_cnt = cpu_to_le16(sess->s.recon_cnt);
+	uuid_copy(&msg.sess_uuid, &sess->s.uuid);
+	uuid_copy(&msg.paths_uuid, &clt->paths_uuid);
+
+	err = rdma_connect(con->c.cm_id, &param);
+	if (err)
+		ibtrs_err(sess, "rdma_connect(): %d\n", err);
+
+	return err;
+}
+
+static int ibtrs_rdma_conn_established(struct ibtrs_clt_con *con,
+				       struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt *clt = sess->clt;
+	const struct ibtrs_msg_conn_rsp *msg;
+	u16 version, queue_depth;
+	int errno;
+	u8 len;
+
+	msg = ev->param.conn.private_data;
+	len = ev->param.conn.private_data_len;
+	if (unlikely(len < sizeof(*msg))) {
+		ibtrs_err(sess, "Invalid IBTRS connection response\n");
+		return -ECONNRESET;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != IBTRS_MAGIC)) {
+		ibtrs_err(sess, "Invalid IBTRS magic\n");
+		return -ECONNRESET;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != IBTRS_PROTO_VER_MAJOR)) {
+		ibtrs_err(sess, "Unsupported major IBTRS version: %d, expected %d\n",
+			  version >> 8, IBTRS_PROTO_VER_MAJOR);
+		return -ECONNRESET;
+	}
+	errno = le16_to_cpu(msg->errno);
+	if (unlikely(errno)) {
+		ibtrs_err(sess, "Invalid IBTRS message: errno %d\n",
+			  errno);
+		return -ECONNRESET;
+	}
+	if (con->c.cid == 0) {
+		queue_depth = le16_to_cpu(msg->queue_depth);
+
+		if (queue_depth > MAX_SESS_QUEUE_DEPTH) {
+			ibtrs_err(sess, "Invalid IBTRS message: queue=%d\n",
+				  queue_depth);
+			return -ECONNRESET;
+		}
+		if (!sess->rbufs || sess->queue_depth < queue_depth) {
+			kfree(sess->rbufs);
+			sess->rbufs = kcalloc(queue_depth, sizeof(*sess->rbufs),
+					      GFP_KERNEL);
+			if (unlikely(!sess->rbufs)) {
+				ibtrs_err(sess, "Failed to allocate "
+					  "queue_depth=%d\n", queue_depth);
+				return -ENOMEM;
+			}
+		}
+		sess->queue_depth = queue_depth;
+		sess->max_hdr_size = le32_to_cpu(msg->max_hdr_size);
+		sess->max_io_size = le32_to_cpu(msg->max_io_size);
+		sess->chunk_size = sess->max_io_size + sess->max_hdr_size;
+
+		/*
+		 * Global queue depth and IO size is always a minimum.
+		 * If while a reconnection server sends us a value a bit
+		 * higher - client does not care and uses cached minimum.
+		 *
+		 * Since we can have several sessions (paths) restablishing
+		 * connections in parallel, use lock.
+		 */
+		mutex_lock(&clt->paths_mutex);
+		clt->queue_depth = min_not_zero(sess->queue_depth,
+						clt->queue_depth);
+		clt->max_io_size = min_not_zero(sess->max_io_size,
+						clt->max_io_size);
+		mutex_unlock(&clt->paths_mutex);
+
+		/*
+		 * Cache the hca_port and hca_name for sysfs
+		 */
+		sess->hca_port = con->c.cm_id->port_num;
+		scnprintf(sess->hca_name, sizeof(sess->hca_name),
+			  sess->s.dev->ib_dev->name);
+	}
+
+	return 0;
+}
+
+static int ibtrs_rdma_conn_rejected(struct ibtrs_clt_con *con,
+				    struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	const struct ibtrs_msg_conn_rsp *msg;
+	const char *rej_msg;
+	int status, errno;
+	u8 data_len;
+
+	status = ev->status;
+	rej_msg = rdma_reject_msg(con->c.cm_id, status);
+	msg = rdma_consumer_reject_data(con->c.cm_id, ev, &data_len);
+
+	if (msg && data_len >= sizeof(*msg)) {
+		errno = (int16_t)le16_to_cpu(msg->errno);
+		if (errno == -EBUSY)
+			ibtrs_err(sess,
+				  "Previous session is still exists on the "
+				  "server, please reconnect later\n");
+		else
+			ibtrs_err(sess,
+				  "Connect rejected: status %d (%s), ibtrs "
+				  "errno %d\n", status, rej_msg, errno);
+	} else {
+		ibtrs_err(sess,
+			  "Connect rejected but with malformed message: "
+			  "status %d (%s)\n", status, rej_msg);
+	}
+
+	return -ECONNRESET;
+}
+
+static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (ibtrs_clt_change_state_from_to(sess,
+					   IBTRS_CLT_CONNECTED,
+					   IBTRS_CLT_RECONNECTING)) {
+		/*
+		 * Normal scenario, reconnect if we were successfully connected
+		 */
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
+	} else {
+		/*
+		 * Error can happen just on establishing new connection,
+		 * so notify waiter with error state, waiter is responsible
+		 * for cleaning the rest and reconnect if needed.
+		 */
+		ibtrs_clt_change_state_from_to(sess,
+					       IBTRS_CLT_CONNECTING,
+					       IBTRS_CLT_CONNECTING_ERR);
+	}
+}
+
+static inline void flag_success_on_conn(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	atomic_inc(&sess->connected_cnt);
+	con->cm_err = 1;
+}
+
+static inline void flag_error_on_conn(struct ibtrs_clt_con *con, int cm_err)
+{
+	if (con->cm_err == 1) {
+		struct ibtrs_clt_sess *sess;
+
+		sess = to_clt_sess(con->c.sess);
+		if (atomic_dec_and_test(&sess->connected_cnt))
+			wake_up(&sess->state_wq);
+	}
+	con->cm_err = cm_err;
+}
+
+static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_con *con = cm_id->context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int cm_err = 0;
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		cm_err = ibtrs_rdma_addr_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		cm_err = ibtrs_rdma_route_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ESTABLISHED:
+		con->cm_err = ibtrs_rdma_conn_established(con, ev);
+		if (likely(!con->cm_err)) {
+			/*
+			 * Report success and wake up. Here we abuse state_wq,
+			 * i.e. wake up without state change, but we set cm_err.
+			 */
+			flag_success_on_conn(con);
+			wake_up(&sess->state_wq);
+			return 0;
+		}
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+		cm_err = ibtrs_rdma_conn_rejected(con, ev);
+		break;
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		ibtrs_wrn(sess, "CM error event %d\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+		cm_err = -EHOSTUNREACH;
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		/*
+		 * Device removal is a special case.  Queue close and return 0.
+		 */
+		ibtrs_clt_close_conns(sess, false);
+		return 0;
+	default:
+		ibtrs_err(sess, "Unexpected RDMA CM event (%d)\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	}
+
+	if (cm_err) {
+		/*
+		 * cm error makes sense only on connection establishing,
+		 * in other cases we rely on normal procedure of reconnecting.
+		 */
+		flag_error_on_conn(con, cm_err);
+		ibtrs_rdma_error_recovery(con);
+	}
+
+	return 0;
+}
+
+static void ibtrs_clt_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	ibtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info request send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+		return;
+	}
+
+	ibtrs_clt_update_wc_stats(con);
+}
+
+static int process_info_rsp(struct ibtrs_clt_sess *sess,
+			    const struct ibtrs_msg_info_rsp *msg)
+{
+	unsigned int sg_cnt, total_len;
+	int i, sgi;
+
+	sg_cnt = le16_to_cpu(msg->sg_cnt);
+	if (unlikely(!sg_cnt))
+		return -EINVAL;
+	/*
+	 * Check if IB immediate data size is enough to hold the mem_id and
+	 * the offset inside the memory chunk.
+	 */
+	if (unlikely((ilog2(sg_cnt-1)+1) + (ilog2(sess->chunk_size-1)+1) >
+		     MAX_IMM_PAYL_BITS)) {
+		ibtrs_err(sess, "RDMA immediate size (%db) not enough to "
+			  "encode %d buffers of size %dB\n",  MAX_IMM_PAYL_BITS,
+			  sg_cnt, sess->chunk_size);
+		return -EINVAL;
+	}
+	if (unlikely(!sg_cnt || (sess->queue_depth % sg_cnt))) {
+		ibtrs_err(sess, "Incorrect sg_cnt %d, is not multiple\n",
+			  sg_cnt);
+		return -EINVAL;
+	}
+	total_len = 0;
+	for (sgi = 0, i = 0; sgi < sg_cnt && i < sess->queue_depth; sgi++) {
+		const struct ibtrs_sg_desc *desc = &msg->desc[sgi];
+		u32 len, rkey;
+		u64 addr;
+
+		addr = le64_to_cpu(desc->addr);
+		rkey = le32_to_cpu(desc->key);
+		len  = le32_to_cpu(desc->len);
+
+		total_len += len;
+
+		if (unlikely(!len || (len % sess->chunk_size))) {
+			ibtrs_err(sess, "Incorrect [%d].len %d\n", sgi, len);
+			return -EINVAL;
+		}
+		for ( ; len && i < sess->queue_depth; i++) {
+			sess->rbufs[i].addr = addr;
+			sess->rbufs[i].rkey = rkey;
+
+			len  -= sess->chunk_size;
+			addr += sess->chunk_size;
+		}
+	}
+	/* Sanity check */
+	if (unlikely(sgi != sg_cnt || i != sess->queue_depth)) {
+		ibtrs_err(sess, "Incorrect sg vector, not fully mapped\n");
+		return -EINVAL;
+	}
+	if (unlikely(total_len != sess->chunk_size * sess->queue_depth)) {
+		ibtrs_err(sess, "Incorrect total_len %d\n", total_len);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void ibtrs_clt_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_info_rsp *msg;
+	enum ibtrs_clt_state state;
+	struct ibtrs_iu *iu;
+	size_t rx_sz;
+	int err;
+
+	state = IBTRS_CLT_CONNECTING_ERR;
+
+	WARN_ON(con->c.cid);
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info response recv failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto out;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, iu->dma_addr,
+				   iu->size, DMA_FROM_DEVICE);
+	msg = iu->buf;
+	if (unlikely(le16_to_cpu(msg->type) != IBTRS_MSG_INFO_RSP)) {
+		ibtrs_err(sess, "Sess info response is malformed: type %d\n",
+			  le32_to_cpu(msg->type));
+		goto out;
+	}
+	rx_sz  = sizeof(*msg);
+	rx_sz += sizeof(msg->desc[0]) * le16_to_cpu(msg->sg_cnt);
+	if (unlikely(wc->byte_len < rx_sz)) {
+		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	err = process_info_rsp(sess, msg);
+	if (unlikely(err))
+		goto out;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err))
+		goto out;
+
+	state = IBTRS_CLT_CONNECTED;
+
+out:
+	ibtrs_clt_update_wc_stats(con);
+	ibtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+	ibtrs_clt_change_state(sess, state);
+}
+
+static int ibtrs_send_sess_info(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_con *usr_con = to_clt_con(sess->s.con[0]);
+	struct ibtrs_msg_info_req *msg;
+	struct ibtrs_iu *tx_iu, *rx_iu;
+	size_t rx_sz;
+	int err;
+
+	rx_sz  = sizeof(struct ibtrs_msg_info_rsp);
+	rx_sz += sizeof(u64) * MAX_SESS_QUEUE_DEPTH;
+
+	tx_iu = ibtrs_iu_alloc(0, sizeof(struct ibtrs_msg_info_req), GFP_KERNEL,
+			       sess->s.dev->ib_dev, DMA_TO_DEVICE,
+			       ibtrs_clt_info_req_done);
+	rx_iu = ibtrs_iu_alloc(0, rx_sz, GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_FROM_DEVICE, ibtrs_clt_info_rsp_done);
+	if (unlikely(!tx_iu || !rx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(): no memory\n");
+		err = -ENOMEM;
+		goto out;
+	}
+	/* Prepare for getting info response */
+	err = ibtrs_iu_post_recv(&usr_con->c, rx_iu);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_recv(), err: %d\n", err);
+		goto out;
+	}
+	rx_iu = NULL;
+
+	msg = tx_iu->buf;
+	msg->type = cpu_to_le16(IBTRS_MSG_INFO_REQ);
+	memcpy(msg->sessname, sess->s.sessname, sizeof(msg->sessname));
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, tx_iu->dma_addr,
+				      tx_iu->size, DMA_TO_DEVICE);
+
+	/* Send info request */
+	err = ibtrs_iu_post_send(&usr_con->c, tx_iu, sizeof(*msg), NULL);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_send(), err: %d\n", err);
+		goto out;
+	}
+	tx_iu = NULL;
+
+	/* Wait for state change */
+	wait_event_interruptible_timeout(sess->state_wq,
+				sess->state != IBTRS_CLT_CONNECTING,
+				msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(sess->state != IBTRS_CLT_CONNECTED)) {
+		if (sess->state == IBTRS_CLT_CONNECTING_ERR)
+			err = -ECONNRESET;
+		else
+			err = -ETIMEDOUT;
+		goto out;
+	}
+
+out:
+	if (tx_iu)
+		ibtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+	if (rx_iu)
+		ibtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+	if (unlikely(err))
+		/* If we've never taken async path because of malloc problems */
+		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+/**
+ * init_sess() - establishes all session connections and does handshake
+ *
+ * In case of error full close or reconnect procedure should be taken,
+ * because reconnect or close async works can be started.
+ */
+static int init_sess(struct ibtrs_clt_sess *sess)
+{
+	int err;
+
+	mutex_lock(&sess->init_mutex);
+	err = init_conns(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "init_conns(), err: %d\n", err);
+		goto out;
+	}
+	err = ibtrs_send_sess_info(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_send_sess_info(), err: %d\n", err);
+		goto out;
+	}
+	ibtrs_clt_sess_up(sess);
+out:
+	mutex_unlock(&sess->init_mutex);
+
+	return err;
+}
+
+static void ibtrs_clt_reconnect_work(struct work_struct *work)
+{
+	struct ibtrs_clt_sess *sess;
+	struct ibtrs_clt *clt;
+	unsigned int delay_ms;
+	int err;
+
+	sess = container_of(to_delayed_work(work), struct ibtrs_clt_sess,
+			    reconnect_dwork);
+	clt = sess->clt;
+
+	if (ibtrs_clt_state(sess) == IBTRS_CLT_CLOSING)
+		/* User requested closing */
+		return;
+
+	if (sess->reconnect_attempts >= clt->max_reconnect_attempts) {
+		/* Close a session completely if max attempts is reached */
+		ibtrs_clt_close_conns(sess, false);
+		return;
+	}
+	sess->reconnect_attempts++;
+
+	/* Stop everything */
+	ibtrs_clt_stop_and_destroy_conns(sess);
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING);
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto reconnect_again;
+
+	return;
+
+reconnect_again:
+	if (ibtrs_clt_change_state(sess, IBTRS_CLT_RECONNECTING)) {
+		sess->stats.reconnects.fail_cnt++;
+		delay_ms = clt->reconnect_delay_sec * 1000;
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork,
+				   msecs_to_jiffies(delay_ms));
+	}
+}
+
+static void ibtrs_clt_dev_release(struct device *dev)
+{
+	/* Nobody plays with device references, so nop */
+}
+
+static struct ibtrs_clt *alloc_clt(const char *sessname, size_t paths_num,
+				   short port, size_t pdu_sz,
+				   void *priv, link_clt_ev_fn *link_ev,
+				   unsigned int max_segments,
+				   unsigned int reconnect_delay_sec,
+				   unsigned int max_reconnect_attempts)
+{
+	struct ibtrs_clt *clt;
+	int err;
+
+	if (unlikely(!paths_num || paths_num > MAX_PATHS_NUM))
+		return ERR_PTR(-EINVAL);
+
+	if (unlikely(strlen(sessname) >= sizeof(clt->sessname)))
+		return ERR_PTR(-EINVAL);
+
+	clt = kzalloc(sizeof(*clt), GFP_KERNEL);
+	if (unlikely(!clt))
+		return ERR_PTR(-ENOMEM);
+
+	clt->pcpu_path = alloc_percpu(typeof(*clt->pcpu_path));
+	if (unlikely(!clt->pcpu_path)) {
+		kfree(clt);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	uuid_gen(&clt->paths_uuid);
+	INIT_LIST_HEAD_RCU(&clt->paths_list);
+	clt->paths_num = paths_num;
+	clt->paths_up = MAX_PATHS_NUM;
+	clt->port = port;
+	clt->pdu_sz = pdu_sz;
+	clt->max_segments = max_segments;
+	clt->reconnect_delay_sec = reconnect_delay_sec;
+	clt->max_reconnect_attempts = max_reconnect_attempts;
+	clt->priv = priv;
+	clt->link_ev = link_ev;
+	clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	strlcpy(clt->sessname, sessname, sizeof(clt->sessname));
+	init_waitqueue_head(&clt->tags_wait);
+	mutex_init(&clt->paths_ev_mutex);
+	mutex_init(&clt->paths_mutex);
+
+	clt->dev.class = ibtrs_dev_class;
+	clt->dev.release = ibtrs_clt_dev_release;
+	dev_set_name(&clt->dev, "%s", sessname);
+
+	err = device_register(&clt->dev);
+	if (unlikely(err))
+		goto percpu_free;
+
+	err = ibtrs_clt_create_sysfs_root_folders(clt);
+	if (unlikely(err))
+		goto dev_unregister;
+
+	return clt;
+
+dev_unregister:
+	/* Nobody plays with dev refs, so dev.release() is nop */
+	device_unregister(&clt->dev);
+percpu_free:
+	free_percpu(clt->pcpu_path);
+	kfree(clt);
+
+	return ERR_PTR(err);
+}
+
+static void wait_for_inflight_tags(struct ibtrs_clt *clt)
+{
+	if (clt->tags_map) {
+		size_t sz = clt->queue_depth;
+
+		wait_event(clt->tags_wait,
+			   find_first_bit(clt->tags_map, sz) >= sz);
+	}
+}
+
+static void free_clt(struct ibtrs_clt *clt)
+{
+	ibtrs_clt_destroy_sysfs_root_folders(clt);
+	wait_for_inflight_tags(clt);
+	free_tags(clt);
+	free_percpu(clt->pcpu_path);
+	/* Nobody plays with dev refs, so dev.release() is nop */
+	device_unregister(&clt->dev);
+	kfree(clt);
+}
+
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct ibtrs_addr *paths,
+				 size_t paths_num,
+				 short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts)
+{
+	struct ibtrs_clt_sess *sess, *tmp;
+	struct ibtrs_clt *clt;
+	int err, i;
+
+	clt = alloc_clt(sessname, paths_num, port, pdu_sz, priv, link_ev,
+			max_segments, reconnect_delay_sec,
+			max_reconnect_attempts);
+	if (unlikely(IS_ERR(clt))) {
+		err = PTR_ERR(clt);
+		goto out;
+	}
+	for (i = 0; i < paths_num; i++) {
+		struct ibtrs_clt_sess *sess;
+
+		sess = alloc_sess(clt, &paths[i], nr_cons_per_session,
+				  max_segments);
+		if (unlikely(IS_ERR(sess))) {
+			err = PTR_ERR(sess);
+			ibtrs_err(clt, "alloc_sess(), err: %d\n", err);
+			goto close_all_sess;
+		}
+		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+
+		err = init_sess(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+
+		err = ibtrs_clt_create_sess_files(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+	}
+	err = alloc_tags(clt);
+	if (unlikely(err)) {
+		ibtrs_err(clt, "alloc_tags(), err: %d\n", err);
+		goto close_all_sess;
+	}
+	err = ibtrs_clt_create_sysfs_root_files(clt);
+	if (unlikely(err))
+		goto close_all_sess;
+
+	/*
+	 * There is a race if someone decides to completely remove just
+	 * newly created path using sysfs entry.  To avoid the race we
+	 * use simple 'opened' flag, see ibtrs_clt_remove_path_from_sysfs().
+	 */
+	clt->opened = true;
+
+	/* Do not let module be unloaded if client is alive */
+	__module_get(THIS_MODULE);
+
+	return clt;
+
+close_all_sess:
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		ibtrs_clt_destroy_sess_files(sess, NULL);
+		ibtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+
+out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(ibtrs_clt_open);
+
+void ibtrs_clt_close(struct ibtrs_clt *clt)
+{
+	struct ibtrs_clt_sess *sess, *tmp;
+
+	/* Firstly forbid sysfs access */
+	ibtrs_clt_destroy_sysfs_root_files(clt);
+	ibtrs_clt_destroy_sysfs_root_folders(clt);
+
+	/* Now it is safe to iterate over all paths without locks */
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		ibtrs_clt_destroy_sess_files(sess, NULL);
+		ibtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(ibtrs_clt_close);
+
+int ibtrs_clt_reconnect_from_sysfs(struct ibtrs_clt_sess *sess)
+{
+	enum ibtrs_clt_state old_state;
+	int err = -EBUSY;
+	bool changed;
+
+	changed = ibtrs_clt_change_state_get_old(sess, IBTRS_CLT_RECONNECTING,
+						 &old_state);
+	if (changed) {
+		sess->reconnect_attempts = 0;
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
+	}
+	if (changed || old_state == IBTRS_CLT_RECONNECTING) {
+		/*
+		 * flush_delayed_work() queues pending work for immediate
+		 * execution, so do the flush if we have queued something
+		 * right now or work is pending.
+		 */
+		flush_delayed_work(&sess->reconnect_dwork);
+		err = ibtrs_clt_sess_is_connected(sess) ? 0 : -ENOTCONN;
+	}
+
+	return err;
+}
+
+int ibtrs_clt_disconnect_from_sysfs(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_clt_close_conns(sess, true);
+
+	return 0;
+}
+
+int ibtrs_clt_remove_path_from_sysfs(struct ibtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	enum ibtrs_clt_state old_state;
+	bool changed;
+
+	/*
+	 * That can happen only when userspace tries to remove path
+	 * very early, when ibtrs_clt_open() is not yet finished.
+	 */
+	if (unlikely(!clt->opened))
+		return -EBUSY;
+
+	/*
+	 * Continue stopping path till state was changed to DEAD or
+	 * state was observed as DEAD:
+	 * 1. State was changed to DEAD - we were fast and nobody
+	 *    invoked ibtrs_clt_reconnect(), which can again start
+	 *    reconnecting.
+	 * 2. State was observed as DEAD - we have someone in parallel
+	 *    removing the path.
+	 */
+	do {
+		ibtrs_clt_close_conns(sess, true);
+	} while (!(changed = ibtrs_clt_change_state_get_old(sess,
+							    IBTRS_CLT_DEAD,
+							    &old_state)) &&
+		   old_state != IBTRS_CLT_DEAD);
+
+	/*
+	 * If state was successfully changed to DEAD, commit suicide.
+	 */
+	if (likely(changed)) {
+		ibtrs_clt_destroy_sess_files(sess, sysfs_self);
+		ibtrs_clt_remove_path_from_arr(sess);
+		free_sess(sess);
+	}
+
+	return 0;
+}
+
+void ibtrs_clt_set_max_reconnect_attempts(struct ibtrs_clt *clt, int value)
+{
+	clt->max_reconnect_attempts = (unsigned int)value;
+}
+
+int ibtrs_clt_get_max_reconnect_attempts(const struct ibtrs_clt *clt)
+{
+	return (int)clt->max_reconnect_attempts;
+}
+
+static int ibtrs_post_rdma_write_sg(struct ibtrs_clt_con *con,
+				    struct ibtrs_clt_io_req *req,
+				    struct ibtrs_rbuf *rbuf,
+				    u32 size, u32 imm)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ib_sge *sge = req->sge;
+	enum ib_send_flags flags;
+	struct scatterlist *sg;
+	size_t num_sge;
+	int i;
+
+	for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+		sge[i].addr   = sg_dma_address(sg);
+		sge[i].length = sg_dma_len(sg);
+		sge[i].lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+	}
+	sge[i].addr   = req->iu->dma_addr;
+	sge[i].length = size;
+	sge[i].lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+
+	num_sge = 1 + req->sg_cnt;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr,
+				      size, DMA_TO_DEVICE);
+
+	return ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, sge, num_sge,
+					    rbuf->rkey, rbuf->addr, imm,
+					    flags, NULL);
+}
+
+static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_rdma_write *msg;
+
+	struct ibtrs_rbuf *rbuf;
+	int ret, count = 0;
+	u32 imm, buf_id;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		ibtrs_wrn(sess, "Write request failed, size too big %zu > %d\n",
+			  tsize, sess->chunk_size);
+		return -EMSGSIZE;
+	}
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(sess->s.dev->ib_dev, req->sglist,
+				      req->sg_cnt, req->dir);
+		if (unlikely(!count)) {
+			ibtrs_wrn(sess, "Write request failed, map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put ibtrs msg after sg and user message */
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(IBTRS_MSG_WRITE);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	/* ibtrs message on server side will be after user data and message */
+	imm = req->tag->mem_off + req->data_len + req->usr_len;
+	imm = ibtrs_to_io_req_imm(imm);
+	buf_id = req->tag->mem_id;
+	req->sg_size = tsize;
+	rbuf = &sess->rbufs[buf_id];
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	ibtrs_clt_update_all_stats(req, WRITE);
+
+	ret = ibtrs_post_rdma_write_sg(req->con, req, rbuf,
+				       req->usr_len + sizeof(*msg),
+				       imm);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "Write request failed: %d\n", ret);
+		ibtrs_clt_decrease_inflight(&sess->stats);
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+static int ibtrs_map_sg_fr(struct ibtrs_clt_io_req *req, size_t count)
+{
+	int nr;
+
+	/* Align the MR to a 4K page size to match the block virt boundary */
+	nr = ib_map_mr_sg(req->mr, req->sglist, count, NULL, SZ_4K);
+	if (unlikely(nr < req->sg_cnt)) {
+		if (nr < 0)
+			return nr;
+		return -EINVAL;
+	}
+	ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
+
+	return nr;
+}
+
+static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_rdma_read *msg;
+	struct ibtrs_ib_dev *dev;
+	struct scatterlist *sg;
+
+	struct ib_reg_wr rwr;
+	struct ib_send_wr *wr = NULL;
+
+	int i, ret, count = 0;
+	u32 imm, buf_id;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	dev = sess->s.dev;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		ibtrs_wrn(sess, "Read request failed, message size is"
+			  " %zu, bigger than CHUNK_SIZE %d\n", tsize,
+			  sess->chunk_size);
+		return -EMSGSIZE;
+	}
+
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(dev->ib_dev, req->sglist, req->sg_cnt,
+				      req->dir);
+		if (unlikely(!count)) {
+			ibtrs_wrn(sess, "Read request failed, "
+				  "dma map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put our message into req->buf after user message*/
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(IBTRS_MSG_READ);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	if (count > noreg_cnt) {
+		ret = ibtrs_map_sg_fr(req, count);
+		if (ret < 0) {
+			ibtrs_err_rl(sess,
+				     "Read request failed, failed to map "
+				     " fast reg. data, err: %d\n", ret);
+			ib_dma_unmap_sg(dev->ib_dev, req->sglist, req->sg_cnt,
+					req->dir);
+			return ret;
+		}
+		memset(&rwr, 0, sizeof(rwr));
+		rwr.wr.next = NULL;
+		rwr.wr.opcode = IB_WR_REG_MR;
+		rwr.wr.wr_cqe = &fast_reg_cqe;
+		rwr.wr.num_sge = 0;
+		rwr.mr = req->mr;
+		rwr.key = req->mr->rkey;
+		rwr.access = (IB_ACCESS_LOCAL_WRITE |
+			      IB_ACCESS_REMOTE_WRITE);
+		wr = &rwr.wr;
+
+		msg->sg_cnt = cpu_to_le16(1);
+		msg->flags = cpu_to_le16(ibtrs_invalidate_flag());
+
+		msg->desc[0].addr = cpu_to_le64(req->mr->iova);
+		msg->desc[0].key = cpu_to_le32(req->mr->rkey);
+		msg->desc[0].len = cpu_to_le32(req->mr->length);
+
+		/* Further invalidation is required */
+		req->need_inv = !!ibtrs_invalidate_flag();
+
+	} else {
+		msg->sg_cnt = cpu_to_le16(count);
+		msg->flags = 0;
+
+		for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+			msg->desc[i].addr = cpu_to_le64(sg_dma_address(sg));
+			msg->desc[i].key = cpu_to_le32(dev->ib_pd->unsafe_global_rkey);
+			msg->desc[i].len = cpu_to_le32(sg_dma_len(sg));
+		}
+	}
+	/*
+	 * ibtrs message will be after the space reserved for disk data and
+	 * user message
+	 */
+	imm = req->tag->mem_off + req->data_len + req->usr_len;
+	imm = ibtrs_to_io_req_imm(imm);
+	buf_id = req->tag->mem_id;
+
+	req->sg_size  = sizeof(*msg);
+	req->sg_size += le16_to_cpu(msg->sg_cnt) * sizeof(struct ibtrs_sg_desc);
+	req->sg_size += req->usr_len;
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	ibtrs_clt_update_all_stats(req, READ);
+
+	ret = ibtrs_post_send_rdma(req->con, req, &sess->rbufs[buf_id],
+				   req->data_len, imm, wr);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "Read request failed: %d\n", ret);
+		ibtrs_clt_decrease_inflight(&sess->stats);
+		req->need_inv = false;
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(dev->ib_dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+int ibtrs_clt_request(int dir, ibtrs_conf_fn *conf, struct ibtrs_clt *clt,
+		      struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		      size_t nr, size_t data_len, struct scatterlist *sg,
+		      unsigned int sg_cnt)
+{
+	struct ibtrs_clt_io_req *req;
+	struct ibtrs_clt_sess *sess;
+
+	enum dma_data_direction dma_dir;
+	int err = -ECONNABORTED, i;
+	size_t usr_len, hdr_len;
+	struct path_it it;
+
+	/* Get kvec length */
+	for (i = 0, usr_len = 0; i < nr; i++)
+		usr_len += vec[i].iov_len;
+
+	if (dir == READ) {
+		hdr_len = sizeof(struct ibtrs_msg_rdma_read) +
+			  sg_cnt * sizeof(struct ibtrs_sg_desc);
+		dma_dir = DMA_FROM_DEVICE;
+	} else {
+		hdr_len = sizeof(struct ibtrs_msg_rdma_write);
+		dma_dir = DMA_TO_DEVICE;
+	}
+
+	do_each_path(sess, clt, &it) {
+		if (unlikely(sess->state != IBTRS_CLT_CONNECTED))
+			continue;
+
+		if (unlikely(usr_len + hdr_len > sess->max_hdr_size)) {
+			ibtrs_wrn_rl(sess, "%s request failed, user message "
+				     "size is %zu and header length %zu, but "
+				     "max size is %u\n",
+				     dir == READ ? "Read" : "Write",
+				     usr_len, hdr_len, sess->max_hdr_size);
+			err = -EMSGSIZE;
+			break;
+		}
+		req = ibtrs_clt_get_req(sess, conf, tag, priv, vec, usr_len,
+					sg, sg_cnt, data_len, dma_dir);
+		if (dir == READ)
+			err = ibtrs_clt_read_req(req);
+		else
+			err = ibtrs_clt_write_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+EXPORT_SYMBOL(ibtrs_clt_request);
+
+int ibtrs_clt_query(struct ibtrs_clt *clt, struct ibtrs_attrs *attr)
+{
+	if (unlikely(!ibtrs_clt_is_connected(clt)))
+		return -ECOMM;
+
+	attr->queue_depth      = clt->queue_depth;
+	attr->max_io_size      = clt->max_io_size;
+	strlcpy(attr->sessname, clt->sessname, sizeof(attr->sessname));
+
+	return 0;
+}
+EXPORT_SYMBOL(ibtrs_clt_query);
+
+int ibtrs_clt_create_path_from_sysfs(struct ibtrs_clt *clt,
+				     struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt_sess *sess;
+	int err;
+
+	if (ibtrs_clt_path_exists(clt, addr))
+		return -EEXIST;
+
+	sess = alloc_sess(clt, addr, nr_cons_per_session, clt->max_segments);
+	if (unlikely(IS_ERR(sess)))
+		return PTR_ERR(sess);
+
+	/*
+	 * It is totally safe to add path in CONNECTING state: coming
+	 * IO will never grab it.  Also it is very important to add
+	 * path before init, since init fires LINK_CONNECTED event.
+	 */
+	err = ibtrs_clt_add_path_to_arr(sess, addr);
+	if (unlikely(err))
+		goto free_sess;
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	err = ibtrs_clt_create_sess_files(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	return 0;
+
+close_sess:
+	ibtrs_clt_remove_path_from_arr(sess);
+	ibtrs_clt_close_conns(sess, true);
+free_sess:
+	free_sess(sess);
+
+	return err;
+}
+
+static int check_module_params(void)
+{
+	if (nr_cons_per_session == 0)
+		nr_cons_per_session = min_t(unsigned int, nr_cpu_ids, U16_MAX);
+
+	return 0;
+}
+
+static int ibtrs_clt_ib_dev_init(struct ibtrs_ib_dev *dev)
+{
+	if (!(dev->ib_dev->attrs.device_cap_flags &
+	      IB_DEVICE_MEM_MGT_EXTENSIONS)) {
+		pr_err("Memory registrations not supported.\n");
+		return -ENOTSUPP;
+	}
+
+	return 0;
+}
+
+static const struct ibtrs_ib_dev_pool_ops dev_pool_ops = {
+	.init = ibtrs_clt_ib_dev_init
+};
+
+static int __init ibtrs_client_init(void)
+{
+	int err;
+
+	pr_info("Loading module %s, version %s, proto %s: "
+		"(retry_cnt: %d, noreg_cnt: %d)\n",
+		KBUILD_MODNAME, IBTRS_VER_STRING, IBTRS_PROTO_VER_STRING,
+		retry_cnt, noreg_cnt);
+
+	ibtrs_ib_dev_pool_init(noreg_cnt ? IB_PD_UNSAFE_GLOBAL_RKEY : 0,
+			       &dev_pool);
+
+	err = check_module_params();
+	if (unlikely(err)) {
+		pr_err("Failed to load module, invalid module parameters,"
+		       " err: %d\n", err);
+		return err;
+	}
+	ibtrs_dev_class = class_create(THIS_MODULE, "ibtrs-client");
+	if (unlikely(IS_ERR(ibtrs_dev_class))) {
+		pr_err("Failed to create ibtrs-client dev class\n");
+		return PTR_ERR(ibtrs_dev_class);
+	}
+	ibtrs_wq = alloc_workqueue("ibtrs_client_wq", WQ_MEM_RECLAIM, 0);
+	if (unlikely(!ibtrs_wq)) {
+		pr_err("Failed to load module, alloc ibtrs_client_wq failed\n");
+		class_destroy(ibtrs_dev_class);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void __exit ibtrs_client_exit(void)
+{
+	destroy_workqueue(ibtrs_wq);
+	class_destroy(ibtrs_dev_class);
+	ibtrs_ib_dev_pool_deinit(&dev_pool);
+}
+
+module_init(ibtrs_client_init);
+module_exit(ibtrs_client_exit);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 08/26] ibtrs: client: statistics functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (6 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 07/26] ibtrs: client: main functionality Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 09/26] ibtrs: client: sysfs interface functions Roman Pen
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This introduces set of functions used on client side to account
statistics of RDMA data sent/received, amount of IOs inflight,
latency, cpu migrations, etc.  Almost all statistics is collected
using percpu variables.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c | 455 +++++++++++++++++++++++++
 1 file changed, 455 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
new file mode 100644
index 000000000000..af2ed05d2900
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
@@ -0,0 +1,455 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-clt.h"
+
+static inline int ibtrs_clt_ms_to_id(unsigned long ms)
+{
+	int id = ms ? ilog2(ms) - MIN_LOG_LAT + 1 : 0;
+
+	return clamp(id, 0, LOG_LAT_SZ - 1);
+}
+
+void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *stats, bool read,
+			       unsigned long ms)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int id;
+
+	id = ibtrs_clt_ms_to_id(ms);
+	s = this_cpu_ptr(stats->pcpu_stats);
+	if (read) {
+		s->rdma_lat_distr[id].read++;
+		if (s->rdma_lat_max.read < ms)
+			s->rdma_lat_max.read = ms;
+	} else {
+		s->rdma_lat_distr[id].write++;
+		if (s->rdma_lat_max.write < ms)
+			s->rdma_lat_max.write = ms;
+	}
+}
+
+void ibtrs_clt_decrease_inflight(struct ibtrs_clt_stats *stats)
+{
+	atomic_dec(&stats->inflight);
+}
+
+void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt_stats *stats = &sess->stats;
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	cpu = raw_smp_processor_id();
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->wc_comp.cnt++;
+	s->wc_comp.total_cnt++;
+	if (unlikely(con->cpu != cpu)) {
+		s->cpu_migr.to++;
+
+		/* Careful here, override s pointer */
+		s = per_cpu_ptr(stats->pcpu_stats, con->cpu);
+		atomic_inc(&s->cpu_migr.from);
+	}
+}
+
+void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *stats)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.failover_cnt++;
+}
+
+static inline u32 ibtrs_clt_stats_get_avg_wc_cnt(struct ibtrs_clt_stats *stats)
+{
+	u32 cnt = 0;
+	u64 sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct ibtrs_clt_stats_pcpu *s;
+
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		sum += s->wc_comp.total_cnt;
+		cnt += s->wc_comp.cnt;
+	}
+
+	return cnt ? sum / cnt : 0;
+}
+
+int ibtrs_clt_stats_wc_completion_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	return scnprintf(buf, len, "%u\n",
+			 ibtrs_clt_stats_get_avg_wc_cnt(stats));
+}
+
+ssize_t ibtrs_clt_stats_rdma_lat_distr_to_str(struct ibtrs_clt_stats *stats,
+					      char *page, size_t len)
+{
+	struct ibtrs_clt_stats_rdma_lat res[LOG_LAT_SZ];
+	struct ibtrs_clt_stats_rdma_lat max;
+	struct ibtrs_clt_stats_pcpu *s;
+
+	ssize_t cnt = 0;
+	int i, cpu;
+
+	max.write = 0;
+	max.read = 0;
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+		if (max.write < s->rdma_lat_max.write)
+			max.write = s->rdma_lat_max.write;
+		if (max.read < s->rdma_lat_max.read)
+			max.read = s->rdma_lat_max.read;
+	}
+	for (i = 0; i < ARRAY_SIZE(res); i++) {
+		res[i].write = 0;
+		res[i].read = 0;
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+			res[i].write += s->rdma_lat_distr[i].write;
+			res[i].read += s->rdma_lat_distr[i].read;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(res) - 1; i++)
+		cnt += scnprintf(page + cnt, len - cnt,
+				 "< %6d ms: %llu %llu\n",
+				 1 << (i + MIN_LOG_LAT), res[i].read,
+				 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, ">= %5d ms: %llu %llu\n",
+			 1 << (i - 1 + MIN_LOG_LAT), res[i].read,
+			 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, " maximum ms: %llu %llu\n",
+			 max.read, max.write);
+
+	return cnt;
+}
+
+int ibtrs_clt_stats_migration_cnt_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	size_t used;
+	int cpu;
+
+	used = scnprintf(buf, len, "    ");
+	for_each_possible_cpu(cpu)
+		used += scnprintf(buf + used, len - used, " CPU%u", cpu);
+
+	used += scnprintf(buf + used, len - used, "\nfrom:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  atomic_read(&s->cpu_migr.from));
+	}
+
+	used += scnprintf(buf + used, len - used, "\nto  :");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  s->cpu_migr.to);
+	}
+	used += scnprintf(buf + used, len - used, "\n");
+
+	return used;
+}
+
+int ibtrs_clt_stats_reconnects_to_str(struct ibtrs_clt_stats *stats, char *buf,
+				      size_t len)
+{
+	return scnprintf(buf, len, "%d %d\n",
+			 stats->reconnects.successful_cnt,
+			 stats->reconnects.fail_cnt);
+}
+
+ssize_t ibtrs_clt_stats_rdma_to_str(struct ibtrs_clt_stats *stats,
+				    char *page, size_t len)
+{
+	struct ibtrs_clt_stats_rdma sum;
+	struct ibtrs_clt_stats_rdma *r;
+	int cpu;
+
+	memset(&sum, 0, sizeof(sum));
+
+	for_each_possible_cpu(cpu) {
+		r = &per_cpu_ptr(stats->pcpu_stats, cpu)->rdma;
+
+		sum.dir[READ].cnt	  += r->dir[READ].cnt;
+		sum.dir[READ].size_total  += r->dir[READ].size_total;
+		sum.dir[WRITE].cnt	  += r->dir[WRITE].cnt;
+		sum.dir[WRITE].size_total += r->dir[WRITE].size_total;
+		sum.failover_cnt	  += r->failover_cnt;
+	}
+
+	return scnprintf(page, len, "%llu %llu %llu %llu %u %llu\n",
+			 sum.dir[READ].cnt, sum.dir[READ].size_total,
+			 sum.dir[WRITE].cnt, sum.dir[WRITE].size_total,
+			 atomic_read(&stats->inflight), sum.failover_cnt);
+}
+
+int ibtrs_clt_stats_sg_list_distr_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	int i, cpu, cnt;
+
+	cnt = scnprintf(buf, len, "n\\cpu:");
+	for_each_possible_cpu(cpu)
+		cnt += scnprintf(buf + cnt, len - cnt, "%5d", cpu);
+
+	for (i = 0; i < SG_DISTR_SZ; i++) {
+		if (i <= MAX_LIN_SG)
+			cnt += scnprintf(buf + cnt, len - cnt, "\n= %3d:", i);
+		else if (i < SG_DISTR_SZ - 1)
+			cnt += scnprintf(buf + cnt, len - cnt,
+					 "\n< %3d:",
+					 1 << (i + MIN_LOG_SG - MAX_LIN_SG));
+		else
+			cnt += scnprintf(buf + cnt, len - cnt,
+					 "\n>=%3d:",
+					 1 << (i + MIN_LOG_SG - MAX_LIN_SG - 1));
+
+		for_each_possible_cpu(cpu) {
+			unsigned int p, p_i, p_f;
+			u64 total, distr;
+
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			total = s->sg_list_total;
+			distr = s->sg_list_distr[i];
+
+			p = total ? distr * 1000 / total : 0;
+			p_i = p / 10;
+			p_f = p % 10;
+
+			if (distr)
+				cnt += scnprintf(buf + cnt, len - cnt,
+						 " %2u.%01u", p_i, p_f);
+			else
+				cnt += scnprintf(buf + cnt, len - cnt, "    0");
+		}
+	}
+
+	cnt += scnprintf(buf + cnt, len - cnt, "\ntotal:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		cnt += scnprintf(buf + cnt, len - cnt, " %llu",
+				 s->sg_list_total);
+	}
+	cnt += scnprintf(buf + cnt, len - cnt, "\n");
+
+	return cnt;
+}
+
+ssize_t ibtrs_clt_reset_all_help(struct ibtrs_clt_stats *s,
+				 char *page, size_t len)
+{
+	return scnprintf(page, len, "echo 1 to reset all statistics\n");
+}
+
+int ibtrs_clt_reset_rdma_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->rdma, 0, sizeof(s->rdma));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_rdma_lat_distr_stats(struct ibtrs_clt_stats *stats,
+					 bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (enable) {
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			memset(&s->rdma_lat_max, 0, sizeof(s->rdma_lat_max));
+			memset(&s->rdma_lat_distr, 0,
+			       sizeof(s->rdma_lat_distr));
+		}
+	}
+	stats->enable_rdma_lat = enable;
+
+	return 0;
+}
+
+int ibtrs_clt_reset_sg_list_distr_stats(struct ibtrs_clt_stats *stats,
+					bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->sg_list_total, 0, sizeof(s->sg_list_total));
+		memset(&s->sg_list_distr, 0, sizeof(s->sg_list_distr));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_cpu_migr_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->cpu_migr, 0, sizeof(s->cpu_migr));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_reconnects_stat(struct ibtrs_clt_stats *stats, bool enable)
+{
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	memset(&stats->reconnects, 0, sizeof(stats->reconnects));
+
+	return 0;
+}
+
+int ibtrs_clt_reset_wc_comp_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->wc_comp, 0, sizeof(s->wc_comp));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_all_stats(struct ibtrs_clt_stats *s, bool enable)
+{
+	if (enable) {
+		ibtrs_clt_reset_rdma_stats(s, enable);
+		ibtrs_clt_reset_rdma_lat_distr_stats(s, enable);
+		ibtrs_clt_reset_sg_list_distr_stats(s, enable);
+		ibtrs_clt_reset_cpu_migr_stats(s, enable);
+		ibtrs_clt_reset_reconnects_stat(s, enable);
+		ibtrs_clt_reset_wc_comp_stats(s, enable);
+		atomic_set(&s->inflight, 0);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static inline void ibtrs_clt_record_sg_distr(u64 stat[SG_DISTR_SZ], u64 *total,
+					     unsigned int cnt)
+{
+	int i;
+
+	i = cnt > MAX_LIN_SG ? ilog2(cnt) + MAX_LIN_SG - MIN_LOG_SG + 1 : cnt;
+	i = i < SG_DISTR_SZ ? i : SG_DISTR_SZ - 1;
+
+	stat[i]++;
+	(*total)++;
+}
+
+static inline void ibtrs_clt_update_rdma_stats(struct ibtrs_clt_stats *stats,
+					       size_t size, int d)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.dir[d].cnt++;
+	s->rdma.dir[d].size_total += size;
+}
+
+void ibtrs_clt_update_all_stats(struct ibtrs_clt_io_req *req, int dir)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt_stats *stats = &sess->stats;
+	unsigned int len;
+
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	ibtrs_clt_record_sg_distr(s->sg_list_distr, &s->sg_list_total,
+				  req->sg_cnt);
+	len = req->usr_len + req->data_len;
+	ibtrs_clt_update_rdma_stats(stats, len, dir);
+	atomic_inc(&stats->inflight);
+}
+
+int ibtrs_clt_init_stats(struct ibtrs_clt_stats *stats)
+{
+	stats->enable_rdma_lat = false;
+	stats->pcpu_stats = alloc_percpu(typeof(*stats->pcpu_stats));
+	if (unlikely(!stats->pcpu_stats))
+		return -ENOMEM;
+
+	/*
+	 * successful_cnt will be set to 0 after session
+	 * is established for the first time
+	 */
+	stats->reconnects.successful_cnt = -1;
+
+	return 0;
+}
+
+void ibtrs_clt_free_stats(struct ibtrs_clt_stats *stats)
+{
+	free_percpu(stats->pcpu_stats);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 09/26] ibtrs: client: sysfs interface functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (7 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 08/26] ibtrs: client: statistics functions Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 10/26] ibtrs: server: private header with server structs and functions Roman Pen
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is the sysfs interface to IBTRS sessions on client side:

  /sys/devices/virtual/ibtrs-client/<SESS-NAME>/
    *** IBTRS session created by ibtrs_clt_open() API call
    |
    |- max_reconnect_attempts
    |  *** number of reconnect attempts for session
    |
    |- add_path
    |  *** adds another connection path into IBTRS session
    |
    |- paths/<DEST-IP>/
       *** established paths to server in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- reconnect
       |  *** reconnect path
       |
       |- remove_path
       |  *** remove current path
       |
       |- state
       |  *** retrieve current path state
       |
       |- hca_port
       |  *** HCA port number
       |
       |- hca_name
       |  *** HCA name
       |
       |- stats/
          *** current path statistics
          |
	  |- cpu_migration
	  |- rdma
	  |- rdma_lat
	  |- reconnects
	  |- reset_all
	  |- sg_entries
	  |- wc_completions

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c | 482 +++++++++++++++++++++++++
 1 file changed, 482 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
new file mode 100644
index 000000000000..c185bbc4fd5c
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
@@ -0,0 +1,482 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+#define MIN_MAX_RECONN_ATT -1
+#define MAX_MAX_RECONN_ATT 9999
+
+static struct kobj_type ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+};
+
+static ssize_t max_reconnect_attempts_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *page)
+{
+	struct ibtrs_clt *clt;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	return sprintf(page, "%d\n", ibtrs_clt_get_max_reconnect_attempts(clt));
+}
+
+static ssize_t max_reconnect_attempts_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf,
+					    size_t count)
+{
+	struct ibtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (unlikely(ret)) {
+		ibtrs_err(clt, "%s: failed to convert string '%s' to int\n",
+			  attr->attr.name, buf);
+		return ret;
+	}
+	if (unlikely(value > MAX_MAX_RECONN_ATT ||
+		     value < MIN_MAX_RECONN_ATT)) {
+		ibtrs_err(clt, "%s: invalid range"
+			  " (provided: '%s', accepted: min: %d, max: %d)\n",
+			  attr->attr.name, buf, MIN_MAX_RECONN_ATT,
+			  MAX_MAX_RECONN_ATT);
+		return -EINVAL;
+	}
+	ibtrs_clt_set_max_reconnect_attempts(clt, value);
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(max_reconnect_attempts);
+
+static ssize_t mpath_policy_show(struct device *dev,
+				 struct device_attribute *attr,
+				 char *page)
+{
+	struct ibtrs_clt *clt;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	switch (clt->mp_policy) {
+	case MP_POLICY_RR:
+		return sprintf(page, "round-robin (RR: %d)\n", clt->mp_policy);
+	case MP_POLICY_MIN_INFLIGHT:
+		return sprintf(page, "min-inflight (MI: %d)\n", clt->mp_policy);
+	default:
+		return sprintf(page, "Unknown (%d)\n", clt->mp_policy);
+	}
+}
+
+static ssize_t mpath_policy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf,
+				  size_t count)
+{
+	struct ibtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (!ret && (value == MP_POLICY_RR || value == MP_POLICY_MIN_INFLIGHT)) {
+		clt->mp_policy = value;
+		return count;
+	}
+
+	if (!strncasecmp(buf, "round-robin", 11) ||
+	    !strncasecmp(buf, "rr", 2))
+		clt->mp_policy = MP_POLICY_RR;
+	else if (!strncasecmp(buf, "min-inflight", 12) ||
+		 !strncasecmp(buf, "mi", 2))
+		clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(mpath_policy);
+
+static ssize_t add_path_show(struct device *dev,
+			     struct device_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo"
+			 " [<source addr>,]<destination addr> > %s\n\n"
+			"*addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]\n",
+			 attr->attr.name);
+}
+
+static ssize_t add_path_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t count)
+{
+	struct sockaddr_storage srcaddr, dstaddr;
+	struct ibtrs_addr addr = {
+		.src = &srcaddr,
+		.dst = &dstaddr
+	};
+	struct ibtrs_clt *clt;
+	const char *nl;
+	size_t len;
+	int err;
+
+	clt = container_of(dev, struct ibtrs_clt, dev);
+
+	nl = strchr(buf, '\n');
+	if (nl)
+		len = nl - buf;
+	else
+		len = count;
+	err = ibtrs_addr_to_sockaddr(buf, len, clt->port, &addr);
+	if (unlikely(err))
+		return -EINVAL;
+
+	err = ibtrs_clt_create_path_from_sysfs(clt, &addr);
+	if (unlikely(err))
+		return err;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(add_path);
+
+static ssize_t ibtrs_clt_state_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *page)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (ibtrs_clt_sess_is_connected(sess))
+		return sprintf(page, "connected\n");
+
+	return sprintf(page, "disconnected\n");
+}
+
+static struct kobj_attribute ibtrs_clt_state_attr =
+	__ATTR(state, 0444, ibtrs_clt_state_show, NULL);
+
+static ssize_t ibtrs_clt_reconnect_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_reconnect_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_reconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_reconnect_attr =
+	__ATTR(reconnect, 0644, ibtrs_clt_reconnect_show,
+	       ibtrs_clt_reconnect_store);
+
+static ssize_t ibtrs_clt_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_disconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_disconnect_attr =
+	__ATTR(disconnect, 0644, ibtrs_clt_disconnect_show,
+	       ibtrs_clt_disconnect_store);
+
+static ssize_t ibtrs_clt_remove_path_show(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_remove_path_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_remove_path_from_sysfs(sess, &attr->attr);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_remove_path_attr =
+	__ATTR(remove_path, 0644, ibtrs_clt_remove_path_show,
+	       ibtrs_clt_remove_path_store);
+
+STAT_ATTR(struct ibtrs_clt_sess, cpu_migration,
+	  ibtrs_clt_stats_migration_cnt_to_str,
+	  ibtrs_clt_reset_cpu_migr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, sg_entries,
+	  ibtrs_clt_stats_sg_list_distr_to_str,
+	  ibtrs_clt_reset_sg_list_distr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, reconnects,
+	  ibtrs_clt_stats_reconnects_to_str,
+	  ibtrs_clt_reset_reconnects_stat);
+
+STAT_ATTR(struct ibtrs_clt_sess, rdma_lat,
+	  ibtrs_clt_stats_rdma_lat_distr_to_str,
+	  ibtrs_clt_reset_rdma_lat_distr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, wc_completion,
+	  ibtrs_clt_stats_wc_completion_to_str,
+	  ibtrs_clt_reset_wc_comp_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, rdma,
+	  ibtrs_clt_stats_rdma_to_str,
+	  ibtrs_clt_reset_rdma_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, reset_all,
+	  ibtrs_clt_reset_all_help,
+	  ibtrs_clt_reset_all_stats);
+
+static struct attribute *ibtrs_clt_stats_attrs[] = {
+	&sg_entries_attr.attr,
+	&cpu_migration_attr.attr,
+	&reconnects_attr.attr,
+	&rdma_lat_attr.attr,
+	&wc_completion_attr.attr,
+	&rdma_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_stats_attr_group = {
+	.attrs = ibtrs_clt_stats_attrs,
+};
+
+static int ibtrs_clt_create_stats_files(struct kobject *kobj,
+					struct kobject *kobj_stats)
+{
+	int ret;
+
+	ret = kobject_init_and_add(kobj_stats, &ktype, kobj, "stats");
+	if (ret) {
+		pr_err("Failed to init and add stats kobject, err: %d\n",
+		       ret);
+		return ret;
+	}
+
+	ret = sysfs_create_group(kobj_stats, &ibtrs_clt_stats_attr_group);
+	if (ret) {
+		pr_err("failed to create stats sysfs group, err: %d\n",
+		       ret);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_del(kobj_stats);
+	kobject_put(kobj_stats);
+
+	return ret;
+}
+
+static ssize_t ibtrs_clt_hca_port_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(kobj, typeof(*sess), kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%u\n", sess->hca_port);
+}
+
+static struct kobj_attribute ibtrs_clt_hca_port_attr =
+	__ATTR(hca_port, 0444, ibtrs_clt_hca_port_show, NULL);
+
+static ssize_t ibtrs_clt_hca_name_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", sess->hca_name);
+}
+
+static struct kobj_attribute ibtrs_clt_hca_name_attr =
+	__ATTR(hca_name, 0444, ibtrs_clt_hca_name_show, NULL);
+
+static struct attribute *ibtrs_clt_sess_attrs[] = {
+	&ibtrs_clt_hca_name_attr.attr,
+	&ibtrs_clt_hca_port_attr.attr,
+	&ibtrs_clt_state_attr.attr,
+	&ibtrs_clt_reconnect_attr.attr,
+	&ibtrs_clt_disconnect_attr.attr,
+	&ibtrs_clt_remove_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_sess_attr_group = {
+	.attrs = ibtrs_clt_sess_attrs,
+};
+
+int ibtrs_clt_create_sess_files(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	char str[MAXHOSTNAMELEN];
+	int err;
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &clt->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add: %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj, &ibtrs_clt_sess_attr_group);
+	if (unlikely(err)) {
+		pr_err("sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = ibtrs_clt_create_stats_files(&sess->kobj, &sess->kobj_stats);
+	if (unlikely(err))
+		goto put_kobj;
+
+	return 0;
+
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+
+	return err;
+}
+
+void ibtrs_clt_destroy_sess_files(struct ibtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		if (sysfs_self)
+			/* To avoid deadlock firstly commit suicide */
+			sysfs_remove_file_self(&sess->kobj, sysfs_self);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+	}
+}
+
+static struct attribute *ibtrs_clt_attrs[] = {
+	&dev_attr_max_reconnect_attempts.attr,
+	&dev_attr_mpath_policy.attr,
+	&dev_attr_add_path.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_attr_group = {
+	.attrs = ibtrs_clt_attrs,
+};
+
+int ibtrs_clt_create_sysfs_root_folders(struct ibtrs_clt *clt)
+{
+	return kobject_init_and_add(&clt->kobj_paths, &ktype,
+				    &clt->dev.kobj, "paths");
+}
+
+int ibtrs_clt_create_sysfs_root_files(struct ibtrs_clt *clt)
+{
+	return sysfs_create_group(&clt->dev.kobj, &ibtrs_clt_attr_group);
+}
+
+void ibtrs_clt_destroy_sysfs_root_folders(struct ibtrs_clt *clt)
+{
+	if (clt->kobj_paths.state_in_sysfs) {
+		kobject_del(&clt->kobj_paths);
+		kobject_put(&clt->kobj_paths);
+	}
+}
+
+void ibtrs_clt_destroy_sysfs_root_files(struct ibtrs_clt *clt)
+{
+	sysfs_remove_group(&clt->dev.kobj, &ibtrs_clt_attr_group);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 10/26] ibtrs: server: private header with server structs and functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (8 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 09/26] ibtrs: client: sysfs interface functions Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 11/26] ibtrs: server: main functionality Roman Pen
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This header describes main structs and functions used by ibtrs-server
module, mainly for accepting IBTRS sessions, creating/destroying
sysfs entries, accounting statistics on server side.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h | 175 +++++++++++++++++++++++++++++++
 1 file changed, 175 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
new file mode 100644
index 000000000000..8193d568e67e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
@@ -0,0 +1,175 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_SRV_H
+#define IBTRS_SRV_H
+
+#include <linux/device.h>
+#include <linux/refcount.h>
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_srv_state - Server states.
+ */
+enum ibtrs_srv_state {
+	IBTRS_SRV_CONNECTING,
+	IBTRS_SRV_CONNECTED,
+	IBTRS_SRV_CLOSING,
+	IBTRS_SRV_CLOSED,
+};
+
+static inline const char *ibtrs_srv_state_str(enum ibtrs_srv_state state)
+{
+	switch (state) {
+	case IBTRS_SRV_CONNECTING:
+		return "IBTRS_SRV_CONNECTING";
+	case IBTRS_SRV_CONNECTED:
+		return "IBTRS_SRV_CONNECTED";
+	case IBTRS_SRV_CLOSING:
+		return "IBTRS_SRV_CLOSING";
+	case IBTRS_SRV_CLOSED:
+		return "IBTRS_SRV_CLOSED";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+struct ibtrs_stats_wc_comp {
+	atomic64_t	calls;
+	atomic64_t	total_wc_cnt;
+};
+
+struct ibtrs_srv_stats_rdma_stats {
+	struct {
+		atomic64_t	cnt;
+		atomic64_t	size_total;
+	} dir[2];
+};
+
+struct ibtrs_srv_stats {
+	struct ibtrs_srv_stats_rdma_stats	rdma_stats;
+	atomic_t				apm_cnt;
+	struct ibtrs_stats_wc_comp		wc_comp;
+};
+
+struct ibtrs_srv_con {
+	struct ibtrs_con	c;
+	atomic_t		wr_cnt;
+};
+
+struct ibtrs_srv_op {
+	struct ibtrs_srv_con		*con;
+	u32				msg_id;
+	u8				dir;
+	struct ibtrs_msg_rdma_read	*rd_msg;
+	struct ib_rdma_wr		*tx_wr;
+	struct ib_sge			*tx_sg;
+};
+
+struct ibtrs_srv_mr {
+	struct ib_mr	*mr;
+	struct sg_table	sgt;
+};
+
+struct ibtrs_srv_sess {
+	struct ibtrs_sess	s;
+	struct ibtrs_srv	*srv;
+	struct work_struct	close_work;
+	enum ibtrs_srv_state	state;
+	spinlock_t		state_lock;
+	int			cur_cq_vector;
+	struct ibtrs_srv_op	**ops_ids;
+	atomic_t		ids_inflight;
+	wait_queue_head_t	ids_waitq;
+	struct ibtrs_srv_mr	*mrs;
+	unsigned int		mrs_num;
+	dma_addr_t		*dma_addr;
+	bool			established;
+	unsigned int		mem_bits;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct ibtrs_srv_stats	stats;
+};
+
+struct ibtrs_srv {
+	struct list_head	paths_list;
+	int			paths_up;
+	struct mutex		paths_ev_mutex;
+	size_t			paths_num;
+	struct mutex		paths_mutex;
+	uuid_t			paths_uuid;
+	refcount_t		refcount;
+	struct ibtrs_srv_ctx	*ctx;
+	struct list_head	ctx_list;
+	void			*priv;
+	size_t			queue_depth;
+	struct page		**chunks;
+	struct device		dev;
+	unsigned		dev_ref;
+	struct kobject		kobj_paths;
+};
+
+struct ibtrs_srv_ctx {
+	rdma_ev_fn *rdma_ev;
+	link_ev_fn *link_ev;
+	struct rdma_cm_id *cm_id_ip;
+	struct rdma_cm_id *cm_id_ib;
+	struct mutex srv_mutex;
+	struct list_head srv_list;
+};
+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj)						\
+	LIST(CASE(obj, struct ibtrs_srv_sess *, s.sessname))
+
+void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess);
+
+/* ibtrs-srv-stats.c */
+
+void ibtrs_srv_update_rdma_stats(struct ibtrs_srv_stats *s, size_t size, int d);
+void ibtrs_srv_update_wc_stats(struct ibtrs_srv_stats *s);
+
+int ibtrs_srv_reset_rdma_stats(struct ibtrs_srv_stats *stats, bool enable);
+ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
+				    char *page, size_t len);
+int ibtrs_srv_reset_wc_completion_stats(struct ibtrs_srv_stats *stats,
+					bool enable);
+int ibtrs_srv_stats_wc_completion_to_str(struct ibtrs_srv_stats *stats, char *buf,
+					 size_t len);
+int ibtrs_srv_reset_all_stats(struct ibtrs_srv_stats *stats, bool enable);
+ssize_t ibtrs_srv_reset_all_help(struct ibtrs_srv_stats *stats,
+				 char *page, size_t len);
+
+/* ibtrs-srv-sysfs.c */
+
+int ibtrs_srv_create_sess_files(struct ibtrs_srv_sess *sess);
+void ibtrs_srv_destroy_sess_files(struct ibtrs_srv_sess *sess);
+
+#endif /* IBTRS_SRV_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 11/26] ibtrs: server: main functionality
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (9 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 10/26] ibtrs: server: private header with server structs and functions Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:03 ` [PATCH v2 12/26] ibtrs: server: statistics functions Roman Pen
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is main functionality of ibtrs-server module, which accepts
set of RDMA connections (so called IBTRS session), creates/destroys
sysfs entries associated with IBTRS session and notifies upper layer
(user of IBTRS API) about RDMA requests or link events.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1981 ++++++++++++++++++++++++++++++
 1 file changed, 1981 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
new file mode 100644
index 000000000000..d57fa6af5a5c
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
@@ -0,0 +1,1981 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/mempool.h>
+
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Server");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+/* Must be power of 2, see mask from mr->page_size in ib_sg_to_pages() */
+#define DEFAULT_MAX_CHUNK_SIZE (128 << 10)
+#define DEFAULT_SESS_QUEUE_DEPTH 512
+#define MAX_HDR_SIZE PAGE_SIZE
+#define MAX_SG_COUNT ((MAX_HDR_SIZE - sizeof(struct ibtrs_msg_rdma_read)) \
+		      / sizeof(struct ibtrs_sg_desc))
+
+/* We guarantee to serve 10 paths at least */
+#define CHUNK_POOL_SZ 10
+
+static struct ibtrs_ib_dev_pool dev_pool;
+static mempool_t *chunk_pool;
+struct class *ibtrs_dev_class;
+
+static int retry_count = 7;
+static int __read_mostly max_chunk_size = DEFAULT_MAX_CHUNK_SIZE;
+static int __read_mostly sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
+
+module_param_named(max_chunk_size, max_chunk_size, int, 0444);
+MODULE_PARM_DESC(max_chunk_size,
+		 "Max size for each IO request, when change the unit is in byte"
+		 " (default: " __stringify(DEFAULT_MAX_CHUNK_SIZE_KB) "KB)");
+
+module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
+MODULE_PARM_DESC(sess_queue_depth,
+		 "Number of buffers for pending I/O requests to allocate"
+		 " per session. Maximum: " __stringify(MAX_SESS_QUEUE_DEPTH)
+		 " (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
+
+static int retry_count_set(const char *val, const struct kernel_param *kp)
+{
+	int err, ival;
+
+	err = kstrtoint(val, 0, &ival);
+	if (err)
+		return err;
+
+	if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT) {
+		pr_err("Invalid retry count value %d, has to be"
+		       " > %d, < %d\n", ival, MIN_RTR_CNT, MAX_RTR_CNT);
+		return -EINVAL;
+	}
+
+	retry_count = ival;
+	pr_info("QP retry count changed to %d\n", ival);
+
+	return 0;
+}
+
+static const struct kernel_param_ops retry_count_ops = {
+	.set		= retry_count_set,
+	.get		= param_get_int,
+};
+module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
+
+MODULE_PARM_DESC(retry_count, "Number of times to send the message if the"
+		 " remote side didn't respond with Ack or Nack (default: 3,"
+		 " min: " __stringify(MIN_RTR_CNT) ", max: "
+		 __stringify(MAX_RTR_CNT) ")");
+
+static char cq_affinity_list[256] = "";
+static cpumask_t cq_affinity_mask = { CPU_BITS_ALL };
+
+static void init_cq_affinity(void)
+{
+	sprintf(cq_affinity_list, "0-%d", nr_cpu_ids - 1);
+}
+
+static int cq_affinity_list_set(const char *val, const struct kernel_param *kp)
+{
+	int ret = 0, len = strlen(val);
+	cpumask_var_t new_value;
+
+	if (!strlen(cq_affinity_list))
+		init_cq_affinity();
+
+	if (len >= sizeof(cq_affinity_list))
+		return -EINVAL;
+	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = cpulist_parse(val, new_value);
+	if (ret) {
+		pr_err("Can't set cq_affinity_list \"%s\": %d\n", val,
+		       ret);
+		goto free_cpumask;
+	}
+
+	strlcpy(cq_affinity_list, val, sizeof(cq_affinity_list));
+	*strchrnul(cq_affinity_list, '\n') = '\0';
+	cpumask_copy(&cq_affinity_mask, new_value);
+
+	pr_info("cq_affinity_list changed to %*pbl\n",
+		cpumask_pr_args(&cq_affinity_mask));
+free_cpumask:
+	free_cpumask_var(new_value);
+	return ret;
+}
+
+static struct kparam_string cq_affinity_list_kparam_str = {
+	.maxlen	= sizeof(cq_affinity_list),
+	.string	= cq_affinity_list
+};
+
+static const struct kernel_param_ops cq_affinity_list_ops = {
+	.set	= cq_affinity_list_set,
+	.get	= param_get_string,
+};
+
+module_param_cb(cq_affinity_list, &cq_affinity_list_ops,
+		&cq_affinity_list_kparam_str, 0644);
+MODULE_PARM_DESC(cq_affinity_list, "Sets the list of cpus to use as cq vectors."
+		 "(default: use all possible CPUs)");
+
+static struct workqueue_struct *ibtrs_wq;
+
+static void close_sess(struct ibtrs_srv_sess *sess);
+
+static inline struct ibtrs_srv_con *to_srv_con(struct ibtrs_con *c)
+{
+	return container_of(c, struct ibtrs_srv_con, c);
+}
+
+static inline struct ibtrs_srv_sess *to_srv_sess(struct ibtrs_sess *s)
+{
+	return container_of(s, struct ibtrs_srv_sess, s);
+}
+
+static bool __ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
+				     enum ibtrs_srv_state new_state)
+{
+	enum ibtrs_srv_state old_state;
+	bool changed = false;
+
+	old_state = sess->state;
+	switch (new_state) {
+	case IBTRS_SRV_CONNECTED:
+		switch (old_state) {
+		case IBTRS_SRV_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_SRV_CLOSING:
+		switch (old_state) {
+		case IBTRS_SRV_CONNECTING:
+		case IBTRS_SRV_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_SRV_CLOSED:
+		switch (old_state) {
+		case IBTRS_SRV_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed)
+		sess->state = new_state;
+
+	return changed;
+}
+
+static bool ibtrs_srv_change_state_get_old(struct ibtrs_srv_sess *sess,
+					   enum ibtrs_srv_state new_state,
+					   enum ibtrs_srv_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_lock);
+	*old_state = sess->state;
+	changed = __ibtrs_srv_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_lock);
+
+	return changed;
+}
+
+static bool ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
+				   enum ibtrs_srv_state new_state)
+{
+	enum ibtrs_srv_state old_state;
+
+	return ibtrs_srv_change_state_get_old(sess, new_state, &old_state);
+}
+
+static void free_id(struct ibtrs_srv_op *id)
+{
+	if (!id)
+		return;
+	kfree(id->tx_wr);
+	kfree(id->tx_sg);
+	kfree(id);
+}
+
+static void ibtrs_srv_free_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int i;
+
+	WARN_ON(atomic_read(&sess->ids_inflight));
+	if (sess->ops_ids) {
+		for (i = 0; i < srv->queue_depth; i++)
+			free_id(sess->ops_ids[i]);
+		kfree(sess->ops_ids);
+		sess->ops_ids = NULL;
+	}
+}
+
+static int ibtrs_srv_alloc_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_op *id;
+	int i;
+
+	sess->ops_ids = kcalloc(srv->queue_depth, sizeof(*sess->ops_ids),
+				GFP_KERNEL);
+	if (unlikely(!sess->ops_ids))
+		goto err;
+
+	for (i = 0; i < srv->queue_depth; ++i) {
+		id = kzalloc(sizeof(*id), GFP_KERNEL);
+		if (unlikely(!id))
+			goto err;
+
+		sess->ops_ids[i] = id;
+		id->tx_wr = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_wr),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_wr))
+			goto err;
+
+		id->tx_sg = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_sg),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_sg))
+			goto err;
+	}
+	init_waitqueue_head(&sess->ids_waitq);
+	atomic_set(&sess->ids_inflight, 0);
+
+	return 0;
+
+err:
+	ibtrs_srv_free_ops_ids(sess);
+	return -ENOMEM;
+}
+
+static void ibtrs_srv_get_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	atomic_inc(&sess->ids_inflight);
+}
+
+static void ibtrs_srv_put_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	if (atomic_dec_and_test(&sess->ids_inflight))
+		wake_up(&sess->ids_waitq);
+}
+
+static void ibtrs_srv_wait_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	wait_event(sess->ids_waitq, !atomic_read(&sess->ids_inflight));
+}
+
+static void ibtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+
+static struct ib_cqe io_comp_cqe = {
+	.done = ibtrs_srv_rdma_done
+};
+
+/**
+ * rdma_write_sg() - response on successful READ request
+ */
+static int rdma_write_sg(struct ibtrs_srv_op *id)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess);
+	dma_addr_t dma_addr = sess->dma_addr[id->msg_id];
+	struct ibtrs_srv *srv = sess->srv;
+	struct ib_send_wr inv_wr, imm_wr;
+	struct ib_rdma_wr *wr = NULL;
+	struct ib_send_wr *bad_wr;
+	enum ib_send_flags flags;
+	size_t sg_cnt;
+	int err, i, offset;
+	bool need_inval;
+	u32 rkey = 0;
+
+	BUG_ON(id->dir != READ);
+	sg_cnt = le16_to_cpu(id->rd_msg->sg_cnt);
+	need_inval = le16_to_cpu(id->rd_msg->flags) & IBTRS_MSG_NEED_INVAL_F;
+	if (unlikely(!sg_cnt))
+		return -EINVAL;
+
+	offset = 0;
+	for (i = 0; i < sg_cnt; i++) {
+		struct ib_sge *list;
+
+		wr		= &id->tx_wr[i];
+		list		= &id->tx_sg[i];
+		list->addr	= dma_addr + offset;
+		list->length	= le32_to_cpu(id->rd_msg->desc[i].len);
+
+		/* WR will fail with length error
+		 * if this is 0
+		 */
+		if (unlikely(list->length == 0)) {
+			ibtrs_err(sess, "Invalid RDMA-Write sg list length 0\n");
+			return -EINVAL;
+		}
+
+		list->lkey = sess->s.dev->ib_pd->local_dma_lkey;
+		offset += list->length;
+
+		wr->wr.wr_cqe	= &io_comp_cqe;
+		wr->wr.sg_list	= list;
+		wr->wr.num_sge	= 1;
+		wr->remote_addr	= le64_to_cpu(id->rd_msg->desc[i].addr);
+		wr->rkey	= le32_to_cpu(id->rd_msg->desc[i].key);
+		if (rkey == 0)
+			rkey = wr->rkey;
+		else
+			/* Only one key is actually used */
+			WARN_ON_ONCE(rkey != wr->rkey);
+
+		if (i < (sg_cnt - 1))
+			wr->wr.next = &id->tx_wr[i + 1].wr;
+		else if (need_inval)
+			wr->wr.next = &inv_wr;
+		else
+			wr->wr.next = &imm_wr;
+
+		wr->wr.opcode = IB_WR_RDMA_WRITE;
+		wr->wr.ex.imm_data = 0;
+		wr->wr.send_flags  = 0;
+
+	}
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	if (need_inval) {
+		inv_wr.next = &imm_wr;
+		inv_wr.wr_cqe = &io_comp_cqe;
+		inv_wr.sg_list = NULL;
+		inv_wr.num_sge = 0;
+		inv_wr.opcode = IB_WR_SEND_WITH_INV;
+		inv_wr.send_flags = 0;
+		inv_wr.ex.invalidate_rkey = rkey;
+	}
+	imm_wr.next = NULL;
+	imm_wr.wr_cqe = &io_comp_cqe;
+	imm_wr.sg_list = NULL;
+	imm_wr.num_sge = 0;
+	imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	imm_wr.send_flags = flags;
+	imm_wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id,
+							     0, need_inval));
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, dma_addr,
+				      offset, DMA_BIDIRECTIONAL);
+
+	err = ib_post_send(id->con->c.qp, &id->tx_wr[0].wr, &bad_wr);
+	if (unlikely(err))
+		ibtrs_err(sess,
+			  "Posting RDMA-Write-Request to QP failed, err: %d\n",
+			  err);
+
+	return err;
+}
+
+/**
+ * send_io_resp_imm() - response with empty IMM on failed READ/WRITE requests or
+ *                      on successful WRITE request.
+ */
+static int send_io_resp_imm(struct ibtrs_srv_con *con, struct ibtrs_srv_op *id,
+			    int errno)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ib_send_wr inv_wr, *wr = NULL;
+	struct ibtrs_srv *srv = sess->srv;
+	bool need_inval = false;
+	enum ib_send_flags flags;
+	u32 imm;
+	int err;
+
+	if (id->dir == READ) {
+		struct ibtrs_msg_rdma_read *rd_msg = id->rd_msg;
+		size_t sg_cnt;
+
+		need_inval = le16_to_cpu(rd_msg->flags) & IBTRS_MSG_NEED_INVAL_F;
+		sg_cnt = le16_to_cpu(rd_msg->sg_cnt);
+
+		if (need_inval) {
+			if (likely(sg_cnt)) {
+				inv_wr.next = NULL;
+				inv_wr.wr_cqe = &io_comp_cqe;
+				inv_wr.sg_list = NULL;
+				inv_wr.num_sge = 0;
+				inv_wr.opcode = IB_WR_SEND_WITH_INV;
+				inv_wr.send_flags = 0;
+				/* Only one key is actually used */
+				inv_wr.ex.invalidate_rkey =
+					le32_to_cpu(rd_msg->desc[0].key);
+				wr = &inv_wr;
+			} else {
+				WARN_ON_ONCE(1);
+				need_inval = false;
+			}
+		}
+	}
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+	imm = ibtrs_to_io_rsp_imm(id->msg_id, errno, need_inval);
+	err = ibtrs_post_rdma_write_imm_empty(&con->c, &io_comp_cqe, imm,
+					      flags, wr);
+	if (unlikely(err))
+		ibtrs_err_rl(sess, "ib_post_send(), err: %d\n", err);
+
+	return err;
+}
+
+/*
+ * ibtrs_srv_resp_rdma() - sends response to the client.
+ *
+ * Context: any
+ */
+void ibtrs_srv_resp_rdma(struct ibtrs_srv_op *id, int status)
+{
+	struct ibtrs_srv_con *con = id->con;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	int err;
+
+	if (WARN_ON(!id))
+		return;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Sending I/O response failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		goto out;
+	}
+	if (status || id->dir == WRITE || !id->rd_msg->sg_cnt)
+		err = send_io_resp_imm(con, id, status);
+	else
+		err = rdma_write_sg(id);
+	if (unlikely(err)) {
+		ibtrs_err_rl(sess, "IO response failed: %d\n", err);
+		close_sess(sess);
+	}
+out:
+	ibtrs_srv_put_ops_ids(sess);
+}
+EXPORT_SYMBOL(ibtrs_srv_resp_rdma);
+
+void ibtrs_srv_set_sess_priv(struct ibtrs_srv *srv, void *priv)
+{
+	srv->priv = priv;
+}
+EXPORT_SYMBOL(ibtrs_srv_set_sess_priv);
+
+static void unmap_cont_bufs(struct ibtrs_srv_sess *sess)
+{
+	int i;
+
+	for (i = 0; i < sess->mrs_num; i++) {
+		struct ibtrs_srv_mr *srv_mr;
+
+		srv_mr = &sess->mrs[i];
+		ib_dereg_mr(srv_mr->mr);
+		ib_dma_unmap_sg(sess->s.dev->ib_dev, srv_mr->sgt.sgl,
+				srv_mr->sgt.nents, DMA_BIDIRECTIONAL);
+		sg_free_table(&srv_mr->sgt);
+	}
+	kfree(sess->mrs);
+}
+
+static int map_cont_bufs(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int i, mri, err, mrs_num;
+	unsigned int chunk_bits;
+	int chunks_per_mr;
+
+	/*
+	 * Here we map queue_depth chunks to MR.  Firstly we have to
+	 * figure out how many chunks can we map per MR.
+	 */
+
+	chunks_per_mr = sess->s.dev->ib_dev->attrs.max_fast_reg_page_list_len;
+	mrs_num = DIV_ROUND_UP(srv->queue_depth, chunks_per_mr);
+	chunks_per_mr = DIV_ROUND_UP(srv->queue_depth, mrs_num);
+
+	sess->mrs = kcalloc(mrs_num, sizeof(*sess->mrs), GFP_KERNEL);
+	if (unlikely(!sess->mrs))
+		return -ENOMEM;
+
+	sess->mrs_num = mrs_num;
+
+	for (mri = 0; mri < mrs_num; mri++) {
+		struct ibtrs_srv_mr *srv_mr = &sess->mrs[mri];
+		struct sg_table *sgt = &srv_mr->sgt;
+		struct scatterlist *s;
+		struct ib_mr *mr;
+		int nr, chunks;
+
+		chunks = chunks_per_mr * mri;
+		chunks_per_mr = min_t(int, chunks_per_mr,
+				      srv->queue_depth - chunks);
+
+		err = sg_alloc_table(sgt, chunks_per_mr, GFP_KERNEL);
+		if (unlikely(err))
+			goto err;
+
+		for_each_sg(sgt->sgl, s, chunks_per_mr, i)
+			sg_set_page(s, srv->chunks[chunks + i],
+				    max_chunk_size, 0);
+
+		nr = ib_dma_map_sg(sess->s.dev->ib_dev, sgt->sgl,
+				   sgt->nents, DMA_BIDIRECTIONAL);
+		if (unlikely(nr < sgt->nents)) {
+			err = nr < 0 ? nr : -EINVAL;
+			goto free_sg;
+		}
+		mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
+				 sgt->nents);
+		if (unlikely(IS_ERR(mr))) {
+			err = PTR_ERR(mr);
+			goto unmap_sg;
+		}
+		nr = ib_map_mr_sg(mr, sgt->sgl, sgt->nents,
+				  NULL, max_chunk_size);
+		if (unlikely(nr < sgt->nents)) {
+			err = nr < 0 ? nr : -EINVAL;
+			goto dereg_mr;
+		}
+
+		/* Eventually dma addr for each chunk can be cached */
+		for_each_sg(sgt->sgl, s, sgt->orig_nents, i)
+			sess->dma_addr[chunks + i] = sg_dma_address(s);
+
+		ib_update_fast_reg_key(mr, ib_inc_rkey(mr->rkey));
+
+		srv_mr->mr = mr;
+
+		continue;
+err:
+		while (mri--) {
+			srv_mr = &sess->mrs[mri];
+			sgt = &srv_mr->sgt;
+			mr = srv_mr->mr;
+dereg_mr:
+			ib_dereg_mr(mr);
+unmap_sg:
+			ib_dma_unmap_sg(sess->s.dev->ib_dev, sgt->sgl,
+					sgt->nents, DMA_BIDIRECTIONAL);
+free_sg:
+			sg_free_table(sgt);
+		}
+		kfree(sess->mrs);
+
+		return err;
+	}
+
+	chunk_bits = ilog2(srv->queue_depth - 1) + 1;
+	sess->mem_bits = (MAX_IMM_PAYL_BITS - chunk_bits);
+
+	return 0;
+}
+
+static void ibtrs_srv_hb_err_handler(struct ibtrs_con *c, int err)
+{
+	(void)err;
+	close_sess(to_srv_sess(c->sess));
+}
+
+static void ibtrs_srv_init_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_init_hb(&sess->s, &io_comp_cqe,
+		      IBTRS_HB_INTERVAL_MS,
+		      IBTRS_HB_MISSED_MAX,
+		      ibtrs_srv_hb_err_handler,
+		      ibtrs_wq);
+}
+
+static void ibtrs_srv_start_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_start_hb(&sess->s);
+}
+
+static void ibtrs_srv_stop_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_stop_hb(&sess->s);
+}
+
+static void ibtrs_srv_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	ibtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info response send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+		return;
+	}
+	WARN_ON(wc->opcode != IB_WC_SEND);
+	ibtrs_srv_update_wc_stats(&sess->stats);
+}
+
+static void ibtrs_srv_sess_up(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	int up;
+
+	mutex_lock(&srv->paths_ev_mutex);
+	up = ++srv->paths_up;
+	if (up == 1)
+		ctx->link_ev(srv, IBTRS_SRV_LINK_EV_CONNECTED, NULL);
+	mutex_unlock(&srv->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+}
+
+static void ibtrs_srv_sess_down(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&srv->paths_ev_mutex);
+	WARN_ON(!srv->paths_up);
+	if (--srv->paths_up == 0)
+		ctx->link_ev(srv, IBTRS_SRV_LINK_EV_DISCONNECTED, srv->priv);
+	mutex_unlock(&srv->paths_ev_mutex);
+}
+
+static void ibtrs_srv_reg_mr_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "REG MR failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+		return;
+	}
+}
+
+static struct ib_cqe local_reg_cqe = {
+	.done = ibtrs_srv_reg_mr_done
+};
+
+static int post_recv_sess(struct ibtrs_srv_sess *sess);
+
+static int process_info_req(struct ibtrs_srv_con *con,
+			    struct ibtrs_msg_info_req *msg)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ib_send_wr *reg_wr = NULL;
+	struct ibtrs_msg_info_rsp *rsp;
+	struct ibtrs_iu *tx_iu;
+	struct ib_reg_wr *rwr;
+	int mri, err;
+	size_t tx_sz;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "post_recv_sess(), err: %d\n", err);
+		return err;
+	}
+	rwr = kcalloc(sess->mrs_num, sizeof(*rwr), GFP_KERNEL);
+	if (unlikely(!rwr)) {
+		ibtrs_err(sess, "No memory\n");
+		return -ENOMEM;
+	}
+	memcpy(sess->s.sessname, msg->sessname, sizeof(sess->s.sessname));
+
+	tx_sz  = sizeof(*rsp);
+	tx_sz += sizeof(rsp->desc[0]) * sess->mrs_num;
+	tx_iu = ibtrs_iu_alloc(0, tx_sz, GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_TO_DEVICE, ibtrs_srv_info_rsp_done);
+	if (unlikely(!tx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(), err: %d\n", -ENOMEM);
+		err = -ENOMEM;
+		goto rwr_free;
+	}
+
+	rsp = tx_iu->buf;
+	rsp->type = cpu_to_le16(IBTRS_MSG_INFO_RSP);
+	rsp->sg_cnt = cpu_to_le16(sess->mrs_num);
+
+	for (mri = 0; mri < sess->mrs_num; mri++) {
+		struct ib_mr *mr = sess->mrs[mri].mr;
+
+		rsp->desc[mri].addr = cpu_to_le64(mr->iova);
+		rsp->desc[mri].key  = cpu_to_le32(mr->rkey);
+		rsp->desc[mri].len  = cpu_to_le32(mr->length);
+
+		/*
+		 * Fill in reg MR request and chain them *backwards*
+		 */
+		rwr[mri].wr.next = mri ? &rwr[mri-1].wr : NULL;
+		rwr[mri].wr.opcode = IB_WR_REG_MR;
+		rwr[mri].wr.wr_cqe = &local_reg_cqe;
+		rwr[mri].wr.num_sge = 0;
+		rwr[mri].wr.send_flags = 0;
+		rwr[mri].mr = mr;
+		rwr[mri].key = mr->rkey;
+		rwr[mri].access = (IB_ACCESS_LOCAL_WRITE |
+				   IB_ACCESS_REMOTE_WRITE);
+		reg_wr = &rwr[mri].wr;
+	}
+
+	err = ibtrs_srv_create_sess_files(sess);
+	if (unlikely(err))
+		goto iu_free;
+
+	ibtrs_srv_change_state(sess, IBTRS_SRV_CONNECTED);
+	ibtrs_srv_start_hb(sess);
+
+	/*
+	 * We do not account number of established connections at the current
+	 * moment, we rely on the client, which should send info request when
+	 * all connections are successfully established.  Thus, simply notify
+	 * listener with a proper event if we are the first path.
+	 */
+	ibtrs_srv_sess_up(sess);
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, tx_iu->dma_addr,
+				      tx_iu->size, DMA_TO_DEVICE);
+
+	/* Send info response */
+	err = ibtrs_iu_post_send(&con->c, tx_iu, tx_sz, reg_wr);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_send(), err: %d\n", err);
+iu_free:
+		ibtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.dev->ib_dev);
+	}
+rwr_free:
+	kfree(rwr);
+
+	return err;
+}
+
+static void ibtrs_srv_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_msg_info_req *msg;
+	struct ibtrs_iu *iu;
+	int err;
+
+	WARN_ON(con->c.cid);
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info request receive failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto close;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		ibtrs_err(sess, "Sess info request is malformed: size %d\n",
+			  wc->byte_len);
+		goto close;
+	}
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, iu->dma_addr,
+				   iu->size, DMA_FROM_DEVICE);
+	msg = iu->buf;
+	if (unlikely(le32_to_cpu(msg->type) != IBTRS_MSG_INFO_REQ)) {
+		ibtrs_err(sess, "Sess info request is malformed: type %d\n",
+			  le32_to_cpu(msg->type));
+		goto close;
+	}
+	err = process_info_req(con, msg);
+	if (unlikely(err))
+		goto close;
+
+out:
+	ibtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+	return;
+close:
+	close_sess(sess);
+	goto out;
+}
+
+static int post_recv_info_req(struct ibtrs_srv_con *con)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_iu *rx_iu;
+	int err;
+
+	rx_iu = ibtrs_iu_alloc(0, sizeof(struct ibtrs_msg_info_req),
+			       GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_FROM_DEVICE, ibtrs_srv_info_req_done);
+	if (unlikely(!rx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(): no memory\n");
+		return -ENOMEM;
+	}
+	/* Prepare for getting info response */
+	err = ibtrs_iu_post_recv(&con->c, rx_iu);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_recv(), err: %d\n", err);
+		ibtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev);
+		return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_io(struct ibtrs_srv_con *con, size_t q_size)
+{
+	int i, err;
+
+	for (i = 0; i < q_size; i++) {
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	size_t q_size;
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (cid == 0)
+			q_size = SERVICE_CON_QUEUE_DEPTH;
+		else
+			q_size = srv->queue_depth;
+
+		err = post_recv_io(to_srv_con(sess->s.con[cid]), q_size);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static void process_read(struct ibtrs_srv_con *con,
+			 struct ibtrs_msg_rdma_read *msg,
+			 u32 buf_id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	struct ibtrs_srv_op *id;
+
+	size_t usr_len, data_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Processing read request failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		return;
+	}
+	ibtrs_srv_get_ops_ids(sess);
+	ibtrs_srv_update_rdma_stats(&sess->stats, off, READ);
+	id = sess->ops_ids[buf_id];
+	id->con		= con;
+	id->dir		= READ;
+	id->msg_id	= buf_id;
+	id->rd_msg	= msg;
+	usr_len = le16_to_cpu(msg->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	ret = ctx->rdma_ev(srv, srv->priv, id, READ, data, data_len,
+			   data + data_len, usr_len);
+
+	if (unlikely(ret)) {
+		ibtrs_err_rl(sess, "Processing read request failed, user "
+			     "module cb reported for msg_id %d, err: %d\n",
+			     buf_id, ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, id, ret);
+	if (ret < 0) {
+		ibtrs_err_rl(sess, "Sending err msg for failed RDMA-Write-Req"
+			     " failed, msg_id %d, err: %d\n", buf_id, ret);
+		close_sess(sess);
+	}
+	ibtrs_srv_put_ops_ids(sess);
+}
+
+static void process_write(struct ibtrs_srv_con *con,
+			  struct ibtrs_msg_rdma_write *req,
+			  u32 buf_id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	struct ibtrs_srv_op *id;
+
+	size_t data_len, usr_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Processing write request failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		return;
+	}
+	ibtrs_srv_get_ops_ids(sess);
+	ibtrs_srv_update_rdma_stats(&sess->stats, off, WRITE);
+	id = sess->ops_ids[buf_id];
+	id->con    = con;
+	id->dir    = WRITE;
+	id->msg_id = buf_id;
+
+	usr_len = le16_to_cpu(req->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	ret = ctx->rdma_ev(srv, srv->priv, id, WRITE, data, data_len,
+			   data + data_len, usr_len);
+	if (unlikely(ret)) {
+		ibtrs_err_rl(sess, "Processing write request failed, user"
+			     " module callback reports err: %d\n", ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, id, ret);
+	if (ret < 0) {
+		ibtrs_err_rl(sess, "Processing write request failed, sending"
+			     " I/O response failed, msg_id %d, err: %d\n",
+			     buf_id, ret);
+		close_sess(sess);
+	}
+	ibtrs_srv_put_ops_ids(sess);
+}
+
+static void process_io_req(struct ibtrs_srv_con *con, void *msg,
+			   u32 id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	unsigned int type;
+
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, sess->dma_addr[id],
+				   max_chunk_size, DMA_BIDIRECTIONAL);
+	type = le16_to_cpu(le16_to_cpu(*(__le16 *)msg));
+
+	switch (type) {
+	case IBTRS_MSG_WRITE:
+		process_write(con, msg, id, off);
+		break;
+	case IBTRS_MSG_READ:
+		process_read(con, msg, id, off);
+		break;
+	default:
+		ibtrs_err(sess, "Processing I/O request failed, "
+			  "unknown message type received: 0x%02x\n", type);
+		goto err;
+	}
+
+	return;
+
+err:
+	close_sess(sess);
+}
+
+static void ibtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	u32 imm_type, imm_payload;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			ibtrs_err(sess, "%s (wr_cqe: %p,"
+				  " type: %d, vendor_err: 0x%x, len: %u)\n",
+				  ib_wc_status_msg(wc->status), wc->wr_cqe,
+				  wc->opcode, wc->vendor_err, wc->byte_len);
+			close_sess(sess);
+		}
+		return;
+	}
+	ibtrs_srv_update_wc_stats(&sess->stats);
+
+	switch (wc->opcode) {
+	case IB_WC_RDMA_WRITE:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
+			return;
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "ibtrs_post_recv(), err: %d\n", err);
+			close_sess(sess);
+			break;
+		}
+		ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == IBTRS_IO_REQ_IMM)) {
+			u32 msg_id, off;
+			void *data;
+
+			msg_id = imm_payload >> sess->mem_bits;
+			off = imm_payload & ((1 << sess->mem_bits) - 1);
+			if (unlikely(msg_id > srv->queue_depth ||
+				     off > max_chunk_size)) {
+				ibtrs_err(sess, "Wrong msg_id %u, off %u\n",
+					  msg_id, off);
+				close_sess(sess);
+				return;
+			}
+			data = page_address(srv->chunks[msg_id]) + off;
+			process_io_req(con, data, msg_id, off);
+		} else if (imm_type == IBTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			ibtrs_send_hb_ack(&sess->s);
+		} else if (imm_type == IBTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+		} else {
+			ibtrs_wrn(sess, "Unknown IMM type %u\n", imm_type);
+		}
+		break;
+	default:
+		ibtrs_wrn(sess, "Unexpected WC type: %d\n", wc->opcode);
+		return;
+	}
+}
+
+int ibtrs_srv_get_sess_name(struct ibtrs_srv *srv, char *sessname, size_t len)
+{
+	struct ibtrs_srv_sess *sess;
+	int err = -ENOTCONN;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (sess->state != IBTRS_SRV_CONNECTED)
+			continue;
+		memcpy(sessname, sess->s.sessname,
+		       min_t(size_t, sizeof(sess->s.sessname), len));
+		err = 0;
+		break;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL(ibtrs_srv_get_sess_name);
+
+int ibtrs_srv_get_queue_depth(struct ibtrs_srv *srv)
+{
+	return srv->queue_depth;
+}
+EXPORT_SYMBOL(ibtrs_srv_get_queue_depth);
+
+static int find_next_bit_ring(int cur)
+{
+	int v = cpumask_next(cur, &cq_affinity_mask);
+
+	if (v >= nr_cpu_ids)
+		v = cpumask_first(&cq_affinity_mask);
+	return v;
+}
+
+static int ibtrs_srv_get_next_cq_vector(struct ibtrs_srv_sess *sess)
+{
+	sess->cur_cq_vector = find_next_bit_ring(sess->cur_cq_vector);
+
+	return sess->cur_cq_vector;
+}
+
+static struct ibtrs_srv *__alloc_srv(struct ibtrs_srv_ctx *ctx,
+				     const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+	int i;
+
+	srv = kzalloc(sizeof(*srv), GFP_KERNEL);
+	if  (unlikely(!srv))
+		return NULL;
+
+	refcount_set(&srv->refcount, 1);
+	INIT_LIST_HEAD(&srv->paths_list);
+	mutex_init(&srv->paths_mutex);
+	mutex_init(&srv->paths_ev_mutex);
+	uuid_copy(&srv->paths_uuid, paths_uuid);
+	srv->queue_depth = sess_queue_depth;
+	srv->ctx = ctx;
+
+	srv->chunks = kcalloc(srv->queue_depth, sizeof(*srv->chunks),
+			      GFP_KERNEL);
+	if (unlikely(!srv->chunks))
+		goto err_free_srv;
+
+	for (i = 0; i < srv->queue_depth; i++) {
+		srv->chunks[i] = mempool_alloc(chunk_pool, GFP_KERNEL);
+		if (unlikely(!srv->chunks[i])) {
+			pr_err("mempool_alloc() failed\n");
+			goto err_free_chunks;
+		}
+	}
+	list_add(&srv->ctx_list, &ctx->srv_list);
+
+	return srv;
+
+err_free_chunks:
+	while (i--)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+
+err_free_srv:
+	kfree(srv);
+
+	return NULL;
+}
+
+static void free_srv(struct ibtrs_srv *srv)
+{
+	int i;
+
+	WARN_ON(refcount_read(&srv->refcount));
+	for (i = 0; i < srv->queue_depth; i++)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+	kfree(srv);
+}
+
+static inline struct ibtrs_srv *__find_srv_and_get(struct ibtrs_srv_ctx *ctx,
+						   const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list) {
+		if (uuid_equal(&srv->paths_uuid, paths_uuid) &&
+		    refcount_inc_not_zero(&srv->refcount))
+			return srv;
+	}
+
+	return NULL;
+}
+
+static struct ibtrs_srv *get_or_create_srv(struct ibtrs_srv_ctx *ctx,
+					   const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	srv = __find_srv_and_get(ctx, paths_uuid);
+	if (!srv)
+		srv = __alloc_srv(ctx, paths_uuid);
+	mutex_unlock(&ctx->srv_mutex);
+
+	return srv;
+}
+
+static void put_srv(struct ibtrs_srv *srv)
+{
+	if (refcount_dec_and_test(&srv->refcount)) {
+		struct ibtrs_srv_ctx *ctx = srv->ctx;
+
+		WARN_ON(srv->dev.kobj.state_in_sysfs);
+		WARN_ON(srv->kobj_paths.state_in_sysfs);
+
+		mutex_lock(&ctx->srv_mutex);
+		list_del(&srv->ctx_list);
+		mutex_unlock(&ctx->srv_mutex);
+		free_srv(srv);
+	}
+}
+
+static void __add_path_to_srv(struct ibtrs_srv *srv,
+			      struct ibtrs_srv_sess *sess)
+{
+	list_add_tail(&sess->s.entry, &srv->paths_list);
+	srv->paths_num++;
+	WARN_ON(srv->paths_num >= MAX_PATHS_NUM);
+}
+
+static void del_path_from_srv(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+
+	if (WARN_ON(!srv))
+		return;
+
+	mutex_lock(&srv->paths_mutex);
+	list_del(&sess->s.entry);
+	WARN_ON(!srv->paths_num);
+	srv->paths_num--;
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static void ibtrs_srv_close_work(struct work_struct *work)
+{
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_srv_ctx *ctx;
+	struct ibtrs_srv_con *con;
+	int i;
+
+	sess = container_of(work, typeof(*sess), close_work);
+	ctx = sess->srv->ctx;
+
+	ibtrs_srv_destroy_sess_files(sess);
+	ibtrs_srv_stop_hb(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		if (!sess->s.con[i])
+			continue;
+		con = to_srv_con(sess->s.con[i]);
+		rdma_disconnect(con->c.cm_id);
+		ib_drain_qp(con->c.qp);
+	}
+	/* Wait for all inflights */
+	ibtrs_srv_wait_ops_ids(sess);
+
+	/* Notify upper layer if we are the last path */
+	ibtrs_srv_sess_down(sess);
+
+	unmap_cont_bufs(sess);
+	ibtrs_srv_free_ops_ids(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		if (!sess->s.con[i])
+			continue;
+		con = to_srv_con(sess->s.con[i]);
+		ibtrs_cq_qp_destroy(&con->c);
+		rdma_destroy_id(con->c.cm_id);
+		kfree(con);
+	}
+	ibtrs_ib_dev_put(sess->s.dev);
+
+	del_path_from_srv(sess);
+	put_srv(sess->srv);
+	sess->srv = NULL;
+	ibtrs_srv_change_state(sess, IBTRS_SRV_CLOSED);
+
+	kfree(sess->dma_addr);
+	kfree(sess->s.con);
+	kfree(sess);
+}
+
+static int ibtrs_rdma_do_accept(struct ibtrs_srv_sess *sess,
+				struct rdma_cm_id *cm_id)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_msg_conn_rsp msg;
+	struct rdma_conn_param param;
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.retry_count = retry_count;
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_PROTO_VER);
+	msg.errno = 0;
+	msg.queue_depth = cpu_to_le16(srv->queue_depth);
+	msg.max_io_size = cpu_to_le32(max_chunk_size - MAX_HDR_SIZE);
+	msg.max_hdr_size = cpu_to_le32(MAX_HDR_SIZE);
+
+	err = rdma_accept(cm_id, &param);
+	if (err)
+		pr_err("rdma_accept(), err: %d\n", err);
+
+	return err;
+}
+
+static int ibtrs_rdma_do_reject(struct rdma_cm_id *cm_id, int errno)
+{
+	struct ibtrs_msg_conn_rsp msg;
+	int err;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_PROTO_VER);
+	msg.errno = cpu_to_le16(errno);
+
+	err = rdma_reject(cm_id, &msg, sizeof(msg));
+	if (err)
+		pr_err("rdma_reject(), err: %d\n", err);
+
+	/* Bounce errno back */
+	return errno;
+}
+
+static struct ibtrs_srv_sess *
+__find_sess(struct ibtrs_srv *srv, const uuid_t *sess_uuid)
+{
+	struct ibtrs_srv_sess *sess;
+
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (uuid_equal(&sess->s.uuid, sess_uuid))
+			return sess;
+	}
+
+	return NULL;
+}
+
+static int create_con(struct ibtrs_srv_sess *sess,
+		      struct rdma_cm_id *cm_id,
+		      unsigned int cid)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_con *con;
+
+	u16 cq_size, wr_queue_size;
+	int err, cq_vector;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con)) {
+		ibtrs_err(sess, "kzalloc() failed\n");
+		err = -ENOMEM;
+		goto err;
+	}
+
+	con->c.cm_id = cm_id;
+	con->c.sess = &sess->s;
+	con->c.cid = cid;
+	atomic_set(&con->wr_cnt, 0);
+
+	if (con->c.cid == 0) {
+		/*
+		 * All receive and all send (each requiring invalidate)
+		 * + 2 for drain and heartbeat
+		 */
+		cq_size = wr_queue_size = SERVICE_CON_QUEUE_DEPTH * 3 + 2;
+	} else {
+		/*
+		 * If we have all receive requests posted and
+		 * all write requests posted and each read request
+		 * requires an invalidate request + drain
+		 * and qp gets into error state.
+		 */
+		cq_size = srv->queue_depth * 3 + 1;
+		/*
+		 * In theory we might have queue_depth * 32
+		 * outstanding requests if an unsafe global key is used
+		 * and we have queue_depth read requests each consisting
+		 * of 32 different addresses.
+		 */
+		wr_queue_size = sess->s.dev->ib_dev->attrs.max_qp_wr;
+	}
+
+	cq_vector = ibtrs_srv_get_next_cq_vector(sess);
+
+	/* TODO: SOFTIRQ can be faster, but be careful with softirq context */
+	err = ibtrs_cq_qp_create(&sess->s, &con->c, 1, cq_vector, cq_size,
+				 wr_queue_size, IB_POLL_WORKQUEUE);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_cq_qp_create(), err: %d\n", err);
+		goto free_con;
+	}
+	if (con->c.cid == 0) {
+		err = post_recv_info_req(con);
+		if (unlikely(err))
+			goto free_cqqp;
+	}
+	WARN_ON(sess->s.con[cid]);
+	sess->s.con[cid] = &con->c;
+
+	/*
+	 * Change context from server to current connection.  The other
+	 * way is to use cm_id->qp->qp_context, which does not work on OFED.
+	 */
+	cm_id->context = &con->c;
+
+	return 0;
+
+free_cqqp:
+	ibtrs_cq_qp_destroy(&con->c);
+free_con:
+	kfree(con);
+
+err:
+	return err;
+}
+
+static struct ibtrs_srv_sess *__alloc_sess(struct ibtrs_srv *srv,
+					   struct rdma_cm_id *cm_id,
+					   unsigned int con_num,
+					   unsigned int recon_cnt,
+					   const uuid_t *uuid)
+{
+	struct ibtrs_srv_sess *sess;
+	int err = -ENOMEM;
+
+	if (unlikely(srv->paths_num >= MAX_PATHS_NUM)) {
+		err = -ECONNRESET;
+		goto err;
+	}
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	sess->dma_addr = kcalloc(srv->queue_depth, sizeof(*sess->dma_addr),
+				 GFP_KERNEL);
+	if (unlikely(!sess->dma_addr))
+		goto err_free_sess;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_dma_addr;
+
+	sess->state = IBTRS_SRV_CONNECTING;
+	sess->srv = srv;
+	sess->cur_cq_vector = -1;
+	sess->s.dst_addr = cm_id->route.addr.dst_addr;
+	sess->s.con_num = con_num;
+	sess->s.recon_cnt = recon_cnt;
+	uuid_copy(&sess->s.uuid, uuid);
+	spin_lock_init(&sess->state_lock);
+	INIT_WORK(&sess->close_work, ibtrs_srv_close_work);
+	ibtrs_srv_init_hb(sess);
+
+	sess->s.dev = ibtrs_ib_dev_find_or_add(cm_id->device, &dev_pool);
+	if (unlikely(!sess->s.dev)) {
+		err = -ENOMEM;
+		ibtrs_wrn(sess, "Failed to alloc ibtrs_device\n");
+		goto err_free_con;
+	}
+	err = map_cont_bufs(sess);
+	if (unlikely(err))
+		goto err_put_dev;
+
+	err = ibtrs_srv_alloc_ops_ids(sess);
+	if (unlikely(err))
+		goto err_unmap_bufs;
+
+	__add_path_to_srv(srv, sess);
+
+	return sess;
+
+err_unmap_bufs:
+	unmap_cont_bufs(sess);
+err_put_dev:
+	ibtrs_ib_dev_put(sess->s.dev);
+err_free_con:
+	kfree(sess->s.con);
+err_free_dma_addr:
+	kfree(sess->dma_addr);
+err_free_sess:
+	kfree(sess);
+
+err:
+	return ERR_PTR(err);
+}
+
+static int ibtrs_rdma_connect(struct rdma_cm_id *cm_id,
+			      const struct ibtrs_msg_conn_req *msg,
+			      size_t len)
+{
+	struct ibtrs_srv_ctx *ctx = cm_id->context;
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_srv *srv;
+
+	u16 version, con_num, cid;
+	u16 recon_cnt;
+	int err;
+
+	if (unlikely(len < sizeof(*msg))) {
+		pr_err("Invalid IBTRS connection request\n");
+		goto reject_w_econnreset;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != IBTRS_MAGIC)) {
+		pr_err("Invalid IBTRS magic\n");
+		goto reject_w_econnreset;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != IBTRS_PROTO_VER_MAJOR)) {
+		pr_err("Unsupported major IBTRS version: %d, expected %d\n",
+		       version >> 8, IBTRS_PROTO_VER_MAJOR);
+		goto reject_w_econnreset;
+	}
+	con_num = le16_to_cpu(msg->cid_num);
+	if (unlikely(con_num > 4096)) {
+		/* Sanity check */
+		pr_err("Too many connections requested: %d\n", con_num);
+		goto reject_w_econnreset;
+	}
+	cid = le16_to_cpu(msg->cid);
+	if (unlikely(cid >= con_num)) {
+		/* Sanity check */
+		pr_err("Incorrect cid: %d >= %d\n", cid, con_num);
+		goto reject_w_econnreset;
+	}
+	recon_cnt = le16_to_cpu(msg->recon_cnt);
+	srv = get_or_create_srv(ctx, &msg->paths_uuid);
+	if (unlikely(!srv)) {
+		err = -ENOMEM;
+		goto reject_w_err;
+	}
+	mutex_lock(&srv->paths_mutex);
+	sess = __find_sess(srv, &msg->sess_uuid);
+	if (sess) {
+		/* Session already holds a reference */
+		put_srv(srv);
+
+		if (unlikely(sess->s.recon_cnt != recon_cnt)) {
+			ibtrs_err(sess, "Reconnect detected %d != %d, but "
+				  "previous session is still alive, reconnect "
+				  "later\n", sess->s.recon_cnt, recon_cnt);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_ebusy;
+		}
+		if (unlikely(sess->state != IBTRS_SRV_CONNECTING)) {
+			ibtrs_err(sess, "Session in wrong state: %s\n",
+				  ibtrs_srv_state_str(sess->state));
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		/*
+		 * Sanity checks
+		 */
+		if (unlikely(con_num != sess->s.con_num ||
+			     cid >= sess->s.con_num)) {
+			ibtrs_err(sess, "Incorrect request: %d, %d\n",
+				  cid, con_num);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		if (unlikely(sess->s.con[cid])) {
+			ibtrs_err(sess, "Connection already exists: %d\n",
+				  cid);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+	} else {
+		sess = __alloc_sess(srv, cm_id, con_num, recon_cnt,
+				    &msg->sess_uuid);
+		if (unlikely(IS_ERR(sess))) {
+			mutex_unlock(&srv->paths_mutex);
+			put_srv(srv);
+			err = PTR_ERR(sess);
+			goto reject_w_err;
+		}
+	}
+	err = create_con(sess, cm_id, cid);
+	if (unlikely(err)) {
+		(void)ibtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since session has other connections we follow normal way
+		 * through workqueue, but still return an error to tell cma.c
+		 * to call rdma_destroy_id() for current connection.
+		 */
+		goto close_and_return_err;
+	}
+	err = ibtrs_rdma_do_accept(sess, cm_id);
+	if (unlikely(err)) {
+		(void)ibtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since current connection was successfully added to the
+		 * session we follow normal way through workqueue to close the
+		 * session, thus return 0 to tell cma.c we call
+		 * rdma_destroy_id() ourselves.
+		 */
+		err = 0;
+		goto close_and_return_err;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return 0;
+
+reject_w_err:
+	return ibtrs_rdma_do_reject(cm_id, err);
+
+reject_w_econnreset:
+	return ibtrs_rdma_do_reject(cm_id, -ECONNRESET);
+
+reject_w_ebusy:
+	return ibtrs_rdma_do_reject(cm_id, -EBUSY);
+
+close_and_return_err:
+	close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+
+static int ibtrs_srv_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct ibtrs_srv_sess *sess = NULL;
+
+	if (ev->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+		struct ibtrs_con *c = cm_id->context;
+
+		sess = to_srv_sess(c->sess);
+	}
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_CONNECT_REQUEST:
+		/*
+		 * In case of error cma.c will destroy cm_id,
+		 * see cma_process_remove()
+		 */
+		return ibtrs_rdma_connect(cm_id, ev->param.conn.private_data,
+					  ev->param.conn.private_data_len);
+	case RDMA_CM_EVENT_ESTABLISHED:
+		/* Nothing here */
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		ibtrs_err(sess, "CM error (CM event: %s, err: %d)\n",
+			  rdma_event_msg(ev->event), ev->status);
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		close_sess(sess);
+		break;
+	default:
+		pr_err("Ignoring unexpected CM event %s, err %d\n",
+		       rdma_event_msg(ev->event), ev->status);
+		break;
+	}
+
+	return 0;
+}
+
+static struct rdma_cm_id *ibtrs_srv_cm_init(struct ibtrs_srv_ctx *ctx,
+					    struct sockaddr *addr,
+					    enum rdma_ucm_port_space ps)
+{
+	struct rdma_cm_id *cm_id;
+	int ret;
+
+	cm_id = rdma_create_id(&init_net, ibtrs_srv_rdma_cm_handler,
+			       ctx, ps, IB_QPT_RC);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		pr_err("Creating id for RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_out;
+	}
+	ret = rdma_bind_addr(cm_id, addr);
+	if (ret) {
+		pr_err("Binding RDMA address failed, err: %d\n", ret);
+		goto err_cm;
+	}
+	ret = rdma_listen(cm_id, 64);
+	if (ret) {
+		pr_err("Listening on RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_cm;
+	}
+
+	return cm_id;
+
+err_cm:
+	rdma_destroy_id(cm_id);
+err_out:
+
+	return ERR_PTR(ret);
+}
+
+static int ibtrs_srv_rdma_init(struct ibtrs_srv_ctx *ctx, unsigned int port)
+{
+	struct sockaddr_in6 sin = {
+		.sin6_family	= AF_INET6,
+		.sin6_addr	= IN6ADDR_ANY_INIT,
+		.sin6_port	= htons(port),
+	};
+	struct sockaddr_ib sib = {
+		.sib_family			= AF_IB,
+		.sib_addr.sib_subnet_prefix	= 0ULL,
+		.sib_addr.sib_interface_id	= 0ULL,
+		.sib_sid	= cpu_to_be64(RDMA_IB_IP_PS_IB | port),
+		.sib_sid_mask	= cpu_to_be64(0xffffffffffffffffULL),
+		.sib_pkey	= cpu_to_be16(0xffff),
+	};
+	struct rdma_cm_id *cm_ip, *cm_ib;
+	int ret;
+
+	/*
+	 * We accept both IPoIB and IB connections, so we need to keep
+	 * two cm id's, one for each socket type and port space.
+	 * If the cm initialization of one of the id's fails, we abort
+	 * everything.
+	 */
+	cm_ip = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sin, RDMA_PS_TCP);
+	if (unlikely(IS_ERR(cm_ip)))
+		return PTR_ERR(cm_ip);
+
+	cm_ib = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sib, RDMA_PS_IB);
+	if (unlikely(IS_ERR(cm_ib))) {
+		ret = PTR_ERR(cm_ib);
+		goto free_cm_ip;
+	}
+
+	ctx->cm_id_ip = cm_ip;
+	ctx->cm_id_ib = cm_ib;
+
+	return 0;
+
+free_cm_ip:
+	rdma_destroy_id(cm_ip);
+
+	return ret;
+}
+
+static struct ibtrs_srv_ctx *alloc_srv_ctx(rdma_ev_fn *rdma_ev,
+					   link_ev_fn *link_ev)
+{
+	struct ibtrs_srv_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->rdma_ev = rdma_ev;
+	ctx->link_ev = link_ev;
+	mutex_init(&ctx->srv_mutex);
+	INIT_LIST_HEAD(&ctx->srv_list);
+
+	return ctx;
+}
+
+static void free_srv_ctx(struct ibtrs_srv_ctx *ctx)
+{
+	WARN_ON(!list_empty(&ctx->srv_list));
+	kfree(ctx);
+}
+
+struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port)
+{
+	struct ibtrs_srv_ctx *ctx;
+	int err;
+
+	ctx = alloc_srv_ctx(rdma_ev, link_ev);
+	if (unlikely(!ctx))
+		return ERR_PTR(-ENOMEM);
+
+	err = ibtrs_srv_rdma_init(ctx, port);
+	if (unlikely(err)) {
+		free_srv_ctx(ctx);
+		return ERR_PTR(err);
+	}
+	/* Do not let module be unloaded if server context is alive */
+	__module_get(THIS_MODULE);
+
+	return ctx;
+}
+EXPORT_SYMBOL(ibtrs_srv_open);
+
+void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess)
+{
+	close_sess(sess);
+}
+
+static void close_sess(struct ibtrs_srv_sess *sess)
+{
+	enum ibtrs_srv_state old_state;
+
+	if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING,
+					   &old_state))
+		queue_work(ibtrs_wq, &sess->close_work);
+	WARN_ON(sess->state != IBTRS_SRV_CLOSING);
+}
+
+static void close_sessions(struct ibtrs_srv *srv)
+{
+	struct ibtrs_srv_sess *sess;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry)
+		close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static void close_ctx(struct ibtrs_srv_ctx *ctx)
+{
+	struct ibtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list)
+		close_sessions(srv);
+	mutex_unlock(&ctx->srv_mutex);
+	flush_workqueue(ibtrs_wq);
+}
+
+void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx)
+{
+	rdma_destroy_id(ctx->cm_id_ip);
+	rdma_destroy_id(ctx->cm_id_ib);
+	close_ctx(ctx);
+	free_srv_ctx(ctx);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(ibtrs_srv_close);
+
+static int check_module_params(void)
+{
+	if (sess_queue_depth < 1 || sess_queue_depth > MAX_SESS_QUEUE_DEPTH) {
+		pr_err("Invalid sess_queue_depth value %d, has to be"
+		       " >= %d, <= %d.\n",
+		       sess_queue_depth, 1, MAX_SESS_QUEUE_DEPTH);
+		return -EINVAL;
+	}
+	if (max_chunk_size < 4096 || !is_power_of_2(max_chunk_size)) {
+		pr_err("Invalid max_chunk_size value %d, has to be"
+		       " >= %d and should be power of two.\n",
+		       max_chunk_size, 4096);
+		return -EINVAL;
+	}
+
+	/*
+	 * Check if IB immediate data size is enough to hold the mem_id and the
+	 * offset inside the memory chunk
+	 */
+	if ((ilog2(sess_queue_depth-1)+1) + (ilog2(max_chunk_size-1)+1) >
+	    MAX_IMM_PAYL_BITS) {
+		pr_err("RDMA immediate size (%db) not enough to encode "
+		       "%d buffers of size %dB. Reduce 'sess_queue_depth' "
+		       "or 'max_chunk_size' parameters.\n", MAX_IMM_PAYL_BITS,
+		       sess_queue_depth, max_chunk_size);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init ibtrs_server_init(void)
+{
+	int err;
+
+	if (!strlen(cq_affinity_list))
+		init_cq_affinity();
+
+	pr_info("Loading module %s, version %s, proto %s: "
+		"(retry_count: %d, cq_affinity_list: %s, "
+		"max_chunk_size: %d (pure IO %ld, headers %ld) , "
+		"sess_queue_depth: %d)\n",
+		KBUILD_MODNAME, IBTRS_VER_STRING, IBTRS_PROTO_VER_STRING,
+		retry_count, cq_affinity_list, max_chunk_size,
+		max_chunk_size - MAX_HDR_SIZE, MAX_HDR_SIZE,
+		sess_queue_depth);
+
+	ibtrs_ib_dev_pool_init(0, &dev_pool);
+
+	err = check_module_params();
+	if (err) {
+		pr_err("Failed to load module, invalid module parameters,"
+		       " err: %d\n", err);
+		return err;
+	}
+	chunk_pool = mempool_create_page_pool(sess_queue_depth * CHUNK_POOL_SZ,
+					      get_order(max_chunk_size));
+	if (unlikely(!chunk_pool)) {
+		pr_err("Failed preallocate pool of chunks\n");
+		return -ENOMEM;
+	}
+	ibtrs_dev_class = class_create(THIS_MODULE, "ibtrs-server");
+	if (unlikely(IS_ERR(ibtrs_dev_class))) {
+		pr_err("Failed to create ibtrs-server dev class\n");
+		err = PTR_ERR(ibtrs_dev_class);
+		goto out_chunk_pool;
+	}
+	ibtrs_wq = alloc_workqueue("ibtrs_server_wq", WQ_MEM_RECLAIM, 0);
+	if (unlikely(!ibtrs_wq)) {
+		pr_err("Failed to load module, alloc ibtrs_server_wq failed\n");
+		goto out_dev_class;
+	}
+
+	return 0;
+
+out_dev_class:
+	class_destroy(ibtrs_dev_class);
+out_chunk_pool:
+	mempool_destroy(chunk_pool);
+
+	return err;
+}
+
+static void __exit ibtrs_server_exit(void)
+{
+	destroy_workqueue(ibtrs_wq);
+	class_destroy(ibtrs_dev_class);
+	mempool_destroy(chunk_pool);
+	ibtrs_ib_dev_pool_deinit(&dev_pool);
+}
+
+module_init(ibtrs_server_init);
+module_exit(ibtrs_server_exit);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 12/26] ibtrs: server: statistics functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (10 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 11/26] ibtrs: server: main functionality Roman Pen
@ 2018-05-18 13:03 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 13/26] ibtrs: server: sysfs interface functions Roman Pen
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:03 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This introduces set of functions used on server side to account
statistics of RDMA data sent/received.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c | 110 +++++++++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
new file mode 100644
index 000000000000..5933cfc03f95
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
@@ -0,0 +1,110 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-srv.h"
+
+void ibtrs_srv_update_rdma_stats(struct ibtrs_srv_stats *s,
+				 size_t size, int d)
+{
+	atomic64_inc(&s->rdma_stats.dir[d].cnt);
+	atomic64_add(size, &s->rdma_stats.dir[d].size_total);
+}
+
+void ibtrs_srv_update_wc_stats(struct ibtrs_srv_stats *s)
+{
+	atomic64_inc(&s->wc_comp.calls);
+	atomic64_inc(&s->wc_comp.total_wc_cnt);
+}
+
+int ibtrs_srv_reset_rdma_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+
+		memset(r, 0, sizeof(*r));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
+				    char *page, size_t len)
+{
+	struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+	struct ibtrs_srv_sess *sess;
+
+	sess = container_of(stats, typeof(*sess), stats);
+
+	return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
+			 (s64)atomic64_read(&r->dir[READ].cnt),
+			 (s64)atomic64_read(&r->dir[READ].size_total),
+			 (s64)atomic64_read(&r->dir[WRITE].cnt),
+			 (s64)atomic64_read(&r->dir[WRITE].size_total),
+			 atomic_read(&sess->ids_inflight));
+}
+
+int ibtrs_srv_reset_wc_completion_stats(struct ibtrs_srv_stats *stats,
+					bool enable)
+{
+	if (enable) {
+		memset(&stats->wc_comp, 0, sizeof(stats->wc_comp));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+int ibtrs_srv_stats_wc_completion_to_str(struct ibtrs_srv_stats *stats,
+					 char *buf, size_t len)
+{
+	return snprintf(buf, len, "%lld %lld\n",
+			(s64)atomic64_read(&stats->wc_comp.total_wc_cnt),
+			(s64)atomic64_read(&stats->wc_comp.calls));
+}
+
+ssize_t ibtrs_srv_reset_all_help(struct ibtrs_srv_stats *stats,
+				 char *page, size_t len)
+{
+	return scnprintf(page, PAGE_SIZE, "echo 1 to reset all statistics\n");
+}
+
+int ibtrs_srv_reset_all_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		ibtrs_srv_reset_wc_completion_stats(stats, enable);
+		ibtrs_srv_reset_rdma_stats(stats, enable);
+		return 0;
+	}
+
+	return -EINVAL;
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 13/26] ibtrs: server: sysfs interface functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (11 preceding siblings ...)
  2018-05-18 13:03 ` [PATCH v2 12/26] ibtrs: server: statistics functions Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation Roman Pen
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is the sysfs interface to IBTRS sessions on server side:

  /sys/devices/virtual/ibtrs-server/<SESS-NAME>/
    *** IBTRS session accepted from a client peer
    |
    |- paths/<SOURCE-IP>/
       *** established paths from a client in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- hca_name
       |  *** HCA name
       |
       |- hca_port
       |  *** HCA port
       |
       |- stats/
          *** current path statistics
          |
	  |- rdma
	  |- reset_all
	  |- wc_completions

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c | 271 +++++++++++++++++++++++++
 1 file changed, 271 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
new file mode 100644
index 000000000000..96d9d9f08e0e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
@@ -0,0 +1,271 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+extern struct class *ibtrs_dev_class;
+
+static struct kobj_type ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+};
+
+static ssize_t ibtrs_srv_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_srv_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibtrs_srv_sess *sess;
+	char str[MAXHOSTNAMELEN];
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: invalid value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	ibtrs_info(sess, "disconnect for path %s requested\n", str);
+	ibtrs_srv_queue_close(sess);
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_srv_disconnect_attr =
+	__ATTR(disconnect, 0644,
+	       ibtrs_srv_disconnect_show, ibtrs_srv_disconnect_store);
+
+static ssize_t ibtrs_srv_hca_port_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_con *usr_con;
+
+	sess = container_of(kobj, typeof(*sess), kobj);
+	usr_con = sess->s.con[0];
+
+	return scnprintf(page, PAGE_SIZE, "%u\n",
+			 usr_con->cm_id->port_num);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_port_attr =
+	__ATTR(hca_port, 0444, ibtrs_srv_hca_port_show, NULL);
+
+static ssize_t ibtrs_srv_hca_name_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 sess->s.dev->ib_dev->name);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_name_attr =
+	__ATTR(hca_name, 0444, ibtrs_srv_hca_name_show, NULL);
+
+static struct attribute *ibtrs_srv_sess_attrs[] = {
+	&ibtrs_srv_hca_name_attr.attr,
+	&ibtrs_srv_hca_port_attr.attr,
+	&ibtrs_srv_disconnect_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_srv_sess_attr_group = {
+	.attrs = ibtrs_srv_sess_attrs,
+};
+
+STAT_ATTR(struct ibtrs_srv_sess, rdma,
+	  ibtrs_srv_stats_rdma_to_str,
+	  ibtrs_srv_reset_rdma_stats);
+
+STAT_ATTR(struct ibtrs_srv_sess, wc_completion,
+	  ibtrs_srv_stats_wc_completion_to_str,
+	  ibtrs_srv_reset_wc_completion_stats);
+
+STAT_ATTR(struct ibtrs_srv_sess, reset_all,
+	  ibtrs_srv_reset_all_help,
+	  ibtrs_srv_reset_all_stats);
+
+static struct attribute *ibtrs_srv_stats_attrs[] = {
+	&rdma_attr.attr,
+	&wc_completion_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_srv_stats_attr_group = {
+	.attrs = ibtrs_srv_stats_attrs,
+};
+
+static void ibtrs_srv_dev_release(struct device *dev)
+{
+	/* Nobody plays with device references, so nop */
+}
+
+static int ibtrs_srv_create_once_sysfs_root_folders(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int err = 0;
+
+	mutex_lock(&srv->paths_mutex);
+	if (srv->dev_ref++) {
+		/*
+		 * Just increase device reference.  We can't use get_device()
+		 * because we need to unregister device when ref goes to 0,
+		 * not just to put it.
+		 */
+		goto unlock;
+	}
+	srv->dev.class = ibtrs_dev_class;
+	srv->dev.release = ibtrs_srv_dev_release;
+	dev_set_name(&srv->dev, "%s", sess->s.sessname);
+
+	err = device_register(&srv->dev);
+	if (unlikely(err)) {
+		pr_err("device_register(): %d\n", err);
+		goto unlock;
+	}
+	err = kobject_init_and_add(&srv->kobj_paths, &ktype,
+				   &srv->dev.kobj, "paths");
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add(): %d\n", err);
+		device_unregister(&srv->dev);
+		goto unlock;
+	}
+unlock:
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+
+static void ibtrs_srv_destroy_once_sysfs_root_folders(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+
+	mutex_lock(&srv->paths_mutex);
+	if (!--srv->dev_ref) {
+		kobject_put(&srv->kobj_paths);
+		device_unregister(&srv->dev);
+	}
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static int ibtrs_srv_create_stats_files(struct ibtrs_srv_sess *sess)
+{
+	int err;
+
+	err = kobject_init_and_add(&sess->kobj_stats, &ktype,
+				   &sess->kobj, "stats");
+	if (unlikely(err)) {
+		ibtrs_err(sess, "kobject_init_and_add(): %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj_stats,
+				 &ibtrs_srv_stats_attr_group);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "sysfs_create_group(): %d\n", err);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_put(&sess->kobj_stats);
+
+	return err;
+}
+
+int ibtrs_srv_create_sess_files(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	char str[MAXHOSTNAMELEN];
+	int err;
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	err = ibtrs_srv_create_once_sysfs_root_folders(sess);
+	if (unlikely(err))
+		return err;
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &srv->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "kobject_init_and_add(): %d\n", err);
+		goto destroy_root;
+	}
+	err = sysfs_create_group(&sess->kobj, &ibtrs_srv_sess_attr_group);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = ibtrs_srv_create_stats_files(sess);
+	if (unlikely(err))
+		goto remove_group;
+
+	return 0;
+
+remove_group:
+	sysfs_remove_group(&sess->kobj, &ibtrs_srv_sess_attr_group);
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+destroy_root:
+	ibtrs_srv_destroy_once_sysfs_root_folders(sess);
+
+	return err;
+}
+
+void ibtrs_srv_destroy_sess_files(struct ibtrs_srv_sess *sess)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+
+		ibtrs_srv_destroy_once_sysfs_root_folders(sess);
+	}
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (12 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 13/26] ibtrs: server: sysfs interface functions Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-20 22:14   ` kbuild test robot
                     ` (2 more replies)
  2018-05-18 13:04 ` [PATCH v2 15/26] ibtrs: a bit of documentation Roman Pen
                   ` (12 subsequent siblings)
  26 siblings, 3 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

Add IBTRS Makefile, Kconfig and also corresponding lines into upper
layer infiniband/ulp files.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/Kconfig            |  1 +
 drivers/infiniband/ulp/Makefile       |  1 +
 drivers/infiniband/ulp/ibtrs/Kconfig  | 20 ++++++++++++++++++++
 drivers/infiniband/ulp/ibtrs/Makefile | 15 +++++++++++++++
 4 files changed, 37 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..787bd286fb08 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
+source "drivers/infiniband/ulp/ibtrs/Kconfig"
 
 source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
index 437813c7b481..1c4f10dc8d49 100644
--- a/drivers/infiniband/ulp/Makefile
+++ b/drivers/infiniband/ulp/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)		+= srpt/
 obj-$(CONFIG_INFINIBAND_ISER)		+= iser/
 obj-$(CONFIG_INFINIBAND_ISERT)		+= isert/
 obj-$(CONFIG_INFINIBAND_OPA_VNIC)	+= opa_vnic/
+obj-$(CONFIG_INFINIBAND_IBTRS)		+= ibtrs/
diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig b/drivers/infiniband/ulp/ibtrs/Kconfig
new file mode 100644
index 000000000000..eaeb8f3f6b4e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Kconfig
@@ -0,0 +1,20 @@
+config INFINIBAND_IBTRS
+	tristate
+	depends on INFINIBAND_ADDR_TRANS
+
+config INFINIBAND_IBTRS_CLIENT
+	tristate "IBTRS client module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_IBTRS
+	help
+	  IBTRS client allows for simplified data transfer and connection
+	  establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
+	  READ/WRITE semantics and provides multipath capabilities.
+
+config INFINIBAND_IBTRS_SERVER
+	tristate "IBTRS server module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_IBTRS
+	help
+	  IBTRS server module processing connection and IO requests received
+	  from the IBTRS client module.
diff --git a/drivers/infiniband/ulp/ibtrs/Makefile b/drivers/infiniband/ulp/ibtrs/Makefile
new file mode 100644
index 000000000000..e6ea858745ad
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Makefile
@@ -0,0 +1,15 @@
+ibtrs-client-y := ibtrs-clt.o \
+		  ibtrs-clt-stats.o \
+		  ibtrs-clt-sysfs.o
+
+ibtrs-server-y := ibtrs-srv.o \
+		  ibtrs-srv-stats.o \
+		  ibtrs-srv-sysfs.o
+
+ibtrs-core-y := ibtrs.o
+
+obj-$(CONFIG_INFINIBAND_IBTRS)        += ibtrs-core.o
+obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
+obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 15/26] ibtrs: a bit of documentation
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (13 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 16/26] ibnbd: private headers with IBNBD protocol structs and helpers Roman Pen
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

README with description of major sysfs entries.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/README | 358 ++++++++++++++++++++++++++++++++++++
 1 file changed, 358 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/README

diff --git a/drivers/infiniband/ulp/ibtrs/README b/drivers/infiniband/ulp/ibtrs/README
new file mode 100644
index 000000000000..010a93b02d9c
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/README
@@ -0,0 +1,358 @@
+****************************
+InfiniBand Transport (IBTRS)
+****************************
+
+IBTRS (InfiniBand Transport) is a reliable high speed transport library
+which provides support to establish optimal number of connections
+between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
+transport. It is optimized to transfer (read/write) IO blocks.
+
+In its core interface it follows the BIO semantics of providing the
+possibility to either write data from an sg list to the remote side
+or to request ("read") data transfer from the remote side into a given
+sg list.
+
+IBTRS provides I/O fail-over and load-balancing capabilities by using
+multipath I/O (see "add_path" and "mp_policy" configuration entries).
+
+IBTRS is used by the IBNBD (Infiniband Network Block Device) modules.
+
+======================
+Client Sysfs Interface
+======================
+
+This chapter describes only the most important files of sysfs interface
+on client side.
+
+Entries under /sys/devices/virtual/ibtrs-client/
+================================================
+
+When a user of IBTRS API creates a new session, a directory entry with
+the name of that session is created.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/
+===============================================================
+
+add_path (RW)
+-------------
+
+Adds a new path (connection) to an existing session. Expected format is the
+following:
+
+  <[source addr,]destination addr>
+
+  *addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]
+
+max_reconnect_attempts (RW)
+---------------------------
+
+Maximum number reconnect attempts the client should make before giving up
+after connection breaks unexpectedly.
+
+mp_policy (RW)
+--------------
+
+Multipath policy specifies which path should be selected on each IO:
+
+   round-robin (0):
+       select path in per CPU round-robin manner.
+
+   min-inflight (1):
+       select path with minimum inflights.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/paths/
+=====================================================================
+
+
+Each path belonging to a given session is listed here by its destination
+address. When a new path is added to a session by writing to the "add_path"
+entry, a directory with the corresponding destination address is created.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/paths/<dest-addr>/
+=================================================================================
+
+state (R)
+---------
+
+Contains "connected" if the session is connected to the peer and fully
+functional.  Otherwise the file contains "disconnected"
+
+reconnect (RW)
+--------------
+
+Write "1" to the file in order to reconnect the path.
+Operation is blocking and returns 0 if reconnect was successful.
+
+disconnect (RW)
+---------------
+
+Write "1" to the file in order to disconnect the path.
+Operation blocks until IBTRS path is disconnected.
+
+remove_path (RW)
+----------------
+
+Write "1" to the file in order to disconnected and remove the path
+from the session.  Operation blocks until the path is disconnected
+and removed from the session.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/paths/<dest-addr>/stats/
+=======================================================================================
+
+Write "0" to any file in that directory to reset corresponding statistics.
+
+reset_all (RW)
+--------------
+
+Read will return usage help, write 0 will clear all the statistics.
+
+sg_entries (RW)
+---------------
+
+Data to be transferred via RDMA is passed to IBTRS as scatter-gather
+list. A scatter-gather list can contain multiple entries.
+Scatter-gather list with less entries require less processing power
+and can therefore transferred faster. The file sg_entries outputs a
+per-CPU distribution table for the number of entries in the
+scatter-gather lists, that were passed to the IBTRS API function
+ibtrs_clt_request (READ or WRITE).
+
+cpu_migration (RW)
+------------------
+
+IBTRS expects that each HCA IRQ is pinned to a separate CPU. If it's
+not the case, the processing of an I/O response could be processed on a
+different CPU than where it was originally submitted.  This file shows
+how many interrupts where generated on a non expected CPU.
+"from:" is the CPU on which the IRQ was expected, but not generated.
+"to:" is the CPU on which the IRQ was generated, but not expected.
+
+reconnects (RW)
+---------------
+
+Contains 2 unsigned int values, the first one records number of successful
+reconnects in the path lifetime, the second one records number of failed
+reconnects in the path lifetime.
+
+rdma_lat (RW)
+-------------
+
+Latency distribution of IBTRS requests.
+The format is:
+   1 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   2 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   4 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   8 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  16 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  ...
+  65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  >= 65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  maximum ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+
+wc_completion (RW)
+------------------
+
+Contains 2 unsigned int values, the first one records max number of work
+requests processed in work_completion in session lifetime, the second
+one records average number of work requests processed in work_completion
+in session lifetime.
+
+rdma (RW)
+---------
+
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 6 values:
+
+<read-count> <read-total-size> <write-count> <write-total-size> \
+<inflights> <failovered>
+
+======================
+Server Sysfs Interface
+======================
+
+Entries under /sys/devices/virtual/ibtrs-server/
+================================================
+
+When a user of IBTRS API creates a new session on a client side, a
+directory entry with the name of that session is created in here.
+
+Entries under /sys/devices/virtual/ibtrs-server/<session-name>/paths/
+=====================================================================
+
+When new path is created by writing to "add_path" entry on client side,
+a directory entry with source address is created on server.
+
+Entries under /sys/devices/virtual/ibtrs-server/<session-name>/paths/<source-addr>/
+===================================================================================
+
+disconnect (RW)
+---------------
+
+When "1" is written to the file, the IBTRS session is being disconnected.
+Operations is non-blocking and returns control immediately to the caller.
+
+hca_name (R)
+------------
+
+Contains the the name of HCA the connection established on.
+
+hca_port (R)
+------------
+
+Contains the port number of active port traffic is going through.
+
+Entries under /sys/devices/virtual/ibtrs-server/<session-name>/paths/<source-addr>/stats/
+=========================================================================================
+
+When "0" is written to a file in this directory, the corresponding counters
+will be reset.
+
+reset_all (RW)
+--------------
+
+Read will return usage help, write 0 will clear all the counters about
+stats.
+
+rdma (RW)
+---------
+
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 5 values:
+
+<read-count> <read-total-size> <write-count> <write-total-size> <inflights>
+
+wc_completion (RW)
+------------------
+
+Contains 3 values, the first one is int, records max number of work
+requests processed in work_completion in session lifetime, the second
+one long int records total number of work requests processed in
+work_completion in session lifetime and the 3rd one long int records
+total number of calls to the cq completion handler. Division of 2nd
+number through 3rd gives the average number of completions processed
+in completion handler.
+
+==================
+Transport protocol
+==================
+
+Overview
+--------
+An established connection between a client and a server is called ibtrs
+session. A session is associated with a set of memory chunks reserved on the
+server side for a given client for rdma transfer. A session
+consists of multiple paths, each representing a separate physical link
+between client and server. Those are used for load balancing and failover.
+Each path consists of as many connections (QPs) as there are cpus on
+the client.
+
+When processing an incoming rdma write or read request ibtrs client uses memory
+chunks reserved for him on the server side. Their number, size and addresses
+need to be exchanged between client and server during the connection
+establishment phase. Apart from the memory related information client needs to
+inform the server about the session name and identify each path and connection
+individually.
+
+On an established session client sends to server write or read messages.
+Server uses immediate field to tell the client which request is being
+acknowledged and for errno. Client uses immediate field to tell the server
+which of the memory chunks has been accessed and at which offset the message
+can be found.
+
+Connection establishment
+------------------------
+
+1. Client starts establishing connections belonging to a path of a session one
+by one via attaching IBTRS_MSG_CON_REQ messages to the rdma_connect requests.
+Those include uuid of the session and uuid of the path to be
+established. They are used by the server to find a persisting session/path or
+to create a new one when necessary. The message also contains the protocol
+version and magic for compatibility, total number of connections per session
+(as many as cpus on the client), the id of the current connection and
+the reconnect counter, which is used to resolve the situations where
+client is trying to reconnect a path, while server is still destroying the old
+one.
+
+2. Server accepts the connection requests one by one and attaches
+IBTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
+protocol version, the messages include error code, queue depth supported by
+the server (number of memory chunks which are going to be allocated for that
+session) and the maximum size of one io.
+
+3. After all connections of a path are established client sends to server the
+IBTRS_MSG_INFO_REQ message, containing the name of the session. This message
+requests the address information from the server.
+
+4. Server replies to the session info request message with IBTRS_MSG_INFO_RSP,
+which contains the addresses and keys of the RDMA buffers allocated for that
+session.
+
+5. Session becomes connected after all paths to be established are connected
+(i.e. steps 1-4 finished for all paths requested for a session)
+
+6. Server and client exchange periodically heartbeat messages (empty rdma
+messages with an immediate field) which are used to detect a crash on remote
+side or network outage in an absence of IO.
+
+7. On any RDMA related error or in the case of a heartbeat timeout, the
+corresponding path is disconnected, all the inflight IO are failed over to a
+healthy path, if any, and the reconnect mechanism is triggered.
+
+CLT                                     SRV
+*for each connection belonging to a path and for each path:
+IBTRS_MSG_CON_REQ  ------------------->
+                   <------------------- IBTRS_MSG_CON_RSP
+...
+*after all connections are established:
+IBTRS_MSG_INFO_REQ ------------------->
+                   <------------------- IBTRS_MSG_INFO_RSP
+*heartbeat is started from both sides:
+                   -------------------> [IBTRS_HB_MSG_IMM]
+[IBTRS_HB_MSG_ACK] <-------------------
+[IBTRS_HB_MSG_IMM] <-------------------
+                   -------------------> [IBTRS_HB_MSG_ACK]
+
+IO path
+-------
+
+* Write *
+
+1. When processing a write request client selects one of the memory chunks
+on the server side and rdma writes there the user data, user header and the
+IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
+contains size of the user header. The client tells the server which chunk has
+been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by
+using the IMM field.
+
+2. When confirming a write request server sends an "empty" rdma message with
+an immediate field. The 32 bit field is used to specify the outstanding
+inflight IO and for the error code.
+
+CLT                                                          SRV
+usr_data + usr_hdr + ibtrs_msg_rdma_write -----------------> [IBTRS_IO_REQ_IMM]
+[IBTRS_IO_RSP_IMM]                        <----------------- (id + errno)
+
+* Read *
+
+1. When processing a read request client selects one of the memory chunks
+on the server side and rdma writes there the user header and the
+IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of
+the user header, flags (specifying if memory invalidation is necessary) and the
+list of addresses along with keys for the data to be read into.
+
+2. When confirming a read request server transfers the requested data first,
+attaches an invalidation message if requested and finally an "empty" rdma
+message with an immediate field. The 32 bit field is used to specify the
+outstanding inflight IO and the error code.
+
+CLT                                           SRV
+usr_hdr + ibtrs_msg_rdma_read --------------> [IBTRS_IO_REQ_IMM]
+[IBTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
+or in case client requested invalidation:
+[IBTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
+
+
+Contact
+-------
+
+Mailing list: "IBNBD/IBTRS Storage Team" <ibnbd@profitbricks.com>
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 16/26] ibnbd: private headers with IBNBD protocol structs and helpers
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (14 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 15/26] ibtrs: a bit of documentation Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 17/26] ibnbd: client: private header with client structs and functions Roman Pen
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

These are common private headers with IBNBD protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-log.h   |  71 ++++++++
 drivers/block/ibnbd/ibnbd-proto.h | 364 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 435 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-log.h
 create mode 100644 drivers/block/ibnbd/ibnbd-proto.h

diff --git a/drivers/block/ibnbd/ibnbd-log.h b/drivers/block/ibnbd/ibnbd-log.h
new file mode 100644
index 000000000000..489343a61171
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-log.h
@@ -0,0 +1,71 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_LOG_H
+#define IBNBD_LOG_H
+
+#include "ibnbd-clt.h"
+#include "ibnbd-srv.h"
+
+#define ibnbd_diskname(dev) ({						\
+	struct gendisk *gd = ((struct ibnbd_clt_dev *)dev)->gd;		\
+	gd ? gd->disk_name : "<no dev>";				\
+})
+
+void unknown_type(void);
+
+#define ibnbd_log(fn, dev, fmt, ...) ({					\
+	__builtin_choose_expr(						\
+		__builtin_types_compatible_p(				\
+			typeof(dev), struct ibnbd_clt_dev *),		\
+		fn("<%s@%s> %s: " fmt, (dev)->pathname,		\
+		   (dev)->sess->sessname, ibnbd_diskname(dev),		\
+		   ##__VA_ARGS__),					\
+		__builtin_choose_expr(					\
+			__builtin_types_compatible_p(typeof(dev),	\
+					struct ibnbd_srv_sess_dev *),	\
+			fn("<%s@%s>: " fmt, (dev)->pathname,	\
+			   (dev)->sess->sessname, ##__VA_ARGS__),		\
+			unknown_type()));				\
+})
+
+#define ibnbd_err(dev, fmt, ...)	\
+	ibnbd_log(pr_err, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_err_rl(dev, fmt, ...)	\
+	ibnbd_log(pr_err_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_wrn(dev, fmt, ...)	\
+	ibnbd_log(pr_warn, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_wrn_rl(dev, fmt, ...) \
+	ibnbd_log(pr_warn_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_info(dev, fmt, ...) \
+	ibnbd_log(pr_info, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_info_rl(dev, fmt, ...) \
+	ibnbd_log(pr_info_ratelimited, dev, fmt, ##__VA_ARGS__)
+
+#endif /* IBNBD_LOG_H */
diff --git a/drivers/block/ibnbd/ibnbd-proto.h b/drivers/block/ibnbd/ibnbd-proto.h
new file mode 100644
index 000000000000..050d3fa4c1bf
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-proto.h
@@ -0,0 +1,364 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_PROTO_H
+#define IBNBD_PROTO_H
+
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/limits.h>
+#include <linux/inet.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <rdma/ib.h>
+
+#define IBNBD_PROTO_VER_MAJOR 1
+#define IBNBD_PROTO_VER_MINOR 0
+
+#define IBNBD_PROTO_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
+			       __stringify(IBNBD_PROTO_VER_MINOR)
+
+#ifndef IBNBD_VER_STRING
+#define IBNBD_VER_STRING __stringify(IBNBD_PROTO_VER_MAJOR) "." \
+			 __stringify(IBNBD_PROTO_VER_MINOR)
+#endif
+
+/* TODO: should be configurable */
+#define IBTRS_PORT 1234
+
+/**
+ * enum ibnbd_msg_types - IBNBD message types
+ * @IBNBD_MSG_SESS_INFO:	initial session info from client to server
+ * @IBNBD_MSG_SESS_INFO_RSP:	initial session info from server to client
+ * @IBNBD_MSG_OPEN:		open (map) device request
+ * @IBNBD_MSG_OPEN_RSP:		response to an @IBNBD_MSG_OPEN
+ * @IBNBD_MSG_IO:		block IO request operation
+ * @IBNBD_MSG_CLOSE:		close (unmap) device request
+ */
+enum ibnbd_msg_type {
+	IBNBD_MSG_SESS_INFO,
+	IBNBD_MSG_SESS_INFO_RSP,
+	IBNBD_MSG_OPEN,
+	IBNBD_MSG_OPEN_RSP,
+	IBNBD_MSG_IO,
+	IBNBD_MSG_CLOSE,
+};
+
+/**
+ * struct ibnbd_msg_hdr - header of IBNBD messages
+ * @type:	Message type, valid values see: enum ibnbd_msg_types
+ */
+struct ibnbd_msg_hdr {
+	__le16		type;
+	__le16		__padding;
+};
+
+enum ibnbd_access_mode {
+	IBNBD_ACCESS_RO,
+	IBNBD_ACCESS_RW,
+	IBNBD_ACCESS_MIGRATION,
+};
+
+#define _IBNBD_FILEIO  0
+#define _IBNBD_BLOCKIO 1
+#define _IBNBD_AUTOIO  2
+
+enum ibnbd_io_mode {
+	IBNBD_FILEIO = _IBNBD_FILEIO,
+	IBNBD_BLOCKIO = _IBNBD_BLOCKIO,
+	IBNBD_AUTOIO = _IBNBD_AUTOIO,
+};
+
+/**
+ * struct ibnbd_msg_sess_info - initial session info from client to server
+ * @hdr:		message header
+ * @ver:		IBNBD protocol version
+ */
+struct ibnbd_msg_sess_info {
+	struct ibnbd_msg_hdr hdr;
+	u8		ver;
+	u8		reserved[31];
+};
+
+/**
+ * struct ibnbd_msg_sess_info_rsp - initial session info from server to client
+ * @hdr:		message header
+ * @ver:		IBNBD protocol version
+ */
+struct ibnbd_msg_sess_info_rsp {
+	struct ibnbd_msg_hdr hdr;
+	u8		ver;
+	u8		reserved[31];
+};
+
+/**
+ * struct ibnbd_msg_open - request to open a remote device.
+ * @hdr:		message header
+ * @access_mode:	the mode to open remote device, valid values see:
+ *			enum ibnbd_access_mode
+ * @io_mode:		Open volume on server as block device or as file
+ * @device_name:	device path on remote side
+ */
+struct ibnbd_msg_open {
+	struct ibnbd_msg_hdr hdr;
+	u8		access_mode;
+	u8		io_mode;
+	s8		dev_name[NAME_MAX];
+	u8		__padding[3];
+};
+
+/**
+ * struct ibnbd_msg_close - request to close a remote device.
+ * @hdr:	message header
+ * @device_id:	device_id on server side to identify the device
+ */
+struct ibnbd_msg_close {
+	struct ibnbd_msg_hdr hdr;
+	__le32		device_id;
+};
+
+/**
+ * struct ibnbd_msg_open_rsp - response message to IBNBD_MSG_OPEN
+ * @hdr:		message header
+ * @nsectors:		number of sectors
+ * @device_id:		device_id on server side to identify the device
+ * @queue_flags:	queue_flags of the device on server side
+ * @max_hw_sectors:	max hardware sectors in the usual 512b unit
+ * @max_write_same_sectors: max sectors for WRITE SAME in the 512b unit
+ * @max_discard_sectors: max. sectors that can be discarded at once
+ * @discard_granularity: size of the internal discard allocation unit
+ * @discard_alignment: offset from internal allocation assignment
+ * @physical_block_size: physical block size device supports
+ * @logical_block_size: logical block size device supports
+ * @max_segments:	max segments hardware support in one transfer
+ * @secure_discard:	supports secure discard
+ * @rotation:		is a rotational disc?
+ * @io_mode:		io_mode device is opened.
+ */
+struct ibnbd_msg_open_rsp {
+	struct ibnbd_msg_hdr	hdr;
+	__le32			device_id;
+	__le64			nsectors;
+	__le32			max_hw_sectors;
+	__le32			max_write_same_sectors;
+	__le32			max_discard_sectors;
+	__le32			discard_granularity;
+	__le32			discard_alignment;
+	__le16			physical_block_size;
+	__le16			logical_block_size;
+	__le16			max_segments;
+	__le16			secure_discard;
+	u8			rotational;
+	u8			io_mode;
+	u8			__padding[10];
+};
+
+/**
+ * struct ibnbd_msg_io - message for I/O read/write
+ * @hdr:	message header
+ * @device_id:	device_id on server side to find the right device
+ * @sector:	bi_sector attribute from struct bio
+ * @rw:		bitmask, valid values are defined in enum ibnbd_io_flags
+ * @bi_size:   number of bytes for I/O read/write
+ */
+struct ibnbd_msg_io {
+	struct ibnbd_msg_hdr hdr;
+	__le32		device_id;
+	__le64		sector;
+	__le32		rw;
+	__le32		bi_size;
+};
+
+#define IBNBD_OP_BITS  8
+#define IBNBD_OP_MASK  ((1 << IBNBD_OP_BITS) - 1)
+
+/**
+ * enum ibnbd_io_flags - IBNBD request types from rq_flag_bits
+ * @IBNBD_OP_READ:	     read sectors from the device
+ * @IBNBD_OP_WRITE:	     write sectors to the device
+ * @IBNBD_OP_FLUSH:	     flush the volatile write cache
+ * @IBNBD_OP_DISCARD:        discard sectors
+ * @IBNBD_OP_SECURE_ERASE:   securely erase sectors
+ * @IBNBD_OP_WRITE_SAME:     write the same sectors many times
+
+ * @IBNBD_F_SYNC:	     request is sync (sync write or read)
+ * @IBNBD_F_FUA:             forced unit access
+ */
+enum ibnbd_io_flags {
+
+	/* Operations */
+
+	IBNBD_OP_READ		= 0,
+	IBNBD_OP_WRITE		= 1,
+	IBNBD_OP_FLUSH		= 2,
+	IBNBD_OP_DISCARD	= 3,
+	IBNBD_OP_SECURE_ERASE	= 4,
+	IBNBD_OP_WRITE_SAME	= 5,
+
+	IBNBD_OP_LAST,
+
+	/* Flags */
+
+	IBNBD_F_SYNC  = 1<<(IBNBD_OP_BITS + 0),
+	IBNBD_F_FUA   = 1<<(IBNBD_OP_BITS + 1),
+
+	IBNBD_F_ALL   = (IBNBD_F_SYNC | IBNBD_F_FUA)
+
+};
+
+static inline u32 ibnbd_op(u32 flags)
+{
+	return (flags & IBNBD_OP_MASK);
+}
+
+static inline u32 ibnbd_flags(u32 flags)
+{
+	return (flags & ~IBNBD_OP_MASK);
+}
+
+static inline bool ibnbd_flags_supported(u32 flags)
+{
+	u32 op;
+
+	op = ibnbd_op(flags);
+	flags = ibnbd_flags(flags);
+
+	if (op >= IBNBD_OP_LAST)
+		return false;
+	if (flags & ~IBNBD_F_ALL)
+		return false;
+
+	return true;
+}
+
+static inline u32 ibnbd_to_bio_flags(u32 ibnbd_flags)
+{
+	u32 bio_flags;
+
+	switch (ibnbd_op(ibnbd_flags)) {
+	case IBNBD_OP_READ:
+		bio_flags = REQ_OP_READ;
+		break;
+	case IBNBD_OP_WRITE:
+		bio_flags = REQ_OP_WRITE;
+		break;
+	case IBNBD_OP_FLUSH:
+		bio_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
+		break;
+	case IBNBD_OP_DISCARD:
+		bio_flags = REQ_OP_DISCARD;
+		break;
+	case IBNBD_OP_SECURE_ERASE:
+		bio_flags = REQ_OP_SECURE_ERASE;
+		break;
+	case IBNBD_OP_WRITE_SAME:
+		bio_flags = REQ_OP_WRITE_SAME;
+		break;
+	default:
+		WARN(1, "Unknown IBNBD type: %d (flags %d)\n",
+		     ibnbd_op(ibnbd_flags), ibnbd_flags);
+		bio_flags = 0;
+	}
+
+	if (ibnbd_flags & IBNBD_F_SYNC)
+		bio_flags |= REQ_SYNC;
+
+	if (ibnbd_flags & IBNBD_F_FUA)
+		bio_flags |= REQ_FUA;
+
+	return bio_flags;
+}
+
+static inline u32 rq_to_ibnbd_flags(struct request *rq)
+{
+	u32 ibnbd_flags;
+
+	switch (req_op(rq)) {
+	case REQ_OP_READ:
+		ibnbd_flags = IBNBD_OP_READ;
+		break;
+	case REQ_OP_WRITE:
+		ibnbd_flags = IBNBD_OP_WRITE;
+		break;
+	case REQ_OP_DISCARD:
+		ibnbd_flags = IBNBD_OP_DISCARD;
+		break;
+	case REQ_OP_SECURE_ERASE:
+		ibnbd_flags = IBNBD_OP_SECURE_ERASE;
+		break;
+	case REQ_OP_WRITE_SAME:
+		ibnbd_flags = IBNBD_OP_WRITE_SAME;
+		break;
+	case REQ_OP_FLUSH:
+		ibnbd_flags = IBNBD_OP_FLUSH;
+		break;
+	default:
+		WARN(1, "Unknown request type %d (flags %llu)\n",
+		     req_op(rq), (unsigned long long)rq->cmd_flags);
+		ibnbd_flags = 0;
+	}
+
+	if (op_is_sync(rq->cmd_flags))
+		ibnbd_flags |= IBNBD_F_SYNC;
+
+	if (op_is_flush(rq->cmd_flags))
+		ibnbd_flags |= IBNBD_F_FUA;
+
+	return ibnbd_flags;
+}
+
+static inline const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
+{
+	switch (mode) {
+	case IBNBD_FILEIO:
+		return "fileio";
+	case IBNBD_BLOCKIO:
+		return "blockio";
+	case IBNBD_AUTOIO:
+		return "autoio";
+	default:
+		return "unknown";
+	}
+}
+
+static inline const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
+{
+	switch (mode) {
+	case IBNBD_ACCESS_RO:
+		return "ro";
+	case IBNBD_ACCESS_RW:
+		return "rw";
+	case IBNBD_ACCESS_MIGRATION:
+		return "migration";
+	default:
+		return "unknown";
+	}
+}
+
+#endif /* IBNBD_PROTO_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 17/26] ibnbd: client: private header with client structs and functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (15 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 16/26] ibnbd: private headers with IBNBD protocol structs and helpers Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 18/26] ibnbd: client: main functionality Roman Pen
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This header describes main structs and functions used by ibnbd-client
module, mainly for managing IBNBD sessions and mapped block devices,
creating and destroying sysfs entries.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-clt.h | 172 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.h

diff --git a/drivers/block/ibnbd/ibnbd-clt.h b/drivers/block/ibnbd/ibnbd-clt.h
new file mode 100644
index 000000000000..c5f6f08ec338
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt.h
@@ -0,0 +1,172 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_CLT_H
+#define IBNBD_CLT_H
+
+#include <linux/wait.h>
+#include <linux/in.h>
+#include <linux/inet.h>
+#include <linux/blk-mq.h>
+#include <linux/refcount.h>
+
+#include "ibtrs.h"
+#include "ibnbd-proto.h"
+#include "ibnbd-log.h"
+
+#define BMAX_SEGMENTS 31
+#define RECONNECT_DELAY 30
+#define MAX_RECONNECTS -1
+
+enum ibnbd_clt_dev_state {
+	DEV_STATE_INIT,
+	DEV_STATE_MAPPED,
+	DEV_STATE_MAPPED_DISCONNECTED,
+	DEV_STATE_UNMAPPED,
+};
+
+struct ibnbd_iu_comp {
+	wait_queue_head_t wait;
+	int errno;
+};
+
+struct ibnbd_iu {
+	union {
+		struct request *rq; /* for block io */
+		void *buf; /* for user messages */
+	};
+	struct ibtrs_tag	*tag;
+	union {
+		/* use to send msg associated with a dev */
+		struct ibnbd_clt_dev *dev;
+		/* use to send msg associated with a sess */
+		struct ibnbd_clt_session *sess;
+	};
+	blk_status_t		status;
+	struct scatterlist	sglist[BMAX_SEGMENTS];
+	struct work_struct	work;
+	int			errno;
+	struct ibnbd_iu_comp	*comp;
+};
+
+struct ibnbd_cpu_qlist {
+	struct list_head	requeue_list;
+	spinlock_t		requeue_lock;
+	unsigned int		cpu;
+};
+
+struct ibnbd_clt_session {
+	struct list_head        list;
+	struct ibtrs_clt        *ibtrs;
+	wait_queue_head_t       ibtrs_waitq;
+	bool                    ibtrs_ready;
+	struct ibnbd_cpu_qlist	__percpu
+				*cpu_queues;
+	DECLARE_BITMAP(cpu_queues_bm, NR_CPUS);
+	int	__percpu	*cpu_rr; /* per-cpu var for CPU round-robin */
+	atomic_t		busy;
+	int			queue_depth;
+	u32			max_io_size;
+	struct blk_mq_tag_set	tag_set;
+	struct mutex		lock; /* protects state and devs_list */
+	struct list_head        devs_list; /* list of struct ibnbd_clt_dev */
+	refcount_t		refcount;
+	char			sessname[NAME_MAX];
+	u8			ver; /* protocol version */
+};
+
+/**
+ * Submission queues.
+ */
+struct ibnbd_queue {
+	struct list_head	requeue_list;
+	unsigned long		in_list;
+	struct ibnbd_clt_dev	*dev;
+	struct blk_mq_hw_ctx	*hctx;
+};
+
+struct ibnbd_clt_dev {
+	struct ibnbd_clt_session	*sess;
+	struct request_queue	*queue;
+	struct ibnbd_queue	*hw_queues;
+	u32			device_id;
+	/* local Idr index - used to track minor number allocations. */
+	u32			clt_device_id;
+	struct mutex		lock;
+	enum ibnbd_clt_dev_state	dev_state;
+	enum ibnbd_io_mode	io_mode; /* user requested */
+	enum ibnbd_io_mode	remote_io_mode; /* server really used */
+	char			pathname[NAME_MAX];
+	enum ibnbd_access_mode	access_mode;
+	bool			read_only;
+	bool			rotational;
+	u32			max_hw_sectors;
+	u32			max_write_same_sectors;
+	u32			max_discard_sectors;
+	u32			discard_granularity;
+	u32			discard_alignment;
+	u16			secure_discard;
+	u16			physical_block_size;
+	u16			logical_block_size;
+	u16			max_segments;
+	size_t			nsectors;
+	u64			size;		/* device size in bytes */
+	struct list_head        list;
+	struct gendisk		*gd;
+	struct kobject		kobj;
+	char			blk_symlink_name[NAME_MAX];
+	refcount_t		refcount;
+	struct work_struct	unmap_on_rmmod_work;
+};
+
+/* ibnbd-clt.c */
+
+struct ibnbd_clt_dev *ibnbd_clt_map_device(const char *sessname,
+					   struct ibtrs_addr *paths,
+					   size_t path_cnt,
+					   const char *pathname,
+					   enum ibnbd_access_mode access_mode,
+					   enum ibnbd_io_mode io_mode);
+int ibnbd_clt_unmap_device(struct ibnbd_clt_dev *dev, bool force,
+			   const struct attribute *sysfs_self);
+
+int ibnbd_clt_remap_device(struct ibnbd_clt_dev *dev);
+int ibnbd_clt_resize_disk(struct ibnbd_clt_dev *dev, size_t newsize);
+
+/* ibnbd-clt-sysfs.c */
+
+int ibnbd_clt_create_sysfs_files(void);
+
+void ibnbd_clt_destroy_sysfs_files(void);
+void ibnbd_clt_destroy_default_group(void);
+
+void ibnbd_clt_remove_dev_symlink(struct ibnbd_clt_dev *dev);
+
+#endif /* IBNBD_CLT_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 18/26] ibnbd: client: main functionality
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (16 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 17/26] ibnbd: client: private header with client structs and functions Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 19/26] ibnbd: client: sysfs interface functions Roman Pen
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is main functionality of ibnbd-client module, which provides
interface to map remote device as local block device /dev/ibnbd<N>
and feeds IBTRS with IO requests.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-clt.c | 1819 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 1819 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.c

diff --git a/drivers/block/ibnbd/ibnbd-clt.c b/drivers/block/ibnbd/ibnbd-clt.c
new file mode 100644
index 000000000000..06524e33e19f
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt.c
@@ -0,0 +1,1819 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/scatterlist.h>
+#include <linux/idr.h>
+
+#include "ibnbd-clt.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("InfiniBand Network Block Device Client");
+MODULE_VERSION(IBNBD_VER_STRING);
+MODULE_LICENSE("GPL");
+
+/*
+ * This is for closing devices when unloading the module:
+ * we might be closing a lot (>256) of devices in parallel
+ * and it is better not to use the system_wq.
+ */
+static struct workqueue_struct *unload_wq;
+static int ibnbd_client_major;
+static DEFINE_IDA(index_ida);
+static DEFINE_MUTEX(ida_lock);
+static DEFINE_MUTEX(sess_lock);
+static LIST_HEAD(sess_list);
+
+static bool softirq_enable;
+module_param(softirq_enable, bool, 0444);
+MODULE_PARM_DESC(softirq_enable, "finish request in softirq_fn."
+		 " (default: 0)");
+/*
+ * Maximum number of partitions an instance can have.
+ * 6 bits = 64 minors = 63 partitions (one minor is used for the device itself)
+ */
+#define IBNBD_PART_BITS		6
+#define KERNEL_SECTOR_SIZE      512
+
+static inline bool ibnbd_clt_get_sess(struct ibnbd_clt_session *sess)
+{
+	return refcount_inc_not_zero(&sess->refcount);
+}
+
+static void free_sess(struct ibnbd_clt_session *sess);
+
+static void ibnbd_clt_put_sess(struct ibnbd_clt_session *sess)
+{
+	might_sleep();
+
+	if (refcount_dec_and_test(&sess->refcount))
+		free_sess(sess);
+}
+
+static inline bool ibnbd_clt_dev_is_mapped(struct ibnbd_clt_dev *dev)
+{
+	return dev->dev_state == DEV_STATE_MAPPED;
+}
+
+static void ibnbd_clt_put_dev(struct ibnbd_clt_dev *dev)
+{
+	might_sleep();
+
+	if (refcount_dec_and_test(&dev->refcount)) {
+		mutex_lock(&ida_lock);
+		ida_simple_remove(&index_ida, dev->clt_device_id);
+		mutex_unlock(&ida_lock);
+		kfree(dev->hw_queues);
+		ibnbd_clt_put_sess(dev->sess);
+		kfree(dev);
+	}
+}
+
+static inline bool ibnbd_clt_get_dev(struct ibnbd_clt_dev *dev)
+{
+	return refcount_inc_not_zero(&dev->refcount);
+}
+
+static int ibnbd_clt_set_dev_attr(struct ibnbd_clt_dev *dev,
+				  const struct ibnbd_msg_open_rsp *rsp)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+
+	if (unlikely(!rsp->logical_block_size))
+		return -EINVAL;
+
+	dev->device_id		    = le32_to_cpu(rsp->device_id);
+	dev->nsectors		    = le64_to_cpu(rsp->nsectors);
+	dev->logical_block_size	    = le16_to_cpu(rsp->logical_block_size);
+	dev->physical_block_size    = le16_to_cpu(rsp->physical_block_size);
+	dev->max_write_same_sectors = le32_to_cpu(rsp->max_write_same_sectors);
+	dev->max_discard_sectors    = le32_to_cpu(rsp->max_discard_sectors);
+	dev->discard_granularity    = le32_to_cpu(rsp->discard_granularity);
+	dev->discard_alignment	    = le32_to_cpu(rsp->discard_alignment);
+	dev->secure_discard	    = le16_to_cpu(rsp->secure_discard);
+	dev->rotational		    = rsp->rotational;
+	dev->remote_io_mode	    = rsp->io_mode;
+
+	dev->max_hw_sectors = sess->max_io_size / dev->logical_block_size;
+	dev->max_segments = BMAX_SEGMENTS;
+
+	if (dev->remote_io_mode == IBNBD_BLOCKIO) {
+		dev->max_hw_sectors = min_t(u32, dev->max_hw_sectors,
+					    le32_to_cpu(rsp->max_hw_sectors));
+		dev->max_segments = min_t(u16, dev->max_segments,
+					  le16_to_cpu(rsp->max_segments));
+	}
+
+	return 0;
+}
+
+static int ibnbd_clt_revalidate_disk(struct ibnbd_clt_dev *dev,
+				     size_t new_nsectors)
+{
+	int err = 0;
+
+	ibnbd_info(dev, "Device size changed from %zu to %zu sectors\n",
+		   dev->nsectors, new_nsectors);
+	dev->nsectors = new_nsectors;
+	set_capacity(dev->gd,
+		     dev->nsectors * (dev->logical_block_size /
+				      KERNEL_SECTOR_SIZE));
+	err = revalidate_disk(dev->gd);
+	if (err)
+		ibnbd_err(dev, "Failed to change device size from"
+			  " %zu to %zu, err: %d\n", dev->nsectors,
+			  new_nsectors, err);
+	return err;
+}
+
+static int process_msg_open_rsp(struct ibnbd_clt_dev *dev,
+				struct ibnbd_msg_open_rsp *rsp)
+{
+	int err = 0;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state == DEV_STATE_UNMAPPED) {
+		ibnbd_info(dev, "Ignoring Open-Response message from server for "
+			   " unmapped device\n");
+		err = -ENOENT;
+		goto out;
+	}
+	if (dev->dev_state == DEV_STATE_MAPPED_DISCONNECTED) {
+		u64 nsectors = le64_to_cpu(rsp->nsectors);
+
+		/*
+		 * If the device was remapped and the size changed in the
+		 * meantime we need to revalidate it
+		 */
+		if (dev->nsectors != nsectors)
+			ibnbd_clt_revalidate_disk(dev, nsectors);
+		ibnbd_info(dev, "Device online, device remapped successfully\n");
+	}
+	err = ibnbd_clt_set_dev_attr(dev, rsp);
+	if (unlikely(err))
+		goto out;
+	dev->dev_state = DEV_STATE_MAPPED;
+
+out:
+	mutex_unlock(&dev->lock);
+
+	return err;
+}
+
+int ibnbd_clt_resize_disk(struct ibnbd_clt_dev *dev, size_t newsize)
+{
+	int ret = 0;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state != DEV_STATE_MAPPED) {
+		pr_err("Failed to set new size of the device, "
+		       "device is not opened\n");
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = ibnbd_clt_revalidate_disk(dev, newsize);
+
+out:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static inline void ibnbd_clt_dev_requeue(struct ibnbd_queue *q)
+{
+	if (WARN_ON(!q->hctx))
+		return;
+
+	/* We can come here from interrupt, thus async=true */
+	blk_mq_run_hw_queue(q->hctx, true);
+}
+
+enum {
+	IBNBD_DELAY_10ms   = 10,
+	IBNBD_DELAY_IFBUSY = -1,
+};
+
+/**
+ * ibnbd_get_cpu_qlist() - finds a list with HW queues to be requeued
+ *
+ * Description:
+ *     Each CPU has a list of HW queues, which needs to be requeed.  If a list
+ *     is not empty - it is marked with a bit.  This function finds first
+ *     set bit in a bitmap and returns corresponding CPU list.
+ */
+static struct ibnbd_cpu_qlist *
+ibnbd_get_cpu_qlist(struct ibnbd_clt_session *sess, int cpu)
+{
+	int bit;
+
+	/* First half */
+	bit = find_next_bit(sess->cpu_queues_bm, nr_cpu_ids, cpu);
+	if (bit < nr_cpu_ids) {
+		return per_cpu_ptr(sess->cpu_queues, bit);
+	} else if (cpu != 0) {
+		/* Second half */
+		bit = find_next_bit(sess->cpu_queues_bm, cpu, 0);
+		if (bit < cpu)
+			return per_cpu_ptr(sess->cpu_queues, bit);
+	}
+
+	return NULL;
+}
+
+static inline int nxt_cpu(int cpu)
+{
+	return (cpu + 1) % nr_cpu_ids;
+}
+
+/**
+ * ibnbd_requeue_if_needed() - requeue if CPU queue is marked as non empty
+ *
+ * Description:
+ *     Each CPU has it's own list of HW queues, which should be requeued.
+ *     Function finds such list with HW queues, takes a list lock, picks up
+ *     the first HW queue out of the list and requeues it.
+ *
+ * Return:
+ *     True if the queue was requeued, false otherwise.
+ *
+ * Context:
+ *     Does not matter.
+ */
+static inline bool ibnbd_requeue_if_needed(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_queue *q = NULL;
+	struct ibnbd_cpu_qlist *cpu_q;
+	unsigned long flags;
+	int *cpup;
+
+	/*
+	 * To keep fairness and not to let other queues starve we always
+	 * try to wake up someone else in round-robin manner.  That of course
+	 * increases latency but queues always have a chance to be executed.
+	 */
+	cpup = get_cpu_ptr(sess->cpu_rr);
+	for (cpu_q = ibnbd_get_cpu_qlist(sess, nxt_cpu(*cpup)); cpu_q;
+	     cpu_q = ibnbd_get_cpu_qlist(sess, nxt_cpu(cpu_q->cpu))) {
+		if (!spin_trylock_irqsave(&cpu_q->requeue_lock, flags))
+			continue;
+		if (likely(test_bit(cpu_q->cpu, sess->cpu_queues_bm))) {
+			q = list_first_entry_or_null(&cpu_q->requeue_list,
+						     typeof(*q), requeue_list);
+			if (WARN_ON(!q))
+				goto clear_bit;
+			list_del_init(&q->requeue_list);
+			clear_bit_unlock(0, &q->in_list);
+
+			if (list_empty(&cpu_q->requeue_list)) {
+				/* Clear bit if nothing is left */
+clear_bit:
+				clear_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			}
+		}
+		spin_unlock_irqrestore(&cpu_q->requeue_lock, flags);
+
+		if (q)
+			break;
+	}
+
+	/**
+	 * Saves the CPU that is going to be requeued on the per-cpu var. Just
+	 * incrementing it doesn't work because ibnbd_get_cpu_qlist() will
+	 * always return the first CPU with something on the queue list when the
+	 * value stored on the var is greater than the last CPU with something
+	 * on the list.
+	 */
+	if (cpu_q)
+		*cpup = cpu_q->cpu;
+	put_cpu_var(sess->cpu_rr);
+
+	if (q)
+		ibnbd_clt_dev_requeue(q);
+
+	return !!q;
+}
+
+/**
+ * ibnbd_requeue_all_if_idle() - requeue all queues left in the list if
+ *     session is idling (there are no requests in-flight).
+ *
+ * Description:
+ *     This function tries to rerun all stopped queues if there are no
+ *     requests in-flight anymore.  This function tries to solve an obvious
+ *     problem, when number of tags < than number of queues (hctx), which
+ *     are stopped and put to sleep.  If last tag, which has been just put,
+ *     does not wake up all left queues (hctxs), IO requests hang forever.
+ *
+ *     That can happen when all number of tags, say N, have been exhausted
+ *     from one CPU, and we have many block devices per session, say M.
+ *     Each block device has it's own queue (hctx) for each CPU, so eventually
+ *     we can put that number of queues (hctxs) to sleep: M x nr_cpu_ids.
+ *     If number of tags N < M x nr_cpu_ids finally we will get an IO hang.
+ *
+ *     To avoid this hang last caller of ibnbd_put_tag() (last caller is the
+ *     one who observes sess->busy == 0) must wake up all remaining queues.
+ *
+ * Context:
+ *     Does not matter.
+ */
+static inline void ibnbd_requeue_all_if_idle(struct ibnbd_clt_session *sess)
+{
+	bool requeued;
+
+	do {
+		requeued = ibnbd_requeue_if_needed(sess);
+	} while (atomic_read(&sess->busy) == 0 && requeued);
+}
+
+static struct ibtrs_tag *ibnbd_get_tag(struct ibnbd_clt_session *sess,
+				       enum ibtrs_clt_con_type con_type,
+				       int wait)
+{
+	struct ibtrs_tag *tag;
+
+	tag = ibtrs_clt_get_tag(sess->ibtrs, con_type,
+				wait ? IBTRS_TAG_WAIT : IBTRS_TAG_NOWAIT);
+	if (likely(tag))
+		/* We have a subtle rare case here, when all tags can be
+		 * consumed before busy counter increased.  This is safe,
+		 * because loser will get NULL as a tag, observe 0 busy
+		 * counter and immediately restart the queue himself.
+		 */
+		atomic_inc(&sess->busy);
+
+	return tag;
+}
+
+static void ibnbd_put_tag(struct ibnbd_clt_session *sess, struct ibtrs_tag *tag)
+{
+	ibtrs_clt_put_tag(sess->ibtrs, tag);
+	atomic_dec(&sess->busy);
+	/* Paired with ibnbd_clt_dev_add_to_requeue().  Decrement first
+	 * and then check queue bits.
+	 */
+	smp_mb__after_atomic();
+	ibnbd_requeue_all_if_idle(sess);
+}
+
+static struct ibnbd_iu *ibnbd_get_iu(struct ibnbd_clt_session *sess,
+				     enum ibtrs_clt_con_type con_type,
+				     int wait)
+{
+	struct ibnbd_iu *iu;
+	struct ibtrs_tag *tag;
+
+	tag = ibnbd_get_tag(sess, con_type,
+			    wait ? IBTRS_TAG_WAIT : IBTRS_TAG_NOWAIT);
+	if (unlikely(!tag))
+		return NULL;
+	iu = ibtrs_tag_to_pdu(tag);
+	iu->tag = tag; /* yes, ibtrs_tag_from_pdu() can be nice here,
+			* but also we have to think about MQ mode
+			*/
+
+	return iu;
+}
+
+static void ibnbd_put_iu(struct ibnbd_clt_session *sess, struct ibnbd_iu *iu)
+{
+	ibnbd_put_tag(sess, iu->tag);
+}
+
+static void ibnbd_softirq_done_fn(struct request *rq)
+{
+	struct ibnbd_clt_dev *dev	= rq->rq_disk->private_data;
+	struct ibnbd_clt_session *sess	= dev->sess;
+	struct ibnbd_iu *iu;
+
+	iu = blk_mq_rq_to_pdu(rq);
+	ibnbd_put_tag(sess, iu->tag);
+	blk_mq_end_request(rq, iu->status);
+}
+
+static void msg_io_conf(void *priv, int errno)
+{
+	struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
+	struct ibnbd_clt_dev *dev = iu->dev;
+	struct request *rq = iu->rq;
+
+	iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
+
+	if (softirq_enable) {
+		blk_mq_complete_request(rq);
+	} else {
+		ibnbd_put_tag(dev->sess, iu->tag);
+		blk_mq_end_request(rq, iu->status);
+	}
+
+	if (errno)
+		ibnbd_info_rl(dev, "%s I/O failed with err: %d\n",
+			      rq_data_dir(rq) == READ ? "read" : "write",
+			      errno);
+}
+
+static void init_iu_comp(struct ibnbd_iu *iu, struct ibnbd_iu_comp *comp)
+{
+	init_waitqueue_head(&comp->wait);
+	comp->errno = INT_MAX;
+	iu->comp = comp;
+}
+
+static void deinit_iu_comp(struct ibnbd_iu *iu)
+{
+	iu->comp = NULL;
+}
+
+static void wake_up_iu_comp(struct ibnbd_iu *iu, int errno)
+{
+	struct ibnbd_iu_comp *comp = iu->comp;
+
+	if (comp) {
+		comp->errno = errno;
+		wake_up(&comp->wait);
+		deinit_iu_comp(iu);
+	}
+}
+
+static void wait_iu_comp(struct ibnbd_iu_comp *comp)
+{
+	wait_event(comp->wait, comp->errno != INT_MAX);
+}
+
+static void msg_conf(void *priv, int errno)
+{
+	struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
+
+	iu->errno = errno;
+	schedule_work(&iu->work);
+}
+
+enum {
+	NO_WAIT = 0,
+	WAIT    = 1
+};
+
+static int send_usr_msg(struct ibtrs_clt *ibtrs, int dir,
+			struct ibnbd_iu *iu, struct kvec *vec, size_t nr,
+			size_t len, struct scatterlist *sg, unsigned int sg_len,
+			void (*conf)(struct work_struct *work),
+			int *errno, bool wait)
+{
+	struct ibnbd_iu_comp comp;
+	int err;
+
+	if (wait)
+		init_iu_comp(iu, &comp);
+	INIT_WORK(&iu->work, conf);
+	err = ibtrs_clt_request(dir, msg_conf, ibtrs, iu->tag,
+				iu, vec, nr, len, sg, sg_len);
+	if (unlikely(err)) {
+		deinit_iu_comp(iu);
+	} else if (wait) {
+		wait_iu_comp(&comp);
+		*errno = comp.errno;
+	} else {
+		*errno = 0;
+	}
+
+	return err;
+}
+
+static void msg_close_conf(struct work_struct *work)
+{
+	struct ibnbd_iu *iu = container_of(work, struct ibnbd_iu, work);
+	struct ibnbd_clt_dev *dev = iu->dev;
+
+	wake_up_iu_comp(iu, iu->errno);
+	ibnbd_put_iu(dev->sess, iu);
+	ibnbd_clt_put_dev(dev);
+}
+
+static int send_msg_close(struct ibnbd_clt_dev *dev, u32 device_id, bool wait)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	struct ibnbd_msg_close msg;
+	struct ibnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	iu = ibnbd_get_iu(sess, IBTRS_USR_CON, IBTRS_TAG_WAIT);
+	if (unlikely(!iu))
+		return -ENOMEM;
+
+	iu->buf = NULL;
+	iu->dev = dev;
+
+	sg_mark_end(&iu->sglist[0]);
+
+	msg.hdr.type	= cpu_to_le16(IBNBD_MSG_CLOSE);
+	msg.device_id	= cpu_to_le32(device_id);
+
+	WARN_ON(!ibnbd_clt_get_dev(dev));
+	err = send_usr_msg(sess->ibtrs, WRITE, iu, &vec, 1, 0, NULL, 0,
+			   msg_close_conf, &errno, wait);
+	if (unlikely(err)) {
+		ibnbd_clt_put_dev(dev);
+		ibnbd_put_iu(sess, iu);
+	} else {
+		err = errno;
+	}
+
+	return err;
+}
+
+static void msg_open_conf(struct work_struct *work)
+{
+	struct ibnbd_iu *iu = container_of(work, struct ibnbd_iu, work);
+	struct ibnbd_msg_open_rsp *rsp = iu->buf;
+	struct ibnbd_clt_dev *dev = iu->dev;
+	int errno = iu->errno;
+
+	if (errno) {
+		ibnbd_err(dev, "Opening failed, server responded: %d\n", errno);
+	} else {
+		errno = process_msg_open_rsp(dev, rsp);
+		if (unlikely(errno)) {
+			u32 device_id = le32_to_cpu(rsp->device_id);
+			/*
+			 * If server thinks its fine, but we fail to process
+			 * then be nice and send a close to server.
+			 */
+			(void)send_msg_close(dev, device_id, NO_WAIT);
+		}
+	}
+	kfree(rsp);
+	wake_up_iu_comp(iu, errno);
+	ibnbd_put_iu(dev->sess, iu);
+	ibnbd_clt_put_dev(dev);
+}
+
+static void msg_sess_info_conf(struct work_struct *work)
+{
+	struct ibnbd_iu *iu = container_of(work, struct ibnbd_iu, work);
+	struct ibnbd_msg_sess_info_rsp *rsp = iu->buf;
+	struct ibnbd_clt_session *sess = iu->sess;
+
+	if (likely(!iu->errno))
+		sess->ver = min_t(u8, rsp->ver, IBNBD_PROTO_VER_MAJOR);
+
+	kfree(rsp);
+	wake_up_iu_comp(iu, iu->errno);
+	ibnbd_put_iu(sess, iu);
+	ibnbd_clt_put_sess(sess);
+}
+
+static int send_msg_open(struct ibnbd_clt_dev *dev, bool wait)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	struct ibnbd_msg_open_rsp *rsp;
+	struct ibnbd_msg_open msg;
+	struct ibnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+	if (unlikely(!rsp))
+		return -ENOMEM;
+
+	iu = ibnbd_get_iu(sess, IBTRS_USR_CON, IBTRS_TAG_WAIT);
+	if (unlikely(!iu)) {
+		kfree(rsp);
+		return -ENOMEM;
+	}
+
+	iu->buf = rsp;
+	iu->dev = dev;
+
+	sg_init_one(iu->sglist, rsp, sizeof(*rsp));
+
+	msg.hdr.type	= cpu_to_le16(IBNBD_MSG_OPEN);
+	msg.access_mode	= dev->access_mode;
+	msg.io_mode	= dev->io_mode;
+	strlcpy(msg.dev_name, dev->pathname, sizeof(msg.dev_name));
+
+	WARN_ON(!ibnbd_clt_get_dev(dev));
+	err = send_usr_msg(sess->ibtrs, READ, iu,
+			   &vec, 1, sizeof(*rsp), iu->sglist, 1,
+			   msg_open_conf, &errno, wait);
+	if (unlikely(err)) {
+		ibnbd_clt_put_dev(dev);
+		ibnbd_put_iu(sess, iu);
+		kfree(rsp);
+	} else {
+		err = errno;
+	}
+
+	return err;
+}
+
+static int send_msg_sess_info(struct ibnbd_clt_session *sess, bool wait)
+{
+	struct ibnbd_msg_sess_info_rsp *rsp;
+	struct ibnbd_msg_sess_info msg;
+	struct ibnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+	if (unlikely(!rsp))
+		return -ENOMEM;
+
+	iu = ibnbd_get_iu(sess, IBTRS_USR_CON, IBTRS_TAG_WAIT);
+	if (unlikely(!iu)) {
+		kfree(rsp);
+		return -ENOMEM;
+	}
+
+	iu->buf = rsp;
+	iu->sess = sess;
+
+	sg_init_one(iu->sglist, rsp, sizeof(*rsp));
+
+	msg.hdr.type = cpu_to_le16(IBNBD_MSG_SESS_INFO);
+	msg.ver      = IBNBD_PROTO_VER_MAJOR;
+
+	if (unlikely(!ibnbd_clt_get_sess(sess))) {
+		/*
+		 * That can happen only in one case, when IBTRS has restablished
+		 * the connection and link_ev() is called, but session is almost
+		 * dead, last reference on session is put and caller is waiting
+		 * for IBTRS to close everything.
+		 */
+		err = -ENODEV;
+		goto put_iu;
+	}
+	err = send_usr_msg(sess->ibtrs, READ, iu,
+			   &vec, 1, sizeof(*rsp), iu->sglist, 1,
+			   msg_sess_info_conf, &errno, wait);
+	if (unlikely(err)) {
+		ibnbd_clt_put_sess(sess);
+put_iu:
+		ibnbd_put_iu(sess, iu);
+		kfree(rsp);
+	} else {
+		err = errno;
+	}
+
+	return err;
+}
+
+static void set_dev_states_to_disconnected(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_clt_dev *dev;
+
+	mutex_lock(&sess->lock);
+	list_for_each_entry(dev, &sess->devs_list, list) {
+		ibnbd_err(dev, "Device disconnected.\n");
+
+		mutex_lock(&dev->lock);
+		if (dev->dev_state == DEV_STATE_MAPPED)
+			dev->dev_state = DEV_STATE_MAPPED_DISCONNECTED;
+		mutex_unlock(&dev->lock);
+	}
+	mutex_unlock(&sess->lock);
+}
+
+static void remap_devs(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_clt_dev *dev;
+	struct ibtrs_attrs attrs;
+	int err;
+
+	/*
+	 * Careful here: we are called from IBTRS link event directly,
+	 * thus we can't send any IBTRS request and wait for response
+	 * or IBTRS will not be able to complete request with failure
+	 * if something goes wrong (failing of outstanding requests
+	 * happens exactly from the context where we are blocking now).
+	 *
+	 * So to avoid deadlocks each usr message sent from here must
+	 * be asynchronous.
+	 */
+
+	err = send_msg_sess_info(sess, NO_WAIT);
+	if (unlikely(err)) {
+		pr_err("send_msg_sess_info(\"%s\"): %d\n", sess->sessname, err);
+		return;
+	}
+
+	ibtrs_clt_query(sess->ibtrs, &attrs);
+	mutex_lock(&sess->lock);
+	sess->max_io_size = attrs.max_io_size;
+
+	list_for_each_entry(dev, &sess->devs_list, list) {
+		bool skip;
+
+		mutex_lock(&dev->lock);
+		skip = (dev->dev_state == DEV_STATE_INIT);
+		mutex_unlock(&dev->lock);
+		if (skip)
+			/*
+			 * When device is establishing connection for the first
+			 * time - do not remap, it will be closed soon.
+			 */
+			continue;
+
+		ibnbd_info(dev, "session reconnected, remapping device\n");
+		err = send_msg_open(dev, NO_WAIT);
+		if (unlikely(err)) {
+			ibnbd_err(dev, "send_msg_open(): %d\n", err);
+			break;
+		}
+	}
+	mutex_unlock(&sess->lock);
+}
+
+static void ibnbd_clt_link_ev(void *priv, enum ibtrs_clt_link_ev ev)
+{
+	struct ibnbd_clt_session *sess = priv;
+
+	switch (ev) {
+	case IBTRS_CLT_LINK_EV_DISCONNECTED:
+		set_dev_states_to_disconnected(sess);
+		break;
+	case IBTRS_CLT_LINK_EV_RECONNECTED:
+		remap_devs(sess);
+		break;
+	default:
+		pr_err("Unknown session event received (%d), session: %s\n",
+		       ev, sess->sessname);
+	}
+}
+
+static void ibnbd_init_cpu_qlists(struct ibnbd_cpu_qlist __percpu *cpu_queues)
+{
+	unsigned int cpu;
+	struct ibnbd_cpu_qlist *cpu_q;
+
+	for_each_possible_cpu(cpu) {
+		cpu_q = per_cpu_ptr(cpu_queues, cpu);
+
+		cpu_q->cpu = cpu;
+		INIT_LIST_HEAD(&cpu_q->requeue_list);
+		spin_lock_init(&cpu_q->requeue_lock);
+	}
+}
+
+static struct blk_mq_ops ibnbd_mq_ops;
+static int setup_mq_tags(struct ibnbd_clt_session *sess)
+{
+	struct blk_mq_tag_set *tags = &sess->tag_set;
+
+	memset(tags, 0, sizeof(*tags));
+	tags->ops		= &ibnbd_mq_ops;
+	tags->queue_depth	= sess->queue_depth;
+	tags->numa_node		= NUMA_NO_NODE;
+	tags->flags		= BLK_MQ_F_SHOULD_MERGE |
+				  BLK_MQ_F_SG_MERGE     |
+				  BLK_MQ_F_TAG_SHARED;
+	tags->cmd_size		= sizeof(struct ibnbd_iu);
+	tags->nr_hw_queues	= num_online_cpus();
+
+	return blk_mq_alloc_tag_set(tags);
+}
+
+static void destroy_mq_tags(struct ibnbd_clt_session *sess)
+{
+	if (sess->tag_set.tags)
+		blk_mq_free_tag_set(&sess->tag_set);
+}
+
+static inline void wake_up_ibtrs_waiters(struct ibnbd_clt_session *sess)
+{
+	/* paired with rmb() in wait_for_ibtrs_connection() */
+	smp_wmb();
+	sess->ibtrs_ready = true;
+	wake_up_all(&sess->ibtrs_waitq);
+}
+
+static void close_ibtrs(struct ibnbd_clt_session *sess)
+{
+	might_sleep();
+
+	if (!IS_ERR_OR_NULL(sess->ibtrs)) {
+		ibtrs_clt_close(sess->ibtrs);
+		sess->ibtrs = NULL;
+		wake_up_ibtrs_waiters(sess);
+	}
+}
+
+static void free_sess(struct ibnbd_clt_session *sess)
+{
+	WARN_ON(!list_empty(&sess->devs_list));
+
+	might_sleep();
+
+	close_ibtrs(sess);
+	destroy_mq_tags(sess);
+	if (!list_empty(&sess->list)) {
+		mutex_lock(&sess_lock);
+		list_del(&sess->list);
+		mutex_unlock(&sess_lock);
+	}
+	free_percpu(sess->cpu_queues);
+	free_percpu(sess->cpu_rr);
+	kfree(sess);
+}
+
+static struct ibnbd_clt_session *alloc_sess(const char *sessname,
+					    const struct ibtrs_addr *paths,
+					    size_t path_cnt)
+{
+	struct ibnbd_clt_session *sess;
+	int err, cpu;
+
+	sess = kzalloc_node(sizeof(*sess), GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!sess)) {
+		pr_err("Failed to create session %s,"
+		       " allocating session struct failed\n", sessname);
+		return ERR_PTR(-ENOMEM);
+	}
+	strlcpy(sess->sessname, sessname, sizeof(sess->sessname));
+	atomic_set(&sess->busy, 0);
+	mutex_init(&sess->lock);
+	INIT_LIST_HEAD(&sess->devs_list);
+	INIT_LIST_HEAD(&sess->list);
+	bitmap_zero(sess->cpu_queues_bm, NR_CPUS);
+	init_waitqueue_head(&sess->ibtrs_waitq);
+	refcount_set(&sess->refcount, 1);
+
+	sess->cpu_queues = alloc_percpu(struct ibnbd_cpu_qlist);
+	if (unlikely(!sess->cpu_queues)) {
+		pr_err("Failed to create session to %s,"
+		       " alloc of percpu var (cpu_queues) failed\n", sessname);
+		err = -ENOMEM;
+		goto err;
+	}
+	ibnbd_init_cpu_qlists(sess->cpu_queues);
+
+	/**
+	 * That is simple percpu variable which stores cpu indeces, which are
+	 * incremented on each access.  We need that for the sake of fairness
+	 * to wake up queues in a round-robin manner.
+	 */
+	sess->cpu_rr = alloc_percpu(int);
+	if (unlikely(!sess->cpu_rr)) {
+		pr_err("Failed to create session %s,"
+		       " alloc of percpu var (cpu_rr) failed\n", sessname);
+		err = -ENOMEM;
+		goto err;
+	}
+	for_each_possible_cpu(cpu)
+		*per_cpu_ptr(sess->cpu_rr, cpu) = cpu;
+
+	return sess;
+
+err:
+	free_sess(sess);
+
+	return ERR_PTR(err);
+}
+
+static int wait_for_ibtrs_connection(struct ibnbd_clt_session *sess)
+{
+	wait_event(sess->ibtrs_waitq, sess->ibtrs_ready);
+	/* paired with wmb() in wake_up_ibtrs_waiters() */
+	smp_rmb();
+	if (unlikely(IS_ERR_OR_NULL(sess->ibtrs)))
+		return -ECONNRESET;
+
+	return 0;
+}
+
+static void wait_for_ibtrs_disconnection(struct ibnbd_clt_session *sess)
+__releases(&sess_lock)
+__acquires(&sess_lock)
+{
+	DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
+
+	prepare_to_wait(&sess->ibtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
+	if (IS_ERR_OR_NULL(sess->ibtrs)) {
+		finish_wait(&sess->ibtrs_waitq, &wait);
+		return;
+	}
+	mutex_unlock(&sess_lock);
+	/* After unlock session can be freed, so careful */
+	schedule();
+	mutex_lock(&sess_lock);
+}
+
+static struct ibnbd_clt_session *__find_and_get_sess(const char *sessname)
+__releases(&sess_lock)
+__acquires(&sess_lock)
+{
+	struct ibnbd_clt_session *sess;
+	int err;
+
+again:
+	list_for_each_entry(sess, &sess_list, list) {
+		if (strcmp(sessname, sess->sessname))
+			continue;
+
+		if (unlikely(sess->ibtrs_ready && IS_ERR_OR_NULL(sess->ibtrs)))
+			/*
+			 * No IBTRS connection, session is dying.
+			 */
+			continue;
+
+		if (likely(ibnbd_clt_get_sess(sess))) {
+			/*
+			 * Alive session is found, wait for IBTRS connection.
+			 */
+			mutex_unlock(&sess_lock);
+			err = wait_for_ibtrs_connection(sess);
+			if (unlikely(err))
+				ibnbd_clt_put_sess(sess);
+			mutex_lock(&sess_lock);
+
+			if (unlikely(err))
+				/* Session is dying, repeat the loop */
+				goto again;
+
+			return sess;
+		} else {
+			/*
+			 * Ref is 0, session is dying, wait for IBTRS disconnect
+			 * in order to avoid session names clashes.
+			 */
+			wait_for_ibtrs_disconnection(sess);
+			/*
+			 * IBTRS is disconnected and soon session will be freed,
+			 * so repeat a loop.
+			 */
+			goto again;
+		}
+	}
+
+	return NULL;
+}
+
+static struct ibnbd_clt_session *find_and_get_sess(const char *sessname)
+{
+	struct ibnbd_clt_session *sess;
+
+	mutex_lock(&sess_lock);
+	sess = __find_and_get_sess(sessname);
+	mutex_unlock(&sess_lock);
+
+	return sess;
+}
+
+static struct ibnbd_clt_session *
+find_and_get_or_insert_sess(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_clt_session *found;
+
+	mutex_lock(&sess_lock);
+	found = __find_and_get_sess(sess->sessname);
+	if (!found)
+		list_add(&sess->list, &sess_list);
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static struct ibnbd_clt_session *
+find_and_get_or_create_sess(const char *sessname,
+			    const struct ibtrs_addr *paths,
+			    size_t path_cnt)
+{
+	struct ibnbd_clt_session *sess, *found;
+	struct ibtrs_attrs attrs;
+	int err;
+
+	sess = find_and_get_sess(sessname);
+	if (sess)
+		return sess;
+
+	sess = alloc_sess(sessname, paths, path_cnt);
+	if (unlikely(IS_ERR(sess)))
+		return sess;
+
+	found = find_and_get_or_insert_sess(sess);
+	if (unlikely(found)) {
+		free_sess(sess);
+
+		return found;
+	}
+	/*
+	 * Nothing was found, establish ibtrs connection and proceed further.
+	 */
+	sess->ibtrs = ibtrs_clt_open(sess, ibnbd_clt_link_ev, sessname,
+				     paths, path_cnt, IBTRS_PORT,
+				     sizeof(struct ibnbd_iu),
+				     RECONNECT_DELAY, BMAX_SEGMENTS,
+				     MAX_RECONNECTS);
+	if (unlikely(IS_ERR(sess->ibtrs))) {
+		err = PTR_ERR(sess->ibtrs);
+		goto wake_up_and_put;
+	}
+	ibtrs_clt_query(sess->ibtrs, &attrs);
+	sess->max_io_size = attrs.max_io_size;
+	sess->queue_depth = attrs.queue_depth;
+
+	err = setup_mq_tags(sess);
+	if (unlikely(err))
+		goto close_ibtrs;
+
+	err = send_msg_sess_info(sess, WAIT);
+	if (unlikely(err))
+		goto close_ibtrs;
+
+	wake_up_ibtrs_waiters(sess);
+
+	return sess;
+
+close_ibtrs:
+	close_ibtrs(sess);
+put_sess:
+	ibnbd_clt_put_sess(sess);
+
+	return ERR_PTR(err);
+
+wake_up_and_put:
+	wake_up_ibtrs_waiters(sess);
+	goto put_sess;
+}
+
+static int ibnbd_client_open(struct block_device *block_device, fmode_t mode)
+{
+	struct ibnbd_clt_dev *dev = block_device->bd_disk->private_data;
+
+	if (dev->read_only && (mode & FMODE_WRITE))
+		return -EPERM;
+
+	if (dev->dev_state == DEV_STATE_UNMAPPED ||
+	    !ibnbd_clt_get_dev(dev))
+		return -EIO;
+
+	return 0;
+}
+
+static void ibnbd_client_release(struct gendisk *gen, fmode_t mode)
+{
+	struct ibnbd_clt_dev *dev = gen->private_data;
+
+	ibnbd_clt_put_dev(dev);
+}
+
+static int ibnbd_client_getgeo(struct block_device *block_device,
+			       struct hd_geometry *geo)
+{
+	u64 size;
+	struct ibnbd_clt_dev *dev;
+
+	dev = block_device->bd_disk->private_data;
+	size = dev->size * (dev->logical_block_size / KERNEL_SECTOR_SIZE);
+	geo->cylinders	= (size & ~0x3f) >> 6;	/* size/64 */
+	geo->heads	= 4;
+	geo->sectors	= 16;
+	geo->start	= 0;
+
+	return 0;
+}
+
+static const struct block_device_operations ibnbd_client_ops = {
+	.owner		= THIS_MODULE,
+	.open		= ibnbd_client_open,
+	.release	= ibnbd_client_release,
+	.getgeo		= ibnbd_client_getgeo
+};
+
+static size_t ibnbd_clt_get_sg_size(struct scatterlist *sglist, u32 len)
+{
+	struct scatterlist *sg;
+	size_t tsize = 0;
+	int i;
+
+	for_each_sg(sglist, sg, len, i)
+		tsize += sg->length;
+	return tsize;
+}
+
+static int ibnbd_client_xfer_request(struct ibnbd_clt_dev *dev,
+				     struct request *rq,
+				     struct ibnbd_iu *iu)
+{
+	struct ibtrs_clt *ibtrs = dev->sess->ibtrs;
+	struct ibtrs_tag *tag = iu->tag;
+	struct ibnbd_msg_io msg;
+	unsigned int sg_cnt = 0;
+	struct kvec vec;
+	size_t size;
+	int err;
+
+	iu->rq		= rq;
+	iu->dev		= dev;
+	msg.sector	= cpu_to_le64(blk_rq_pos(rq));
+	msg.bi_size	= cpu_to_le32(blk_rq_bytes(rq));
+	msg.rw		= cpu_to_le32(rq_to_ibnbd_flags(rq));
+
+	/* We only support discards with single segment for now. See queue limits. */
+	if (req_op(rq) != REQ_OP_DISCARD)
+		sg_cnt = blk_rq_map_sg(dev->queue, rq, iu->sglist);
+
+	if (sg_cnt == 0)
+		/* Do not forget to mark the end */
+		sg_mark_end(&iu->sglist[0]);
+
+	msg.hdr.type	= cpu_to_le16(IBNBD_MSG_IO);
+	msg.device_id	= cpu_to_le32(dev->device_id);
+
+	vec = (struct kvec) {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+
+	size = ibnbd_clt_get_sg_size(iu->sglist, sg_cnt);
+	err = ibtrs_clt_request(rq_data_dir(rq), msg_io_conf, ibtrs, tag,
+				iu, &vec, 1, size, iu->sglist, sg_cnt);
+	if (unlikely(err)) {
+		ibnbd_err_rl(dev, "IBTRS failed to transfer IO, err: %d\n",
+			     err);
+		return err;
+	}
+
+	return 0;
+}
+
+/**
+ * ibnbd_clt_dev_add_to_requeue() - add device to requeue if session is busy
+ *
+ * Description:
+ *     If session is busy, that means someone will requeue us when resources
+ *     are freed.  If session is not doing anything - device is not added to
+ *     the list and @false is returned.
+ */
+static inline bool ibnbd_clt_dev_add_to_requeue(struct ibnbd_clt_dev *dev,
+						struct ibnbd_queue *q)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	struct ibnbd_cpu_qlist *cpu_q;
+	unsigned long flags;
+	bool added = true;
+	bool need_set;
+
+	cpu_q = get_cpu_ptr(sess->cpu_queues);
+	spin_lock_irqsave(&cpu_q->requeue_lock, flags);
+
+	if (likely(!test_and_set_bit_lock(0, &q->in_list))) {
+		if (WARN_ON(!list_empty(&q->requeue_list)))
+			goto unlock;
+
+		need_set = !test_bit(cpu_q->cpu, sess->cpu_queues_bm);
+		if (need_set) {
+			set_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			/* Paired with ibnbd_put_tag().	 Set a bit first
+			 * and then observe the busy counter.
+			 */
+			smp_mb__before_atomic();
+		}
+		if (likely(atomic_read(&sess->busy))) {
+			list_add_tail(&q->requeue_list, &cpu_q->requeue_list);
+		} else {
+			/* Very unlikely, but possible: busy counter was
+			 * observed as zero.  Drop all bits and return
+			 * false to restart the queue by ourselves.
+			 */
+			if (need_set)
+				clear_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			clear_bit_unlock(0, &q->in_list);
+			added = false;
+		}
+	}
+unlock:
+	spin_unlock_irqrestore(&cpu_q->requeue_lock, flags);
+	put_cpu_ptr(sess->cpu_queues);
+
+	return added;
+}
+
+static void ibnbd_clt_dev_kick_mq_queue(struct ibnbd_clt_dev *dev,
+					struct blk_mq_hw_ctx *hctx,
+					int delay)
+{
+	struct ibnbd_queue *q = hctx->driver_data;
+
+	if (delay != IBNBD_DELAY_IFBUSY)
+		blk_mq_delay_run_hw_queue(hctx, delay);
+	else if (unlikely(!ibnbd_clt_dev_add_to_requeue(dev, q)))
+		/*
+		 * If session is not busy we have to restart
+		 * the queue ourselves.
+		 */
+		blk_mq_delay_run_hw_queue(hctx, IBNBD_DELAY_10ms);
+}
+
+static blk_status_t ibnbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+				   const struct blk_mq_queue_data *bd)
+{
+	struct request *rq = bd->rq;
+	struct ibnbd_clt_dev *dev = rq->rq_disk->private_data;
+	struct ibnbd_iu *iu = blk_mq_rq_to_pdu(rq);
+	int err;
+
+	if (unlikely(!ibnbd_clt_dev_is_mapped(dev)))
+		return BLK_STS_IOERR;
+
+	iu->tag = ibnbd_get_tag(dev->sess, IBTRS_IO_CON, IBTRS_TAG_NOWAIT);
+	if (unlikely(!iu->tag)) {
+		ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_IFBUSY);
+		return BLK_STS_RESOURCE;
+	}
+
+	blk_mq_start_request(rq);
+	err = ibnbd_client_xfer_request(dev, rq, iu);
+	if (likely(err == 0))
+		return BLK_STS_OK;
+	if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
+		ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_10ms);
+		ibnbd_put_tag(dev->sess, iu->tag);
+		return BLK_STS_RESOURCE;
+	}
+
+	ibnbd_put_tag(dev->sess, iu->tag);
+	return BLK_STS_IOERR;
+}
+
+static int ibnbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
+			      unsigned int hctx_idx, unsigned int numa_node)
+{
+	struct ibnbd_iu *iu = blk_mq_rq_to_pdu(rq);
+
+	sg_init_table(iu->sglist, BMAX_SEGMENTS);
+	return 0;
+}
+
+static inline void ibnbd_init_hw_queue(struct ibnbd_clt_dev *dev,
+				       struct ibnbd_queue *q,
+				       struct blk_mq_hw_ctx *hctx)
+{
+	INIT_LIST_HEAD(&q->requeue_list);
+	q->dev  = dev;
+	q->hctx = hctx;
+}
+
+static void ibnbd_init_mq_hw_queues(struct ibnbd_clt_dev *dev)
+{
+	int i;
+	struct blk_mq_hw_ctx *hctx;
+	struct ibnbd_queue *q;
+
+	queue_for_each_hw_ctx(dev->queue, hctx, i) {
+		q = &dev->hw_queues[i];
+		ibnbd_init_hw_queue(dev, q, hctx);
+		hctx->driver_data = q;
+	}
+}
+
+static struct blk_mq_ops ibnbd_mq_ops = {
+	.queue_rq	= ibnbd_queue_rq,
+	.init_request	= ibnbd_init_request,
+	.complete	= ibnbd_softirq_done_fn,
+};
+
+static int index_to_minor(int index)
+{
+	return index << IBNBD_PART_BITS;
+}
+
+static int minor_to_index(int minor)
+{
+	return minor >> IBNBD_PART_BITS;
+}
+
+static int setup_mq_dev(struct ibnbd_clt_dev *dev)
+{
+	dev->queue = blk_mq_init_queue(&dev->sess->tag_set);
+	if (IS_ERR(dev->queue)) {
+		ibnbd_err(dev,
+			  "Initializing multiqueue queue failed, err: %ld\n",
+			  PTR_ERR(dev->queue));
+		return PTR_ERR(dev->queue);
+	}
+	ibnbd_init_mq_hw_queues(dev);
+	return 0;
+}
+
+static void setup_request_queue(struct ibnbd_clt_dev *dev)
+{
+	blk_queue_logical_block_size(dev->queue, dev->logical_block_size);
+	blk_queue_physical_block_size(dev->queue, dev->physical_block_size);
+	blk_queue_max_hw_sectors(dev->queue, dev->max_hw_sectors);
+	blk_queue_max_write_same_sectors(dev->queue,
+					 dev->max_write_same_sectors);
+
+	/* we don't support discards to "discontiguous" segments in on request */
+	blk_queue_max_discard_segments(dev->queue, 1);
+
+	blk_queue_max_discard_sectors(dev->queue, dev->max_discard_sectors);
+	dev->queue->limits.discard_granularity	= dev->discard_granularity;
+	dev->queue->limits.discard_alignment	= dev->discard_alignment;
+	if (dev->max_discard_sectors)
+		blk_queue_flag_set(QUEUE_FLAG_DISCARD, dev->queue);
+	if (dev->secure_discard)
+		blk_queue_flag_set(QUEUE_FLAG_SECERASE, dev->queue);
+
+	blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, dev->queue);
+	blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, dev->queue);
+	/* our hca only support 32 sg cnt, proto use one, so 31 left */
+	blk_queue_max_segments(dev->queue, dev->max_segments);
+	blk_queue_io_opt(dev->queue, dev->sess->max_io_size);
+	blk_queue_virt_boundary(dev->queue, 4095);
+	blk_queue_write_cache(dev->queue, true, true);
+	dev->queue->queuedata = dev;
+}
+
+static void ibnbd_clt_setup_gen_disk(struct ibnbd_clt_dev *dev, int idx)
+{
+	dev->gd->major		= ibnbd_client_major;
+	dev->gd->first_minor	= index_to_minor(idx);
+	dev->gd->fops		= &ibnbd_client_ops;
+	dev->gd->queue		= dev->queue;
+	dev->gd->private_data	= dev;
+	snprintf(dev->gd->disk_name, sizeof(dev->gd->disk_name), "ibnbd%d",
+		 idx);
+	pr_debug("disk_name=%s, capacity=%zu\n",
+		 dev->gd->disk_name,
+		 dev->nsectors * (dev->logical_block_size / KERNEL_SECTOR_SIZE)
+		 );
+
+	set_capacity(dev->gd, dev->nsectors * (dev->logical_block_size /
+					       KERNEL_SECTOR_SIZE));
+
+	if (dev->access_mode == IBNBD_ACCESS_RO) {
+		dev->read_only = true;
+		set_disk_ro(dev->gd, true);
+	} else {
+		dev->read_only = false;
+	}
+
+	if (!dev->rotational)
+		blk_queue_flag_set(QUEUE_FLAG_NONROT, dev->queue);
+}
+
+static void ibnbd_clt_add_gen_disk(struct ibnbd_clt_dev *dev)
+{
+	add_disk(dev->gd);
+}
+
+static int ibnbd_client_setup_device(struct ibnbd_clt_session *sess,
+				     struct ibnbd_clt_dev *dev, int idx)
+{
+	int err;
+
+	dev->size = dev->nsectors * dev->logical_block_size;
+
+	err = setup_mq_dev(dev);
+	if (err)
+		return err;
+
+	setup_request_queue(dev);
+
+	dev->gd = alloc_disk_node(1 << IBNBD_PART_BITS,	NUMA_NO_NODE);
+	if (!dev->gd) {
+		ibnbd_err(dev, "Failed to allocate disk node\n");
+		blk_cleanup_queue(dev->queue);
+		return -ENOMEM;
+	}
+
+	ibnbd_clt_setup_gen_disk(dev, idx);
+
+	return 0;
+}
+
+static struct ibnbd_clt_dev *init_dev(struct ibnbd_clt_session *sess,
+				      enum ibnbd_access_mode access_mode,
+				      enum ibnbd_io_mode io_mode,
+				      const char *pathname)
+{
+	struct ibnbd_clt_dev *dev;
+	int ret;
+
+	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, NUMA_NO_NODE);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	dev->hw_queues = kcalloc(nr_cpu_ids, sizeof(*dev->hw_queues), GFP_KERNEL);
+	if (unlikely(!dev->hw_queues)) {
+		pr_err("Failed to initialize device '%s' from session"
+		       " %s, allocating hw_queues failed.", pathname,
+		       sess->sessname);
+		ret = -ENOMEM;
+		goto out_alloc;
+	}
+
+	mutex_lock(&ida_lock);
+	ret = ida_simple_get(&index_ida, 0, minor_to_index(1 << MINORBITS),
+			     GFP_KERNEL);
+	mutex_unlock(&ida_lock);
+	if (ret < 0) {
+		pr_err("Failed to initialize device '%s' from session %s,"
+		       " allocating idr failed, err: %d\n", pathname,
+		       sess->sessname, ret);
+		goto out_queues;
+	}
+	dev->clt_device_id	= ret;
+	dev->sess		= sess;
+	dev->access_mode	= access_mode;
+	dev->io_mode		= io_mode;
+	strlcpy(dev->pathname, pathname, sizeof(dev->pathname));
+	mutex_init(&dev->lock);
+	refcount_set(&dev->refcount, 1);
+	dev->dev_state = DEV_STATE_INIT;
+
+	/*
+	 * Here we called from sysfs entry, thus clt-sysfs is
+	 * responsible that session will not disappear.
+	 */
+	WARN_ON(!ibnbd_clt_get_sess(sess));
+
+	return dev;
+
+out_queues:
+	kfree(dev->hw_queues);
+out_alloc:
+	kfree(dev);
+	return ERR_PTR(ret);
+}
+
+static bool __exists_dev(const char *pathname)
+{
+	struct ibnbd_clt_session *sess;
+	struct ibnbd_clt_dev *dev;
+	bool found = false;
+
+	list_for_each_entry(sess, &sess_list, list) {
+		mutex_lock(&sess->lock);
+		list_for_each_entry(dev, &sess->devs_list, list) {
+			if (!strncmp(dev->pathname, pathname,
+				     sizeof(dev->pathname))) {
+				found = true;
+				break;
+			}
+		}
+		mutex_unlock(&sess->lock);
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+static bool exists_devpath(const char *pathname)
+{
+	bool found;
+
+	mutex_lock(&sess_lock);
+	found = __exists_dev(pathname);
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static bool insert_dev_if_not_exists_devpath(const char *pathname,
+					     struct ibnbd_clt_session *sess,
+					     struct ibnbd_clt_dev *dev)
+{
+	bool found;
+
+	mutex_lock(&sess_lock);
+	found = __exists_dev(pathname);
+	if (!found) {
+		mutex_lock(&sess->lock);
+		list_add_tail(&dev->list, &sess->devs_list);
+		mutex_unlock(&sess->lock);
+	}
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static void delete_dev(struct ibnbd_clt_dev *dev)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+
+	mutex_lock(&sess->lock);
+	list_del(&dev->list);
+	mutex_unlock(&sess->lock);
+}
+
+struct ibnbd_clt_dev *ibnbd_clt_map_device(const char *sessname,
+					   struct ibtrs_addr *paths,
+					   size_t path_cnt,
+					   const char *pathname,
+					   enum ibnbd_access_mode access_mode,
+					   enum ibnbd_io_mode io_mode)
+{
+	struct ibnbd_clt_session *sess;
+	struct ibnbd_clt_dev *dev;
+	int ret;
+
+	if (unlikely(exists_devpath(pathname)))
+		return ERR_PTR(-EEXIST);
+
+	sess = find_and_get_or_create_sess(sessname, paths, path_cnt);
+	if (unlikely(IS_ERR(sess)))
+		return ERR_CAST(sess);
+
+	dev = init_dev(sess, access_mode, io_mode, pathname);
+	if (unlikely(IS_ERR(dev))) {
+		pr_err("map_device: failed to map device '%s' from session %s,"
+		       " can't initialize device, err: %ld\n", pathname,
+		       sess->sessname, PTR_ERR(dev));
+		ret = PTR_ERR(dev);
+		goto put_sess;
+	}
+	if (unlikely(insert_dev_if_not_exists_devpath(pathname, sess, dev))) {
+		ret = -EEXIST;
+		goto put_dev;
+	}
+	ret = send_msg_open(dev, WAIT);
+	if (unlikely(ret)) {
+		ibnbd_err(dev, "map_device: failed, can't open remote device,"
+			  " err: %d\n", ret);
+		goto del_dev;
+	}
+	mutex_lock(&dev->lock);
+	pr_debug("Opened remote device: session=%s, path='%s'\n",
+		 sess->sessname, pathname);
+	ret = ibnbd_client_setup_device(sess, dev, dev->clt_device_id);
+	if (ret) {
+		ibnbd_err(dev, "map_device: Failed to configure device, err: %d\n",
+			  ret);
+		mutex_unlock(&dev->lock);
+		goto del_dev;
+	}
+
+	ibnbd_info(dev, "map_device: Device mapped as %s (nsectors: %zu,"
+		   " logical_block_size: %d, physical_block_size: %d,"
+		   " max_write_same_sectors: %d, max_discard_sectors: %d,"
+		   " discard_granularity: %d, discard_alignment: %d, "
+		   "secure_discard: %d, max_segments: %d, max_hw_sectors: %d, "
+		   "rotational: %d)\n",
+		   dev->gd->disk_name, dev->nsectors, dev->logical_block_size,
+		   dev->physical_block_size, dev->max_write_same_sectors,
+		   dev->max_discard_sectors, dev->discard_granularity,
+		   dev->discard_alignment, dev->secure_discard,
+		   dev->max_segments, dev->max_hw_sectors, dev->rotational);
+
+	mutex_unlock(&dev->lock);
+
+	ibnbd_clt_add_gen_disk(dev);
+	ibnbd_clt_put_sess(sess);
+
+	return dev;
+
+del_dev:
+	delete_dev(dev);
+put_dev:
+	ibnbd_clt_put_dev(dev);
+put_sess:
+	ibnbd_clt_put_sess(sess);
+
+	return ERR_PTR(ret);
+}
+
+static void destroy_gen_disk(struct ibnbd_clt_dev *dev)
+{
+	del_gendisk(dev->gd);
+	/*
+	 * Before marking queue as dying (blk_cleanup_queue() does that)
+	 * we have to be sure that everything in-flight has gone.
+	 * Blink with freeze/unfreeze.
+	 */
+	blk_mq_freeze_queue(dev->queue);
+	blk_mq_unfreeze_queue(dev->queue);
+	blk_cleanup_queue(dev->queue);
+	put_disk(dev->gd);
+}
+
+static void destroy_sysfs(struct ibnbd_clt_dev *dev,
+			  const struct attribute *sysfs_self)
+{
+	ibnbd_clt_remove_dev_symlink(dev);
+	if (dev->kobj.state_initialized) {
+		if (sysfs_self)
+			/* To avoid deadlock firstly commit suicide */
+			sysfs_remove_file_self(&dev->kobj, sysfs_self);
+		kobject_del(&dev->kobj);
+		kobject_put(&dev->kobj);
+	}
+}
+
+int ibnbd_clt_unmap_device(struct ibnbd_clt_dev *dev, bool force,
+			   const struct attribute *sysfs_self)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	int refcount, ret = 0;
+	bool was_mapped;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state == DEV_STATE_UNMAPPED) {
+		ibnbd_info(dev, "Device is already being unmapped\n");
+		ret = -EALREADY;
+		goto err;
+	}
+	refcount = refcount_read(&dev->refcount);
+	if (!force && refcount > 1) {
+		ibnbd_err(dev, "Closing device failed, device is in use,"
+			  " (%d device users)\n", refcount - 1);
+		ret = -EBUSY;
+		goto err;
+	}
+	was_mapped = (dev->dev_state == DEV_STATE_MAPPED);
+	dev->dev_state = DEV_STATE_UNMAPPED;
+	mutex_unlock(&dev->lock);
+
+	delete_dev(dev);
+	destroy_sysfs(dev, sysfs_self);
+	destroy_gen_disk(dev);
+	if (was_mapped && sess->ibtrs)
+		send_msg_close(dev, dev->device_id, WAIT);
+
+	ibnbd_info(dev, "Device is unmapped\n");
+
+	/* Likely last reference put */
+	ibnbd_clt_put_dev(dev);
+
+	/*
+	 * Here device and session can be vanished!
+	 */
+
+	return 0;
+err:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+int ibnbd_clt_remap_device(struct ibnbd_clt_dev *dev)
+{
+	int err;
+
+	mutex_lock(&dev->lock);
+	if (likely(dev->dev_state == DEV_STATE_MAPPED_DISCONNECTED))
+		err = 0;
+	else if (dev->dev_state == DEV_STATE_UNMAPPED)
+		err = -ENODEV;
+	else if (dev->dev_state == DEV_STATE_MAPPED)
+		err = -EALREADY;
+	else
+		err = -EBUSY;
+	mutex_unlock(&dev->lock);
+	if (likely(!err)) {
+		ibnbd_info(dev, "Remapping device.\n");
+		err = send_msg_open(dev, WAIT);
+		if (unlikely(err))
+			ibnbd_err(dev, "remap_device: %d\n", err);
+	}
+
+	return err;
+}
+
+static void unmap_device_work(struct work_struct *work)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(work, typeof(*dev), unmap_on_rmmod_work);
+	ibnbd_clt_unmap_device(dev, true, NULL);
+}
+
+static void ibnbd_destroy_sessions(void)
+{
+	struct ibnbd_clt_session *sess, *sn;
+	struct ibnbd_clt_dev *dev, *tn;
+
+	/* Firstly forbid access through sysfs interface */
+	ibnbd_clt_destroy_default_group();
+	ibnbd_clt_destroy_sysfs_files();
+
+	/*
+	 * Here at this point there is no any concurrent access to sessions
+	 * list and devices list:
+	 *   1. New session or device can'be be created - session sysfs files
+	 *      are removed.
+	 *   2. Device or session can't be removed - module reference is taken
+	 *      into account in unmap device sysfs callback.
+	 *   3. No IO requests inflight - each file open of block_dev increases
+	 *      module reference in get_disk().
+	 *
+	 * But still there can be user requests inflights, which are sent by
+	 * asynchronous send_msg_*() functions, thus before unmapping devices
+	 * IBTRS session must be explicitly closed.
+	 */
+
+	list_for_each_entry_safe(sess, sn, &sess_list, list) {
+		WARN_ON(!ibnbd_clt_get_sess(sess));
+		close_ibtrs(sess);
+		list_for_each_entry_safe(dev, tn, &sess->devs_list, list) {
+			/*
+			 * Here unmap happens in parallel for only one reason:
+			 * blk_cleanup_queue() takes around half a second, so
+			 * on huge amount of devices the whole module unload
+			 * procedure takes minutes.
+			 */
+			INIT_WORK(&dev->unmap_on_rmmod_work, unmap_device_work);
+			queue_work(unload_wq, &dev->unmap_on_rmmod_work);
+		}
+		ibnbd_clt_put_sess(sess);
+	}
+	/* Wait for all scheduled unmap works */
+	flush_workqueue(unload_wq);
+	WARN_ON(!list_empty(&sess_list));
+}
+
+static int __init ibnbd_client_init(void)
+{
+	int err;
+
+	pr_info("Loading module %s, version %s, proto %s: "
+		"(softirq_enable: %d)\n", KBUILD_MODNAME,
+		IBNBD_VER_STRING, IBNBD_PROTO_VER_STRING,
+		softirq_enable);
+
+	ibnbd_client_major = register_blkdev(ibnbd_client_major, "ibnbd");
+	if (ibnbd_client_major <= 0) {
+		pr_err("Failed to load module,"
+		       " block device registration failed\n");
+		err = -EBUSY;
+		goto out;
+	}
+
+	err = ibnbd_clt_create_sysfs_files();
+	if (err) {
+		pr_err("Failed to load module,"
+		       " creating sysfs device files failed, err: %d\n",
+		       err);
+		goto out_unregister_blk;
+	}
+
+	unload_wq = alloc_workqueue("ibnbd_unload_wq", WQ_MEM_RECLAIM, 0);
+	if (!unload_wq) {
+		pr_err("Failed to load module, alloc ibnbd_unload_wq failed\n");
+		goto out_destroy_sysfs_files;
+	}
+
+	return 0;
+
+out_destroy_sysfs_files:
+	ibnbd_clt_destroy_sysfs_files();
+out_unregister_blk:
+	unregister_blkdev(ibnbd_client_major, "ibnbd");
+out:
+	return err;
+}
+
+static void __exit ibnbd_client_exit(void)
+{
+	pr_info("Unloading module\n");
+	ibnbd_destroy_sessions();
+	unregister_blkdev(ibnbd_client_major, "ibnbd");
+	ida_destroy(&index_ida);
+	destroy_workqueue(unload_wq);
+	pr_info("Module unloaded\n");
+}
+
+module_init(ibnbd_client_init);
+module_exit(ibnbd_client_exit);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 19/26] ibnbd: client: sysfs interface functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (17 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 18/26] ibnbd: client: main functionality Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 20/26] ibnbd: server: private header with server structs and functions Roman Pen
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is the sysfs interface to IBNBD block devices on client side:

  /sys/devices/virtual/ibnbd-client/ctl/
    |- map_device
    |  *** maps remote device
    |
    |- devices/
       *** all mapped devices

  /sys/block/ibnbd<N>/ibnbd_client/
    |- unmap_device
    |  *** unmaps device
    |
    |- state
    |  *** device state
    |
    |- session
    |  *** session name
    |
    |- mapping_path
       *** path of the dev that was mapped on server

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-clt-sysfs.c | 675 ++++++++++++++++++++++++++++++++++
 1 file changed, 675 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c

diff --git a/drivers/block/ibnbd/ibnbd-clt-sysfs.c b/drivers/block/ibnbd/ibnbd-clt-sysfs.c
new file mode 100644
index 000000000000..ca3e59b28c54
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt-sysfs.c
@@ -0,0 +1,675 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Swapnil Ingle <swapnil.ingle@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/parser.h>
+#include <linux/module.h>
+#include <linux/in6.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/device.h>
+#include <rdma/ib.h>
+#include <rdma/rdma_cm.h>
+
+#include "ibnbd-clt.h"
+
+static struct device *ibnbd_dev;
+static struct class *ibnbd_dev_class;
+static struct kobject *ibnbd_devs_kobj;
+
+enum {
+	IBNBD_OPT_ERR		= 0,
+	IBNBD_OPT_PATH		= 1 << 0,
+	IBNBD_OPT_DEV_PATH	= 1 << 1,
+	IBNBD_OPT_ACCESS_MODE	= 1 << 3,
+	IBNBD_OPT_IO_MODE	= 1 << 5,
+	IBNBD_OPT_SESSNAME	= 1 << 6,
+};
+
+static unsigned int ibnbd_opt_mandatory[] = {
+	IBNBD_OPT_PATH,
+	IBNBD_OPT_DEV_PATH,
+	IBNBD_OPT_SESSNAME,
+};
+
+static const match_table_t ibnbd_opt_tokens = {
+	{	IBNBD_OPT_PATH,		"path=%s"		},
+	{	IBNBD_OPT_DEV_PATH,	"device_path=%s"	},
+	{	IBNBD_OPT_ACCESS_MODE,	"access_mode=%s"	},
+	{	IBNBD_OPT_IO_MODE,	"io_mode=%s"		},
+	{	IBNBD_OPT_SESSNAME,	"sessname=%s"		},
+	{	IBNBD_OPT_ERR,		NULL			},
+};
+
+/* remove new line from string */
+static void strip(char *s)
+{
+	char *p = s;
+
+	while (*s != '\0') {
+		if (*s != '\n')
+			*p++ = *s++;
+		else
+			++s;
+	}
+	*p = '\0';
+}
+
+static int ibnbd_clt_parse_map_options(const char *buf,
+				       char *sessname,
+				       struct ibtrs_addr *paths,
+				       size_t *path_cnt,
+				       size_t max_path_cnt,
+				       char *pathname,
+				       enum ibnbd_access_mode *access_mode,
+				       enum ibnbd_io_mode *io_mode)
+{
+	char *options, *sep_opt;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int opt_mask = 0;
+	int token;
+	int ret = -EINVAL;
+	int i;
+	int p_cnt = 0;
+
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	sep_opt = strstrip(options);
+	strip(sep_opt);
+	while ((p = strsep(&sep_opt, " ")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, ibnbd_opt_tokens, args);
+		opt_mask |= token;
+
+		switch (token) {
+		case IBNBD_OPT_SESSNAME:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (strlen(p) > NAME_MAX) {
+				pr_err("map_device: sessname too long\n");
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			strlcpy(sessname, p, NAME_MAX);
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_PATH:
+			if (p_cnt >= max_path_cnt) {
+				pr_err("map_device: too many (> %lu) paths "
+				       "provided\n", max_path_cnt);
+				ret = -ENOMEM;
+				goto out;
+			}
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			ret = ibtrs_addr_to_sockaddr(p, strlen(p), IBTRS_PORT,
+						     &paths[p_cnt]);
+			if (ret) {
+				pr_err("Can't parse path %s: %d\n", p, ret);
+				kfree(p);
+				goto out;
+			}
+
+			p_cnt++;
+
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_DEV_PATH:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (strlen(p) > NAME_MAX) {
+				pr_err("map_device: Device path too long\n");
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			strlcpy(pathname, p, NAME_MAX);
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_ACCESS_MODE:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			if (!strcmp(p, "ro")) {
+				*access_mode = IBNBD_ACCESS_RO;
+			} else if (!strcmp(p, "rw")) {
+				*access_mode = IBNBD_ACCESS_RW;
+			} else if (!strcmp(p, "migration")) {
+				*access_mode = IBNBD_ACCESS_MIGRATION;
+			} else {
+				pr_err("map_device: Invalid access_mode:"
+				       " '%s'\n", p);
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_IO_MODE:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (!strcmp(p, "blockio")) {
+				*io_mode = IBNBD_BLOCKIO;
+			} else if (!strcmp(p, "fileio")) {
+				*io_mode = IBNBD_FILEIO;
+			} else {
+				pr_err("map_device: Invalid io_mode: '%s'.\n",
+				       p);
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+
+		default:
+			pr_err("map_device: Unknown parameter or missing value"
+			       " '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ibnbd_opt_mandatory); i++) {
+		if ((opt_mask & ibnbd_opt_mandatory[i])) {
+			ret = 0;
+		} else {
+			pr_err("map_device: Parameters missing\n");
+			ret = -EINVAL;
+			break;
+		}
+	}
+
+out:
+	*path_cnt = p_cnt;
+	kfree(options);
+	return ret;
+}
+
+static ssize_t ibnbd_clt_state_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	switch (dev->dev_state) {
+	case (DEV_STATE_INIT):
+		return scnprintf(page, PAGE_SIZE, "init\n");
+	case (DEV_STATE_MAPPED):
+		/* TODO fix cli tool before changing to proper state */
+		return scnprintf(page, PAGE_SIZE, "open\n");
+	case (DEV_STATE_MAPPED_DISCONNECTED):
+		/* TODO fix cli tool before changing to proper state */
+		return scnprintf(page, PAGE_SIZE, "closed\n");
+	case (DEV_STATE_UNMAPPED):
+		return scnprintf(page, PAGE_SIZE, "unmapped\n");
+	default:
+		return scnprintf(page, PAGE_SIZE, "unknown\n");
+	}
+}
+
+static struct kobj_attribute ibnbd_clt_state_attr =
+	__ATTR(state, 0444, ibnbd_clt_state_show, NULL);
+
+static ssize_t ibnbd_clt_mapping_path_show(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", dev->pathname);
+}
+
+static struct kobj_attribute ibnbd_clt_mapping_path_attr =
+	__ATTR(mapping_path, 0444, ibnbd_clt_mapping_path_show, NULL);
+
+static ssize_t ibnbd_clt_io_mode_show(struct kobject *kobj,
+				      struct kobj_attribute *attr, char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 ibnbd_io_mode_str(dev->remote_io_mode));
+}
+
+static struct kobj_attribute ibnbd_clt_io_mode =
+	__ATTR(io_mode, 0444, ibnbd_clt_io_mode_show, NULL);
+
+static ssize_t ibnbd_clt_unmap_dev_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo <normal|force> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibnbd_clt_unmap_dev_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct ibnbd_clt_dev *dev;
+	char *opt, *options;
+	bool force;
+	int err;
+
+	opt = kstrdup(buf, GFP_KERNEL);
+	if (!opt)
+		return -ENOMEM;
+
+	options = strstrip(opt);
+	strip(options);
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	if (sysfs_streq(options, "normal")) {
+		force = false;
+	} else if (sysfs_streq(options, "force")) {
+		force = true;
+	} else {
+		ibnbd_err(dev, "unmap_device: Invalid value: %s\n", options);
+		err = -EINVAL;
+		goto out;
+	}
+
+	ibnbd_info(dev, "Unmapping device, option: %s.\n",
+		   force ? "force" : "normal");
+
+	/*
+	 * We take explicit module reference only for one reason: do not
+	 * race with lockless ibnbd_destroy_sessions().
+	 */
+	if (!try_module_get(THIS_MODULE)) {
+		err = -ENODEV;
+		goto out;
+	}
+	err = ibnbd_clt_unmap_device(dev, force, &attr->attr);
+	if (unlikely(err)) {
+		if (unlikely(err != -EALREADY))
+			ibnbd_err(dev, "unmap_device: %d\n",  err);
+		goto module_put;
+	}
+
+	/*
+	 * Here device can be vanished!
+	 */
+
+	err = count;
+
+module_put:
+	module_put(THIS_MODULE);
+out:
+	kfree(opt);
+
+	return err;
+}
+
+static struct kobj_attribute ibnbd_clt_unmap_device_attr =
+	__ATTR(unmap_device, 0644, ibnbd_clt_unmap_dev_show,
+	       ibnbd_clt_unmap_dev_store);
+
+static ssize_t ibnbd_clt_resize_dev_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE,
+			 "Usage: echo <new size in sectors> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibnbd_clt_resize_dev_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int ret;
+	unsigned long sectors;
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	ret = kstrtoul(buf, 0, &sectors);
+	if (ret)
+		return ret;
+
+	ret = ibnbd_clt_resize_disk(dev, (size_t)sectors);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibnbd_clt_resize_dev_attr =
+	__ATTR(resize, 0644, ibnbd_clt_resize_dev_show,
+	       ibnbd_clt_resize_dev_store);
+
+static ssize_t ibnbd_clt_remap_dev_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo <1> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibnbd_clt_remap_dev_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct ibnbd_clt_dev *dev;
+	char *opt, *options;
+	int err;
+
+	opt = kstrdup(buf, GFP_KERNEL);
+	if (!opt)
+		return -ENOMEM;
+
+	options = strstrip(opt);
+	strip(options);
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+	if (!sysfs_streq(options, "1")) {
+		ibnbd_err(dev, "remap_device: Invalid value: %s\n", options);
+		err = -EINVAL;
+		goto out;
+	}
+	err = ibnbd_clt_remap_device(dev);
+	if (likely(!err))
+		err = count;
+
+out:
+	kfree(opt);
+
+	return err;
+}
+
+static struct kobj_attribute ibnbd_clt_remap_device_attr =
+	__ATTR(remap_device, 0644, ibnbd_clt_remap_dev_show,
+	       ibnbd_clt_remap_dev_store);
+
+static ssize_t ibnbd_clt_session_show(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", dev->sess->sessname);
+}
+
+static struct kobj_attribute ibnbd_clt_session_attr =
+	__ATTR(session, 0444, ibnbd_clt_session_show, NULL);
+
+static struct attribute *ibnbd_dev_attrs[] = {
+	&ibnbd_clt_unmap_device_attr.attr,
+	&ibnbd_clt_resize_dev_attr.attr,
+	&ibnbd_clt_remap_device_attr.attr,
+	&ibnbd_clt_mapping_path_attr.attr,
+	&ibnbd_clt_state_attr.attr,
+	&ibnbd_clt_session_attr.attr,
+	&ibnbd_clt_io_mode.attr,
+	NULL,
+};
+
+void ibnbd_clt_remove_dev_symlink(struct ibnbd_clt_dev *dev)
+{
+	/*
+	 * The module_is_live() check is crucial and helps to avoid annoying
+	 * sysfs warning raised in sysfs_remove_link(), when the whole sysfs
+	 * path was just removed, see ibnbd_close_sessions().
+	 */
+	if (strlen(dev->blk_symlink_name) && module_is_live(THIS_MODULE))
+		sysfs_remove_link(ibnbd_devs_kobj, dev->blk_symlink_name);
+}
+
+static struct kobj_type ibnbd_dev_ktype = {
+	.sysfs_ops      = &kobj_sysfs_ops,
+	.default_attrs  = ibnbd_dev_attrs,
+};
+
+static int ibnbd_clt_add_dev_kobj(struct ibnbd_clt_dev *dev)
+{
+	int ret;
+	struct kobject *gd_kobj = &disk_to_dev(dev->gd)->kobj;
+
+	ret = kobject_init_and_add(&dev->kobj, &ibnbd_dev_ktype, gd_kobj, "%s",
+				   "ibnbd");
+	if (ret)
+		ibnbd_err(dev, "Failed to create device sysfs dir, err: %d\n",
+			  ret);
+
+	return ret;
+}
+
+static ssize_t ibnbd_clt_map_device_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo \""
+			 "sessname=<name of the ibtrs session>"
+			 " path=<[srcaddr,]dstaddr>"
+			 " [path=<[srcaddr,]dstaddr>]"
+			 " device_path=<full path on remote side>"
+			 " [access_mode=<ro|rw|migration>]"
+			 " [io_mode=<fileio|blockio>]\" > %s\n\n"
+			 "addr ::= [ ip:<ipv4> | ip:<ipv6> | gid:<gid> ]\n",
+			 attr->attr.name);
+}
+
+static int ibnbd_clt_get_path_name(struct ibnbd_clt_dev *dev, char *buf,
+				   size_t len)
+{
+	int ret;
+	char pathname[NAME_MAX], *s;
+
+	strlcpy(pathname, dev->pathname, sizeof(pathname));
+	while ((s = strchr(pathname, '/')))
+		s[0] = '!';
+
+	ret = snprintf(buf, len, "%s", pathname);
+	if (ret >= len)
+		return -ENAMETOOLONG;
+
+	return 0;
+}
+
+static int ibnbd_clt_add_dev_symlink(struct ibnbd_clt_dev *dev)
+{
+	struct kobject *gd_kobj = &disk_to_dev(dev->gd)->kobj;
+	int ret;
+
+	ret = ibnbd_clt_get_path_name(dev, dev->blk_symlink_name,
+				      sizeof(dev->blk_symlink_name));
+	if (ret) {
+		ibnbd_err(dev, "Failed to get /sys/block symlink path, err: %d\n",
+			  ret);
+		goto out_err;
+	}
+
+	ret = sysfs_create_link(ibnbd_devs_kobj, gd_kobj,
+				dev->blk_symlink_name);
+	if (ret) {
+		ibnbd_err(dev, "Creating /sys/block symlink failed, err: %d\n",
+			  ret);
+		goto out_err;
+	}
+
+	return 0;
+
+out_err:
+	dev->blk_symlink_name[0] = '\0';
+	return ret;
+}
+
+static ssize_t ibnbd_clt_map_device_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibnbd_clt_dev *dev;
+	int ret;
+	char pathname[NAME_MAX];
+	char sessname[NAME_MAX];
+	enum ibnbd_access_mode access_mode = IBNBD_ACCESS_RW;
+	enum ibnbd_io_mode io_mode = IBNBD_AUTOIO;
+
+	size_t path_cnt;
+	struct ibtrs_addr paths[3];
+	struct sockaddr_storage saddr[ARRAY_SIZE(paths)];
+	struct sockaddr_storage daddr[ARRAY_SIZE(paths)];
+
+	for (path_cnt = 0; path_cnt < ARRAY_SIZE(paths); path_cnt++) {
+		paths[path_cnt].src = &saddr[path_cnt];
+		paths[path_cnt].dst = &daddr[path_cnt];
+	}
+
+	ret = ibnbd_clt_parse_map_options(buf, sessname, paths,
+					  &path_cnt, ARRAY_SIZE(paths),
+					  pathname, &access_mode, &io_mode);
+	if (ret)
+		return ret;
+
+	pr_info("Mapping device %s on session %s, (access_mode: %s, "
+		"io_mode: %s)\n", pathname, sessname,
+		ibnbd_access_mode_str(access_mode), ibnbd_io_mode_str(io_mode));
+
+	dev = ibnbd_clt_map_device(sessname, paths, path_cnt, pathname,
+				   access_mode, io_mode);
+	if (unlikely(IS_ERR(dev)))
+		return PTR_ERR(dev);
+
+	ret = ibnbd_clt_add_dev_kobj(dev);
+	if (unlikely(ret))
+		goto unmap_dev;
+
+	ret = ibnbd_clt_add_dev_symlink(dev);
+	if (ret)
+		goto unmap_dev;
+
+	return count;
+
+unmap_dev:
+	ibnbd_clt_unmap_device(dev, true, NULL);
+
+	return ret;
+}
+
+static struct kobj_attribute ibnbd_clt_map_device_attr =
+	__ATTR(map_device, 0644,
+	       ibnbd_clt_map_device_show, ibnbd_clt_map_device_store);
+
+static struct attribute *default_attrs[] = {
+	&ibnbd_clt_map_device_attr.attr,
+	NULL,
+};
+
+static struct attribute_group default_attr_group = {
+	.attrs = default_attrs,
+};
+
+int ibnbd_clt_create_sysfs_files(void)
+{
+	int err;
+
+	ibnbd_dev_class = class_create(THIS_MODULE, "ibnbd-client");
+	if (unlikely(IS_ERR(ibnbd_dev_class)))
+		return PTR_ERR(ibnbd_dev_class);
+
+	ibnbd_dev = device_create(ibnbd_dev_class, NULL,
+				  MKDEV(0, 0), NULL, "ctl");
+	if (unlikely(IS_ERR(ibnbd_dev))) {
+		err = PTR_ERR(ibnbd_dev);
+		goto cls_destroy;
+	}
+	ibnbd_devs_kobj = kobject_create_and_add("devices", &ibnbd_dev->kobj);
+	if (unlikely(!ibnbd_devs_kobj)) {
+		err = -ENOMEM;
+		goto dev_destroy;
+	}
+	err = sysfs_create_group(&ibnbd_dev->kobj, &default_attr_group);
+	if (unlikely(err))
+		goto put_devs_kobj;
+
+	return 0;
+
+put_devs_kobj:
+	kobject_del(ibnbd_devs_kobj);
+	kobject_put(ibnbd_devs_kobj);
+dev_destroy:
+	device_destroy(ibnbd_dev_class, MKDEV(0, 0));
+cls_destroy:
+	class_destroy(ibnbd_dev_class);
+
+	return err;
+}
+
+void ibnbd_clt_destroy_default_group(void)
+{
+	sysfs_remove_group(&ibnbd_dev->kobj, &default_attr_group);
+}
+
+void ibnbd_clt_destroy_sysfs_files(void)
+{
+	kobject_del(ibnbd_devs_kobj);
+	kobject_put(ibnbd_devs_kobj);
+	device_destroy(ibnbd_dev_class, MKDEV(0, 0));
+	class_destroy(ibnbd_dev_class);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 20/26] ibnbd: server: private header with server structs and functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (18 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 19/26] ibnbd: client: sysfs interface functions Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 21/26] ibnbd: server: main functionality Roman Pen
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This header describes main structs and functions used by ibnbd-server
module, namely structs for managing sessions from different clients
and mapped (opened) devices.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv.h | 100 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.h

diff --git a/drivers/block/ibnbd/ibnbd-srv.h b/drivers/block/ibnbd/ibnbd-srv.h
new file mode 100644
index 000000000000..191a1650bc1d
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.h
@@ -0,0 +1,100 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_SRV_H
+#define IBNBD_SRV_H
+
+#include <linux/types.h>
+#include <linux/idr.h>
+#include <linux/kref.h>
+
+#include "ibtrs.h"
+#include "ibnbd-proto.h"
+#include "ibnbd-log.h"
+
+struct ibnbd_srv_session {
+	/* Entry inside global sess_list */
+	struct list_head        list;
+	struct ibtrs_srv	*ibtrs;
+	char			sessname[NAME_MAX];
+	int			queue_depth;
+	struct bio_set		*sess_bio_set;
+
+	rwlock_t                index_lock ____cacheline_aligned;
+	struct idr              index_idr;
+	/* List of struct ibnbd_srv_sess_dev */
+	struct list_head        sess_dev_list;
+	struct mutex		lock;
+	u8			ver;
+};
+
+struct ibnbd_srv_dev {
+	/* Entry inside global dev_list */
+	struct list_head                list;
+	struct kobject                  dev_kobj;
+	struct kobject                  dev_sessions_kobj;
+	struct kref                     kref;
+	char				id[NAME_MAX];
+	/* List of ibnbd_srv_sess_dev structs */
+	struct list_head		sess_dev_list;
+	struct mutex			lock;
+	int				open_write_cnt;
+	enum ibnbd_io_mode		mode;
+};
+
+/* Structure which binds N devices and N sessions */
+struct ibnbd_srv_sess_dev {
+	/* Entry inside ibnbd_srv_dev struct */
+	struct list_head		dev_list;
+	/* Entry inside ibnbd_srv_session struct */
+	struct list_head		sess_list;
+	struct ibnbd_dev		*ibnbd_dev;
+	struct ibnbd_srv_session        *sess;
+	struct ibnbd_srv_dev		*dev;
+	struct kobject                  kobj;
+	struct completion		*sysfs_release_compl;
+	u32                             device_id;
+	fmode_t                         open_flags;
+	struct kref			kref;
+	struct completion               *destroy_comp;
+	char				pathname[NAME_MAX];
+};
+
+/* ibnbd-srv-sysfs.c */
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+			       struct block_device *bdev,
+			       const char *dir_name);
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev);
+int ibnbd_srv_create_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+void ibnbd_srv_destroy_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+int ibnbd_srv_create_sysfs_files(void);
+void ibnbd_srv_destroy_sysfs_files(void);
+
+#endif /* IBNBD_SRV_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 21/26] ibnbd: server: main functionality
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (19 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 20/26] ibnbd: server: private header with server structs and functions Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 22/26] ibnbd: server: functionality for IO submission to file or block dev Roman Pen
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is main functionality of ibnbd-server module, which handles IBTRS
events and IBNBD protocol requests, like map (open) or unmap (close)
device.  Also server side is responsible for processing incoming IBTRS
IO requests and forward them to local mapped devices.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv.c | 922 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 922 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.c

diff --git a/drivers/block/ibnbd/ibnbd-srv.c b/drivers/block/ibnbd/ibnbd-srv.c
new file mode 100644
index 000000000000..a42a9191dad9
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.c
@@ -0,0 +1,922 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/blkdev.h>
+
+#include "ibnbd-srv.h"
+#include "ibnbd-srv-dev.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_VERSION(IBNBD_VER_STRING);
+MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_DEV_SEARCH_PATH "/"
+
+static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
+
+static int dev_search_path_set(const char *val, const struct kernel_param *kp)
+{
+	char *dup;
+
+	if (strlen(val) >= sizeof(dev_search_path))
+		return -EINVAL;
+
+	dup = kstrdup(val, GFP_KERNEL);
+
+	if (dup[strlen(dup) - 1] == '\n')
+		dup[strlen(dup) - 1] = '\0';
+
+	strlcpy(dev_search_path, dup, sizeof(dev_search_path));
+
+	kfree(dup);
+	pr_info("dev_search_path changed to '%s'\n", dev_search_path);
+
+	return 0;
+}
+
+static struct kparam_string dev_search_path_kparam_str = {
+	.maxlen	= sizeof(dev_search_path),
+	.string	= dev_search_path
+};
+
+static const struct kernel_param_ops dev_search_path_ops = {
+	.set	= dev_search_path_set,
+	.get	= param_get_string,
+};
+
+module_param_cb(dev_search_path, &dev_search_path_ops,
+		&dev_search_path_kparam_str, 0444);
+MODULE_PARM_DESC(dev_search_path, "Sets the dev_search_path."
+		 " When a device is mapped this path is prepended to the"
+		 " device path from the map device operation.  If %SESSNAME%"
+		 " is specified in a path, then device will be searched in a"
+		 " session namespace."
+		 " (default: " DEFAULT_DEV_SEARCH_PATH ")");
+
+static int def_io_mode = IBNBD_BLOCKIO;
+module_param(def_io_mode, int, 0444);
+MODULE_PARM_DESC(def_io_mode, "By default, export devices in"
+		 " blockio(" __stringify(_IBNBD_BLOCKIO) ") or"
+		 " fileio(" __stringify(_IBNBD_FILEIO) ") mode."
+		 " (default: " __stringify(_IBNBD_BLOCKIO) " (blockio))");
+
+static DEFINE_MUTEX(sess_lock);
+static DEFINE_SPINLOCK(dev_lock);
+
+static LIST_HEAD(sess_list);
+static LIST_HEAD(dev_list);
+
+struct ibnbd_io_private {
+	struct ibtrs_srv_op		*id;
+	struct ibnbd_srv_sess_dev	*sess_dev;
+};
+
+static void ibnbd_sess_dev_release(struct kref *kref)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kref, struct ibnbd_srv_sess_dev, kref);
+	complete(sess_dev->destroy_comp);
+}
+
+static inline void ibnbd_put_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	kref_put(&sess_dev->kref, ibnbd_sess_dev_release);
+}
+
+static void ibnbd_endio(void *priv, int error)
+{
+	struct ibnbd_io_private *ibnbd_priv = priv;
+	struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
+
+	ibnbd_put_sess_dev(sess_dev);
+
+	ibtrs_srv_resp_rdma(ibnbd_priv->id, error);
+
+	kfree(priv);
+}
+
+static struct ibnbd_srv_sess_dev *
+ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+	int ret = 0;
+
+	read_lock(&srv_sess->index_lock);
+	sess_dev = idr_find(&srv_sess->index_idr, dev_id);
+	if (likely(sess_dev))
+		ret = kref_get_unless_zero(&sess_dev->kref);
+	read_unlock(&srv_sess->index_lock);
+
+	if (unlikely(!sess_dev || !ret))
+		return ERR_PTR(-ENXIO);
+
+	return sess_dev;
+}
+
+static int process_rdma(struct ibtrs_srv *sess,
+			struct ibnbd_srv_session *srv_sess,
+			struct ibtrs_srv_op *id, void *data, u32 datalen,
+			const void *usr, size_t usrlen)
+{
+	const struct ibnbd_msg_io *msg = usr;
+	struct ibnbd_io_private *priv;
+	struct ibnbd_srv_sess_dev *sess_dev;
+	u32 dev_id;
+	int err;
+
+	priv = kmalloc(sizeof(*priv), GFP_KERNEL);
+	if (unlikely(!priv))
+		return -ENOMEM;
+
+	dev_id = le32_to_cpu(msg->device_id);
+
+	sess_dev = ibnbd_get_sess_dev(dev_id, srv_sess);
+	if (unlikely(IS_ERR(sess_dev))) {
+		pr_err_ratelimited("Got I/O request on session %s for "
+				   "unknown device id %d\n",
+				   srv_sess->sessname, dev_id);
+		err = -ENOTCONN;
+		goto err;
+	}
+
+	priv->sess_dev = sess_dev;
+	priv->id = id;
+
+	err = ibnbd_dev_submit_io(sess_dev->ibnbd_dev, le64_to_cpu(msg->sector),
+				  data, datalen, le32_to_cpu(msg->bi_size),
+				  le32_to_cpu(msg->rw), priv);
+	if (unlikely(err)) {
+		ibnbd_err(sess_dev,
+			  "Submitting I/O to device failed, err: %d\n", err);
+		goto sess_dev_put;
+	}
+
+	return 0;
+
+sess_dev_put:
+	ibnbd_put_sess_dev(sess_dev);
+err:
+	kfree(priv);
+	return err;
+}
+
+static void destroy_device(struct ibnbd_srv_dev *dev)
+{
+	WARN(!list_empty(&dev->sess_dev_list),
+	     "Device %s is being destroyed but still in use!\n",
+	     dev->id);
+
+	spin_lock(&dev_lock);
+	list_del(&dev->list);
+	spin_unlock(&dev_lock);
+
+	if (dev->dev_kobj.state_in_sysfs)
+		/*
+		 * Destroy kobj only if it was really created.
+		 * The following call should be sync, because
+		 *  we free the memory afterwards.
+		 */
+		ibnbd_srv_destroy_dev_sysfs(dev);
+
+	kfree(dev);
+}
+
+static void destroy_device_cb(struct kref *kref)
+{
+	struct ibnbd_srv_dev *dev;
+
+	dev = container_of(kref, struct ibnbd_srv_dev, kref);
+
+	destroy_device(dev);
+}
+
+static void ibnbd_put_srv_dev(struct ibnbd_srv_dev *dev)
+{
+	kref_put(&dev->kref, destroy_device_cb);
+}
+
+static void ibnbd_destroy_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	DECLARE_COMPLETION_ONSTACK(dc);
+
+	write_lock(&sess_dev->sess->index_lock);
+	idr_remove(&sess_dev->sess->index_idr, sess_dev->device_id);
+	write_unlock(&sess_dev->sess->index_lock);
+
+	sess_dev->destroy_comp = &dc;
+	ibnbd_put_sess_dev(sess_dev);
+	wait_for_completion(&dc);
+
+	ibnbd_dev_close(sess_dev->ibnbd_dev);
+	list_del(&sess_dev->sess_list);
+	mutex_lock(&sess_dev->dev->lock);
+	list_del(&sess_dev->dev_list);
+	if (sess_dev->open_flags & FMODE_WRITE)
+		sess_dev->dev->open_write_cnt--;
+	mutex_unlock(&sess_dev->dev->lock);
+
+	ibnbd_put_srv_dev(sess_dev->dev);
+
+	ibnbd_info(sess_dev, "Device closed\n");
+	kfree(sess_dev);
+}
+
+static void destroy_sess(struct ibnbd_srv_session *srv_sess)
+{
+	struct ibnbd_srv_sess_dev *sess_dev, *tmp;
+
+	if (list_empty(&srv_sess->sess_dev_list))
+		goto out;
+
+	mutex_lock(&srv_sess->lock);
+	list_for_each_entry_safe(sess_dev, tmp, &srv_sess->sess_dev_list,
+				 sess_list) {
+		ibnbd_srv_destroy_dev_session_sysfs(sess_dev);
+		ibnbd_destroy_sess_dev(sess_dev);
+	}
+	mutex_unlock(&srv_sess->lock);
+
+out:
+	idr_destroy(&srv_sess->index_idr);
+	bioset_free(srv_sess->sess_bio_set);
+
+	pr_info("IBTRS Session %s disconnected\n", srv_sess->sessname);
+
+	mutex_lock(&sess_lock);
+	list_del(&srv_sess->list);
+	mutex_unlock(&sess_lock);
+
+	kfree(srv_sess);
+}
+
+static int create_sess(struct ibtrs_srv *ibtrs)
+{
+	struct ibnbd_srv_session *srv_sess;
+	char sessname[NAME_MAX];
+	int err;
+
+	err = ibtrs_srv_get_sess_name(ibtrs, sessname, sizeof(sessname));
+	if (unlikely(err)) {
+		pr_err("ibtrs_srv_get_sess_name(%s): %d\n", sessname, err);
+
+		return err;
+	}
+	srv_sess = kzalloc(sizeof(*srv_sess), GFP_KERNEL);
+	if (!srv_sess)
+		return -ENOMEM;
+	srv_sess->queue_depth = ibtrs_srv_get_queue_depth(ibtrs);
+	srv_sess->sess_bio_set = bioset_create(srv_sess->queue_depth, 0,
+					       BIOSET_NEED_BVECS);
+	if (!srv_sess->sess_bio_set) {
+		pr_err("Allocating srv_session for session %s failed\n",
+		       sessname);
+		kfree(srv_sess);
+		return -ENOMEM;
+	}
+
+	idr_init(&srv_sess->index_idr);
+	rwlock_init(&srv_sess->index_lock);
+	INIT_LIST_HEAD(&srv_sess->sess_dev_list);
+	mutex_init(&srv_sess->lock);
+	mutex_lock(&sess_lock);
+	list_add(&srv_sess->list, &sess_list);
+	mutex_unlock(&sess_lock);
+
+	srv_sess->ibtrs = ibtrs;
+	srv_sess->queue_depth = ibtrs_srv_get_queue_depth(ibtrs);
+	strlcpy(srv_sess->sessname, sessname, sizeof(srv_sess->sessname));
+
+	ibtrs_srv_set_sess_priv(ibtrs, srv_sess);
+
+	return 0;
+}
+
+static int ibnbd_srv_link_ev(struct ibtrs_srv *ibtrs,
+			     enum ibtrs_srv_link_ev ev, void *priv)
+{
+	struct ibnbd_srv_session *srv_sess = priv;
+
+	switch (ev) {
+	case IBTRS_SRV_LINK_EV_CONNECTED:
+		return create_sess(ibtrs);
+
+	case IBTRS_SRV_LINK_EV_DISCONNECTED:
+		if (WARN_ON(!srv_sess))
+			return -EINVAL;
+
+		destroy_sess(srv_sess);
+		return 0;
+
+	default:
+		pr_warn("Received unknown IBTRS session event %d from session"
+			" %s\n", ev, srv_sess->sessname);
+		return -EINVAL;
+	}
+}
+
+static int process_msg_close(struct ibtrs_srv *ibtrs,
+			     struct ibnbd_srv_session *srv_sess,
+			     void *data, size_t datalen, const void *usr,
+			     size_t usrlen)
+{
+	const struct ibnbd_msg_close *close_msg = usr;
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = ibnbd_get_sess_dev(close_msg->device_id, srv_sess);
+	if (unlikely(IS_ERR(sess_dev)))
+		return 0;
+
+	ibnbd_srv_destroy_dev_session_sysfs(sess_dev);
+	ibnbd_put_sess_dev(sess_dev);
+	mutex_lock(&srv_sess->lock);
+	ibnbd_destroy_sess_dev(sess_dev);
+	mutex_unlock(&srv_sess->lock);
+	return 0;
+}
+
+static int process_msg_open(struct ibtrs_srv *ibtrs,
+			    struct ibnbd_srv_session *srv_sess,
+			    const void *msg, size_t len,
+			    void *data, size_t datalen);
+
+static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
+				 struct ibnbd_srv_session *srv_sess,
+				 const void *msg, size_t len,
+				 void *data, size_t datalen);
+
+static int ibnbd_srv_rdma_ev(struct ibtrs_srv *ibtrs, void *priv,
+			     struct ibtrs_srv_op *id, int dir,
+			     void *data, size_t datalen, const void *usr,
+			     size_t usrlen)
+{
+	struct ibnbd_srv_session *srv_sess = priv;
+	const struct ibnbd_msg_hdr *hdr = usr;
+	int ret = 0;
+	u16 type;
+
+	if (unlikely(WARN_ON(!srv_sess)))
+		return -ENODEV;
+
+	type = le16_to_cpu(hdr->type);
+
+	switch (type) {
+	case IBNBD_MSG_IO:
+		return process_rdma(ibtrs, srv_sess, id, data, datalen, usr,
+				    usrlen);
+	case IBNBD_MSG_CLOSE:
+		ret = process_msg_close(ibtrs, srv_sess, data, datalen,
+					usr, usrlen);
+		break;
+	case IBNBD_MSG_OPEN:
+		ret = process_msg_open(ibtrs, srv_sess, usr, usrlen,
+				       data, datalen);
+		break;
+	case IBNBD_MSG_SESS_INFO:
+		ret = process_msg_sess_info(ibtrs, srv_sess, usr, usrlen,
+					    data, datalen);
+		break;
+	default:
+		pr_warn("Received unexpected message type %d with dir %d from"
+			" session %s\n", type, dir, srv_sess->sessname);
+		return -EINVAL;
+	}
+
+	ibtrs_srv_resp_rdma(id, ret);
+	return 0;
+}
+
+static struct ibnbd_srv_sess_dev
+*ibnbd_sess_dev_alloc(struct ibnbd_srv_session *srv_sess)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+	int error;
+
+	sess_dev = kzalloc(sizeof(*sess_dev), GFP_KERNEL);
+	if (!sess_dev)
+		return ERR_PTR(-ENOMEM);
+
+	idr_preload(GFP_KERNEL);
+	write_lock(&srv_sess->index_lock);
+
+	error = idr_alloc(&srv_sess->index_idr, sess_dev, 0, -1, GFP_NOWAIT);
+	if (error < 0) {
+		pr_warn("Allocating idr failed, err: %d\n", error);
+		goto out_unlock;
+	}
+
+	sess_dev->device_id = error;
+	error = 0;
+
+out_unlock:
+	write_unlock(&srv_sess->index_lock);
+	idr_preload_end();
+	if (error) {
+		kfree(sess_dev);
+		return ERR_PTR(error);
+	}
+
+	return sess_dev;
+}
+
+static struct ibnbd_srv_dev *ibnbd_srv_init_srv_dev(const char *id,
+						    enum ibnbd_io_mode mode)
+{
+	struct ibnbd_srv_dev *dev;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	strlcpy(dev->id, id, sizeof(dev->id));
+	dev->mode = mode;
+	kref_init(&dev->kref);
+	INIT_LIST_HEAD(&dev->sess_dev_list);
+	mutex_init(&dev->lock);
+
+	return dev;
+}
+
+static struct ibnbd_srv_dev *
+ibnbd_srv_find_or_add_srv_dev(struct ibnbd_srv_dev *new_dev)
+{
+	struct ibnbd_srv_dev *dev;
+
+	spin_lock(&dev_lock);
+	list_for_each_entry(dev, &dev_list, list) {
+		if (!strncmp(dev->id, new_dev->id, sizeof(dev->id))) {
+			if (!kref_get_unless_zero(&dev->kref))
+				/*
+				 * We lost the race, device is almost dead.
+				 *  Continue traversing to find a valid one.
+				 */
+				continue;
+			spin_unlock(&dev_lock);
+			return dev;
+		}
+	}
+	list_add(&new_dev->list, &dev_list);
+	spin_unlock(&dev_lock);
+
+	return new_dev;
+}
+
+static int ibnbd_srv_check_update_open_perm(struct ibnbd_srv_dev *srv_dev,
+					    struct ibnbd_srv_session *srv_sess,
+					    enum ibnbd_io_mode io_mode,
+					    enum ibnbd_access_mode access_mode)
+{
+	int ret = -EPERM;
+
+	mutex_lock(&srv_dev->lock);
+
+	if (srv_dev->mode != io_mode) {
+		pr_err("Mapping device '%s' for session %s in %s mode forbidden,"
+		       " device is already mapped from other client(s) in"
+		       " %s mode\n", srv_dev->id, srv_sess->sessname,
+		       ibnbd_io_mode_str(io_mode),
+		       ibnbd_io_mode_str(srv_dev->mode));
+		goto out;
+	}
+
+	switch (access_mode) {
+	case IBNBD_ACCESS_RO:
+		ret = 0;
+		break;
+	case IBNBD_ACCESS_RW:
+		if (srv_dev->open_write_cnt == 0)  {
+			srv_dev->open_write_cnt++;
+			ret = 0;
+		} else {
+			pr_err("Mapping device '%s' for session %s with"
+			       " RW permissions failed. Device already opened"
+			       " as 'RW' by %d client(s) in %s mode.\n",
+			       srv_dev->id, srv_sess->sessname,
+			       srv_dev->open_write_cnt,
+			       ibnbd_io_mode_str(srv_dev->mode));
+		}
+		break;
+	case IBNBD_ACCESS_MIGRATION:
+		if (srv_dev->open_write_cnt < 2) {
+			srv_dev->open_write_cnt++;
+			ret = 0;
+		} else {
+			pr_err("Mapping device '%s' for session %s with"
+			       " migration permissions failed. Device already"
+			       " opened as 'RW' by %d client(s) in %s mode.\n",
+			       srv_dev->id, srv_sess->sessname,
+			       srv_dev->open_write_cnt,
+			       ibnbd_io_mode_str(srv_dev->mode));
+		}
+		break;
+	default:
+		pr_err("Received mapping request for device '%s' on session %s"
+		       " with invalid access mode: %d\n", srv_dev->id,
+		       srv_sess->sessname, access_mode);
+		ret = -EINVAL;
+	}
+
+out:
+	mutex_unlock(&srv_dev->lock);
+
+	return ret;
+}
+
+static struct ibnbd_srv_dev *
+ibnbd_srv_get_or_create_srv_dev(struct ibnbd_dev *ibnbd_dev,
+				struct ibnbd_srv_session *srv_sess,
+				enum ibnbd_io_mode io_mode,
+				enum ibnbd_access_mode access_mode)
+{
+	int ret;
+	struct ibnbd_srv_dev *new_dev, *dev;
+
+	new_dev = ibnbd_srv_init_srv_dev(ibnbd_dev->name, io_mode);
+	if (IS_ERR(new_dev))
+		return new_dev;
+
+	dev = ibnbd_srv_find_or_add_srv_dev(new_dev);
+	if (dev != new_dev)
+		kfree(new_dev);
+
+	ret = ibnbd_srv_check_update_open_perm(dev, srv_sess, io_mode,
+					       access_mode);
+	if (ret) {
+		ibnbd_put_srv_dev(dev);
+		return ERR_PTR(ret);
+	}
+
+	return dev;
+}
+
+static void ibnbd_srv_fill_msg_open_rsp(struct ibnbd_msg_open_rsp *rsp,
+					struct ibnbd_srv_sess_dev *sess_dev)
+{
+	struct ibnbd_dev *ibnbd_dev = sess_dev->ibnbd_dev;
+
+	rsp->hdr.type = cpu_to_le16(IBNBD_MSG_OPEN_RSP);
+	rsp->device_id =
+		cpu_to_le32(sess_dev->device_id);
+	rsp->nsectors =
+		cpu_to_le64(get_capacity(ibnbd_dev->bdev->bd_disk));
+	rsp->logical_block_size	=
+		cpu_to_le16(ibnbd_dev_get_logical_bsize(ibnbd_dev));
+	rsp->physical_block_size =
+		cpu_to_le16(ibnbd_dev_get_phys_bsize(ibnbd_dev));
+	rsp->max_segments =
+		cpu_to_le16(ibnbd_dev_get_max_segs(ibnbd_dev));
+	rsp->max_hw_sectors =
+		cpu_to_le32(ibnbd_dev_get_max_hw_sects(ibnbd_dev));
+	rsp->max_write_same_sectors =
+		cpu_to_le32(ibnbd_dev_get_max_write_same_sects(ibnbd_dev));
+	rsp->max_discard_sectors =
+		cpu_to_le32(ibnbd_dev_get_max_discard_sects(ibnbd_dev));
+	rsp->discard_granularity =
+		cpu_to_le32(ibnbd_dev_get_discard_granularity(ibnbd_dev));
+	rsp->discard_alignment =
+		cpu_to_le32(ibnbd_dev_get_discard_alignment(ibnbd_dev));
+	rsp->secure_discard =
+		cpu_to_le16(ibnbd_dev_get_secure_discard(ibnbd_dev));
+	rsp->rotational =
+		!blk_queue_nonrot(bdev_get_queue(ibnbd_dev->bdev));
+	rsp->io_mode =
+		ibnbd_dev->mode;
+}
+
+static struct ibnbd_srv_sess_dev *
+ibnbd_srv_create_set_sess_dev(struct ibnbd_srv_session *srv_sess,
+			      const struct ibnbd_msg_open *open_msg,
+			      struct ibnbd_dev *ibnbd_dev, fmode_t open_flags,
+			      struct ibnbd_srv_dev *srv_dev)
+{
+	struct ibnbd_srv_sess_dev *sdev = ibnbd_sess_dev_alloc(srv_sess);
+
+	if (IS_ERR(sdev))
+		return sdev;
+
+	kref_init(&sdev->kref);
+
+	strlcpy(sdev->pathname, open_msg->dev_name, sizeof(sdev->pathname));
+
+	sdev->ibnbd_dev		= ibnbd_dev;
+	sdev->sess		= srv_sess;
+	sdev->dev		= srv_dev;
+	sdev->open_flags	= open_flags;
+
+	return sdev;
+}
+
+static char *ibnbd_srv_get_full_path(struct ibnbd_srv_session *srv_sess,
+				     const char *dev_name)
+{
+	char *full_path;
+	char *a, *b;
+
+	full_path = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!full_path)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Replace %SESSNAME% with a real session name in order to
+	 * create device namespace.
+	 */
+	if ((a = strnstr(dev_search_path, "%SESSNAME%",
+			       sizeof(dev_search_path)))) {
+		int len = a - dev_search_path;
+
+		len = snprintf(full_path, PATH_MAX, "%.*s/%s/%s", len,
+			       dev_search_path, srv_sess->sessname, dev_name);
+		if (len >= PATH_MAX) {
+			pr_err("Tooooo looong path: %s, %s, %s\n",
+			       dev_search_path, srv_sess->sessname, dev_name);
+			kfree(full_path);
+			return ERR_PTR(-EINVAL);
+		}
+	} else
+		snprintf(full_path, PATH_MAX, "%s/%s",
+			 dev_search_path, dev_name);
+
+	/* eliminitate duplicated slashes */
+	a = strchr(full_path, '/');
+	b = a;
+	while (*b != '\0') {
+		if (*b == '/' && *a == '/') {
+			b++;
+		} else {
+			a++;
+			*a = *b;
+			b++;
+		}
+	}
+	a++;
+	*a = '\0';
+
+	return full_path;
+}
+
+static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
+				 struct ibnbd_srv_session *srv_sess,
+				 const void *msg, size_t len,
+				 void *data, size_t datalen)
+{
+	const struct ibnbd_msg_sess_info *sess_info_msg = msg;
+	struct ibnbd_msg_sess_info_rsp *rsp = data;
+
+	srv_sess->ver = min_t(u8, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
+	pr_debug("Session %s using protocol version %d (client version: %d,"
+		 " server version: %d)\n", srv_sess->sessname,
+		 srv_sess->ver, sess_info_msg->ver, IBNBD_PROTO_VER_MAJOR);
+
+	rsp->hdr.type = cpu_to_le16(IBNBD_MSG_SESS_INFO_RSP);
+	rsp->ver = srv_sess->ver;
+
+	return 0;
+}
+
+/**
+ * find_srv_sess_dev() - a dev is already opened by this name
+ *
+ * Return struct ibnbd_srv_sess_dev if srv_sess already opened the dev_name
+ * NULL if the session didn't open the device yet.
+ */
+static struct ibnbd_srv_sess_dev *
+find_srv_sess_dev(struct ibnbd_srv_session *srv_sess, const char *dev_name)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	if (list_empty(&srv_sess->sess_dev_list))
+		return NULL;
+
+	list_for_each_entry(sess_dev, &srv_sess->sess_dev_list, sess_list)
+		if (!strcmp(sess_dev->pathname, dev_name))
+			return sess_dev;
+
+	return NULL;
+}
+
+static int process_msg_open(struct ibtrs_srv *ibtrs,
+			    struct ibnbd_srv_session *srv_sess,
+			    const void *msg, size_t len,
+			    void *data, size_t datalen)
+{
+	int ret;
+	struct ibnbd_srv_dev *srv_dev;
+	struct ibnbd_srv_sess_dev *srv_sess_dev;
+	const struct ibnbd_msg_open *open_msg = msg;
+	fmode_t open_flags;
+	char *full_path;
+	struct ibnbd_dev *ibnbd_dev;
+	enum ibnbd_io_mode io_mode;
+	struct ibnbd_msg_open_rsp *rsp = data;
+
+	pr_debug("Open message received: session='%s' path='%s' access_mode=%d"
+		 " io_mode=%d\n", srv_sess->sessname, open_msg->dev_name,
+		 open_msg->access_mode, open_msg->io_mode);
+	open_flags = FMODE_READ;
+	if (open_msg->access_mode != IBNBD_ACCESS_RO)
+		open_flags |= FMODE_WRITE;
+
+	mutex_lock(&srv_sess->lock);
+
+	srv_sess_dev = find_srv_sess_dev(srv_sess, open_msg->dev_name);
+	if (srv_sess_dev)
+		goto fill_response;
+
+	if ((strlen(dev_search_path) + strlen(open_msg->dev_name))
+	    >= PATH_MAX) {
+		pr_err("Opening device for session %s failed, device path too"
+		       " long. '%s/%s' is longer than PATH_MAX (%d)\n",
+		       srv_sess->sessname, dev_search_path, open_msg->dev_name,
+		       PATH_MAX);
+		ret = -EINVAL;
+		goto reject;
+	}
+	full_path = ibnbd_srv_get_full_path(srv_sess, open_msg->dev_name);
+	if (IS_ERR(full_path)) {
+		ret = PTR_ERR(full_path);
+		pr_err("Opening device '%s' for client %s failed,"
+		       " failed to get device full path, err: %d\n",
+		       open_msg->dev_name, srv_sess->sessname, ret);
+		goto reject;
+	}
+
+	if (open_msg->io_mode == IBNBD_BLOCKIO)
+		io_mode = IBNBD_BLOCKIO;
+	else if (open_msg->io_mode == IBNBD_FILEIO)
+		io_mode = IBNBD_FILEIO;
+	else
+		io_mode = def_io_mode;
+
+	ibnbd_dev = ibnbd_dev_open(full_path, open_flags, io_mode,
+				   srv_sess->sess_bio_set, ibnbd_endio);
+	if (IS_ERR(ibnbd_dev)) {
+		pr_err("Opening device '%s' on session %s failed,"
+		       " failed to open the block device, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(ibnbd_dev));
+		ret = PTR_ERR(ibnbd_dev);
+		goto free_path;
+	}
+
+	srv_dev = ibnbd_srv_get_or_create_srv_dev(ibnbd_dev, srv_sess, io_mode,
+						  open_msg->access_mode);
+	if (IS_ERR(srv_dev)) {
+		pr_err("Opening device '%s' on session %s failed,"
+		       " creating srv_dev failed, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(srv_dev));
+		ret = PTR_ERR(srv_dev);
+		goto ibnbd_dev_close;
+	}
+
+	srv_sess_dev = ibnbd_srv_create_set_sess_dev(srv_sess, open_msg,
+						     ibnbd_dev, open_flags,
+						     srv_dev);
+	if (IS_ERR(srv_sess_dev)) {
+		pr_err("Opening device '%s' on session %s failed,"
+		       " creating sess_dev failed, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(srv_sess_dev));
+		ret = PTR_ERR(srv_sess_dev);
+		goto srv_dev_put;
+	}
+
+	/* Create the srv_dev sysfs files if they haven't been created yet. The
+	 * reason to delay the creation is not to create the sysfs files before
+	 * we are sure the device can be opened.
+	 */
+	mutex_lock(&srv_dev->lock);
+	if (!srv_dev->dev_kobj.state_in_sysfs) {
+		ret = ibnbd_srv_create_dev_sysfs(srv_dev, ibnbd_dev->bdev,
+						 ibnbd_dev->name);
+		if (ret) {
+			mutex_unlock(&srv_dev->lock);
+			ibnbd_err(srv_sess_dev, "Opening device failed, failed to"
+				  " create device sysfs files, err: %d\n",
+				  ret);
+			goto free_srv_sess_dev;
+		}
+	}
+
+	ret = ibnbd_srv_create_dev_session_sysfs(srv_sess_dev);
+	if (ret) {
+		mutex_unlock(&srv_dev->lock);
+		ibnbd_err(srv_sess_dev, "Opening device failed, failed to create"
+			  " dev client sysfs files, err: %d\n", ret);
+		goto free_srv_sess_dev;
+	}
+
+	list_add(&srv_sess_dev->dev_list, &srv_dev->sess_dev_list);
+	mutex_unlock(&srv_dev->lock);
+
+	list_add(&srv_sess_dev->sess_list, &srv_sess->sess_dev_list);
+
+	ibnbd_info(srv_sess_dev, "Opened device '%s' in %s mode\n",
+		   srv_dev->id, ibnbd_io_mode_str(io_mode));
+
+	kfree(full_path);
+
+fill_response:
+	ibnbd_srv_fill_msg_open_rsp(rsp, srv_sess_dev);
+	mutex_unlock(&srv_sess->lock);
+	return 0;
+
+free_srv_sess_dev:
+	write_lock(&srv_sess->index_lock);
+	idr_remove(&srv_sess->index_idr, srv_sess_dev->device_id);
+	write_unlock(&srv_sess->index_lock);
+	kfree(srv_sess_dev);
+srv_dev_put:
+	if (open_msg->access_mode != IBNBD_ACCESS_RO) {
+		mutex_lock(&srv_dev->lock);
+		srv_dev->open_write_cnt--;
+		mutex_unlock(&srv_dev->lock);
+	}
+	ibnbd_put_srv_dev(srv_dev);
+ibnbd_dev_close:
+	ibnbd_dev_close(ibnbd_dev);
+free_path:
+	kfree(full_path);
+reject:
+	mutex_unlock(&srv_sess->lock);
+	return ret;
+}
+
+static struct ibtrs_srv_ctx *ibtrs_ctx;
+
+static int __init ibnbd_srv_init_module(void)
+{
+	int err;
+
+	pr_info("Loading module %s, version %s, proto %s\n",
+		KBUILD_MODNAME, IBNBD_VER_STRING, IBNBD_PROTO_VER_STRING);
+
+	ibtrs_ctx = ibtrs_srv_open(ibnbd_srv_rdma_ev, ibnbd_srv_link_ev,
+				   IBTRS_PORT);
+	if (unlikely(IS_ERR(ibtrs_ctx))) {
+		err = PTR_ERR(ibtrs_ctx);
+		pr_err("ibtrs_srv_open(), err: %d\n", err);
+		goto out;
+	}
+	err = ibnbd_dev_init();
+	if (err) {
+		pr_err("ibnbd_dev_init(), err: %d\n", err);
+		goto srv_close;
+	}
+
+	err = ibnbd_srv_create_sysfs_files();
+	if (err) {
+		pr_err("ibnbd_srv_create_sysfs_files(), err: %d\n", err);
+		goto dev_destroy;
+	}
+
+	return 0;
+
+dev_destroy:
+	ibnbd_dev_destroy();
+srv_close:
+	ibtrs_srv_close(ibtrs_ctx);
+out:
+
+	return err;
+}
+
+static void __exit ibnbd_srv_cleanup_module(void)
+{
+	ibtrs_srv_close(ibtrs_ctx);
+	WARN_ON(!list_empty(&sess_list));
+	ibnbd_srv_destroy_sysfs_files();
+	ibnbd_dev_destroy();
+	pr_info("Module unloaded\n");
+}
+
+module_init(ibnbd_srv_init_module);
+module_exit(ibnbd_srv_cleanup_module);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 22/26] ibnbd: server: functionality for IO submission to file or block dev
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (20 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 21/26] ibnbd: server: main functionality Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 23/26] ibnbd: server: sysfs interface functions Roman Pen
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This provides helper functions for IO submission to file or block dev.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv-dev.c | 410 ++++++++++++++++++++++++++++++++++++
 drivers/block/ibnbd/ibnbd-srv-dev.h | 149 +++++++++++++
 2 files changed, 559 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h

diff --git a/drivers/block/ibnbd/ibnbd-srv-dev.c b/drivers/block/ibnbd/ibnbd-srv-dev.c
new file mode 100644
index 000000000000..a5894849b9d5
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-dev.c
@@ -0,0 +1,410 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibnbd-srv-dev.h"
+#include "ibnbd-log.h"
+
+#define IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS 0
+
+struct ibnbd_dev_file_io_work {
+	struct ibnbd_dev	*dev;
+	void			*priv;
+
+	sector_t		sector;
+	void			*data;
+	size_t			len;
+	size_t			bi_size;
+	enum ibnbd_io_flags	flags;
+
+	struct work_struct	work;
+};
+
+struct ibnbd_dev_blk_io {
+	struct ibnbd_dev *dev;
+	void		 *priv;
+};
+
+static struct workqueue_struct *fileio_wq;
+
+int ibnbd_dev_init(void)
+{
+	fileio_wq = alloc_workqueue("%s", WQ_UNBOUND,
+				    IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS,
+				    "ibnbd_server_fileio_wq");
+	if (!fileio_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void ibnbd_dev_destroy(void)
+{
+	destroy_workqueue(fileio_wq);
+}
+
+static inline struct block_device *ibnbd_dev_open_bdev(const char *path,
+						       fmode_t flags)
+{
+	return blkdev_get_by_path(path, flags, THIS_MODULE);
+}
+
+static int ibnbd_dev_blk_open(struct ibnbd_dev *dev, const char *path,
+			      fmode_t flags)
+{
+	dev->bdev = ibnbd_dev_open_bdev(path, flags);
+	return PTR_ERR_OR_ZERO(dev->bdev);
+}
+
+static int ibnbd_dev_vfs_open(struct ibnbd_dev *dev, const char *path,
+			      fmode_t flags)
+{
+	int oflags = O_DSYNC; /* enable write-through */
+
+	if (flags & FMODE_WRITE)
+		oflags |= O_RDWR;
+	else if (flags & FMODE_READ)
+		oflags |= O_RDONLY;
+	else
+		return -EINVAL;
+
+	dev->file = filp_open(path, oflags, 0);
+	return PTR_ERR_OR_ZERO(dev->file);
+}
+
+struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
+				 enum ibnbd_io_mode mode, struct bio_set *bs,
+				 ibnbd_dev_io_fn io_cb)
+{
+	struct ibnbd_dev *dev;
+	int ret;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	if (mode == IBNBD_BLOCKIO) {
+		dev->blk_open_flags = flags;
+		ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+		if (ret)
+			goto err;
+	} else if (mode == IBNBD_FILEIO) {
+		dev->blk_open_flags = FMODE_READ;
+		ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+		if (ret)
+			goto err;
+
+		ret = ibnbd_dev_vfs_open(dev, path, flags);
+		if (ret)
+			goto blk_put;
+	}
+
+	dev->blk_open_flags	= flags;
+	dev->mode		= mode;
+	dev->io_cb		= io_cb;
+	bdevname(dev->bdev, dev->name);
+	dev->ibd_bio_set	= bs;
+
+	return dev;
+
+blk_put:
+	blkdev_put(dev->bdev, dev->blk_open_flags);
+err:
+	kfree(dev);
+	return ERR_PTR(ret);
+}
+
+void ibnbd_dev_close(struct ibnbd_dev *dev)
+{
+	flush_workqueue(fileio_wq);
+	blkdev_put(dev->bdev, dev->blk_open_flags);
+	if (dev->mode == IBNBD_FILEIO)
+		filp_close(dev->file, dev->file);
+	kfree(dev);
+}
+
+static void ibnbd_dev_bi_end_io(struct bio *bio)
+{
+	struct ibnbd_dev_blk_io *io = bio->bi_private;
+
+	io->dev->io_cb(io->priv, blk_status_to_errno(bio->bi_status));
+	bio_put(bio);
+	kfree(io);
+}
+
+static void bio_map_kern_endio(struct bio *bio)
+{
+	bio_put(bio);
+}
+
+/**
+ *	ibnbd_bio_map_kern	-	map kernel address into bio
+ *	@q: the struct request_queue for the bio
+ *	@data: pointer to buffer to map
+ *	@bs: bio_set to use.
+ *	@len: length in bytes
+ *	@gfp_mask: allocation flags for bio allocation
+ *
+ *	Map the kernel address into a bio suitable for io to a block
+ *	device. Returns an error pointer in case of error.
+ */
+static struct bio *ibnbd_bio_map_kern(struct request_queue *q, void *data,
+				      struct bio_set *bs,
+				      unsigned int len, gfp_t gfp_mask)
+{
+	unsigned long kaddr = (unsigned long)data;
+	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	unsigned long start = kaddr >> PAGE_SHIFT;
+	const int nr_pages = end - start;
+	int offset, i;
+	struct bio *bio;
+
+	bio = bio_alloc_bioset(gfp_mask, nr_pages, bs);
+	if (!bio)
+		return ERR_PTR(-ENOMEM);
+
+	offset = offset_in_page(kaddr);
+	for (i = 0; i < nr_pages; i++) {
+		unsigned int bytes = PAGE_SIZE - offset;
+
+		if (len <= 0)
+			break;
+
+		if (bytes > len)
+			bytes = len;
+
+		if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
+				    offset) < bytes) {
+			/* we don't support partial mappings */
+			bio_put(bio);
+			return ERR_PTR(-EINVAL);
+		}
+
+		data += bytes;
+		len -= bytes;
+		offset = 0;
+	}
+
+	bio->bi_end_io = bio_map_kern_endio;
+	return bio;
+}
+
+static int ibnbd_dev_blk_submit_io(struct ibnbd_dev *dev, sector_t sector,
+				   void *data, size_t len, u32 bi_size,
+				   enum ibnbd_io_flags flags, void *priv)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+	struct ibnbd_dev_blk_io *io;
+	struct bio *bio;
+
+	/* check if the buffer is suitable for bdev */
+	if (unlikely(WARN_ON(!blk_rq_aligned(q, (unsigned long)data, len))))
+		return -EINVAL;
+
+	/* Generate bio with pages pointing to the rdma buffer */
+	bio = ibnbd_bio_map_kern(q, data, dev->ibd_bio_set, len, GFP_KERNEL);
+	if (unlikely(IS_ERR(bio)))
+		return PTR_ERR(bio);
+
+	io = kmalloc(sizeof(*io), GFP_KERNEL);
+	if (unlikely(!io)) {
+		bio_put(bio);
+		return -ENOMEM;
+	}
+
+	io->dev		= dev;
+	io->priv	= priv;
+
+	bio->bi_end_io		= ibnbd_dev_bi_end_io;
+	bio->bi_private		= io;
+	bio->bi_opf		= ibnbd_to_bio_flags(flags);
+	bio->bi_iter.bi_sector	= sector;
+	bio->bi_iter.bi_size	= bi_size;
+	bio_set_dev(bio, dev->bdev);
+
+	submit_bio(bio);
+
+	return 0;
+}
+
+static int ibnbd_dev_file_handle_flush(struct ibnbd_dev_file_io_work *w,
+				       loff_t start)
+{
+	int ret;
+	loff_t end;
+	int len = w->bi_size;
+
+	if (len)
+		end = start + len - 1;
+	else
+		end = LLONG_MAX;
+
+	ret = vfs_fsync_range(w->dev->file, start, end, 1);
+	if (unlikely(ret))
+		pr_info_ratelimited("I/O FLUSH failed on %s, vfs_sync err: %d\n",
+				    w->dev->name, ret);
+	return ret;
+}
+
+static int ibnbd_dev_file_handle_fua(struct ibnbd_dev_file_io_work *w,
+				     loff_t start)
+{
+	int ret;
+	loff_t end;
+	int len = w->bi_size;
+
+	if (len)
+		end = start + len - 1;
+	else
+		end = LLONG_MAX;
+
+	ret = vfs_fsync_range(w->dev->file, start, end, 1);
+	if (unlikely(ret))
+		pr_info_ratelimited("I/O FUA failed on %s, vfs_sync err: %d\n",
+				    w->dev->name, ret);
+	return ret;
+}
+
+static int ibnbd_dev_file_handle_write_same(struct ibnbd_dev_file_io_work *w)
+{
+	int i;
+
+	if (unlikely(WARN_ON(w->bi_size % w->len)))
+		return -EINVAL;
+
+	for (i = 1; i < w->bi_size / w->len; i++)
+		memcpy(w->data + i * w->len, w->data, w->len);
+
+	return 0;
+}
+
+static void ibnbd_dev_file_submit_io_worker(struct work_struct *w)
+{
+	struct ibnbd_dev_file_io_work *dev_work;
+	struct file *f;
+	int ret, len;
+	loff_t off;
+
+	dev_work = container_of(w, struct ibnbd_dev_file_io_work, work);
+	off = dev_work->sector * ibnbd_dev_get_logical_bsize(dev_work->dev);
+	f = dev_work->dev->file;
+	len = dev_work->bi_size;
+
+	if (ibnbd_op(dev_work->flags) == IBNBD_OP_FLUSH) {
+		ret = ibnbd_dev_file_handle_flush(dev_work, off);
+		if (unlikely(ret))
+			goto out;
+	}
+
+	if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE_SAME) {
+		ret = ibnbd_dev_file_handle_write_same(dev_work);
+		if (unlikely(ret))
+			goto out;
+	}
+
+	/* TODO Implement support for DIRECT */
+	if (dev_work->bi_size) {
+		loff_t off_tmp = off;
+
+		if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE)
+			ret = kernel_write(f, dev_work->data, dev_work->bi_size,
+					   &off_tmp);
+		else
+			ret = kernel_read(f, dev_work->data, dev_work->bi_size,
+					  &off_tmp);
+
+		if (unlikely(ret < 0)) {
+			goto out;
+		} else if (unlikely(ret != dev_work->bi_size)) {
+			/* TODO implement support for partial completions */
+			ret = -EIO;
+			goto out;
+		} else {
+			ret = 0;
+		}
+	}
+
+	if (dev_work->flags & IBNBD_F_FUA)
+		ret = ibnbd_dev_file_handle_fua(dev_work, off);
+out:
+	dev_work->dev->io_cb(dev_work->priv, ret);
+	kfree(dev_work);
+}
+
+static int ibnbd_dev_file_submit_io(struct ibnbd_dev *dev, sector_t sector,
+				    void *data, size_t len, size_t bi_size,
+				    enum ibnbd_io_flags flags, void *priv)
+{
+	struct ibnbd_dev_file_io_work *w;
+
+	if (!ibnbd_flags_supported(flags)) {
+		pr_info_ratelimited("Unsupported I/O flags: 0x%x on device "
+				    "%s\n", flags, dev->name);
+		return -ENOTSUPP;
+	}
+
+	w = kmalloc(sizeof(*w), GFP_KERNEL);
+	if (!w)
+		return -ENOMEM;
+
+	w->dev		= dev;
+	w->priv		= priv;
+	w->sector	= sector;
+	w->data		= data;
+	w->len		= len;
+	w->bi_size	= bi_size;
+	w->flags	= flags;
+	INIT_WORK(&w->work, ibnbd_dev_file_submit_io_worker);
+
+	if (unlikely(!queue_work(fileio_wq, &w->work))) {
+		kfree(w);
+		return -EEXIST;
+	}
+
+	return 0;
+}
+
+int ibnbd_dev_submit_io(struct ibnbd_dev *dev, sector_t sector, void *data,
+			size_t len, u32 bi_size, enum ibnbd_io_flags flags,
+			void *priv)
+{
+	if (dev->mode == IBNBD_FILEIO)
+		return ibnbd_dev_file_submit_io(dev, sector, data, len, bi_size,
+						flags, priv);
+	else if (dev->mode == IBNBD_BLOCKIO)
+		return ibnbd_dev_blk_submit_io(dev, sector, data, len, bi_size,
+					       flags, priv);
+
+	pr_warn("Submitting I/O to %s failed, dev->mode contains invalid "
+		"value: '%d', memory corrupted?", dev->name, dev->mode);
+
+	return -EINVAL;
+}
diff --git a/drivers/block/ibnbd/ibnbd-srv-dev.h b/drivers/block/ibnbd/ibnbd-srv-dev.h
new file mode 100644
index 000000000000..2c02038d1f36
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-dev.h
@@ -0,0 +1,149 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_SRV_DEV_H
+#define IBNBD_SRV_DEV_H
+
+#include <linux/fs.h>
+#include "ibnbd-proto.h"
+
+typedef void ibnbd_dev_io_fn(void *priv, int error);
+
+struct ibnbd_dev {
+	struct block_device	*bdev;
+	struct bio_set		*ibd_bio_set;
+	struct file		*file;
+	fmode_t			blk_open_flags;
+	enum ibnbd_io_mode	mode;
+	char			name[BDEVNAME_SIZE];
+	ibnbd_dev_io_fn		*io_cb;
+};
+
+/** ibnbd_dev_init() - Initialize ibnbd_dev
+ *
+ * This functions initialized the ibnbd-dev component.
+ * It has to be called 1x time before ibnbd_dev_open() is used
+ */
+int ibnbd_dev_init(void);
+
+/** ibnbd_dev_destroy() - Destroy ibnbd_dev
+ *
+ * This functions destroys the ibnbd-dev component.
+ * It has to be called after the last device was closed.
+ */
+void ibnbd_dev_destroy(void);
+
+/**
+ * ibnbd_dev_open() - Open a device
+ * @flags:	open flags
+ * @mode:	open via VFS or block layer
+ * @bs:		bio_set to use during block io,
+ * @io_cb:	is called when I/O finished
+ */
+struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
+				 enum ibnbd_io_mode mode, struct bio_set *bs,
+				 ibnbd_dev_io_fn io_cb);
+
+/**
+ * ibnbd_dev_close() - Close a device
+ */
+void ibnbd_dev_close(struct ibnbd_dev *dev);
+
+static inline int ibnbd_dev_get_logical_bsize(const struct ibnbd_dev *dev)
+{
+	return bdev_logical_block_size(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_phys_bsize(const struct ibnbd_dev *dev)
+{
+	return bdev_physical_block_size(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_max_segs(const struct ibnbd_dev *dev)
+{
+	return queue_max_segments(bdev_get_queue(dev->bdev));
+}
+
+static inline int ibnbd_dev_get_max_hw_sects(const struct ibnbd_dev *dev)
+{
+	return queue_max_hw_sectors(bdev_get_queue(dev->bdev));
+}
+
+static inline int
+ibnbd_dev_get_max_write_same_sects(const struct ibnbd_dev *dev)
+{
+	return bdev_write_same(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_secure_discard(const struct ibnbd_dev *dev)
+{
+	if (dev->mode == IBNBD_BLOCKIO)
+		return blk_queue_secure_erase(bdev_get_queue(dev->bdev));
+	return 0;
+}
+
+static inline int ibnbd_dev_get_max_discard_sects(const struct ibnbd_dev *dev)
+{
+	if (!blk_queue_discard(bdev_get_queue(dev->bdev)))
+		return 0;
+
+	if (dev->mode == IBNBD_BLOCKIO)
+		return blk_queue_get_max_sectors(bdev_get_queue(dev->bdev),
+						 REQ_OP_DISCARD);
+	return 0;
+}
+
+static inline int ibnbd_dev_get_discard_granularity(const struct ibnbd_dev *dev)
+{
+	if (dev->mode == IBNBD_BLOCKIO)
+		return bdev_get_queue(dev->bdev)->limits.discard_granularity;
+	return 0;
+}
+
+static inline int ibnbd_dev_get_discard_alignment(const struct ibnbd_dev *dev)
+{
+	if (dev->mode == IBNBD_BLOCKIO)
+		return bdev_get_queue(dev->bdev)->limits.discard_alignment;
+	return 0;
+}
+
+/**
+ * ibnbd_dev_submit_io() - Submit an I/O to the disk
+ * @dev:	device to that the I/O is submitted
+ * @sector:	address to read/write data to
+ * @data:	I/O data to write or buffer to read I/O date into
+ * @len:	length of @data
+ * @bi_size:	Amount of data that will be read/written
+ * @priv:	private data passed to @io_fn
+ */
+int ibnbd_dev_submit_io(struct ibnbd_dev *dev, sector_t sector, void *data,
+			size_t len, u32 bi_size, enum ibnbd_io_flags flags,
+			void *priv);
+
+#endif /* IBNBD_SRV_DEV_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 23/26] ibnbd: server: sysfs interface functions
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (21 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 22/26] ibnbd: server: functionality for IO submission to file or block dev Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation Roman Pen
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

This is the sysfs interface to IBNBD mapped devices on server side:

  /sys/devices/virtual/ibnbd-server/ctl/devices/<device_name>/
    |- block_dev
    |  *** link pointing to the corresponding block device sysfs entry
    |
    |- sessions/<session-name>/
    |  *** sessions directory
       |
       |- read_only
       |  *** is devices mapped as read only
       |
       |- mapping_path
          *** relative device path provided by the client during mapping

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv-sysfs.c | 242 ++++++++++++++++++++++++++++++++++
 1 file changed, 242 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c

diff --git a/drivers/block/ibnbd/ibnbd-srv-sysfs.c b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
new file mode 100644
index 000000000000..5bf77cdb09c8
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
@@ -0,0 +1,242 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <uapi/linux/limits.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/stat.h>
+#include <linux/genhd.h>
+#include <linux/list.h>
+#include <linux/moduleparam.h>
+#include <linux/device.h>
+
+#include "ibnbd-srv.h"
+
+static struct device *ibnbd_dev;
+static struct class *ibnbd_dev_class;
+static struct kobject *ibnbd_devs_kobj;
+
+static struct attribute *ibnbd_srv_default_dev_attrs[] = {
+	NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_attr_group = {
+	.attrs = ibnbd_srv_default_dev_attrs,
+};
+
+static struct kobj_type ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+};
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+			       struct block_device *bdev,
+			       const char *dir_name)
+{
+	struct kobject *bdev_kobj;
+	int ret;
+
+	ret = kobject_init_and_add(&dev->dev_kobj, &ktype,
+				   ibnbd_devs_kobj, dir_name);
+	if (ret)
+		return ret;
+
+	ret = kobject_init_and_add(&dev->dev_sessions_kobj,
+				   &ktype,
+				   &dev->dev_kobj, "sessions");
+	if (ret)
+		goto err;
+
+	ret = sysfs_create_group(&dev->dev_kobj,
+				 &ibnbd_srv_default_dev_attr_group);
+	if (ret)
+		goto err2;
+
+	bdev_kobj = &disk_to_dev(bdev->bd_disk)->kobj;
+	ret = sysfs_create_link(&dev->dev_kobj, bdev_kobj, "block_dev");
+	if (ret)
+		goto err3;
+
+	return 0;
+
+err3:
+	sysfs_remove_group(&dev->dev_kobj,
+			   &ibnbd_srv_default_dev_attr_group);
+err2:
+	kobject_del(&dev->dev_sessions_kobj);
+	kobject_put(&dev->dev_sessions_kobj);
+err:
+	kobject_del(&dev->dev_kobj);
+	kobject_put(&dev->dev_kobj);
+	return ret;
+}
+
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev)
+{
+	sysfs_remove_link(&dev->dev_kobj, "block_dev");
+	sysfs_remove_group(&dev->dev_kobj, &ibnbd_srv_default_dev_attr_group);
+	kobject_del(&dev->dev_sessions_kobj);
+	kobject_put(&dev->dev_sessions_kobj);
+	kobject_del(&dev->dev_kobj);
+	kobject_put(&dev->dev_kobj);
+}
+
+static ssize_t ibnbd_srv_dev_session_ro_show(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     char *page)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 (sess_dev->open_flags & FMODE_WRITE) ? "0" : "1");
+}
+
+static struct kobj_attribute ibnbd_srv_dev_session_ro_attr =
+	__ATTR(read_only, 0444,
+	       ibnbd_srv_dev_session_ro_show,
+	       NULL);
+
+static ssize_t
+ibnbd_srv_dev_session_mapping_path_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", sess_dev->pathname);
+}
+
+static struct kobj_attribute ibnbd_srv_dev_session_mapping_path_attr =
+	__ATTR(mapping_path, 0444,
+	       ibnbd_srv_dev_session_mapping_path_show,
+	       NULL);
+
+static struct attribute *ibnbd_srv_default_dev_sessions_attrs[] = {
+	&ibnbd_srv_dev_session_ro_attr.attr,
+	&ibnbd_srv_dev_session_mapping_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_session_attr_group = {
+	.attrs = ibnbd_srv_default_dev_sessions_attrs,
+};
+
+void ibnbd_srv_destroy_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	DECLARE_COMPLETION_ONSTACK(sysfs_compl);
+
+	sysfs_remove_group(&sess_dev->kobj,
+			   &ibnbd_srv_default_dev_session_attr_group);
+
+	sess_dev->sysfs_release_compl = &sysfs_compl;
+	kobject_del(&sess_dev->kobj);
+	kobject_put(&sess_dev->kobj);
+	wait_for_completion(&sysfs_compl);
+}
+
+static void ibnbd_srv_sess_dev_release(struct kobject *kobj)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+	if (sess_dev->sysfs_release_compl)
+		complete_all(sess_dev->sysfs_release_compl);
+}
+
+static struct kobj_type ibnbd_srv_sess_dev_ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+	.release	= ibnbd_srv_sess_dev_release,
+};
+
+int ibnbd_srv_create_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	int ret;
+
+	ret = kobject_init_and_add(&sess_dev->kobj, &ibnbd_srv_sess_dev_ktype,
+				   &sess_dev->dev->dev_sessions_kobj, "%s",
+				   sess_dev->sess->sessname);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_group(&sess_dev->kobj,
+				 &ibnbd_srv_default_dev_session_attr_group);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	kobject_del(&sess_dev->kobj);
+	kobject_put(&sess_dev->kobj);
+
+	return ret;
+}
+
+int ibnbd_srv_create_sysfs_files(void)
+{
+	int err;
+
+	ibnbd_dev_class = class_create(THIS_MODULE, "ibnbd-server");
+	if (unlikely(IS_ERR(ibnbd_dev_class)))
+		return PTR_ERR(ibnbd_dev_class);
+
+	ibnbd_dev = device_create(ibnbd_dev_class, NULL,
+				  MKDEV(0, 0), NULL, "ctl");
+	if (unlikely(IS_ERR(ibnbd_dev))) {
+		err = PTR_ERR(ibnbd_dev);
+		goto cls_destroy;
+	}
+	ibnbd_devs_kobj = kobject_create_and_add("devices", &ibnbd_dev->kobj);
+	if (unlikely(!ibnbd_devs_kobj)) {
+		err = -ENOMEM;
+		goto dev_destroy;
+	}
+
+	return 0;
+
+dev_destroy:
+	device_destroy(ibnbd_dev_class, MKDEV(0, 0));
+cls_destroy:
+	class_destroy(ibnbd_dev_class);
+
+	return err;
+}
+
+void ibnbd_srv_destroy_sysfs_files(void)
+{
+	kobject_del(ibnbd_devs_kobj);
+	kobject_put(ibnbd_devs_kobj);
+	device_destroy(ibnbd_dev_class, MKDEV(0, 0));
+	class_destroy(ibnbd_dev_class);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (22 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 23/26] ibnbd: server: sysfs interface functions Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-20 17:21   ` kbuild test robot
                     ` (2 more replies)
  2018-05-18 13:04 ` [PATCH v2 25/26] ibnbd: a bit of documentation Roman Pen
                   ` (2 subsequent siblings)
  26 siblings, 3 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

Add IBNBD Makefile, Kconfig and also corresponding lines into upper
block layer files.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/Kconfig        |  2 ++
 drivers/block/Makefile       |  1 +
 drivers/block/ibnbd/Kconfig  | 22 ++++++++++++++++++++++
 drivers/block/ibnbd/Makefile | 13 +++++++++++++
 4 files changed, 38 insertions(+)
 create mode 100644 drivers/block/ibnbd/Kconfig
 create mode 100644 drivers/block/ibnbd/Makefile

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index ad9b687a236a..d8c1590411c8 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -481,4 +481,6 @@ config BLK_DEV_RSXX
 	  To compile this driver as a module, choose M here: the
 	  module will be called rsxx.
 
+source "drivers/block/ibnbd/Kconfig"
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dc061158b403..65346a1d0b1a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk.o
 obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_BLK_DEV_IBNBD)	+= ibnbd/
 
 skd-y		:= skd_main.o
 swim_mod-y	:= swim.o swim_asm.o
diff --git a/drivers/block/ibnbd/Kconfig b/drivers/block/ibnbd/Kconfig
new file mode 100644
index 000000000000..b381c6c084d2
--- /dev/null
+++ b/drivers/block/ibnbd/Kconfig
@@ -0,0 +1,22 @@
+config BLK_DEV_IBNBD
+	bool
+
+config BLK_DEV_IBNBD_CLIENT
+	tristate "Network block device driver on top of IBTRS transport"
+	depends on INFINIBAND_IBTRS_CLIENT
+	select BLK_DEV_IBNBD
+	help
+	  IBNBD client allows for mapping of a remote block devices over
+	  IBTRS protocol from a target system where IBNBD server is running.
+
+	  If unsure, say N.
+
+config BLK_DEV_IBNBD_SERVER
+	tristate "Network block device over RDMA Infiniband server support"
+	depends on INFINIBAND_IBTRS_SERVER
+	select BLK_DEV_IBNBD
+	help
+	  IBNBD server allows for exporting local block devices to a remote client
+	  over IBTRS protocol.
+
+	  If unsure, say N.
diff --git a/drivers/block/ibnbd/Makefile b/drivers/block/ibnbd/Makefile
new file mode 100644
index 000000000000..5f20e72e0633
--- /dev/null
+++ b/drivers/block/ibnbd/Makefile
@@ -0,0 +1,13 @@
+ccflags-y := -Idrivers/infiniband/ulp/ibtrs
+
+ibnbd-client-y := ibnbd-clt.o \
+		  ibnbd-clt-sysfs.o
+
+ibnbd-server-y := ibnbd-srv.o \
+		  ibnbd-srv-dev.o \
+		  ibnbd-srv-sysfs.o
+
+obj-$(CONFIG_BLK_DEV_IBNBD_CLIENT) += ibnbd-client.o
+obj-$(CONFIG_BLK_DEV_IBNBD_SERVER) += ibnbd-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 25/26] ibnbd: a bit of documentation
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (23 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-18 13:04 ` [PATCH v2 26/26] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Roman Pen
  2018-05-22 16:45 ` [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Jason Gunthorpe
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

README with description of major sysfs entries.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/README | 299 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 299 insertions(+)
 create mode 100644 drivers/block/ibnbd/README

diff --git a/drivers/block/ibnbd/README b/drivers/block/ibnbd/README
new file mode 100644
index 000000000000..bbaddd02c1c5
--- /dev/null
+++ b/drivers/block/ibnbd/README
@@ -0,0 +1,299 @@
+***************************************
+Infiniband Network Block Device (IBNBD)
+***************************************
+
+Introduction
+------------
+
+IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
+(client and server) that allow for remote access of a block device on
+the server over IBTRS protocol using the RDMA (InfiniBand, RoCE, iWarp)
+transport. After being mapped, the remote block devices can be accessed
+on the client side as local block devices.
+
+I/O is transfered between client and server by the IBTRS transport
+modules. The administration of IBNBD and IBTRS modules is done via
+sysfs entries.
+
+Requirements
+------------
+
+  IBTRS kernel modules
+
+Quick Start
+-----------
+
+Server side:
+  # modprobe ibnbd_server
+
+Client side:
+  # modprobe ibnbd_client
+  # echo "sessname=blya path=ip:10.50.100.66 device_path=/dev/ram0" > \
+            /sys/devices/virtual/ibnbd-client/ctl/map_device
+
+  Where "sessname=" is a session name, a string to identify the session
+  on client and on server sides; "path=" is a destination IP address or
+  a pair of a source and a destination IPs, separated by comma.  Multiple
+  "path=" options can be specified in order to use multipath  (see IBTRS
+  description for details); "device_path=" is the block device to be
+  mapped from the server side. After the session to the server machine is
+  established, the mapped device will appear on the client side under
+  /dev/ibnbd<N>.
+
+
+======================
+Client Sysfs Interface
+======================
+
+All sysfs files that are not read-only provide the usage information on read:
+
+Example:
+  # cat /sys/devices/virtual/ibnbd-client/ctl/map_device
+
+  > Usage: echo "sessname=<name of the ibtrs session> path=<[srcaddr,]dstaddr>
+  > [path=<[srcaddr,]dstaddr>] device_path=<full path on remote side>
+  > [access_mode=<ro|rw|migration>]
+  > [io_mode=<fileio|blockio>]" > map_device
+  >
+  > addr ::= [ ip:<ipv4> | ip:<ipv6> | gid:<gid> ]
+
+Entries under /sys/devices/virtual/ibnbd-client/ctl/
+=======================================
+
+map_device (RW)
+---------------
+
+Expected format is the following:
+
+    sessname=<name of the ibtrs session>
+    path=<[srcaddr,]dstaddr> [path=<[srcaddr,]dstaddr> ...]
+    device_path=<full path on remote side>
+    [access_mode=<ro|rw|migration>]
+    [io_mode=<fileio|blockio>]
+
+Where:
+
+sessname: accepts a string not bigger than 256 chars, which identifies
+          a given session on the client and on the server.
+          I.e. "clt_hostname-srv_hostname" could be a natural choice.
+
+path:     describes a connection between the client and the server by
+      specifying destination and, when required, the source address.
+      The addresses are to be provided in the following format:
+
+            ip:<IPv6>
+            ip:<IPv4>
+            gid:<GID>
+
+          for example:
+
+          path=ip:10.0.0.66
+                         The single addr is treated as the destination.
+                         The connection will be established to this
+                         server from any client IP address.
+
+          path=ip:10.0.0.66,ip:10.0.1.66
+                         First addr is the source address and the second
+                         is the destination.
+
+          If multiple "path=" options are specified multiple connection
+          will be established and data will be sent according to
+          the selected multipath policy (see IBTRS mp_policy sysfs entry
+          description).
+
+device_path: Path to the block device on the server side. Path is specified
+         relative to the directory on server side configured in the
+         'dev_search_path' module parameter of the ibnbd_server.
+         The ibnbd_server prepends the <device_path> received from client
+         with <dev_search_path> and tries to open the
+         <dev_search_path>/<device_path> block device.  On success,
+         a /dev/ibnbd<N> device file, a /sys/block/ibnbd_client/ibnbd<N>/
+         directory and an entry in /sys/devices/virtual/ibnbd-client/ctl/devices
+         will be created.
+
+         If 'dev_search_path' contains '%SESSNAME%', then each session can
+         have different devices namespace, e.g. server was configured with
+         the following parameter "dev_search_path=/run/ibnbd-devs/%SESSNAME%",
+         client has this string "sessname=blya device_path=sda", then server
+         will try to open: /run/ibnbd-devs/blya/sda.
+
+access_mode: the access_mode parameter specifies if the device is to be
+             mapped as "ro" read-only or "rw" read-write. The server allows
+             a device to be exported in rw mode only once. The "migration"
+             access mode has to be specified if a second mapping in read-write
+             mode is desired.
+
+             By default "rw" is used.
+
+io_mode:  the io_mode parameter specifies if the device on the server
+          will be opened as block device "blockio" or as file "fileio".
+          When the device is opened as file, the VFS page cache is used
+          for read I/O operations, write I/O operations bypass the page
+          cache and go directly to disk (except meta updates, like file
+          access time).
+
+          By default "blockio" mode is used.
+
+Exit Codes:
+
+If the device is already mapped it will fail with EEXIST. If the input
+has an invalid format it will return EINVAL. If the device path cannot
+be found on the server, it will fail with ENOENT.
+
+Finding device file after mapping
+---------------------------------
+
+After mapping, the device file can be found by:
+ o  The symlink /sys/devices/virtual/ibnbd-client/ctl/devices/<device_id>
+    points to /sys/block/<dev-name>. The last part of the symlink destination
+    is the same as the device name.  By extracting the last part of the
+    path the path to the device /dev/<dev-name> can be build.
+
+ o /dev/block/$(cat /sys/devices/virtual/ibnbd-client/ctl/devices/<device_id>/dev)
+
+How to find the <device_id> of the device is described on the next
+section.
+
+Entries under /sys/devices/virtual/ibnbd-client/ctl/devices/
+============================================================
+
+For each device mapped on the client a new symbolic link is created as
+/sys/devices/virtual/ibnbd-client/ctl/devices/<device_id>, which points
+to the block device created by ibnbd (/sys/block/ibnbd<N>/).
+The <device_id> of each device is created as follows:
+
+- If the 'device_path' provided during mapping contains slashes ("/"),
+  they are replaced by exclamation mark ("!") and used as as the
+  <device_id>. Otherwise, the <device_id> will be the same as the
+  "device_path" provided.
+
+Entries under /sys/block/ibnbd<N>/ibnbd_client/
+===============================================
+
+unmap_device (RW)
+-----------------
+
+To unmap a volume, "normal" or "force" has to be written to:
+  /sys/block/ibnbd<N>/ibnbd_client/unmap_device
+
+When "normal" is used, the operation will fail with EBUSY if any process
+is using the device.  When "force" is used, the device is also unmapped
+when device is in use.  All I/Os that are in progress will fail.
+
+Example:
+
+   # echo "normal" > /sys/block/ibnbd0/ibnbd/unmap_device
+
+state (RO)
+----------
+
+The file contains the current state of the block device. The state file
+returns "open" when the device is successfully mapped from the server
+and accepting I/O requests. When the connection to the server gets
+disconnected in case of an error (e.g. link failure), the state file
+returns "closed" and all I/O requests submitted to it will fail with -EIO.
+
+session (RO)
+------------
+
+IBNBD uses IBTRS session to transport the data between client and
+server.  The entry "session" contains the name of the session, that
+was used to establish the IBTRS session.  It's the same name that
+was passed as server parameter to the map_device entry.
+
+mapping_path (RO)
+-----------------
+
+Contains the path that was passed as "device_path" to the map_device
+operation.
+
+======================
+Server Sysfs Interface
+======================
+
+Entries under /sys/devices/virtual/ibnbd-server/ctl/
+====================================================
+
+When a client maps a device, a directory entry with the name of the
+block device is created under /sys/devices/virtual/ibnbd-server/ctl/devices/.
+
+Entries under /sys/devices/virtual/ibnbd-server/ctl/devices/<device_name>/
+==========================================================================
+
+block_dev (link)
+---------------
+
+Is a symlink to the sysfs entry of the exported device.
+
+Example:
+
+  block_dev -> ../../../../devices/virtual/block/ram0
+
+Entries under /sys/devices/virtual/ibnbd-server/ctl/devices/<device_name>/sessions/
+===================================================================================
+
+For each client a particular device is exported to, following directory will be
+created:
+
+/sys/devices/virtual/ibnbd-server/ctl/devices/<device_name>/sessions/<session-name>/
+
+When the device is unmapped by that client, the directory will be removed.
+
+Entries under /sys/devices/virtual/ibnbd-server/ctl/devices/<device_name>/sessions/<session-name>
+=================================================================================================
+
+read_only (RO)
+--------------
+
+Contains '1' if device is mapped read-only, otherwise '0'.
+
+mapping_path (RO)
+-----------------
+
+Contains the relative device path provided by the user during mapping.
+
+==============================
+IBNBD-Server Module Parameters
+==============================
+
+dev_search_path
+---------------
+
+When a device is mapped from the client, the server generates the path
+to the block device on the server side by concatenating dev_search_path
+and the "device_path" that was specified in the map_device operation.
+
+The default dev_search_path is: "/".
+
+dev_search_path option can also contain %SESSNAME% in order to provide
+different deviec namespaces for different sessions.  See "device_path"
+option for details.
+
+==============================
+Protocol (ibnbd/ibnbd-proto.h)
+==============================
+
+1. Before mapping first device from a given server, client sends an
+IBNBD_MSG_SESS_INFO to the server. Server responds with
+IBNBD_MSG_SESS_INFO_RSP. Currently the messages only contain the protocol
+version for backward compatibility.
+
+2. Client requests to open a device by sending IBNBD_MSG_OPEN message. This
+contains the path to the device, access mode (read-only or writable), and
+io_mode which specifies if the device should be opened as block device or
+using file io. Server responds to the message with IBNBD_MSG_OPEN_RSP. This
+contains a 32 bit device id to be used for  IOs and device "geometry" related
+information: side, max_hw_sectors, etc.
+
+3. Client attaches IBNBD_MSG_IO to each IO message send to a device. This
+message contains device id, provided by server in his ibnbd_msg_open_rsp,
+sector to be accessed, read-write flags and bi_size.
+
+4. Client closes a device by sending IBNBD_MSG_CLOSE which contains only the
+device id provided by the server.
+
+
+Contact
+-------
+
+Mailing list: "IBNBD/IBTRS Storage Team" <ibnbd@profitbricks.com>
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 26/26] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (24 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 25/26] ibnbd: a bit of documentation Roman Pen
@ 2018-05-18 13:04 ` Roman Pen
  2018-05-22 16:45 ` [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Jason Gunthorpe
  26 siblings, 0 replies; 55+ messages in thread
From: Roman Pen @ 2018-05-18 13:04 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang,
	Roman Pen

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 MAINTAINERS | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 92be777d060a..e5a001bd0f05 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6786,6 +6786,20 @@ IBM ServeRAID RAID DRIVER
 S:	Orphan
 F:	drivers/scsi/ips.*
 
+IBNBD BLOCK DRIVERS
+M:	IBNBD/IBTRS Storage Team <ibnbd@profitbricks.com>
+L:	linux-block@vger.kernel.org
+S:	Maintained
+T:	git git://github.com/profitbricks/ibnbd.git
+F:	drivers/block/ibnbd/
+
+IBTRS TRANSPORT DRIVERS
+M:	IBNBD/IBTRS Storage Team <ibnbd@profitbricks.com>
+L:	linux-rdma@vger.kernel.org
+S:	Maintained
+T:	git git://github.com/profitbricks/ibnbd.git
+F:	drivers/infiniband/ulp/ibtrs/
+
 ICH LPC AND GPIO DRIVER
 M:	Peter Tyser <ptyser@xes-inc.com>
 S:	Maintained
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/26] sysfs: export sysfs_remove_file_self()
  2018-05-18 13:03 ` [PATCH v2 02/26] sysfs: export sysfs_remove_file_self() Roman Pen
@ 2018-05-18 15:08   ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2018-05-18 15:08 UTC (permalink / raw)
  To: Roman Pen
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang, linux-kernel

On Fri, May 18, 2018 at 03:03:49PM +0200, Roman Pen wrote:
> Function is going to be used in transport over RDMA module
> in subsequent patches.
> 
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org

Acked-by: Tejun Heo <tj@kernel.org>

Please feel free to apply with other patches.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-18 13:03 ` [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Roman Pen
@ 2018-05-18 16:56   ` Linus Torvalds
  2018-05-19 20:25     ` Roman Penyaev
  2018-05-19 16:37   ` Paul E. McKenney
  1 sibling, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2018-05-18 16:56 UTC (permalink / raw)
  To: Roman Pen
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	swapnil.ingle, danil.kipnis, jinpu.wang, Paul McKenney,
	Linux Kernel Mailing List

On Fri, May 18, 2018 at 6:07 AM Roman Pen <roman.penyaev@profitbricks.com>
wrote:

> Function is going to be used in transport over RDMA module
> in subsequent patches.

Does this really merit its own helper macro in a generic header?

It honestly smells more like "just have an inline helper function that is
specific to rdma" to me. Particularly since it's probably just one specific
list where you want this oddly specific behavior.

Also, if we really want a round-robin list traversal macro, this isn't the
way it should be implemented, I suspect, and it probably shouldn't be
RCU-specific to begin with.

Side note: I notice that I should already  have been more critical of even
the much simpler "list_next_or_null_rcu()" macro. The "documentation"
comment above the macro is pure and utter cut-and-paste garbage.

Paul, mind giving this a look?

                 Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-18 13:03 ` [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Roman Pen
  2018-05-18 16:56   ` Linus Torvalds
@ 2018-05-19 16:37   ` Paul E. McKenney
  2018-05-19 20:20     ` Roman Penyaev
  1 sibling, 1 reply; 55+ messages in thread
From: Paul E. McKenney @ 2018-05-19 16:37 UTC (permalink / raw)
  To: Roman Pen
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang, linux-kernel

On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> Function is going to be used in transport over RDMA module
> in subsequent patches.
> 
> Function returns next element in round-robin fashion,
> i.e. head will be skipped.  NULL will be returned if list
> is observed as empty.
> 
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/linux/rculist.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 127f534fec94..b0840d5ab25a 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>  })
> 
>  /**
> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> + * @head:	the head for the list.
> + * @ptr:        the list head to take the next element from.
> + * @type:       the type of the struct this is embedded in.
> + * @memb:       the name of the list_head within the struct.
> + *
> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> + * but if list is observed as empty, NULL will be returned.
> + *
> + * This primitive may safely run concurrently with the _rcu list-mutation
> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().

Of course, all the set of list_next_or_null_rr_rcu() invocations that
are round-robining a given list must all be under the same RCU read-side
critical section.  For example, the following will break badly:

	struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
	{
		struct foo *ret;

		rcu_read_lock();
		ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
		rcu_read_unlock();  /* BUG */
		return ret;
	}

You need a big fat comment stating this, at the very least.  The resulting
bug can be very hard to trigger and even harder to debug.

And yes, I know that the same restriction applies to list_next_rcu()
and friends.  The difference is that if you try to invoke those in an
infinite loop, you will be rapped on the knuckles as soon as you hit
the list header.  Without that knuckle-rapping, RCU CPU stall warnings
might tempt people to do something broken like take_rr_step() above.

Is it possible to instead do some sort of list_for_each_entry_rcu()-like
macro that makes it more obvious that the whole thing need to be under
a single RCU read-side critical section?  Such a macro would of course be
an infinite loop if the list never went empty, so presumably there would
be a break or return statement in there somewhere.

> + */
> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> +({ \
> +	list_next_or_null_rcu(head, ptr, type, memb) ?: \
> +		list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \

Are there any uses for this outside of RDMA?  If not, I am with Linus.
Define this within RDMA, where a smaller number of people can more
easily be kept aware of the restrictions on use.  If it turns out to be
more generally useful, we can take a look at exactly what makes sense
more globally.

Even within RDMA, I strongly recommend the big fat comment called out above.
And the list_for_each_entry_rcu()-like formulation, if that can be made to
work within RDMA's code structure.

Seem reasonable, or am I missing something here?

								Thanx, Paul

> +})
> +
> +/**
>   * list_for_each_entry_rcu	-	iterate over rcu list of given type
>   * @pos:	the type * to use as a loop cursor.
>   * @head:	the head for your list.
> -- 
> 2.13.1
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-19 16:37   ` Paul E. McKenney
@ 2018-05-19 20:20     ` Roman Penyaev
  2018-05-19 20:56       ` Linus Torvalds
  2018-05-20  0:43       ` Paul E. McKenney
  0 siblings, 2 replies; 55+ messages in thread
From: Roman Penyaev @ 2018-05-19 20:20 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang, linux-kernel

On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> Function is going to be used in transport over RDMA module
>> in subsequent patches.
>>
>> Function returns next element in round-robin fashion,
>> i.e. head will be skipped.  NULL will be returned if list
>> is observed as empty.
>>
>> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  include/linux/rculist.h | 19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> index 127f534fec94..b0840d5ab25a 100644
>> --- a/include/linux/rculist.h
>> +++ b/include/linux/rculist.h
>> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>>  })
>>
>>  /**
>> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
>> + * @head:    the head for the list.
>> + * @ptr:        the list head to take the next element from.
>> + * @type:       the type of the struct this is embedded in.
>> + * @memb:       the name of the list_head within the struct.
>> + *
>> + * Next element returned in round-robin fashion, i.e. head will be skipped,
>> + * but if list is observed as empty, NULL will be returned.
>> + *
>> + * This primitive may safely run concurrently with the _rcu list-mutation
>> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>
> Of course, all the set of list_next_or_null_rr_rcu() invocations that
> are round-robining a given list must all be under the same RCU read-side
> critical section.  For example, the following will break badly:
>
>         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>         {
>                 struct foo *ret;
>
>                 rcu_read_lock();
>                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
>                 rcu_read_unlock();  /* BUG */
>                 return ret;
>         }
>
> You need a big fat comment stating this, at the very least.  The resulting
> bug can be very hard to trigger and even harder to debug.
>
> And yes, I know that the same restriction applies to list_next_rcu()
> and friends.  The difference is that if you try to invoke those in an
> infinite loop, you will be rapped on the knuckles as soon as you hit
> the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> might tempt people to do something broken like take_rr_step() above.

Hi Paul,

I need -rr behaviour for doing IO load-balancing when I choose next RDMA
connection from the list in order to send a request, i.e. my code is
something like the following:

        static struct conn *get_and_set_next_conn(void)
        {
                struct conn *conn;

                conn = rcu_dereferece(rcu_conn);
                if (unlikely(!conn))
                    return conn;
                conn = list_next_or_null_rr_rcu(&conn_list,
                                                &conn->entry,
                                                typeof(*conn),
                                                entry);
                rcu_assign_pointer(rcu_conn, conn);
                return conn;
        }

        rcu_read_lock();
        conn = get_and_set_next_conn();
        if (unlikely(!conn)) {
                /* ... */
        }
        err = rdma_io(conn, request);
        rcu_read_unlock();

i.e. usage of the @next pointer is under an RCU critical section.

> Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> macro that makes it more obvious that the whole thing need to be under
> a single RCU read-side critical section?  Such a macro would of course be
> an infinite loop if the list never went empty, so presumably there would
> be a break or return statement in there somewhere.

The difference is that I do not need a loop, I take the @next conn pointer,
save it for the following IO request and do IO for current IO request.

It seems list_for_each_entry_rcu()-like with immediate "break" in the body
of the loop does not look nice, I personally do not like it, i.e.:


        static struct conn *get_and_set_next_conn(void)
        {
                struct conn *conn;

                conn = rcu_dereferece(rcu_conn);
                if (unlikely(!conn))
                    return conn;
                list_for_each_entry_rr_rcu(conn, &conn_list,
                                           entry) {
                        break;
                }
                rcu_assign_pointer(rcu_conn, conn);
                return conn;
        }


or maybe I did not fully get your idea?

>> + */
>> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
>> +({ \
>> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
>> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
>
> Are there any uses for this outside of RDMA?  If not, I am with Linus.
> Define this within RDMA, where a smaller number of people can more
> easily be kept aware of the restrictions on use.  If it turns out to be
> more generally useful, we can take a look at exactly what makes sense
> more globally.

The only one list_for_each_entry_rcu()-like macro I am aware of is used in
block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():

https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370

Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
my list_next_or_null_rr_rcu() variant?

> Even within RDMA, I strongly recommend the big fat comment called out above.
> And the list_for_each_entry_rcu()-like formulation, if that can be made to
> work within RDMA's code structure.
>
> Seem reasonable, or am I missing something here?

Thanks for clear explanation.

--
Roman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-18 16:56   ` Linus Torvalds
@ 2018-05-19 20:25     ` Roman Penyaev
  2018-05-19 21:04       ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Roman Penyaev @ 2018-05-19 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	swapnil.ingle, Danil Kipnis, Jinpu Wang, Paul McKenney,
	Linux Kernel Mailing List

On Fri, May 18, 2018 at 6:56 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, May 18, 2018 at 6:07 AM Roman Pen <roman.penyaev@profitbricks.com>
> wrote:
>
>> Function is going to be used in transport over RDMA module
>> in subsequent patches.
>
> Does this really merit its own helper macro in a generic header?
>
> It honestly smells more like "just have an inline helper function that is
> specific to rdma" to me. Particularly since it's probably just one specific
> list where you want this oddly specific behavior.
>
> Also, if we really want a round-robin list traversal macro, this isn't the
> way it should be implemented, I suspect, and it probably shouldn't be
> RCU-specific to begin with.

Hi Linus,

Another one list_for_each_entry_rcu()-like macro I am aware of is used in
block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():

https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370

Can we do something generic with -rr semantics to cover both cases?

--
Roman

>
> Side note: I notice that I should already  have been more critical of even
> the much simpler "list_next_or_null_rcu()" macro. The "documentation"
> comment above the macro is pure and utter cut-and-paste garbage.
>
> Paul, mind giving this a look?
>
>                  Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-19 20:20     ` Roman Penyaev
@ 2018-05-19 20:56       ` Linus Torvalds
  2018-05-20  0:43       ` Paul E. McKenney
  1 sibling, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2018-05-19 20:56 UTC (permalink / raw)
  To: Roman Pen
  Cc: Paul McKenney, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, swapnil.ingle, danil.kipnis, jinpu.wang,
	Linux Kernel Mailing List

On Sat, May 19, 2018 at 1:21 PM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> connection from the list in order to send a request, i.e. my code is
> something like the following:
[ incomplete pseudoicode ]
> i.e. usage of the @next pointer is under an RCU critical section.

That's not enough. The whole chain to look up the pointer you are taking
'next' of needs to be under RCU, and that's not clear from your example.

It's *probably* the case, but basically you have to prove that the starting
point is still on the same RCU list. That wasn't clear from your example.

The above is (as Paul said) true of list_next_rcu() too, so it's not like
this is anything specific to the 'rr' version.

              Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-19 20:25     ` Roman Penyaev
@ 2018-05-19 21:04       ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2018-05-19 21:04 UTC (permalink / raw)
  To: Roman Pen
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	swapnil.ingle, danil.kipnis, jinpu.wang, Paul McKenney,
	Linux Kernel Mailing List

On Sat, May 19, 2018 at 1:25 PM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> Another one list_for_each_entry_rcu()-like macro I am aware of is used in
> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():


https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370

> Can we do something generic with -rr semantics to cover both cases?

That loop actually looks more like what Paul was asking for, and makes it
(perhaps) a bit more obvious that the whole loop has to be done under the
same RCU read sequence that looked up that first 'skip' entry.

(Again, stronger locking than RCU is obviously also acceptable for the
"look up skip entry").

But another reason I really dislike that list_next_or_null_rr_rcu() macro
in the patch under discussion is that it's *really* not the right way to
skip one entry. It may work, but it's really ugly. Again, the
list_for_each_entry_rcu_rr() in blk-mq-sched.c looks better in that regard,
in that the skipping seems at least a _bit_ more explicit about what it's
doing.

And again, if you make this specific to one particular list (and it really
likely is just one particular list that wants this), you can use a nice
legible helper inline function instead of the macro with the member name.

Don't get me wrong - I absolutely adore our generic list handling macros,
but I think they work because they are simple. Once we get to "take the
next entry, but skip it if it's the head entry, and then return NULL if you
get back to the entry you started with" kind of semantics, an inline
function that takes a particular list and has a big comment about *why* you
want those semantics for that particular case sounds _much_ better to me
than adding some complex "generic" macro for a very very unusual special
case.

                  Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-19 20:20     ` Roman Penyaev
  2018-05-19 20:56       ` Linus Torvalds
@ 2018-05-20  0:43       ` Paul E. McKenney
  2018-05-21 13:50         ` Roman Penyaev
  1 sibling, 1 reply; 55+ messages in thread
From: Paul E. McKenney @ 2018-05-20  0:43 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang, linux-kernel

On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> Function is going to be used in transport over RDMA module
> >> in subsequent patches.
> >>
> >> Function returns next element in round-robin fashion,
> >> i.e. head will be skipped.  NULL will be returned if list
> >> is observed as empty.
> >>
> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> Cc: linux-kernel@vger.kernel.org
> >> ---
> >>  include/linux/rculist.h | 19 +++++++++++++++++++
> >>  1 file changed, 19 insertions(+)
> >>
> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> index 127f534fec94..b0840d5ab25a 100644
> >> --- a/include/linux/rculist.h
> >> +++ b/include/linux/rculist.h
> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >>  })
> >>
> >>  /**
> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> >> + * @head:    the head for the list.
> >> + * @ptr:        the list head to take the next element from.
> >> + * @type:       the type of the struct this is embedded in.
> >> + * @memb:       the name of the list_head within the struct.
> >> + *
> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> >> + * but if list is observed as empty, NULL will be returned.
> >> + *
> >> + * This primitive may safely run concurrently with the _rcu list-mutation
> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >
> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> > are round-robining a given list must all be under the same RCU read-side
> > critical section.  For example, the following will break badly:
> >
> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >         {
> >                 struct foo *ret;
> >
> >                 rcu_read_lock();
> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
> >                 rcu_read_unlock();  /* BUG */
> >                 return ret;
> >         }
> >
> > You need a big fat comment stating this, at the very least.  The resulting
> > bug can be very hard to trigger and even harder to debug.
> >
> > And yes, I know that the same restriction applies to list_next_rcu()
> > and friends.  The difference is that if you try to invoke those in an
> > infinite loop, you will be rapped on the knuckles as soon as you hit
> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> > might tempt people to do something broken like take_rr_step() above.
> 
> Hi Paul,
> 
> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> connection from the list in order to send a request, i.e. my code is
> something like the following:
> 
>         static struct conn *get_and_set_next_conn(void)
>         {
>                 struct conn *conn;
> 
>                 conn = rcu_dereferece(rcu_conn);
>                 if (unlikely(!conn))
>                     return conn;

Wait.  Don't you need to restart from the beginning of the list in
this case?  Or does the list never have anything added to it and is
rcu_conn initially the first element in the list?

>                 conn = list_next_or_null_rr_rcu(&conn_list,
>                                                 &conn->entry,
>                                                 typeof(*conn),
>                                                 entry);
>                 rcu_assign_pointer(rcu_conn, conn);

Linus is correct to doubt this code.  You assign a pointer to the current
element to rcu_conn, which is presumably a per-CPU or global variable.
So far, so good ...

>                 return conn;
>         }
> 
>         rcu_read_lock();
>         conn = get_and_set_next_conn();
>         if (unlikely(!conn)) {
>                 /* ... */
>         }
>         err = rdma_io(conn, request);
>         rcu_read_unlock();

... except that some other CPU might well remove the entry referenced by
rcu_conn at this point.  It would have to wait for a grace period (e.g.,
synchronize_rcu()), but the current CPU has exited its RCU read-side
critical section, and therefore is not blocking the grace period.
Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
might well be referencing the freelist, or, even worse, some other type
of structure.

What is your code doing to prevent this from happening?  (There are ways,
but I want to know what you were doing in this case.)

> i.e. usage of the @next pointer is under an RCU critical section.
> 
> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> > macro that makes it more obvious that the whole thing need to be under
> > a single RCU read-side critical section?  Such a macro would of course be
> > an infinite loop if the list never went empty, so presumably there would
> > be a break or return statement in there somewhere.
> 
> The difference is that I do not need a loop, I take the @next conn pointer,
> save it for the following IO request and do IO for current IO request.
> 
> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
> of the loop does not look nice, I personally do not like it, i.e.:
> 
> 
>         static struct conn *get_and_set_next_conn(void)
>         {
>                 struct conn *conn;
> 
>                 conn = rcu_dereferece(rcu_conn);
>                 if (unlikely(!conn))
>                     return conn;
>                 list_for_each_entry_rr_rcu(conn, &conn_list,
>                                            entry) {
>                         break;
>                 }
>                 rcu_assign_pointer(rcu_conn, conn);
>                 return conn;
>         }
> 
> 
> or maybe I did not fully get your idea?

That would not help at all because you are still leaking the pointer out
of the RCU read-side critical section.  That is completely and utterly
broken unless you are somehow cleaning up rcu_conn when you remove
the element.  And getting that cleanup right is -extremely- tricky.
Unless you have some sort of proof of correctness, you will get a NACK
from me.

More like this:

	list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
		do_something_with(conn);
		if (done_for_now())
			break;
	}

> >> + */
> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> >> +({ \
> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
> >
> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
> > Define this within RDMA, where a smaller number of people can more
> > easily be kept aware of the restrictions on use.  If it turns out to be
> > more generally useful, we can take a look at exactly what makes sense
> > more globally.
> 
> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
> 
> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
> 
> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
> my list_next_or_null_rr_rcu() variant?

Let's start with the basics:  It absolutely does not make sense to leak
pointers across rcu_read_unlock() unless you have arranged something else
to protect the pointed-to data in the meantime.  There are a number of ways
of implementing this protection.  Again, what protection are you using?

Your code at the above URL looks plausible to me at first glance: You
do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
rcu_read_unlock().  But at second glance, it looks like htcx->queue
might have the same vulnerability as rcu_conn in your earlier code.

							Thanx, Paul

> > Even within RDMA, I strongly recommend the big fat comment called out above.
> > And the list_for_each_entry_rcu()-like formulation, if that can be made to
> > work within RDMA's code structure.
> >
> > Seem reasonable, or am I missing something here?
> 
> Thanks for clear explanation.
> 
> --
> Roman
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation
  2018-05-18 13:04 ` [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation Roman Pen
@ 2018-05-20 17:21   ` kbuild test robot
  2018-05-20 22:14   ` kbuild test robot
  2018-05-21  5:33   ` kbuild test robot
  2 siblings, 0 replies; 55+ messages in thread
From: kbuild test robot @ 2018-05-20 17:21 UTC (permalink / raw)
  To: Roman Pen
  Cc: kbuild-all, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang, Roman Pen

[-- Attachment #1: Type: text/plain, Size: 11480 bytes --]

Hi Roman,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc5 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Roman-Pen/InfiniBand-Transport-IBTRS-and-Network-Block-Device-IBNBD/20180520-222445
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All warnings (new ones prefixed by >>):

   In file included from include/linux/printk.h:7:0,
                    from include/linux/kernel.h:14,
                    from include/linux/list.h:9,
                    from include/linux/module.h:9,
                    from drivers/block/ibnbd/ibnbd-clt-sysfs.c:37:
   drivers/block/ibnbd/ibnbd-clt-sysfs.c: In function 'ibnbd_clt_parse_map_options':
   include/linux/kern_levels.h:5:18: warning: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type 'size_t {aka unsigned int}' [-Wformat=]
    #define KERN_SOH "\001"  /* ASCII Start Of Header */
                     ^
   include/linux/kern_levels.h:11:18: note: in expansion of macro 'KERN_SOH'
    #define KERN_ERR KERN_SOH "3" /* error conditions */
                     ^~~~~~~~
   include/linux/printk.h:304:9: note: in expansion of macro 'KERN_ERR'
     printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
            ^~~~~~~~
>> drivers/block/ibnbd/ibnbd-clt-sysfs.c:139:5: note: in expansion of macro 'pr_err'
        pr_err("map_device: too many (> %lu) paths "
        ^~~~~~
   drivers/block/ibnbd/ibnbd-clt-sysfs.c: In function 'ibnbd_clt_map_device_store':
>> drivers/block/ibnbd/ibnbd-clt-sysfs.c:613:1: warning: the frame size of 1616 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^

vim +/pr_err +139 drivers/block/ibnbd/ibnbd-clt-sysfs.c

ea541da7d Roman Pen 2018-05-18   88  
ea541da7d Roman Pen 2018-05-18   89  static int ibnbd_clt_parse_map_options(const char *buf,
ea541da7d Roman Pen 2018-05-18   90  				       char *sessname,
ea541da7d Roman Pen 2018-05-18   91  				       struct ibtrs_addr *paths,
ea541da7d Roman Pen 2018-05-18   92  				       size_t *path_cnt,
ea541da7d Roman Pen 2018-05-18   93  				       size_t max_path_cnt,
ea541da7d Roman Pen 2018-05-18   94  				       char *pathname,
ea541da7d Roman Pen 2018-05-18   95  				       enum ibnbd_access_mode *access_mode,
ea541da7d Roman Pen 2018-05-18   96  				       enum ibnbd_io_mode *io_mode)
ea541da7d Roman Pen 2018-05-18   97  {
ea541da7d Roman Pen 2018-05-18   98  	char *options, *sep_opt;
ea541da7d Roman Pen 2018-05-18   99  	char *p;
ea541da7d Roman Pen 2018-05-18  100  	substring_t args[MAX_OPT_ARGS];
ea541da7d Roman Pen 2018-05-18  101  	int opt_mask = 0;
ea541da7d Roman Pen 2018-05-18  102  	int token;
ea541da7d Roman Pen 2018-05-18  103  	int ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  104  	int i;
ea541da7d Roman Pen 2018-05-18  105  	int p_cnt = 0;
ea541da7d Roman Pen 2018-05-18  106  
ea541da7d Roman Pen 2018-05-18  107  	options = kstrdup(buf, GFP_KERNEL);
ea541da7d Roman Pen 2018-05-18  108  	if (!options)
ea541da7d Roman Pen 2018-05-18  109  		return -ENOMEM;
ea541da7d Roman Pen 2018-05-18  110  
ea541da7d Roman Pen 2018-05-18  111  	sep_opt = strstrip(options);
ea541da7d Roman Pen 2018-05-18  112  	strip(sep_opt);
ea541da7d Roman Pen 2018-05-18  113  	while ((p = strsep(&sep_opt, " ")) != NULL) {
ea541da7d Roman Pen 2018-05-18  114  		if (!*p)
ea541da7d Roman Pen 2018-05-18  115  			continue;
ea541da7d Roman Pen 2018-05-18  116  
ea541da7d Roman Pen 2018-05-18  117  		token = match_token(p, ibnbd_opt_tokens, args);
ea541da7d Roman Pen 2018-05-18  118  		opt_mask |= token;
ea541da7d Roman Pen 2018-05-18  119  
ea541da7d Roman Pen 2018-05-18  120  		switch (token) {
ea541da7d Roman Pen 2018-05-18  121  		case IBNBD_OPT_SESSNAME:
ea541da7d Roman Pen 2018-05-18  122  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  123  			if (!p) {
ea541da7d Roman Pen 2018-05-18  124  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  125  				goto out;
ea541da7d Roman Pen 2018-05-18  126  			}
ea541da7d Roman Pen 2018-05-18  127  			if (strlen(p) > NAME_MAX) {
ea541da7d Roman Pen 2018-05-18  128  				pr_err("map_device: sessname too long\n");
ea541da7d Roman Pen 2018-05-18  129  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  130  				kfree(p);
ea541da7d Roman Pen 2018-05-18  131  				goto out;
ea541da7d Roman Pen 2018-05-18  132  			}
ea541da7d Roman Pen 2018-05-18  133  			strlcpy(sessname, p, NAME_MAX);
ea541da7d Roman Pen 2018-05-18  134  			kfree(p);
ea541da7d Roman Pen 2018-05-18  135  			break;
ea541da7d Roman Pen 2018-05-18  136  
ea541da7d Roman Pen 2018-05-18  137  		case IBNBD_OPT_PATH:
ea541da7d Roman Pen 2018-05-18  138  			if (p_cnt >= max_path_cnt) {
ea541da7d Roman Pen 2018-05-18 @139  				pr_err("map_device: too many (> %lu) paths "
ea541da7d Roman Pen 2018-05-18  140  				       "provided\n", max_path_cnt);
ea541da7d Roman Pen 2018-05-18  141  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  142  				goto out;
ea541da7d Roman Pen 2018-05-18  143  			}
ea541da7d Roman Pen 2018-05-18  144  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  145  			if (!p) {
ea541da7d Roman Pen 2018-05-18  146  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  147  				goto out;
ea541da7d Roman Pen 2018-05-18  148  			}
ea541da7d Roman Pen 2018-05-18  149  
ea541da7d Roman Pen 2018-05-18  150  			ret = ibtrs_addr_to_sockaddr(p, strlen(p), IBTRS_PORT,
ea541da7d Roman Pen 2018-05-18  151  						     &paths[p_cnt]);
ea541da7d Roman Pen 2018-05-18  152  			if (ret) {
ea541da7d Roman Pen 2018-05-18  153  				pr_err("Can't parse path %s: %d\n", p, ret);
ea541da7d Roman Pen 2018-05-18  154  				kfree(p);
ea541da7d Roman Pen 2018-05-18  155  				goto out;
ea541da7d Roman Pen 2018-05-18  156  			}
ea541da7d Roman Pen 2018-05-18  157  
ea541da7d Roman Pen 2018-05-18  158  			p_cnt++;
ea541da7d Roman Pen 2018-05-18  159  
ea541da7d Roman Pen 2018-05-18  160  			kfree(p);
ea541da7d Roman Pen 2018-05-18  161  			break;
ea541da7d Roman Pen 2018-05-18  162  
ea541da7d Roman Pen 2018-05-18  163  		case IBNBD_OPT_DEV_PATH:
ea541da7d Roman Pen 2018-05-18  164  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  165  			if (!p) {
ea541da7d Roman Pen 2018-05-18  166  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  167  				goto out;
ea541da7d Roman Pen 2018-05-18  168  			}
ea541da7d Roman Pen 2018-05-18  169  			if (strlen(p) > NAME_MAX) {
ea541da7d Roman Pen 2018-05-18  170  				pr_err("map_device: Device path too long\n");
ea541da7d Roman Pen 2018-05-18  171  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  172  				kfree(p);
ea541da7d Roman Pen 2018-05-18  173  				goto out;
ea541da7d Roman Pen 2018-05-18  174  			}
ea541da7d Roman Pen 2018-05-18  175  			strlcpy(pathname, p, NAME_MAX);
ea541da7d Roman Pen 2018-05-18  176  			kfree(p);
ea541da7d Roman Pen 2018-05-18  177  			break;
ea541da7d Roman Pen 2018-05-18  178  
ea541da7d Roman Pen 2018-05-18  179  		case IBNBD_OPT_ACCESS_MODE:
ea541da7d Roman Pen 2018-05-18  180  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  181  			if (!p) {
ea541da7d Roman Pen 2018-05-18  182  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  183  				goto out;
ea541da7d Roman Pen 2018-05-18  184  			}
ea541da7d Roman Pen 2018-05-18  185  
ea541da7d Roman Pen 2018-05-18  186  			if (!strcmp(p, "ro")) {
ea541da7d Roman Pen 2018-05-18  187  				*access_mode = IBNBD_ACCESS_RO;
ea541da7d Roman Pen 2018-05-18  188  			} else if (!strcmp(p, "rw")) {
ea541da7d Roman Pen 2018-05-18  189  				*access_mode = IBNBD_ACCESS_RW;
ea541da7d Roman Pen 2018-05-18  190  			} else if (!strcmp(p, "migration")) {
ea541da7d Roman Pen 2018-05-18  191  				*access_mode = IBNBD_ACCESS_MIGRATION;
ea541da7d Roman Pen 2018-05-18  192  			} else {
ea541da7d Roman Pen 2018-05-18  193  				pr_err("map_device: Invalid access_mode:"
ea541da7d Roman Pen 2018-05-18  194  				       " '%s'\n", p);
ea541da7d Roman Pen 2018-05-18  195  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  196  				kfree(p);
ea541da7d Roman Pen 2018-05-18  197  				goto out;
ea541da7d Roman Pen 2018-05-18  198  			}
ea541da7d Roman Pen 2018-05-18  199  
ea541da7d Roman Pen 2018-05-18  200  			kfree(p);
ea541da7d Roman Pen 2018-05-18  201  			break;
ea541da7d Roman Pen 2018-05-18  202  
ea541da7d Roman Pen 2018-05-18  203  		case IBNBD_OPT_IO_MODE:
ea541da7d Roman Pen 2018-05-18  204  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  205  			if (!p) {
ea541da7d Roman Pen 2018-05-18  206  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  207  				goto out;
ea541da7d Roman Pen 2018-05-18  208  			}
ea541da7d Roman Pen 2018-05-18  209  			if (!strcmp(p, "blockio")) {
ea541da7d Roman Pen 2018-05-18  210  				*io_mode = IBNBD_BLOCKIO;
ea541da7d Roman Pen 2018-05-18  211  			} else if (!strcmp(p, "fileio")) {
ea541da7d Roman Pen 2018-05-18  212  				*io_mode = IBNBD_FILEIO;
ea541da7d Roman Pen 2018-05-18  213  			} else {
ea541da7d Roman Pen 2018-05-18  214  				pr_err("map_device: Invalid io_mode: '%s'.\n",
ea541da7d Roman Pen 2018-05-18  215  				       p);
ea541da7d Roman Pen 2018-05-18  216  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  217  				kfree(p);
ea541da7d Roman Pen 2018-05-18  218  				goto out;
ea541da7d Roman Pen 2018-05-18  219  			}
ea541da7d Roman Pen 2018-05-18  220  			kfree(p);
ea541da7d Roman Pen 2018-05-18  221  			break;
ea541da7d Roman Pen 2018-05-18  222  
ea541da7d Roman Pen 2018-05-18  223  		default:
ea541da7d Roman Pen 2018-05-18  224  			pr_err("map_device: Unknown parameter or missing value"
ea541da7d Roman Pen 2018-05-18  225  			       " '%s'\n", p);
ea541da7d Roman Pen 2018-05-18  226  			ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  227  			goto out;
ea541da7d Roman Pen 2018-05-18  228  		}
ea541da7d Roman Pen 2018-05-18  229  	}
ea541da7d Roman Pen 2018-05-18  230  
ea541da7d Roman Pen 2018-05-18  231  	for (i = 0; i < ARRAY_SIZE(ibnbd_opt_mandatory); i++) {
ea541da7d Roman Pen 2018-05-18  232  		if ((opt_mask & ibnbd_opt_mandatory[i])) {
ea541da7d Roman Pen 2018-05-18  233  			ret = 0;
ea541da7d Roman Pen 2018-05-18  234  		} else {
ea541da7d Roman Pen 2018-05-18  235  			pr_err("map_device: Parameters missing\n");
ea541da7d Roman Pen 2018-05-18  236  			ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  237  			break;
ea541da7d Roman Pen 2018-05-18  238  		}
ea541da7d Roman Pen 2018-05-18  239  	}
ea541da7d Roman Pen 2018-05-18  240  
ea541da7d Roman Pen 2018-05-18  241  out:
ea541da7d Roman Pen 2018-05-18  242  	*path_cnt = p_cnt;
ea541da7d Roman Pen 2018-05-18  243  	kfree(options);
ea541da7d Roman Pen 2018-05-18  244  	return ret;
ea541da7d Roman Pen 2018-05-18  245  }
ea541da7d Roman Pen 2018-05-18  246  

:::::: The code at line 139 was first introduced by commit
:::::: ea541da7d8b2518d2b1d68d23d19bb13cca1119b ibnbd: client: sysfs interface functions

:::::: TO: Roman Pen <roman.penyaev@profitbricks.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 47812 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation
  2018-05-18 13:04 ` [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation Roman Pen
  2018-05-20 17:21   ` kbuild test robot
@ 2018-05-20 22:14   ` kbuild test robot
  2018-05-21  5:33   ` kbuild test robot
  2 siblings, 0 replies; 55+ messages in thread
From: kbuild test robot @ 2018-05-20 22:14 UTC (permalink / raw)
  To: Roman Pen
  Cc: kbuild-all, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang, Roman Pen

Hi Roman,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc5 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Roman-Pen/InfiniBand-Transport-IBTRS-and-Network-Block-Device-IBNBD/20180520-222445
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/block/ibnbd/ibnbd-clt.c:133:39: sparse: expression using sizeof(void)
>> drivers/block/ibnbd/ibnbd-clt.c:133:39: sparse: expression using sizeof(void)
   drivers/block/ibnbd/ibnbd-clt.c:135:37: sparse: expression using sizeof(void)
   drivers/block/ibnbd/ibnbd-clt.c:135:37: sparse: expression using sizeof(void)
   drivers/block/ibnbd/ibnbd-clt.c:592:29: sparse: expression using sizeof(void)
--
>> drivers/block/ibnbd/ibnbd-srv.c:357:48: sparse: incorrect type in argument 1 (different base types) @@    expected int [signed] dev_id @@    got restricted __le32 consint [signed] dev_id @@
   drivers/block/ibnbd/ibnbd-srv.c:357:48:    expected int [signed] dev_id
   drivers/block/ibnbd/ibnbd-srv.c:357:48:    got restricted __le32 const [usertype] device_id
>> drivers/block/ibnbd/ibnbd-srv.c:696:25: sparse: expression using sizeof(void)
   include/linux/blkdev.h:1105:24: sparse: expression using sizeof(void)

vim +133 drivers/block/ibnbd/ibnbd-clt.c

563b98df Roman Pen 2018-05-18  108  
563b98df Roman Pen 2018-05-18  109  static int ibnbd_clt_set_dev_attr(struct ibnbd_clt_dev *dev,
563b98df Roman Pen 2018-05-18  110  				  const struct ibnbd_msg_open_rsp *rsp)
563b98df Roman Pen 2018-05-18  111  {
563b98df Roman Pen 2018-05-18  112  	struct ibnbd_clt_session *sess = dev->sess;
563b98df Roman Pen 2018-05-18  113  
563b98df Roman Pen 2018-05-18  114  	if (unlikely(!rsp->logical_block_size))
563b98df Roman Pen 2018-05-18  115  		return -EINVAL;
563b98df Roman Pen 2018-05-18  116  
563b98df Roman Pen 2018-05-18  117  	dev->device_id		    = le32_to_cpu(rsp->device_id);
563b98df Roman Pen 2018-05-18  118  	dev->nsectors		    = le64_to_cpu(rsp->nsectors);
563b98df Roman Pen 2018-05-18  119  	dev->logical_block_size	    = le16_to_cpu(rsp->logical_block_size);
563b98df Roman Pen 2018-05-18  120  	dev->physical_block_size    = le16_to_cpu(rsp->physical_block_size);
563b98df Roman Pen 2018-05-18  121  	dev->max_write_same_sectors = le32_to_cpu(rsp->max_write_same_sectors);
563b98df Roman Pen 2018-05-18  122  	dev->max_discard_sectors    = le32_to_cpu(rsp->max_discard_sectors);
563b98df Roman Pen 2018-05-18  123  	dev->discard_granularity    = le32_to_cpu(rsp->discard_granularity);
563b98df Roman Pen 2018-05-18  124  	dev->discard_alignment	    = le32_to_cpu(rsp->discard_alignment);
563b98df Roman Pen 2018-05-18  125  	dev->secure_discard	    = le16_to_cpu(rsp->secure_discard);
563b98df Roman Pen 2018-05-18  126  	dev->rotational		    = rsp->rotational;
563b98df Roman Pen 2018-05-18  127  	dev->remote_io_mode	    = rsp->io_mode;
563b98df Roman Pen 2018-05-18  128  
563b98df Roman Pen 2018-05-18  129  	dev->max_hw_sectors = sess->max_io_size / dev->logical_block_size;
563b98df Roman Pen 2018-05-18  130  	dev->max_segments = BMAX_SEGMENTS;
563b98df Roman Pen 2018-05-18  131  
563b98df Roman Pen 2018-05-18  132  	if (dev->remote_io_mode == IBNBD_BLOCKIO) {
563b98df Roman Pen 2018-05-18 @133  		dev->max_hw_sectors = min_t(u32, dev->max_hw_sectors,
563b98df Roman Pen 2018-05-18  134  					    le32_to_cpu(rsp->max_hw_sectors));
563b98df Roman Pen 2018-05-18  135  		dev->max_segments = min_t(u16, dev->max_segments,
563b98df Roman Pen 2018-05-18  136  					  le16_to_cpu(rsp->max_segments));
563b98df Roman Pen 2018-05-18  137  	}
563b98df Roman Pen 2018-05-18  138  
563b98df Roman Pen 2018-05-18  139  	return 0;
563b98df Roman Pen 2018-05-18  140  }
563b98df Roman Pen 2018-05-18  141  

:::::: The code at line 133 was first introduced by commit
:::::: 563b98df79220ea51ec7d61fa671c810eef1db6b ibnbd: client: main functionality

:::::: TO: Roman Pen <roman.penyaev@profitbricks.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
  2018-05-18 13:04 ` [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation Roman Pen
@ 2018-05-20 22:14   ` kbuild test robot
  2018-05-21  6:36   ` kbuild test robot
  2018-05-22  5:05   ` Leon Romanovsky
  2 siblings, 0 replies; 55+ messages in thread
From: kbuild test robot @ 2018-05-20 22:14 UTC (permalink / raw)
  To: Roman Pen
  Cc: kbuild-all, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang, Roman Pen

[-- Attachment #1: Type: text/plain, Size: 6640 bytes --]

Hi Roman,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc5 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Roman-Pen/InfiniBand-Transport-IBTRS-and-Network-Block-Device-IBNBD/20180520-222445
config: m68k-allmodconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All warnings (new ones prefixed by >>):

   In file included from arch/m68k/include/asm/atomic.h:7:0,
                    from include/linux/atomic.h:5,
                    from include/linux/spinlock.h:399,
                    from include/linux/seqlock.h:36,
                    from include/linux/time.h:6,
                    from include/linux/stat.h:19,
                    from include/linux/module.h:10,
                    from drivers/infiniband//ulp/ibtrs/ibtrs-clt.c:34:
   drivers/infiniband//ulp/ibtrs/ibtrs-clt.c: In function 'ibtrs_clt_remove_path_from_arr':
   arch/m68k/include/asm/cmpxchg.h:122:3: warning: value computed is not used [-Wunused-value]
     ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o),     \
     ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       (unsigned long)(n), sizeof(*(ptr))))
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/infiniband//ulp/ibtrs/ibtrs-clt.c:1456:3: note: in expansion of macro 'cmpxchg'
      cmpxchg(ppcpu_path, sess, next);
      ^~~~~~~

vim +/cmpxchg +1456 drivers/infiniband//ulp/ibtrs/ibtrs-clt.c

44463323 Roman Pen 2018-05-18  1396  
44463323 Roman Pen 2018-05-18  1397  static void ibtrs_clt_remove_path_from_arr(struct ibtrs_clt_sess *sess)
44463323 Roman Pen 2018-05-18  1398  {
44463323 Roman Pen 2018-05-18  1399  	struct ibtrs_clt *clt = sess->clt;
44463323 Roman Pen 2018-05-18  1400  	struct ibtrs_clt_sess *next;
44463323 Roman Pen 2018-05-18  1401  	int cpu;
44463323 Roman Pen 2018-05-18  1402  
44463323 Roman Pen 2018-05-18  1403  	mutex_lock(&clt->paths_mutex);
44463323 Roman Pen 2018-05-18  1404  	list_del_rcu(&sess->s.entry);
44463323 Roman Pen 2018-05-18  1405  
44463323 Roman Pen 2018-05-18  1406  	/* Make sure everybody observes path removal. */
44463323 Roman Pen 2018-05-18  1407  	synchronize_rcu();
44463323 Roman Pen 2018-05-18  1408  
44463323 Roman Pen 2018-05-18  1409  	/*
44463323 Roman Pen 2018-05-18  1410  	 * Decrement paths number only after grace period, because
44463323 Roman Pen 2018-05-18  1411  	 * caller of do_each_path() must firstly observe list without
44463323 Roman Pen 2018-05-18  1412  	 * path and only then decremented paths number.
44463323 Roman Pen 2018-05-18  1413  	 *
44463323 Roman Pen 2018-05-18  1414  	 * Otherwise there can be the following situation:
44463323 Roman Pen 2018-05-18  1415  	 *    o Two paths exist and IO is coming.
44463323 Roman Pen 2018-05-18  1416  	 *    o One path is removed:
44463323 Roman Pen 2018-05-18  1417  	 *      CPU#0                          CPU#1
44463323 Roman Pen 2018-05-18  1418  	 *      do_each_path():                ibtrs_clt_remove_path_from_arr():
44463323 Roman Pen 2018-05-18  1419  	 *          path = get_next_path()
44463323 Roman Pen 2018-05-18  1420  	 *          ^^^                            list_del_rcu(path)
44463323 Roman Pen 2018-05-18  1421  	 *          [!CONNECTED path]              clt->paths_num--
44463323 Roman Pen 2018-05-18  1422  	 *                                              ^^^^^^^^^
44463323 Roman Pen 2018-05-18  1423  	 *          load clt->paths_num                 from 2 to 1
44463323 Roman Pen 2018-05-18  1424  	 *                    ^^^^^^^^^
44463323 Roman Pen 2018-05-18  1425  	 *                    sees 1
44463323 Roman Pen 2018-05-18  1426  	 *
44463323 Roman Pen 2018-05-18  1427  	 *      path is observed as !CONNECTED, but do_each_path() loop
44463323 Roman Pen 2018-05-18  1428  	 *      ends, because expression i < clt->paths_num is false.
44463323 Roman Pen 2018-05-18  1429  	 */
44463323 Roman Pen 2018-05-18  1430  	clt->paths_num--;
44463323 Roman Pen 2018-05-18  1431  
44463323 Roman Pen 2018-05-18  1432  	next = list_next_or_null_rr_rcu(&clt->paths_list, &sess->s.entry,
44463323 Roman Pen 2018-05-18  1433  					typeof(*next), s.entry);
44463323 Roman Pen 2018-05-18  1434  
44463323 Roman Pen 2018-05-18  1435  	/*
44463323 Roman Pen 2018-05-18  1436  	 * Pcpu paths can still point to the path which is going to be
44463323 Roman Pen 2018-05-18  1437  	 * removed, so change the pointer manually.
44463323 Roman Pen 2018-05-18  1438  	 */
44463323 Roman Pen 2018-05-18  1439  	for_each_possible_cpu(cpu) {
44463323 Roman Pen 2018-05-18  1440  		struct ibtrs_clt_sess **ppcpu_path;
44463323 Roman Pen 2018-05-18  1441  
44463323 Roman Pen 2018-05-18  1442  		ppcpu_path = per_cpu_ptr(clt->pcpu_path, cpu);
44463323 Roman Pen 2018-05-18  1443  		if (*ppcpu_path != sess)
44463323 Roman Pen 2018-05-18  1444  			/*
44463323 Roman Pen 2018-05-18  1445  			 * synchronize_rcu() was called just after deleting
44463323 Roman Pen 2018-05-18  1446  			 * entry from the list, thus IO code path cannot
44463323 Roman Pen 2018-05-18  1447  			 * change pointer back to the pointer which is going
44463323 Roman Pen 2018-05-18  1448  			 * to be removed, we are safe here.
44463323 Roman Pen 2018-05-18  1449  			 */
44463323 Roman Pen 2018-05-18  1450  			continue;
44463323 Roman Pen 2018-05-18  1451  
44463323 Roman Pen 2018-05-18  1452  		/*
44463323 Roman Pen 2018-05-18  1453  		 * We race with IO code path, which also changes pointer,
44463323 Roman Pen 2018-05-18  1454  		 * thus we have to be careful not to override it.
44463323 Roman Pen 2018-05-18  1455  		 */
44463323 Roman Pen 2018-05-18 @1456  		cmpxchg(ppcpu_path, sess, next);
44463323 Roman Pen 2018-05-18  1457  	}
44463323 Roman Pen 2018-05-18  1458  	mutex_unlock(&clt->paths_mutex);
44463323 Roman Pen 2018-05-18  1459  }
44463323 Roman Pen 2018-05-18  1460  

:::::: The code at line 1456 was first introduced by commit
:::::: 4446332354bf8cf878755e55a221c59eb55a0f1f ibtrs: client: main functionality

:::::: TO: Roman Pen <roman.penyaev@profitbricks.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45123 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation
  2018-05-18 13:04 ` [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation Roman Pen
  2018-05-20 17:21   ` kbuild test robot
  2018-05-20 22:14   ` kbuild test robot
@ 2018-05-21  5:33   ` kbuild test robot
  2 siblings, 0 replies; 55+ messages in thread
From: kbuild test robot @ 2018-05-21  5:33 UTC (permalink / raw)
  To: Roman Pen
  Cc: kbuild-all, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang, Roman Pen

[-- Attachment #1: Type: text/plain, Size: 10593 bytes --]

Hi Roman,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc6 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Roman-Pen/InfiniBand-Transport-IBTRS-and-Network-Block-Device-IBNBD/20180520-222445
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/block/ibnbd/ibnbd-clt-sysfs.c: In function 'ibnbd_clt_parse_map_options':
>> drivers/block/ibnbd/ibnbd-clt-sysfs.c:139:12: warning: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type 'size_t {aka unsigned int}' [-Wformat=]
        pr_err("map_device: too many (> %lu) paths "
               ^~~~~~
   drivers/block/ibnbd/ibnbd-clt-sysfs.c: In function 'ibnbd_clt_map_device_store':
   drivers/block/ibnbd/ibnbd-clt-sysfs.c:613:1: warning: the frame size of 1612 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^

vim +139 drivers/block/ibnbd/ibnbd-clt-sysfs.c

ea541da7d Roman Pen 2018-05-18   88  
ea541da7d Roman Pen 2018-05-18   89  static int ibnbd_clt_parse_map_options(const char *buf,
ea541da7d Roman Pen 2018-05-18   90  				       char *sessname,
ea541da7d Roman Pen 2018-05-18   91  				       struct ibtrs_addr *paths,
ea541da7d Roman Pen 2018-05-18   92  				       size_t *path_cnt,
ea541da7d Roman Pen 2018-05-18   93  				       size_t max_path_cnt,
ea541da7d Roman Pen 2018-05-18   94  				       char *pathname,
ea541da7d Roman Pen 2018-05-18   95  				       enum ibnbd_access_mode *access_mode,
ea541da7d Roman Pen 2018-05-18   96  				       enum ibnbd_io_mode *io_mode)
ea541da7d Roman Pen 2018-05-18   97  {
ea541da7d Roman Pen 2018-05-18   98  	char *options, *sep_opt;
ea541da7d Roman Pen 2018-05-18   99  	char *p;
ea541da7d Roman Pen 2018-05-18  100  	substring_t args[MAX_OPT_ARGS];
ea541da7d Roman Pen 2018-05-18  101  	int opt_mask = 0;
ea541da7d Roman Pen 2018-05-18  102  	int token;
ea541da7d Roman Pen 2018-05-18  103  	int ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  104  	int i;
ea541da7d Roman Pen 2018-05-18  105  	int p_cnt = 0;
ea541da7d Roman Pen 2018-05-18  106  
ea541da7d Roman Pen 2018-05-18  107  	options = kstrdup(buf, GFP_KERNEL);
ea541da7d Roman Pen 2018-05-18  108  	if (!options)
ea541da7d Roman Pen 2018-05-18  109  		return -ENOMEM;
ea541da7d Roman Pen 2018-05-18  110  
ea541da7d Roman Pen 2018-05-18  111  	sep_opt = strstrip(options);
ea541da7d Roman Pen 2018-05-18  112  	strip(sep_opt);
ea541da7d Roman Pen 2018-05-18  113  	while ((p = strsep(&sep_opt, " ")) != NULL) {
ea541da7d Roman Pen 2018-05-18  114  		if (!*p)
ea541da7d Roman Pen 2018-05-18  115  			continue;
ea541da7d Roman Pen 2018-05-18  116  
ea541da7d Roman Pen 2018-05-18  117  		token = match_token(p, ibnbd_opt_tokens, args);
ea541da7d Roman Pen 2018-05-18  118  		opt_mask |= token;
ea541da7d Roman Pen 2018-05-18  119  
ea541da7d Roman Pen 2018-05-18  120  		switch (token) {
ea541da7d Roman Pen 2018-05-18  121  		case IBNBD_OPT_SESSNAME:
ea541da7d Roman Pen 2018-05-18  122  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  123  			if (!p) {
ea541da7d Roman Pen 2018-05-18  124  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  125  				goto out;
ea541da7d Roman Pen 2018-05-18  126  			}
ea541da7d Roman Pen 2018-05-18  127  			if (strlen(p) > NAME_MAX) {
ea541da7d Roman Pen 2018-05-18  128  				pr_err("map_device: sessname too long\n");
ea541da7d Roman Pen 2018-05-18  129  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  130  				kfree(p);
ea541da7d Roman Pen 2018-05-18  131  				goto out;
ea541da7d Roman Pen 2018-05-18  132  			}
ea541da7d Roman Pen 2018-05-18  133  			strlcpy(sessname, p, NAME_MAX);
ea541da7d Roman Pen 2018-05-18  134  			kfree(p);
ea541da7d Roman Pen 2018-05-18  135  			break;
ea541da7d Roman Pen 2018-05-18  136  
ea541da7d Roman Pen 2018-05-18  137  		case IBNBD_OPT_PATH:
ea541da7d Roman Pen 2018-05-18  138  			if (p_cnt >= max_path_cnt) {
ea541da7d Roman Pen 2018-05-18 @139  				pr_err("map_device: too many (> %lu) paths "
ea541da7d Roman Pen 2018-05-18  140  				       "provided\n", max_path_cnt);
ea541da7d Roman Pen 2018-05-18  141  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  142  				goto out;
ea541da7d Roman Pen 2018-05-18  143  			}
ea541da7d Roman Pen 2018-05-18  144  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  145  			if (!p) {
ea541da7d Roman Pen 2018-05-18  146  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  147  				goto out;
ea541da7d Roman Pen 2018-05-18  148  			}
ea541da7d Roman Pen 2018-05-18  149  
ea541da7d Roman Pen 2018-05-18  150  			ret = ibtrs_addr_to_sockaddr(p, strlen(p), IBTRS_PORT,
ea541da7d Roman Pen 2018-05-18  151  						     &paths[p_cnt]);
ea541da7d Roman Pen 2018-05-18  152  			if (ret) {
ea541da7d Roman Pen 2018-05-18  153  				pr_err("Can't parse path %s: %d\n", p, ret);
ea541da7d Roman Pen 2018-05-18  154  				kfree(p);
ea541da7d Roman Pen 2018-05-18  155  				goto out;
ea541da7d Roman Pen 2018-05-18  156  			}
ea541da7d Roman Pen 2018-05-18  157  
ea541da7d Roman Pen 2018-05-18  158  			p_cnt++;
ea541da7d Roman Pen 2018-05-18  159  
ea541da7d Roman Pen 2018-05-18  160  			kfree(p);
ea541da7d Roman Pen 2018-05-18  161  			break;
ea541da7d Roman Pen 2018-05-18  162  
ea541da7d Roman Pen 2018-05-18  163  		case IBNBD_OPT_DEV_PATH:
ea541da7d Roman Pen 2018-05-18  164  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  165  			if (!p) {
ea541da7d Roman Pen 2018-05-18  166  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  167  				goto out;
ea541da7d Roman Pen 2018-05-18  168  			}
ea541da7d Roman Pen 2018-05-18  169  			if (strlen(p) > NAME_MAX) {
ea541da7d Roman Pen 2018-05-18  170  				pr_err("map_device: Device path too long\n");
ea541da7d Roman Pen 2018-05-18  171  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  172  				kfree(p);
ea541da7d Roman Pen 2018-05-18  173  				goto out;
ea541da7d Roman Pen 2018-05-18  174  			}
ea541da7d Roman Pen 2018-05-18  175  			strlcpy(pathname, p, NAME_MAX);
ea541da7d Roman Pen 2018-05-18  176  			kfree(p);
ea541da7d Roman Pen 2018-05-18  177  			break;
ea541da7d Roman Pen 2018-05-18  178  
ea541da7d Roman Pen 2018-05-18  179  		case IBNBD_OPT_ACCESS_MODE:
ea541da7d Roman Pen 2018-05-18  180  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  181  			if (!p) {
ea541da7d Roman Pen 2018-05-18  182  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  183  				goto out;
ea541da7d Roman Pen 2018-05-18  184  			}
ea541da7d Roman Pen 2018-05-18  185  
ea541da7d Roman Pen 2018-05-18  186  			if (!strcmp(p, "ro")) {
ea541da7d Roman Pen 2018-05-18  187  				*access_mode = IBNBD_ACCESS_RO;
ea541da7d Roman Pen 2018-05-18  188  			} else if (!strcmp(p, "rw")) {
ea541da7d Roman Pen 2018-05-18  189  				*access_mode = IBNBD_ACCESS_RW;
ea541da7d Roman Pen 2018-05-18  190  			} else if (!strcmp(p, "migration")) {
ea541da7d Roman Pen 2018-05-18  191  				*access_mode = IBNBD_ACCESS_MIGRATION;
ea541da7d Roman Pen 2018-05-18  192  			} else {
ea541da7d Roman Pen 2018-05-18  193  				pr_err("map_device: Invalid access_mode:"
ea541da7d Roman Pen 2018-05-18  194  				       " '%s'\n", p);
ea541da7d Roman Pen 2018-05-18  195  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  196  				kfree(p);
ea541da7d Roman Pen 2018-05-18  197  				goto out;
ea541da7d Roman Pen 2018-05-18  198  			}
ea541da7d Roman Pen 2018-05-18  199  
ea541da7d Roman Pen 2018-05-18  200  			kfree(p);
ea541da7d Roman Pen 2018-05-18  201  			break;
ea541da7d Roman Pen 2018-05-18  202  
ea541da7d Roman Pen 2018-05-18  203  		case IBNBD_OPT_IO_MODE:
ea541da7d Roman Pen 2018-05-18  204  			p = match_strdup(args);
ea541da7d Roman Pen 2018-05-18  205  			if (!p) {
ea541da7d Roman Pen 2018-05-18  206  				ret = -ENOMEM;
ea541da7d Roman Pen 2018-05-18  207  				goto out;
ea541da7d Roman Pen 2018-05-18  208  			}
ea541da7d Roman Pen 2018-05-18  209  			if (!strcmp(p, "blockio")) {
ea541da7d Roman Pen 2018-05-18  210  				*io_mode = IBNBD_BLOCKIO;
ea541da7d Roman Pen 2018-05-18  211  			} else if (!strcmp(p, "fileio")) {
ea541da7d Roman Pen 2018-05-18  212  				*io_mode = IBNBD_FILEIO;
ea541da7d Roman Pen 2018-05-18  213  			} else {
ea541da7d Roman Pen 2018-05-18  214  				pr_err("map_device: Invalid io_mode: '%s'.\n",
ea541da7d Roman Pen 2018-05-18  215  				       p);
ea541da7d Roman Pen 2018-05-18  216  				ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  217  				kfree(p);
ea541da7d Roman Pen 2018-05-18  218  				goto out;
ea541da7d Roman Pen 2018-05-18  219  			}
ea541da7d Roman Pen 2018-05-18  220  			kfree(p);
ea541da7d Roman Pen 2018-05-18  221  			break;
ea541da7d Roman Pen 2018-05-18  222  
ea541da7d Roman Pen 2018-05-18  223  		default:
ea541da7d Roman Pen 2018-05-18  224  			pr_err("map_device: Unknown parameter or missing value"
ea541da7d Roman Pen 2018-05-18  225  			       " '%s'\n", p);
ea541da7d Roman Pen 2018-05-18  226  			ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  227  			goto out;
ea541da7d Roman Pen 2018-05-18  228  		}
ea541da7d Roman Pen 2018-05-18  229  	}
ea541da7d Roman Pen 2018-05-18  230  
ea541da7d Roman Pen 2018-05-18  231  	for (i = 0; i < ARRAY_SIZE(ibnbd_opt_mandatory); i++) {
ea541da7d Roman Pen 2018-05-18  232  		if ((opt_mask & ibnbd_opt_mandatory[i])) {
ea541da7d Roman Pen 2018-05-18  233  			ret = 0;
ea541da7d Roman Pen 2018-05-18  234  		} else {
ea541da7d Roman Pen 2018-05-18  235  			pr_err("map_device: Parameters missing\n");
ea541da7d Roman Pen 2018-05-18  236  			ret = -EINVAL;
ea541da7d Roman Pen 2018-05-18  237  			break;
ea541da7d Roman Pen 2018-05-18  238  		}
ea541da7d Roman Pen 2018-05-18  239  	}
ea541da7d Roman Pen 2018-05-18  240  
ea541da7d Roman Pen 2018-05-18  241  out:
ea541da7d Roman Pen 2018-05-18  242  	*path_cnt = p_cnt;
ea541da7d Roman Pen 2018-05-18  243  	kfree(options);
ea541da7d Roman Pen 2018-05-18  244  	return ret;
ea541da7d Roman Pen 2018-05-18  245  }
ea541da7d Roman Pen 2018-05-18  246  

:::::: The code at line 139 was first introduced by commit
:::::: ea541da7d8b2518d2b1d68d23d19bb13cca1119b ibnbd: client: sysfs interface functions

:::::: TO: Roman Pen <roman.penyaev@profitbricks.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 62305 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
  2018-05-18 13:04 ` [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation Roman Pen
  2018-05-20 22:14   ` kbuild test robot
@ 2018-05-21  6:36   ` kbuild test robot
  2018-05-22  5:05   ` Leon Romanovsky
  2 siblings, 0 replies; 55+ messages in thread
From: kbuild test robot @ 2018-05-21  6:36 UTC (permalink / raw)
  To: Roman Pen
  Cc: kbuild-all, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, Swapnil Ingle, Danil Kipnis, Jack Wang, Roman Pen

[-- Attachment #1: Type: text/plain, Size: 1410 bytes --]

Hi Roman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc6 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Roman-Pen/InfiniBand-Transport-IBTRS-and-Network-Block-Device-IBNBD/20180520-222445
config: m68k-allyesconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   drivers/mtd/nand/raw/nand_base.o: In function `nand_soft_waitrdy':
   nand_base.c:(.text+0x1022): undefined reference to `__udivdi3'
   drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.o: In function `ibtrs_clt_stats_wc_completion_to_str':
>> ibtrs-clt-stats.c:(.text+0x172): undefined reference to `__udivdi3'
   drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.o: In function `ibtrs_clt_stats_sg_list_distr_to_str':
   ibtrs-clt-stats.c:(.text+0x49c): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45432 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-20  0:43       ` Paul E. McKenney
@ 2018-05-21 13:50         ` Roman Penyaev
  2018-05-21 15:16           ` Linus Torvalds
  2018-05-21 15:31           ` Paul E. McKenney
  0 siblings, 2 replies; 55+ messages in thread
From: Roman Penyaev @ 2018-05-21 13:50 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang,
	Linux Kernel Mailing List

On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
>> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> >> Function is going to be used in transport over RDMA module
>> >> in subsequent patches.
>> >>
>> >> Function returns next element in round-robin fashion,
>> >> i.e. head will be skipped.  NULL will be returned if list
>> >> is observed as empty.
>> >>
>> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> Cc: linux-kernel@vger.kernel.org
>> >> ---
>> >>  include/linux/rculist.h | 19 +++++++++++++++++++
>> >>  1 file changed, 19 insertions(+)
>> >>
>> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> >> index 127f534fec94..b0840d5ab25a 100644
>> >> --- a/include/linux/rculist.h
>> >> +++ b/include/linux/rculist.h
>> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>> >>  })
>> >>
>> >>  /**
>> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
>> >> + * @head:    the head for the list.
>> >> + * @ptr:        the list head to take the next element from.
>> >> + * @type:       the type of the struct this is embedded in.
>> >> + * @memb:       the name of the list_head within the struct.
>> >> + *
>> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
>> >> + * but if list is observed as empty, NULL will be returned.
>> >> + *
>> >> + * This primitive may safely run concurrently with the _rcu list-mutation
>> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>> >
>> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
>> > are round-robining a given list must all be under the same RCU read-side
>> > critical section.  For example, the following will break badly:
>> >
>> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>> >         {
>> >                 struct foo *ret;
>> >
>> >                 rcu_read_lock();
>> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
>> >                 rcu_read_unlock();  /* BUG */
>> >                 return ret;
>> >         }
>> >
>> > You need a big fat comment stating this, at the very least.  The resulting
>> > bug can be very hard to trigger and even harder to debug.
>> >
>> > And yes, I know that the same restriction applies to list_next_rcu()
>> > and friends.  The difference is that if you try to invoke those in an
>> > infinite loop, you will be rapped on the knuckles as soon as you hit
>> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
>> > might tempt people to do something broken like take_rr_step() above.
>>
>> Hi Paul,
>>
>> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
>> connection from the list in order to send a request, i.e. my code is
>> something like the following:
>>
>>         static struct conn *get_and_set_next_conn(void)
>>         {
>>                 struct conn *conn;
>>
>>                 conn = rcu_dereferece(rcu_conn);
>>                 if (unlikely(!conn))
>>                     return conn;
>
> Wait.  Don't you need to restart from the beginning of the list in
> this case?  Or does the list never have anything added to it and is
> rcu_conn initially the first element in the list?

Hi Paul,

No, I continue from the pointer, which I assigned on the previous IO
in order to send IO fairly and keep load balanced.

Initially @rcu_conn points to the first element, but elements can
be deleted from the list and list can become empty.

The deletion code is below.

>
>>                 conn = list_next_or_null_rr_rcu(&conn_list,
>>                                                 &conn->entry,
>>                                                 typeof(*conn),
>>                                                 entry);
>>                 rcu_assign_pointer(rcu_conn, conn);
>
> Linus is correct to doubt this code.  You assign a pointer to the current
> element to rcu_conn, which is presumably a per-CPU or global variable.
> So far, so good ...

I use per-CPU, in the first example I did not show that not to overcomplicate
the code.

>
>>                 return conn;
>>         }
>>
>>         rcu_read_lock();
>>         conn = get_and_set_next_conn();
>>         if (unlikely(!conn)) {
>>                 /* ... */
>>         }
>>         err = rdma_io(conn, request);
>>         rcu_read_unlock();
>
> ... except that some other CPU might well remove the entry referenced by
> rcu_conn at this point.  It would have to wait for a grace period (e.g.,
> synchronize_rcu()), but the current CPU has exited its RCU read-side
> critical section, and therefore is not blocking the grace period.
> Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> might well be referencing the freelist, or, even worse, some other type
> of structure.
>
> What is your code doing to prevent this from happening?  (There are ways,
> but I want to know what you were doing in this case.)

Probably I should have shown the way of removal at the very beginning,
my fault.  So deletion looks as the following (a bit changed and
simplified for the sake of clearness):

        static void remove_connection(conn)
        {
                bool need_to_wait = false;
                int cpu;

                /* Do not let RCU list add/delete happen in parallel */
                mutex_lock(&conn_lock);

                list_del_rcu(&conn->entry);

                /* Make sure everybody observes element removal */
                synchronize_rcu();

                /*
                 * At this point nobody sees @conn in the list, but
still we have
                 * dangling pointer @rcu_conn which _can_ point to @conn.  Since
                 * nobody can observe @conn in the list, we guarantee
that IO path
                 * will not assign @conn to @rcu_conn, i.e. @rcu_conn
can be equal
                 * to @conn, but can never again become @conn.
                 */

                /*
                 * Get @next connection from current @conn which is going to be
                 * removed.
                 */
                next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
                                                typeof(*next), entry);

                /*
                 * Here @rcu_conn can be changed by reader side, so use @cmpxchg
                 * in order to keep fairness in load-balancing and do not touch
                 * the pointer which can be already changed by the IO path.
                 *
                 * Current path can be faster than IO path and the
following race
                 * exists:
                 *
                 *   CPU0                         CPU1
                 *   ----                         ----
                 *   conn = rcu_dereferece(rcu_conn);
                 *   next = list_next_or_null_rr_rcu(conn)
                 *
                 *                                conn ==
cmpxchg(rcu_conn, conn, next);
                 *                                synchronize_rcu();
                 *
                 *   rcu_assign_pointer(rcu_conn, next);
                 *   ^^^^^^^^^^^^^^^^^^
                 *
                 *   Here @rcu_conn is already equal to @next (done by
@cmpxchg),
                 *   so assignment to the same pointer is harmless.
                 *
                 */
                for_each_possible_cpu(cpu) {
                        struct conn **rcu_conn;

                        rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
                        if (*rcu_conn != conn)
                                /*
                                 * This @cpu will never again pick up @conn,
                                 * so it is safe just to choose next CPU.
                                 */
                                continue;

                        if (conn == cmpxchg(rcu_conn, conn, next))
                                /*
                                 * @rcu_conn was successfully replaced
with @next,
                                 * that means that someone can also hold a @conn
                                 * and dereferencing it, so wait for a
grace period
                                 * is required.
                                 */
                                need_to_wait = true;
                }
                if (need_to_wait)
                        synchronize_rcu();

                mutex_unlock(&conn_lock);

                kfree(conn);
        }


>
>> i.e. usage of the @next pointer is under an RCU critical section.
>>
>> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
>> > macro that makes it more obvious that the whole thing need to be under
>> > a single RCU read-side critical section?  Such a macro would of course be
>> > an infinite loop if the list never went empty, so presumably there would
>> > be a break or return statement in there somewhere.
>>
>> The difference is that I do not need a loop, I take the @next conn pointer,
>> save it for the following IO request and do IO for current IO request.
>>
>> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
>> of the loop does not look nice, I personally do not like it, i.e.:
>>
>>
>>         static struct conn *get_and_set_next_conn(void)
>>         {
>>                 struct conn *conn;
>>
>>                 conn = rcu_dereferece(rcu_conn);
>>                 if (unlikely(!conn))
>>                     return conn;
>>                 list_for_each_entry_rr_rcu(conn, &conn_list,
>>                                            entry) {
>>                         break;
>>                 }
>>                 rcu_assign_pointer(rcu_conn, conn);
>>                 return conn;
>>         }
>>
>>
>> or maybe I did not fully get your idea?
>
> That would not help at all because you are still leaking the pointer out
> of the RCU read-side critical section.  That is completely and utterly
> broken unless you are somehow cleaning up rcu_conn when you remove
> the element.  And getting that cleanup right is -extremely- tricky.
> Unless you have some sort of proof of correctness, you will get a NACK
> from me.

I understand all the consequences of the leaking pointer, and of course
wrapped loop with RCU lock/unlock is simpler, but in order to keep
load-balancing and IO fairness avoiding any locks on IO path I've come
up with these RCU tricks and list_next_or_null_rr_rcu() macro.

> More like this:
>
>         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
>                 do_something_with(conn);
>                 if (done_for_now())
>                         break;
>         }
>
>> >> + */
>> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
>> >> +({ \
>> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
>> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
>> >
>> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
>> > Define this within RDMA, where a smaller number of people can more
>> > easily be kept aware of the restrictions on use.  If it turns out to be
>> > more generally useful, we can take a look at exactly what makes sense
>> > more globally.
>>
>> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
>> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
>>
>> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
>>
>> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
>> my list_next_or_null_rr_rcu() variant?
>
> Let's start with the basics:  It absolutely does not make sense to leak
> pointers across rcu_read_unlock() unless you have arranged something else
> to protect the pointed-to data in the meantime.  There are a number of ways
> of implementing this protection.  Again, what protection are you using?
>
> Your code at the above URL looks plausible to me at first glance: You
> do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> rcu_read_unlock().  But at second glance, it looks like htcx->queue
> might have the same vulnerability as rcu_conn in your earlier code.

I am not the author of the code at the URL I specified. I provided the
link answering the question to show other possible users of round-robin
semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
case I can't use a loop, I leak the pointer and indeed have to be very
careful.  But perhaps we can come up with some generic solution to cover
both cases: -rr loop and -rr next.

--
Roman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-21 13:50         ` Roman Penyaev
@ 2018-05-21 15:16           ` Linus Torvalds
  2018-05-21 15:33             ` Paul E. McKenney
  2018-05-21 15:31           ` Paul E. McKenney
  1 sibling, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2018-05-21 15:16 UTC (permalink / raw)
  To: Roman Pen
  Cc: Paul McKenney, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, swapnil.ingle, danil.kipnis, Jinpu Wang,
	Linux Kernel Mailing List

On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.

Right. And that's exactly what has both me and Paul nervous. You're no
longer in the RCU domain. You're using a pointer where the lifetime has
nothing to do with RCU any more.

Can it be done? Sure. But you need *other* locking for it (that you haven't
explained), and it's fragile as hell.

It's probably best to not use RCU for it at all, but depend on that "other
locking" that you have to have anyway, to keep the pointer valid over the
non-RCU region.

                Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-21 13:50         ` Roman Penyaev
  2018-05-21 15:16           ` Linus Torvalds
@ 2018-05-21 15:31           ` Paul E. McKenney
  2018-05-22  9:09             ` Roman Penyaev
  1 sibling, 1 reply; 55+ messages in thread
From: Paul E. McKenney @ 2018-05-21 15:31 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang,
	Linux Kernel Mailing List

On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> >> Function is going to be used in transport over RDMA module
> >> >> in subsequent patches.
> >> >>
> >> >> Function returns next element in round-robin fashion,
> >> >> i.e. head will be skipped.  NULL will be returned if list
> >> >> is observed as empty.
> >> >>
> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >> Cc: linux-kernel@vger.kernel.org
> >> >> ---
> >> >>  include/linux/rculist.h | 19 +++++++++++++++++++
> >> >>  1 file changed, 19 insertions(+)
> >> >>
> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> >> index 127f534fec94..b0840d5ab25a 100644
> >> >> --- a/include/linux/rculist.h
> >> >> +++ b/include/linux/rculist.h
> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >> >>  })
> >> >>
> >> >>  /**
> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> >> >> + * @head:    the head for the list.
> >> >> + * @ptr:        the list head to take the next element from.
> >> >> + * @type:       the type of the struct this is embedded in.
> >> >> + * @memb:       the name of the list_head within the struct.
> >> >> + *
> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> >> >> + * but if list is observed as empty, NULL will be returned.
> >> >> + *
> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation
> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >> >
> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> >> > are round-robining a given list must all be under the same RCU read-side
> >> > critical section.  For example, the following will break badly:
> >> >
> >> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >> >         {
> >> >                 struct foo *ret;
> >> >
> >> >                 rcu_read_lock();
> >> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
> >> >                 rcu_read_unlock();  /* BUG */
> >> >                 return ret;
> >> >         }
> >> >
> >> > You need a big fat comment stating this, at the very least.  The resulting
> >> > bug can be very hard to trigger and even harder to debug.
> >> >
> >> > And yes, I know that the same restriction applies to list_next_rcu()
> >> > and friends.  The difference is that if you try to invoke those in an
> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> >> > might tempt people to do something broken like take_rr_step() above.
> >>
> >> Hi Paul,
> >>
> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> >> connection from the list in order to send a request, i.e. my code is
> >> something like the following:
> >>
> >>         static struct conn *get_and_set_next_conn(void)
> >>         {
> >>                 struct conn *conn;
> >>
> >>                 conn = rcu_dereferece(rcu_conn);
> >>                 if (unlikely(!conn))
> >>                     return conn;
> >
> > Wait.  Don't you need to restart from the beginning of the list in
> > this case?  Or does the list never have anything added to it and is
> > rcu_conn initially the first element in the list?
> 
> Hi Paul,
> 
> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.
> 
> Initially @rcu_conn points to the first element, but elements can
> be deleted from the list and list can become empty.
> 
> The deletion code is below.
> 
> >
> >>                 conn = list_next_or_null_rr_rcu(&conn_list,
> >>                                                 &conn->entry,
> >>                                                 typeof(*conn),
> >>                                                 entry);
> >>                 rcu_assign_pointer(rcu_conn, conn);
> >
> > Linus is correct to doubt this code.  You assign a pointer to the current
> > element to rcu_conn, which is presumably a per-CPU or global variable.
> > So far, so good ...
> 
> I use per-CPU, in the first example I did not show that not to overcomplicate
> the code.
> 
> >
> >>                 return conn;
> >>         }
> >>
> >>         rcu_read_lock();
> >>         conn = get_and_set_next_conn();
> >>         if (unlikely(!conn)) {
> >>                 /* ... */
> >>         }
> >>         err = rdma_io(conn, request);
> >>         rcu_read_unlock();
> >
> > ... except that some other CPU might well remove the entry referenced by
> > rcu_conn at this point.  It would have to wait for a grace period (e.g.,
> > synchronize_rcu()), but the current CPU has exited its RCU read-side
> > critical section, and therefore is not blocking the grace period.
> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> > might well be referencing the freelist, or, even worse, some other type
> > of structure.
> >
> > What is your code doing to prevent this from happening?  (There are ways,
> > but I want to know what you were doing in this case.)
> 
> Probably I should have shown the way of removal at the very beginning,
> my fault.  So deletion looks as the following (a bit changed and
> simplified for the sake of clearness):

Thank you!  Let's see...

>         static void remove_connection(conn)
>         {
>                 bool need_to_wait = false;
>                 int cpu;
> 
>                 /* Do not let RCU list add/delete happen in parallel */
>                 mutex_lock(&conn_lock);
> 
>                 list_del_rcu(&conn->entry);
> 
>                 /* Make sure everybody observes element removal */
>                 synchronize_rcu();

At this point, any reader who saw the element in the list is done, as you
comment in fact says.  But there might be a pointer to that element in the
per-CPU variables, however, from this point forward it cannot be the case
that one of the per-CPU variables gets set to the newly deleted element.
Which is your next block of code...

>                 /*
>                  * At this point nobody sees @conn in the list, but
> still we have
>                  * dangling pointer @rcu_conn which _can_ point to @conn.  Since
>                  * nobody can observe @conn in the list, we guarantee
> that IO path
>                  * will not assign @conn to @rcu_conn, i.e. @rcu_conn
> can be equal
>                  * to @conn, but can never again become @conn.
>                  */
> 
>                 /*
>                  * Get @next connection from current @conn which is going to be
>                  * removed.
>                  */
>                 next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
>                                                 typeof(*next), entry);
> 
>                 /*
>                  * Here @rcu_conn can be changed by reader side, so use @cmpxchg
>                  * in order to keep fairness in load-balancing and do not touch
>                  * the pointer which can be already changed by the IO path.
>                  *
>                  * Current path can be faster than IO path and the
> following race
>                  * exists:
>                  *
>                  *   CPU0                         CPU1
>                  *   ----                         ----
>                  *   conn = rcu_dereferece(rcu_conn);
>                  *   next = list_next_or_null_rr_rcu(conn)
>                  *
>                  *                                conn ==
> cmpxchg(rcu_conn, conn, next);
>                  *                                synchronize_rcu();
>                  *
>                  *   rcu_assign_pointer(rcu_conn, next);
>                  *   ^^^^^^^^^^^^^^^^^^
>                  *
>                  *   Here @rcu_conn is already equal to @next (done by
> @cmpxchg),
>                  *   so assignment to the same pointer is harmless.
>                  *
>                  */
>                 for_each_possible_cpu(cpu) {
>                         struct conn **rcu_conn;
> 
>                         rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
>                         if (*rcu_conn != conn)
>                                 /*
>                                  * This @cpu will never again pick up @conn,
>                                  * so it is safe just to choose next CPU.
>                                  */
>                                 continue;

... Someone else might have picked up rcu_conn at this point...

>                         if (conn == cmpxchg(rcu_conn, conn, next))
>                                 /*
>                                  * @rcu_conn was successfully replaced
> with @next,
>                                  * that means that someone can also hold a @conn
>                                  * and dereferencing it, so wait for a
> grace period
>                                  * is required.
>                                  */
>                                 need_to_wait = true;

... But if there was any possibility of that, need_to_wait is true, and it
still cannot be the case that a reader finds the newly deleted element
in the list, so they cannot find that element, so the pcpu_rcu_conn
variables cannot be set to it.

>                 }
>                 if (need_to_wait)
>                         synchronize_rcu();

And at this point, the reader that might have picked up rcu_conn
just before the cmpxchg must have completed.  (Good show, by the way!
Many people miss the fact that they need this second synchronize_rcu().)

Hmmm...  What happens if this was the last element in the list, and
the relevant pcpu_rcu_conn variable references that newly removed
element?  Taking a look at list_next_or_null_rcu() and thus at
list_next_or_null_rcu(), and it does appear that you get NULL in that
case, as is right and good.

>                 mutex_unlock(&conn_lock);
> 
>                 kfree(conn);
>         }
> 
> 
> >
> >> i.e. usage of the @next pointer is under an RCU critical section.
> >>
> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> >> > macro that makes it more obvious that the whole thing need to be under
> >> > a single RCU read-side critical section?  Such a macro would of course be
> >> > an infinite loop if the list never went empty, so presumably there would
> >> > be a break or return statement in there somewhere.
> >>
> >> The difference is that I do not need a loop, I take the @next conn pointer,
> >> save it for the following IO request and do IO for current IO request.
> >>
> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
> >> of the loop does not look nice, I personally do not like it, i.e.:
> >>
> >>
> >>         static struct conn *get_and_set_next_conn(void)
> >>         {
> >>                 struct conn *conn;
> >>
> >>                 conn = rcu_dereferece(rcu_conn);
> >>                 if (unlikely(!conn))
> >>                     return conn;
> >>                 list_for_each_entry_rr_rcu(conn, &conn_list,
> >>                                            entry) {
> >>                         break;
> >>                 }
> >>                 rcu_assign_pointer(rcu_conn, conn);
> >>                 return conn;
> >>         }
> >>
> >>
> >> or maybe I did not fully get your idea?
> >
> > That would not help at all because you are still leaking the pointer out
> > of the RCU read-side critical section.  That is completely and utterly
> > broken unless you are somehow cleaning up rcu_conn when you remove
> > the element.  And getting that cleanup right is -extremely- tricky.
> > Unless you have some sort of proof of correctness, you will get a NACK
> > from me.
> 
> I understand all the consequences of the leaking pointer, and of course
> wrapped loop with RCU lock/unlock is simpler, but in order to keep
> load-balancing and IO fairness avoiding any locks on IO path I've come
> up with these RCU tricks and list_next_or_null_rr_rcu() macro.

At first glance, it appears that you have handled this correctly.  But I
can make mistakes just as easily as the next guy, so what have you done
to validate your algorithm?

> > More like this:
> >
> >         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
> >                 do_something_with(conn);
> >                 if (done_for_now())
> >                         break;
> >         }
> >
> >> >> + */
> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> >> >> +({ \
> >> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
> >> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
> >> >
> >> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
> >> > Define this within RDMA, where a smaller number of people can more
> >> > easily be kept aware of the restrictions on use.  If it turns out to be
> >> > more generally useful, we can take a look at exactly what makes sense
> >> > more globally.
> >>
> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
> >>
> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
> >>
> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
> >> my list_next_or_null_rr_rcu() variant?
> >
> > Let's start with the basics:  It absolutely does not make sense to leak
> > pointers across rcu_read_unlock() unless you have arranged something else
> > to protect the pointed-to data in the meantime.  There are a number of ways
> > of implementing this protection.  Again, what protection are you using?
> >
> > Your code at the above URL looks plausible to me at first glance: You
> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> > rcu_read_unlock().  But at second glance, it looks like htcx->queue
> > might have the same vulnerability as rcu_conn in your earlier code.
> 
> I am not the author of the code at the URL I specified. I provided the
> link answering the question to show other possible users of round-robin
> semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
> case I can't use a loop, I leak the pointer and indeed have to be very
> careful.  But perhaps we can come up with some generic solution to cover
> both cases: -rr loop and -rr next.

Ah.  Could you please check their update-side code to make sure that it
looks correct to you?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-21 15:16           ` Linus Torvalds
@ 2018-05-21 15:33             ` Paul E. McKenney
  2018-05-22  9:09               ` Roman Penyaev
  0 siblings, 1 reply; 55+ messages in thread
From: Paul E. McKenney @ 2018-05-21 15:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Pen, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, swapnil.ingle, danil.kipnis, Jinpu Wang,
	Linux Kernel Mailing List

On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
> roman.penyaev@profitbricks.com> wrote:
> 
> > No, I continue from the pointer, which I assigned on the previous IO
> > in order to send IO fairly and keep load balanced.
> 
> Right. And that's exactly what has both me and Paul nervous. You're no
> longer in the RCU domain. You're using a pointer where the lifetime has
> nothing to do with RCU any more.
> 
> Can it be done? Sure. But you need *other* locking for it (that you haven't
> explained), and it's fragile as hell.

He looks to actually have it right, but I would want to see a big comment
on the read side noting the leak of the pointer and documenting why it
is OK.

							Thanx, Paul

> It's probably best to not use RCU for it at all, but depend on that "other
> locking" that you have to have anyway, to keep the pointer valid over the
> non-RCU region.
> 
>                 Linus
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
  2018-05-18 13:04 ` [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation Roman Pen
  2018-05-20 22:14   ` kbuild test robot
  2018-05-21  6:36   ` kbuild test robot
@ 2018-05-22  5:05   ` Leon Romanovsky
  2018-05-22  9:27     ` Roman Penyaev
  2 siblings, 1 reply; 55+ messages in thread
From: Leon Romanovsky @ 2018-05-22  5:05 UTC (permalink / raw)
  To: Roman Pen
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang

[-- Attachment #1: Type: text/plain, Size: 3553 bytes --]

On Fri, May 18, 2018 at 03:04:01PM +0200, Roman Pen wrote:
> Add IBTRS Makefile, Kconfig and also corresponding lines into upper
> layer infiniband/ulp files.
>
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
> Cc: Jack Wang <jinpu.wang@profitbricks.com>
> ---
>  drivers/infiniband/Kconfig            |  1 +
>  drivers/infiniband/ulp/Makefile       |  1 +
>  drivers/infiniband/ulp/ibtrs/Kconfig  | 20 ++++++++++++++++++++
>  drivers/infiniband/ulp/ibtrs/Makefile | 15 +++++++++++++++
>  4 files changed, 37 insertions(+)
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>
> diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
> index ee270e065ba9..787bd286fb08 100644
> --- a/drivers/infiniband/Kconfig
> +++ b/drivers/infiniband/Kconfig
> @@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
>
>  source "drivers/infiniband/ulp/iser/Kconfig"
>  source "drivers/infiniband/ulp/isert/Kconfig"
> +source "drivers/infiniband/ulp/ibtrs/Kconfig"
>
>  source "drivers/infiniband/ulp/opa_vnic/Kconfig"
>  source "drivers/infiniband/sw/rdmavt/Kconfig"
> diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
> index 437813c7b481..1c4f10dc8d49 100644
> --- a/drivers/infiniband/ulp/Makefile
> +++ b/drivers/infiniband/ulp/Makefile
> @@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)		+= srpt/
>  obj-$(CONFIG_INFINIBAND_ISER)		+= iser/
>  obj-$(CONFIG_INFINIBAND_ISERT)		+= isert/
>  obj-$(CONFIG_INFINIBAND_OPA_VNIC)	+= opa_vnic/
> +obj-$(CONFIG_INFINIBAND_IBTRS)		+= ibtrs/
> diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig b/drivers/infiniband/ulp/ibtrs/Kconfig
> new file mode 100644
> index 000000000000..eaeb8f3f6b4e
> --- /dev/null
> +++ b/drivers/infiniband/ulp/ibtrs/Kconfig
> @@ -0,0 +1,20 @@
> +config INFINIBAND_IBTRS
> +	tristate
> +	depends on INFINIBAND_ADDR_TRANS
> +
> +config INFINIBAND_IBTRS_CLIENT
> +	tristate "IBTRS client module"
> +	depends on INFINIBAND_ADDR_TRANS
> +	select INFINIBAND_IBTRS
> +	help
> +	  IBTRS client allows for simplified data transfer and connection
> +	  establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
> +	  READ/WRITE semantics and provides multipath capabilities.
> +
> +config INFINIBAND_IBTRS_SERVER
> +	tristate "IBTRS server module"
> +	depends on INFINIBAND_ADDR_TRANS
> +	select INFINIBAND_IBTRS
> +	help
> +	  IBTRS server module processing connection and IO requests received
> +	  from the IBTRS client module.
> diff --git a/drivers/infiniband/ulp/ibtrs/Makefile b/drivers/infiniband/ulp/ibtrs/Makefile
> new file mode 100644
> index 000000000000..e6ea858745ad
> --- /dev/null
> +++ b/drivers/infiniband/ulp/ibtrs/Makefile
> @@ -0,0 +1,15 @@
> +ibtrs-client-y := ibtrs-clt.o \
> +		  ibtrs-clt-stats.o \
> +		  ibtrs-clt-sysfs.o
> +
> +ibtrs-server-y := ibtrs-srv.o \
> +		  ibtrs-srv-stats.o \
> +		  ibtrs-srv-sysfs.o
> +
> +ibtrs-core-y := ibtrs.o
> +
> +obj-$(CONFIG_INFINIBAND_IBTRS)        += ibtrs-core.o

Will it build ibtrs-core in case both server and client are disabled in .config?

> +obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
> +obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
> +
> +-include $(src)/compat/compat.mk

What is this?


> --
> 2.13.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-21 15:33             ` Paul E. McKenney
@ 2018-05-22  9:09               ` Roman Penyaev
  2018-05-22 16:36                 ` Paul E. McKenney
  2018-05-22 16:38                 ` Linus Torvalds
  0 siblings, 2 replies; 55+ messages in thread
From: Roman Penyaev @ 2018-05-22  9:09 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: Linus Torvalds, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, swapnil.ingle, Danil Kipnis, Jinpu Wang,
	Linux Kernel Mailing List

On Mon, May 21, 2018 at 5:33 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
>> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
>> roman.penyaev@profitbricks.com> wrote:
>>
>> > No, I continue from the pointer, which I assigned on the previous IO
>> > in order to send IO fairly and keep load balanced.
>>
>> Right. And that's exactly what has both me and Paul nervous. You're no
>> longer in the RCU domain. You're using a pointer where the lifetime has
>> nothing to do with RCU any more.
>>
>> Can it be done? Sure. But you need *other* locking for it (that you haven't
>> explained), and it's fragile as hell.
>
> He looks to actually have it right, but I would want to see a big comment
> on the read side noting the leak of the pointer and documenting why it
> is OK.

Hi Paul and Linus,

Should I resend current patch with more clear comments about how careful
caller should be with a leaking pointer?  Also I will update read side
with a fat comment about "rcu_assign_pointer()" which leaks the pointer
out of RCU domain and what is done to prevent nasty consequences.
Does that sound acceptable?

--
Roman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-21 15:31           ` Paul E. McKenney
@ 2018-05-22  9:09             ` Roman Penyaev
  2018-05-22 17:03               ` Paul E. McKenney
  0 siblings, 1 reply; 55+ messages in thread
From: Roman Penyaev @ 2018-05-22  9:09 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang,
	Linux Kernel Mailing List

On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
>> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
>> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> >> >> Function is going to be used in transport over RDMA module
>> >> >> in subsequent patches.
>> >> >>
>> >> >> Function returns next element in round-robin fashion,
>> >> >> i.e. head will be skipped.  NULL will be returned if list
>> >> >> is observed as empty.
>> >> >>
>> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> >> Cc: linux-kernel@vger.kernel.org
>> >> >> ---
>> >> >>  include/linux/rculist.h | 19 +++++++++++++++++++
>> >> >>  1 file changed, 19 insertions(+)
>> >> >>
>> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> >> >> index 127f534fec94..b0840d5ab25a 100644
>> >> >> --- a/include/linux/rculist.h
>> >> >> +++ b/include/linux/rculist.h
>> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>> >> >>  })
>> >> >>
>> >> >>  /**
>> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
>> >> >> + * @head:    the head for the list.
>> >> >> + * @ptr:        the list head to take the next element from.
>> >> >> + * @type:       the type of the struct this is embedded in.
>> >> >> + * @memb:       the name of the list_head within the struct.
>> >> >> + *
>> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
>> >> >> + * but if list is observed as empty, NULL will be returned.
>> >> >> + *
>> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation
>> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>> >> >
>> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
>> >> > are round-robining a given list must all be under the same RCU read-side
>> >> > critical section.  For example, the following will break badly:
>> >> >
>> >> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>> >> >         {
>> >> >                 struct foo *ret;
>> >> >
>> >> >                 rcu_read_lock();
>> >> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
>> >> >                 rcu_read_unlock();  /* BUG */
>> >> >                 return ret;
>> >> >         }
>> >> >
>> >> > You need a big fat comment stating this, at the very least.  The resulting
>> >> > bug can be very hard to trigger and even harder to debug.
>> >> >
>> >> > And yes, I know that the same restriction applies to list_next_rcu()
>> >> > and friends.  The difference is that if you try to invoke those in an
>> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
>> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
>> >> > might tempt people to do something broken like take_rr_step() above.
>> >>
>> >> Hi Paul,
>> >>
>> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
>> >> connection from the list in order to send a request, i.e. my code is
>> >> something like the following:
>> >>
>> >>         static struct conn *get_and_set_next_conn(void)
>> >>         {
>> >>                 struct conn *conn;
>> >>
>> >>                 conn = rcu_dereferece(rcu_conn);
>> >>                 if (unlikely(!conn))
>> >>                     return conn;
>> >
>> > Wait.  Don't you need to restart from the beginning of the list in
>> > this case?  Or does the list never have anything added to it and is
>> > rcu_conn initially the first element in the list?
>>
>> Hi Paul,
>>
>> No, I continue from the pointer, which I assigned on the previous IO
>> in order to send IO fairly and keep load balanced.
>>
>> Initially @rcu_conn points to the first element, but elements can
>> be deleted from the list and list can become empty.
>>
>> The deletion code is below.
>>
>> >
>> >>                 conn = list_next_or_null_rr_rcu(&conn_list,
>> >>                                                 &conn->entry,
>> >>                                                 typeof(*conn),
>> >>                                                 entry);
>> >>                 rcu_assign_pointer(rcu_conn, conn);
>> >
>> > Linus is correct to doubt this code.  You assign a pointer to the current
>> > element to rcu_conn, which is presumably a per-CPU or global variable.
>> > So far, so good ...
>>
>> I use per-CPU, in the first example I did not show that not to overcomplicate
>> the code.
>>
>> >
>> >>                 return conn;
>> >>         }
>> >>
>> >>         rcu_read_lock();
>> >>         conn = get_and_set_next_conn();
>> >>         if (unlikely(!conn)) {
>> >>                 /* ... */
>> >>         }
>> >>         err = rdma_io(conn, request);
>> >>         rcu_read_unlock();
>> >
>> > ... except that some other CPU might well remove the entry referenced by
>> > rcu_conn at this point.  It would have to wait for a grace period (e.g.,
>> > synchronize_rcu()), but the current CPU has exited its RCU read-side
>> > critical section, and therefore is not blocking the grace period.
>> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
>> > might well be referencing the freelist, or, even worse, some other type
>> > of structure.
>> >
>> > What is your code doing to prevent this from happening?  (There are ways,
>> > but I want to know what you were doing in this case.)
>>
>> Probably I should have shown the way of removal at the very beginning,
>> my fault.  So deletion looks as the following (a bit changed and
>> simplified for the sake of clearness):
>
> Thank you!  Let's see...
>
>>         static void remove_connection(conn)
>>         {
>>                 bool need_to_wait = false;
>>                 int cpu;
>>
>>                 /* Do not let RCU list add/delete happen in parallel */
>>                 mutex_lock(&conn_lock);
>>
>>                 list_del_rcu(&conn->entry);
>>
>>                 /* Make sure everybody observes element removal */
>>                 synchronize_rcu();
>
> At this point, any reader who saw the element in the list is done, as you
> comment in fact says.  But there might be a pointer to that element in the
> per-CPU variables, however, from this point forward it cannot be the case
> that one of the per-CPU variables gets set to the newly deleted element.
> Which is your next block of code...
>
>>                 /*
>>                  * At this point nobody sees @conn in the list, but
>> still we have
>>                  * dangling pointer @rcu_conn which _can_ point to @conn.  Since
>>                  * nobody can observe @conn in the list, we guarantee
>> that IO path
>>                  * will not assign @conn to @rcu_conn, i.e. @rcu_conn
>> can be equal
>>                  * to @conn, but can never again become @conn.
>>                  */
>>
>>                 /*
>>                  * Get @next connection from current @conn which is going to be
>>                  * removed.
>>                  */
>>                 next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
>>                                                 typeof(*next), entry);
>>
>>                 /*
>>                  * Here @rcu_conn can be changed by reader side, so use @cmpxchg
>>                  * in order to keep fairness in load-balancing and do not touch
>>                  * the pointer which can be already changed by the IO path.
>>                  *
>>                  * Current path can be faster than IO path and the
>> following race
>>                  * exists:
>>                  *
>>                  *   CPU0                         CPU1
>>                  *   ----                         ----
>>                  *   conn = rcu_dereferece(rcu_conn);
>>                  *   next = list_next_or_null_rr_rcu(conn)
>>                  *
>>                  *                                conn ==
>> cmpxchg(rcu_conn, conn, next);
>>                  *                                synchronize_rcu();
>>                  *
>>                  *   rcu_assign_pointer(rcu_conn, next);
>>                  *   ^^^^^^^^^^^^^^^^^^
>>                  *
>>                  *   Here @rcu_conn is already equal to @next (done by
>> @cmpxchg),
>>                  *   so assignment to the same pointer is harmless.
>>                  *
>>                  */
>>                 for_each_possible_cpu(cpu) {
>>                         struct conn **rcu_conn;
>>
>>                         rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
>>                         if (*rcu_conn != conn)
>>                                 /*
>>                                  * This @cpu will never again pick up @conn,
>>                                  * so it is safe just to choose next CPU.
>>                                  */
>>                                 continue;
>
> ... Someone else might have picked up rcu_conn at this point...
>
>>                         if (conn == cmpxchg(rcu_conn, conn, next))
>>                                 /*
>>                                  * @rcu_conn was successfully replaced
>> with @next,
>>                                  * that means that someone can also hold a @conn
>>                                  * and dereferencing it, so wait for a
>> grace period
>>                                  * is required.
>>                                  */
>>                                 need_to_wait = true;
>
> ... But if there was any possibility of that, need_to_wait is true, and it
> still cannot be the case that a reader finds the newly deleted element
> in the list, so they cannot find that element, so the pcpu_rcu_conn
> variables cannot be set to it.
>
>>                 }
>>                 if (need_to_wait)
>>                         synchronize_rcu();
>
> And at this point, the reader that might have picked up rcu_conn
> just before the cmpxchg must have completed.  (Good show, by the way!
> Many people miss the fact that they need this second synchronize_rcu().)
>
> Hmmm...  What happens if this was the last element in the list, and
> the relevant pcpu_rcu_conn variable references that newly removed
> element?  Taking a look at list_next_or_null_rcu() and thus at
> list_next_or_null_rcu(), and it does appear that you get NULL in that
> case, as is right and good.

Thanks for explicit comments.  What I always lack is a good description.
Indeed it is worth to mention that @next can become NULL if that was the
last element, will add comments then.

>
>>                 mutex_unlock(&conn_lock);
>>
>>                 kfree(conn);
>>         }
>>
>>
>> >
>> >> i.e. usage of the @next pointer is under an RCU critical section.
>> >>
>> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
>> >> > macro that makes it more obvious that the whole thing need to be under
>> >> > a single RCU read-side critical section?  Such a macro would of course be
>> >> > an infinite loop if the list never went empty, so presumably there would
>> >> > be a break or return statement in there somewhere.
>> >>
>> >> The difference is that I do not need a loop, I take the @next conn pointer,
>> >> save it for the following IO request and do IO for current IO request.
>> >>
>> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
>> >> of the loop does not look nice, I personally do not like it, i.e.:
>> >>
>> >>
>> >>         static struct conn *get_and_set_next_conn(void)
>> >>         {
>> >>                 struct conn *conn;
>> >>
>> >>                 conn = rcu_dereferece(rcu_conn);
>> >>                 if (unlikely(!conn))
>> >>                     return conn;
>> >>                 list_for_each_entry_rr_rcu(conn, &conn_list,
>> >>                                            entry) {
>> >>                         break;
>> >>                 }
>> >>                 rcu_assign_pointer(rcu_conn, conn);
>> >>                 return conn;
>> >>         }
>> >>
>> >>
>> >> or maybe I did not fully get your idea?
>> >
>> > That would not help at all because you are still leaking the pointer out
>> > of the RCU read-side critical section.  That is completely and utterly
>> > broken unless you are somehow cleaning up rcu_conn when you remove
>> > the element.  And getting that cleanup right is -extremely- tricky.
>> > Unless you have some sort of proof of correctness, you will get a NACK
>> > from me.
>>
>> I understand all the consequences of the leaking pointer, and of course
>> wrapped loop with RCU lock/unlock is simpler, but in order to keep
>> load-balancing and IO fairness avoiding any locks on IO path I've come
>> up with these RCU tricks and list_next_or_null_rr_rcu() macro.
>
> At first glance, it appears that you have handled this correctly.  But I
> can make mistakes just as easily as the next guy, so what have you done
> to validate your algorithm?

What we only have is a set of unit-tests which run every night.  For this
particular case there is a special stress test which adds/removes RDMA
connections in a loop while IO is performing.  Unfortunately I did not
write any synthetic test-case just for testing this isolated algorithm.
(e.g. module with only these RCU functions can be created and list
modification can be easily simulated. Should not be very much difficult)
Do you think it is worth to do?  Unfortunately it also does not prove
correctness, like Linus said it is fragile as hell, but for sure I can
burn CPUs testing it for couple of days.

>> > More like this:
>> >
>> >         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
>> >                 do_something_with(conn);
>> >                 if (done_for_now())
>> >                         break;
>> >         }
>> >
>> >> >> + */
>> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
>> >> >> +({ \
>> >> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
>> >> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
>> >> >
>> >> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
>> >> > Define this within RDMA, where a smaller number of people can more
>> >> > easily be kept aware of the restrictions on use.  If it turns out to be
>> >> > more generally useful, we can take a look at exactly what makes sense
>> >> > more globally.
>> >>
>> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
>> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
>> >>
>> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
>> >>
>> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
>> >> my list_next_or_null_rr_rcu() variant?
>> >
>> > Let's start with the basics:  It absolutely does not make sense to leak
>> > pointers across rcu_read_unlock() unless you have arranged something else
>> > to protect the pointed-to data in the meantime.  There are a number of ways
>> > of implementing this protection.  Again, what protection are you using?
>> >
>> > Your code at the above URL looks plausible to me at first glance: You
>> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
>> > rcu_read_unlock().  But at second glance, it looks like htcx->queue
>> > might have the same vulnerability as rcu_conn in your earlier code.
>>
>> I am not the author of the code at the URL I specified. I provided the
>> link answering the question to show other possible users of round-robin
>> semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
>> case I can't use a loop, I leak the pointer and indeed have to be very
>> careful.  But perhaps we can come up with some generic solution to cover
>> both cases: -rr loop and -rr next.
>
> Ah.  Could you please check their update-side code to make sure that it
> looks correct to you?

BTW authors of this particular -rr loop are in CC, but they keep silence
so far :)  According to my shallow understanding @queue can't disappear
from the list on this calling path, i.e. existence of a @queue should be
guaranteed by the fact that the queue has no IO in-flights and only then
it can be removed.

--
Roman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
  2018-05-22  5:05   ` Leon Romanovsky
@ 2018-05-22  9:27     ` Roman Penyaev
  2018-05-22 13:18       ` Leon Romanovsky
  0 siblings, 1 reply; 55+ messages in thread
From: Roman Penyaev @ 2018-05-22  9:27 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang

On Tue, May 22, 2018 at 7:05 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Fri, May 18, 2018 at 03:04:01PM +0200, Roman Pen wrote:
>> Add IBTRS Makefile, Kconfig and also corresponding lines into upper
>> layer infiniband/ulp files.
>>
>> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
>> Cc: Jack Wang <jinpu.wang@profitbricks.com>
>> ---
>>  drivers/infiniband/Kconfig            |  1 +
>>  drivers/infiniband/ulp/Makefile       |  1 +
>>  drivers/infiniband/ulp/ibtrs/Kconfig  | 20 ++++++++++++++++++++
>>  drivers/infiniband/ulp/ibtrs/Makefile | 15 +++++++++++++++
>>  4 files changed, 37 insertions(+)
>>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>>
>> diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
>> index ee270e065ba9..787bd286fb08 100644
>> --- a/drivers/infiniband/Kconfig
>> +++ b/drivers/infiniband/Kconfig
>> @@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
>>
>>  source "drivers/infiniband/ulp/iser/Kconfig"
>>  source "drivers/infiniband/ulp/isert/Kconfig"
>> +source "drivers/infiniband/ulp/ibtrs/Kconfig"
>>
>>  source "drivers/infiniband/ulp/opa_vnic/Kconfig"
>>  source "drivers/infiniband/sw/rdmavt/Kconfig"
>> diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
>> index 437813c7b481..1c4f10dc8d49 100644
>> --- a/drivers/infiniband/ulp/Makefile
>> +++ b/drivers/infiniband/ulp/Makefile
>> @@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)         += srpt/
>>  obj-$(CONFIG_INFINIBAND_ISER)                += iser/
>>  obj-$(CONFIG_INFINIBAND_ISERT)               += isert/
>>  obj-$(CONFIG_INFINIBAND_OPA_VNIC)    += opa_vnic/
>> +obj-$(CONFIG_INFINIBAND_IBTRS)               += ibtrs/
>> diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig b/drivers/infiniband/ulp/ibtrs/Kconfig
>> new file mode 100644
>> index 000000000000..eaeb8f3f6b4e
>> --- /dev/null
>> +++ b/drivers/infiniband/ulp/ibtrs/Kconfig
>> @@ -0,0 +1,20 @@
>> +config INFINIBAND_IBTRS
>> +     tristate
>> +     depends on INFINIBAND_ADDR_TRANS
>> +
>> +config INFINIBAND_IBTRS_CLIENT
>> +     tristate "IBTRS client module"
>> +     depends on INFINIBAND_ADDR_TRANS
>> +     select INFINIBAND_IBTRS
>> +     help
>> +       IBTRS client allows for simplified data transfer and connection
>> +       establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
>> +       READ/WRITE semantics and provides multipath capabilities.
>> +
>> +config INFINIBAND_IBTRS_SERVER
>> +     tristate "IBTRS server module"
>> +     depends on INFINIBAND_ADDR_TRANS
>> +     select INFINIBAND_IBTRS
>> +     help
>> +       IBTRS server module processing connection and IO requests received
>> +       from the IBTRS client module.
>> diff --git a/drivers/infiniband/ulp/ibtrs/Makefile b/drivers/infiniband/ulp/ibtrs/Makefile
>> new file mode 100644
>> index 000000000000..e6ea858745ad
>> --- /dev/null
>> +++ b/drivers/infiniband/ulp/ibtrs/Makefile
>> @@ -0,0 +1,15 @@
>> +ibtrs-client-y := ibtrs-clt.o \
>> +               ibtrs-clt-stats.o \
>> +               ibtrs-clt-sysfs.o
>> +
>> +ibtrs-server-y := ibtrs-srv.o \
>> +               ibtrs-srv-stats.o \
>> +               ibtrs-srv-sysfs.o
>> +
>> +ibtrs-core-y := ibtrs.o
>> +
>> +obj-$(CONFIG_INFINIBAND_IBTRS)        += ibtrs-core.o
>
> Will it build ibtrs-core in case both server and client are disabled in .config?

No, CONFIG_INFINIBAND_IBTRS is selected/deselected by
CONFIG_INFINIBAND_IBTRS_CLIENT or CONFIG_INFINIBAND_IBTRS_SERVER,
when you choose them in kconfig.


>> +obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
>> +obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
>> +
>> +-include $(src)/compat/compat.mk
>
> What is this?

Well, in our production we use same source code and in order not to spoil
sources with 'ifdef' macros for different kernel versions I use compat
layer, which obviously will never go upstream.  This line is the only
clean way to keep sources always up-to-date with latest kernel and still
be compatible with what we have on our servers in production.

'-' prefix at the beginning of the line tells make to ignore it if
file does not exist, so should not rise any error for compilation
against latest kernel.

Here is an example of the compat layer for IBNBD block device:
https://github.com/profitbricks/ibnbd/tree/master/ibnbd/compat

--
Roman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
  2018-05-22  9:27     ` Roman Penyaev
@ 2018-05-22 13:18       ` Leon Romanovsky
  2018-05-22 16:12         ` Roman Penyaev
  0 siblings, 1 reply; 55+ messages in thread
From: Leon Romanovsky @ 2018-05-22 13:18 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang

[-- Attachment #1: Type: text/plain, Size: 4827 bytes --]

On Tue, May 22, 2018 at 11:27:21AM +0200, Roman Penyaev wrote:
> On Tue, May 22, 2018 at 7:05 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > On Fri, May 18, 2018 at 03:04:01PM +0200, Roman Pen wrote:
> >> Add IBTRS Makefile, Kconfig and also corresponding lines into upper
> >> layer infiniband/ulp files.
> >>
> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> >> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
> >> Cc: Jack Wang <jinpu.wang@profitbricks.com>
> >> ---
> >>  drivers/infiniband/Kconfig            |  1 +
> >>  drivers/infiniband/ulp/Makefile       |  1 +
> >>  drivers/infiniband/ulp/ibtrs/Kconfig  | 20 ++++++++++++++++++++
> >>  drivers/infiniband/ulp/ibtrs/Makefile | 15 +++++++++++++++
> >>  4 files changed, 37 insertions(+)
> >>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
> >>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
> >>
> >> diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
> >> index ee270e065ba9..787bd286fb08 100644
> >> --- a/drivers/infiniband/Kconfig
> >> +++ b/drivers/infiniband/Kconfig
> >> @@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
> >>
> >>  source "drivers/infiniband/ulp/iser/Kconfig"
> >>  source "drivers/infiniband/ulp/isert/Kconfig"
> >> +source "drivers/infiniband/ulp/ibtrs/Kconfig"
> >>
> >>  source "drivers/infiniband/ulp/opa_vnic/Kconfig"
> >>  source "drivers/infiniband/sw/rdmavt/Kconfig"
> >> diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
> >> index 437813c7b481..1c4f10dc8d49 100644
> >> --- a/drivers/infiniband/ulp/Makefile
> >> +++ b/drivers/infiniband/ulp/Makefile
> >> @@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)         += srpt/
> >>  obj-$(CONFIG_INFINIBAND_ISER)                += iser/
> >>  obj-$(CONFIG_INFINIBAND_ISERT)               += isert/
> >>  obj-$(CONFIG_INFINIBAND_OPA_VNIC)    += opa_vnic/
> >> +obj-$(CONFIG_INFINIBAND_IBTRS)               += ibtrs/
> >> diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig b/drivers/infiniband/ulp/ibtrs/Kconfig
> >> new file mode 100644
> >> index 000000000000..eaeb8f3f6b4e
> >> --- /dev/null
> >> +++ b/drivers/infiniband/ulp/ibtrs/Kconfig
> >> @@ -0,0 +1,20 @@
> >> +config INFINIBAND_IBTRS
> >> +     tristate
> >> +     depends on INFINIBAND_ADDR_TRANS
> >> +
> >> +config INFINIBAND_IBTRS_CLIENT
> >> +     tristate "IBTRS client module"
> >> +     depends on INFINIBAND_ADDR_TRANS
> >> +     select INFINIBAND_IBTRS
> >> +     help
> >> +       IBTRS client allows for simplified data transfer and connection
> >> +       establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
> >> +       READ/WRITE semantics and provides multipath capabilities.
> >> +
> >> +config INFINIBAND_IBTRS_SERVER
> >> +     tristate "IBTRS server module"
> >> +     depends on INFINIBAND_ADDR_TRANS
> >> +     select INFINIBAND_IBTRS
> >> +     help
> >> +       IBTRS server module processing connection and IO requests received
> >> +       from the IBTRS client module.
> >> diff --git a/drivers/infiniband/ulp/ibtrs/Makefile b/drivers/infiniband/ulp/ibtrs/Makefile
> >> new file mode 100644
> >> index 000000000000..e6ea858745ad
> >> --- /dev/null
> >> +++ b/drivers/infiniband/ulp/ibtrs/Makefile
> >> @@ -0,0 +1,15 @@
> >> +ibtrs-client-y := ibtrs-clt.o \
> >> +               ibtrs-clt-stats.o \
> >> +               ibtrs-clt-sysfs.o
> >> +
> >> +ibtrs-server-y := ibtrs-srv.o \
> >> +               ibtrs-srv-stats.o \
> >> +               ibtrs-srv-sysfs.o
> >> +
> >> +ibtrs-core-y := ibtrs.o
> >> +
> >> +obj-$(CONFIG_INFINIBAND_IBTRS)        += ibtrs-core.o
> >
> > Will it build ibtrs-core in case both server and client are disabled in .config?
>
> No, CONFIG_INFINIBAND_IBTRS is selected/deselected by
> CONFIG_INFINIBAND_IBTRS_CLIENT or CONFIG_INFINIBAND_IBTRS_SERVER,
> when you choose them in kconfig.
>

Thanks

>
> >> +obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
> >> +obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
> >> +
> >> +-include $(src)/compat/compat.mk
> >
> > What is this?
>
> Well, in our production we use same source code and in order not to spoil
> sources with 'ifdef' macros for different kernel versions I use compat
> layer, which obviously will never go upstream.  This line is the only
> clean way to keep sources always up-to-date with latest kernel and still
> be compatible with what we have on our servers in production.
>
> '-' prefix at the beginning of the line tells make to ignore it if
> file does not exist, so should not rise any error for compilation
> against latest kernel.
>
> Here is an example of the compat layer for IBNBD block device:
> https://github.com/profitbricks/ibnbd/tree/master/ibnbd/compat

I see it, you will need to remove this line from the upstream kernel
patches.

Thanks

>
> --
> Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
  2018-05-22 13:18       ` Leon Romanovsky
@ 2018-05-22 16:12         ` Roman Penyaev
  0 siblings, 0 replies; 55+ messages in thread
From: Roman Penyaev @ 2018-05-22 16:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang

On Tue, May 22, 2018 at 3:18 PM, Leon Romanovsky <leon@kernel.org> wrote:
> On Tue, May 22, 2018 at 11:27:21AM +0200, Roman Penyaev wrote:
>> On Tue, May 22, 2018 at 7:05 AM, Leon Romanovsky <leon@kernel.org> wrote:
>> > On Fri, May 18, 2018 at 03:04:01PM +0200, Roman Pen wrote:
>> >> Add IBTRS Makefile, Kconfig and also corresponding lines into upper
>> >> layer infiniband/ulp files.
>> >>
>> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> >> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
>> >> Cc: Jack Wang <jinpu.wang@profitbricks.com>
>> >> ---
>> >>  drivers/infiniband/Kconfig            |  1 +
>> >>  drivers/infiniband/ulp/Makefile       |  1 +
>> >>  drivers/infiniband/ulp/ibtrs/Kconfig  | 20 ++++++++++++++++++++
>> >>  drivers/infiniband/ulp/ibtrs/Makefile | 15 +++++++++++++++
>> >>  4 files changed, 37 insertions(+)
>> >>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>> >>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>> >>
>> >> diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
>> >> index ee270e065ba9..787bd286fb08 100644
>> >> --- a/drivers/infiniband/Kconfig
>> >> +++ b/drivers/infiniband/Kconfig
>> >> @@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
>> >>
>> >>  source "drivers/infiniband/ulp/iser/Kconfig"
>> >>  source "drivers/infiniband/ulp/isert/Kconfig"
>> >> +source "drivers/infiniband/ulp/ibtrs/Kconfig"
>> >>
>> >>  source "drivers/infiniband/ulp/opa_vnic/Kconfig"
>> >>  source "drivers/infiniband/sw/rdmavt/Kconfig"
>> >> diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
>> >> index 437813c7b481..1c4f10dc8d49 100644
>> >> --- a/drivers/infiniband/ulp/Makefile
>> >> +++ b/drivers/infiniband/ulp/Makefile
>> >> @@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)         += srpt/
>> >>  obj-$(CONFIG_INFINIBAND_ISER)                += iser/
>> >>  obj-$(CONFIG_INFINIBAND_ISERT)               += isert/
>> >>  obj-$(CONFIG_INFINIBAND_OPA_VNIC)    += opa_vnic/
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS)               += ibtrs/
>> >> diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig b/drivers/infiniband/ulp/ibtrs/Kconfig
>> >> new file mode 100644
>> >> index 000000000000..eaeb8f3f6b4e
>> >> --- /dev/null
>> >> +++ b/drivers/infiniband/ulp/ibtrs/Kconfig
>> >> @@ -0,0 +1,20 @@
>> >> +config INFINIBAND_IBTRS
>> >> +     tristate
>> >> +     depends on INFINIBAND_ADDR_TRANS
>> >> +
>> >> +config INFINIBAND_IBTRS_CLIENT
>> >> +     tristate "IBTRS client module"
>> >> +     depends on INFINIBAND_ADDR_TRANS
>> >> +     select INFINIBAND_IBTRS
>> >> +     help
>> >> +       IBTRS client allows for simplified data transfer and connection
>> >> +       establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
>> >> +       READ/WRITE semantics and provides multipath capabilities.
>> >> +
>> >> +config INFINIBAND_IBTRS_SERVER
>> >> +     tristate "IBTRS server module"
>> >> +     depends on INFINIBAND_ADDR_TRANS
>> >> +     select INFINIBAND_IBTRS
>> >> +     help
>> >> +       IBTRS server module processing connection and IO requests received
>> >> +       from the IBTRS client module.
>> >> diff --git a/drivers/infiniband/ulp/ibtrs/Makefile b/drivers/infiniband/ulp/ibtrs/Makefile
>> >> new file mode 100644
>> >> index 000000000000..e6ea858745ad
>> >> --- /dev/null
>> >> +++ b/drivers/infiniband/ulp/ibtrs/Makefile
>> >> @@ -0,0 +1,15 @@
>> >> +ibtrs-client-y := ibtrs-clt.o \
>> >> +               ibtrs-clt-stats.o \
>> >> +               ibtrs-clt-sysfs.o
>> >> +
>> >> +ibtrs-server-y := ibtrs-srv.o \
>> >> +               ibtrs-srv-stats.o \
>> >> +               ibtrs-srv-sysfs.o
>> >> +
>> >> +ibtrs-core-y := ibtrs.o
>> >> +
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS)        += ibtrs-core.o
>> >
>> > Will it build ibtrs-core in case both server and client are disabled in .config?
>>
>> No, CONFIG_INFINIBAND_IBTRS is selected/deselected by
>> CONFIG_INFINIBAND_IBTRS_CLIENT or CONFIG_INFINIBAND_IBTRS_SERVER,
>> when you choose them in kconfig.
>>
>
> Thanks
>
>>
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
>> >> +
>> >> +-include $(src)/compat/compat.mk
>> >
>> > What is this?
>>
>> Well, in our production we use same source code and in order not to spoil
>> sources with 'ifdef' macros for different kernel versions I use compat
>> layer, which obviously will never go upstream.  This line is the only
>> clean way to keep sources always up-to-date with latest kernel and still
>> be compatible with what we have on our servers in production.
>>
>> '-' prefix at the beginning of the line tells make to ignore it if
>> file does not exist, so should not rise any error for compilation
>> against latest kernel.
>>
>> Here is an example of the compat layer for IBNBD block device:
>> https://github.com/profitbricks/ibnbd/tree/master/ibnbd/compat
>
> I see it, you will need to remove this line from the upstream kernel
> patches.

Hi Leon,

Sure.  Thanks.

--
Roman

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-22  9:09               ` Roman Penyaev
@ 2018-05-22 16:36                 ` Paul E. McKenney
  2018-05-22 16:38                 ` Linus Torvalds
  1 sibling, 0 replies; 55+ messages in thread
From: Paul E. McKenney @ 2018-05-22 16:36 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Linus Torvalds, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, swapnil.ingle, Danil Kipnis, Jinpu Wang,
	Linux Kernel Mailing List

On Tue, May 22, 2018 at 11:09:08AM +0200, Roman Penyaev wrote:
> On Mon, May 21, 2018 at 5:33 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
> >> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
> >> roman.penyaev@profitbricks.com> wrote:
> >>
> >> > No, I continue from the pointer, which I assigned on the previous IO
> >> > in order to send IO fairly and keep load balanced.
> >>
> >> Right. And that's exactly what has both me and Paul nervous. You're no
> >> longer in the RCU domain. You're using a pointer where the lifetime has
> >> nothing to do with RCU any more.
> >>
> >> Can it be done? Sure. But you need *other* locking for it (that you haven't
> >> explained), and it's fragile as hell.
> >
> > He looks to actually have it right, but I would want to see a big comment
> > on the read side noting the leak of the pointer and documenting why it
> > is OK.
> 
> Hi Paul and Linus,
> 
> Should I resend current patch with more clear comments about how careful
> caller should be with a leaking pointer?  Also I will update read side
> with a fat comment about "rcu_assign_pointer()" which leaks the pointer
> out of RCU domain and what is done to prevent nasty consequences.
> Does that sound acceptable?

That sounds good to me.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-22  9:09               ` Roman Penyaev
  2018-05-22 16:36                 ` Paul E. McKenney
@ 2018-05-22 16:38                 ` Linus Torvalds
  2018-05-22 17:04                   ` Paul E. McKenney
  1 sibling, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2018-05-22 16:38 UTC (permalink / raw)
  To: Roman Pen
  Cc: Paul McKenney, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, swapnil.ingle, danil.kipnis, Jinpu Wang,
	Linux Kernel Mailing List

On Tue, May 22, 2018 at 2:09 AM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> Should I resend current patch with more clear comments about how careful
> caller should be with a leaking pointer?

No. Even if we go your way, there is *one* single user, and that one is
special and needs to take a lot more care.

Just roll your own version, and make it an inline function like I've asked
now now many times, and add a shit-ton of explanations of why it's safe to
use in that *one* situation.

I don't want any crazy and unsafe stuff in the generic header file that
absolutely *nobody* should ever use.

                  Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (25 preceding siblings ...)
  2018-05-18 13:04 ` [PATCH v2 26/26] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Roman Pen
@ 2018-05-22 16:45 ` Jason Gunthorpe
  26 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2018-05-22 16:45 UTC (permalink / raw)
  To: Roman Pen
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang

On Fri, May 18, 2018 at 03:03:47PM +0200, Roman Pen wrote:
> Hi all,
> 
> This is v2 of series, which introduces IBNBD/IBTRS modules.
> 
> This cover letter is split on three parts:
> 
> 1. Introduction, which almost repeats everything from previous cover
>    letters.
> 2. Changelog.
> 3. Performance measurements on linux-4.17.0-rc2 and on two different
>    Mellanox cards: ConnectX-2 and ConnectX-3 and CPUs: Intel and AMD.
> 
> 
>  Introduction
> 
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA. It is optimized to transfer (read/write) IO blocks
> in the sense that it follows the BIO semantics of providing the
> possibility to either write data from a scatter-gather list to the
> remote side or to request ("read") data transfer from the remote side
> into a given set of buffers.
> 
> IBTRS is multipath capalbdke and provides I/O fail-over and load-balancing
> functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
> CMs and particular path is selected according to the load-balancing policy.
> 
> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
> 
> Why?
> 
>    - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>      thus internal protocol is simple.
>    - IBTRS was developed as an independent RDMA transport library, which
>      supports fail-over and load-balancing policies using multipath, thus
>      it can be used for any other IO needs rather than only for block
>      device.
>    - IBNBD/IBTRS is faster than NVME over RDMA.
>      Old comparison results:
>      https://www.spinics.net/lists/linux-rdma/msg48799.html
>      New comparison results: see performance measurements section below.
> 
> Key features of IBTRS transport library and IBNBD block device:
> 
> o High throughput and low latency due to:
>    - Only two RDMA messages per IO.
>    - IMM InfiniBand messages on responses to reduce round trip latency.
>    - Simplified memory management: memory allocation happens once on
>      server side when IBTRS session is established.
> 
> o IO fail-over and load-balancing by using multipath.  According to
>   our test loads additional path brings ~20% of bandwidth.  
> 
> o Simple configuration of IBNBD:
>    - Server side is completely passive: volumes do not need to be
>      explicitly exported.
>    - Only IB port GID and device path needed on client side to map
>      a block device.
>    - A device is remapped automatically i.e. after storage reboot.
> 
> Commits for kernel can be found here:
>    https://github.com/profitbricks/ibnbd/commits/linux-4.17-rc2
> 
> The out-of-tree modules are here:
>    https://github.com/profitbricks/ibnbd/
> 
> Vault 2017 presentation:
>    http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf

I think from the RDMA side, before we accept something like this, I'd
like to hear from Christoph, Chuck or Sagi that the dataplane
implementation of this is correct, eg it uses the MRs properly and
invalidates at the right time, sequences with dma_ops as required,
etc.

They all have done this work on their ULPs and it was tricky, I don't
want to see another ULP implement this wrong..

I'm skeptical here already due to the performance numbers - they are
not really what I'd expects and we may find that invalidate changes
will bring the performance down further.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-22  9:09             ` Roman Penyaev
@ 2018-05-22 17:03               ` Paul E. McKenney
  0 siblings, 0 replies; 55+ messages in thread
From: Paul E. McKenney @ 2018-05-22 17:03 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Or Gerlitz, Doug Ledford,
	Swapnil Ingle, Danil Kipnis, Jack Wang,
	Linux Kernel Mailing List

On Tue, May 22, 2018 at 11:09:16AM +0200, Roman Penyaev wrote:
> On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
> >> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> >> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> >> >> Function is going to be used in transport over RDMA module
> >> >> >> in subsequent patches.
> >> >> >>
> >> >> >> Function returns next element in round-robin fashion,
> >> >> >> i.e. head will be skipped.  NULL will be returned if list
> >> >> >> is observed as empty.
> >> >> >>
> >> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> >> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >> >> Cc: linux-kernel@vger.kernel.org
> >> >> >> ---
> >> >> >>  include/linux/rculist.h | 19 +++++++++++++++++++
> >> >> >>  1 file changed, 19 insertions(+)
> >> >> >>
> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> >> >> index 127f534fec94..b0840d5ab25a 100644
> >> >> >> --- a/include/linux/rculist.h
> >> >> >> +++ b/include/linux/rculist.h
> >> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >> >> >>  })
> >> >> >>
> >> >> >>  /**
> >> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> >> >> >> + * @head:    the head for the list.
> >> >> >> + * @ptr:        the list head to take the next element from.
> >> >> >> + * @type:       the type of the struct this is embedded in.
> >> >> >> + * @memb:       the name of the list_head within the struct.
> >> >> >> + *
> >> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> >> >> >> + * but if list is observed as empty, NULL will be returned.
> >> >> >> + *
> >> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation
> >> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >> >> >
> >> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> >> >> > are round-robining a given list must all be under the same RCU read-side
> >> >> > critical section.  For example, the following will break badly:
> >> >> >
> >> >> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >> >> >         {
> >> >> >                 struct foo *ret;
> >> >> >
> >> >> >                 rcu_read_lock();
> >> >> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
> >> >> >                 rcu_read_unlock();  /* BUG */
> >> >> >                 return ret;
> >> >> >         }
> >> >> >
> >> >> > You need a big fat comment stating this, at the very least.  The resulting
> >> >> > bug can be very hard to trigger and even harder to debug.
> >> >> >
> >> >> > And yes, I know that the same restriction applies to list_next_rcu()
> >> >> > and friends.  The difference is that if you try to invoke those in an
> >> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
> >> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> >> >> > might tempt people to do something broken like take_rr_step() above.
> >> >>
> >> >> Hi Paul,
> >> >>
> >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> >> >> connection from the list in order to send a request, i.e. my code is
> >> >> something like the following:
> >> >>
> >> >>         static struct conn *get_and_set_next_conn(void)
> >> >>         {
> >> >>                 struct conn *conn;
> >> >>
> >> >>                 conn = rcu_dereferece(rcu_conn);
> >> >>                 if (unlikely(!conn))
> >> >>                     return conn;
> >> >
> >> > Wait.  Don't you need to restart from the beginning of the list in
> >> > this case?  Or does the list never have anything added to it and is
> >> > rcu_conn initially the first element in the list?
> >>
> >> Hi Paul,
> >>
> >> No, I continue from the pointer, which I assigned on the previous IO
> >> in order to send IO fairly and keep load balanced.
> >>
> >> Initially @rcu_conn points to the first element, but elements can
> >> be deleted from the list and list can become empty.
> >>
> >> The deletion code is below.
> >>
> >> >
> >> >>                 conn = list_next_or_null_rr_rcu(&conn_list,
> >> >>                                                 &conn->entry,
> >> >>                                                 typeof(*conn),
> >> >>                                                 entry);
> >> >>                 rcu_assign_pointer(rcu_conn, conn);
> >> >
> >> > Linus is correct to doubt this code.  You assign a pointer to the current
> >> > element to rcu_conn, which is presumably a per-CPU or global variable.
> >> > So far, so good ...
> >>
> >> I use per-CPU, in the first example I did not show that not to overcomplicate
> >> the code.
> >>
> >> >
> >> >>                 return conn;
> >> >>         }
> >> >>
> >> >>         rcu_read_lock();
> >> >>         conn = get_and_set_next_conn();
> >> >>         if (unlikely(!conn)) {
> >> >>                 /* ... */
> >> >>         }
> >> >>         err = rdma_io(conn, request);
> >> >>         rcu_read_unlock();
> >> >
> >> > ... except that some other CPU might well remove the entry referenced by
> >> > rcu_conn at this point.  It would have to wait for a grace period (e.g.,
> >> > synchronize_rcu()), but the current CPU has exited its RCU read-side
> >> > critical section, and therefore is not blocking the grace period.
> >> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> >> > might well be referencing the freelist, or, even worse, some other type
> >> > of structure.
> >> >
> >> > What is your code doing to prevent this from happening?  (There are ways,
> >> > but I want to know what you were doing in this case.)
> >>
> >> Probably I should have shown the way of removal at the very beginning,
> >> my fault.  So deletion looks as the following (a bit changed and
> >> simplified for the sake of clearness):
> >
> > Thank you!  Let's see...
> >
> >>         static void remove_connection(conn)
> >>         {
> >>                 bool need_to_wait = false;
> >>                 int cpu;
> >>
> >>                 /* Do not let RCU list add/delete happen in parallel */
> >>                 mutex_lock(&conn_lock);
> >>
> >>                 list_del_rcu(&conn->entry);
> >>
> >>                 /* Make sure everybody observes element removal */
> >>                 synchronize_rcu();
> >
> > At this point, any reader who saw the element in the list is done, as you
> > comment in fact says.  But there might be a pointer to that element in the
> > per-CPU variables, however, from this point forward it cannot be the case
> > that one of the per-CPU variables gets set to the newly deleted element.
> > Which is your next block of code...
> >
> >>                 /*
> >>                  * At this point nobody sees @conn in the list, but
> >> still we have
> >>                  * dangling pointer @rcu_conn which _can_ point to @conn.  Since
> >>                  * nobody can observe @conn in the list, we guarantee
> >> that IO path
> >>                  * will not assign @conn to @rcu_conn, i.e. @rcu_conn
> >> can be equal
> >>                  * to @conn, but can never again become @conn.
> >>                  */
> >>
> >>                 /*
> >>                  * Get @next connection from current @conn which is going to be
> >>                  * removed.
> >>                  */
> >>                 next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
> >>                                                 typeof(*next), entry);
> >>
> >>                 /*
> >>                  * Here @rcu_conn can be changed by reader side, so use @cmpxchg
> >>                  * in order to keep fairness in load-balancing and do not touch
> >>                  * the pointer which can be already changed by the IO path.
> >>                  *
> >>                  * Current path can be faster than IO path and the
> >> following race
> >>                  * exists:
> >>                  *
> >>                  *   CPU0                         CPU1
> >>                  *   ----                         ----
> >>                  *   conn = rcu_dereferece(rcu_conn);
> >>                  *   next = list_next_or_null_rr_rcu(conn)
> >>                  *
> >>                  *                                conn ==
> >> cmpxchg(rcu_conn, conn, next);
> >>                  *                                synchronize_rcu();
> >>                  *
> >>                  *   rcu_assign_pointer(rcu_conn, next);
> >>                  *   ^^^^^^^^^^^^^^^^^^
> >>                  *
> >>                  *   Here @rcu_conn is already equal to @next (done by
> >> @cmpxchg),
> >>                  *   so assignment to the same pointer is harmless.
> >>                  *
> >>                  */
> >>                 for_each_possible_cpu(cpu) {
> >>                         struct conn **rcu_conn;
> >>
> >>                         rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
> >>                         if (*rcu_conn != conn)
> >>                                 /*
> >>                                  * This @cpu will never again pick up @conn,
> >>                                  * so it is safe just to choose next CPU.
> >>                                  */
> >>                                 continue;
> >
> > ... Someone else might have picked up rcu_conn at this point...
> >
> >>                         if (conn == cmpxchg(rcu_conn, conn, next))
> >>                                 /*
> >>                                  * @rcu_conn was successfully replaced
> >> with @next,
> >>                                  * that means that someone can also hold a @conn
> >>                                  * and dereferencing it, so wait for a
> >> grace period
> >>                                  * is required.
> >>                                  */
> >>                                 need_to_wait = true;
> >
> > ... But if there was any possibility of that, need_to_wait is true, and it
> > still cannot be the case that a reader finds the newly deleted element
> > in the list, so they cannot find that element, so the pcpu_rcu_conn
> > variables cannot be set to it.
> >
> >>                 }
> >>                 if (need_to_wait)
> >>                         synchronize_rcu();
> >
> > And at this point, the reader that might have picked up rcu_conn
> > just before the cmpxchg must have completed.  (Good show, by the way!
> > Many people miss the fact that they need this second synchronize_rcu().)
> >
> > Hmmm...  What happens if this was the last element in the list, and
> > the relevant pcpu_rcu_conn variable references that newly removed
> > element?  Taking a look at list_next_or_null_rcu() and thus at
> > list_next_or_null_rcu(), and it does appear that you get NULL in that
> > case, as is right and good.
> 
> Thanks for explicit comments.  What I always lack is a good description.
> Indeed it is worth to mention that @next can become NULL if that was the
> last element, will add comments then.

Very good.

And for the record, I agree with Linus that this needs to be private.
I am very glad that you got it right (or appear to have), and there might
come a time when it needs to be generally available, but that time is
certainly not now and might well never come.

> >>                 mutex_unlock(&conn_lock);
> >>
> >>                 kfree(conn);
> >>         }
> >>
> >>
> >> >
> >> >> i.e. usage of the @next pointer is under an RCU critical section.
> >> >>
> >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> >> >> > macro that makes it more obvious that the whole thing need to be under
> >> >> > a single RCU read-side critical section?  Such a macro would of course be
> >> >> > an infinite loop if the list never went empty, so presumably there would
> >> >> > be a break or return statement in there somewhere.
> >> >>
> >> >> The difference is that I do not need a loop, I take the @next conn pointer,
> >> >> save it for the following IO request and do IO for current IO request.
> >> >>
> >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
> >> >> of the loop does not look nice, I personally do not like it, i.e.:
> >> >>
> >> >>
> >> >>         static struct conn *get_and_set_next_conn(void)
> >> >>         {
> >> >>                 struct conn *conn;
> >> >>
> >> >>                 conn = rcu_dereferece(rcu_conn);
> >> >>                 if (unlikely(!conn))
> >> >>                     return conn;
> >> >>                 list_for_each_entry_rr_rcu(conn, &conn_list,
> >> >>                                            entry) {
> >> >>                         break;
> >> >>                 }
> >> >>                 rcu_assign_pointer(rcu_conn, conn);
> >> >>                 return conn;
> >> >>         }
> >> >>
> >> >>
> >> >> or maybe I did not fully get your idea?
> >> >
> >> > That would not help at all because you are still leaking the pointer out
> >> > of the RCU read-side critical section.  That is completely and utterly
> >> > broken unless you are somehow cleaning up rcu_conn when you remove
> >> > the element.  And getting that cleanup right is -extremely- tricky.
> >> > Unless you have some sort of proof of correctness, you will get a NACK
> >> > from me.
> >>
> >> I understand all the consequences of the leaking pointer, and of course
> >> wrapped loop with RCU lock/unlock is simpler, but in order to keep
> >> load-balancing and IO fairness avoiding any locks on IO path I've come
> >> up with these RCU tricks and list_next_or_null_rr_rcu() macro.
> >
> > At first glance, it appears that you have handled this correctly.  But I
> > can make mistakes just as easily as the next guy, so what have you done
> > to validate your algorithm?
> 
> What we only have is a set of unit-tests which run every night.  For this
> particular case there is a special stress test which adds/removes RDMA
> connections in a loop while IO is performing.  Unfortunately I did not
> write any synthetic test-case just for testing this isolated algorithm.
> (e.g. module with only these RCU functions can be created and list
> modification can be easily simulated. Should not be very much difficult)
> Do you think it is worth to do?  Unfortunately it also does not prove
> correctness, like Linus said it is fragile as hell, but for sure I can
> burn CPUs testing it for couple of days.

Yes, this definitely deserves some -serious- focused stress testing.

And formal verification, if that can be arranged.  The Linux-kernel
memory model does handle RCU, and also minimal linked lists, so might
be worth a try.  Unfortunately, I don't know of any other tooling
that handles RCU.  :-(

> >> > More like this:
> >> >
> >> >         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
> >> >                 do_something_with(conn);
> >> >                 if (done_for_now())
> >> >                         break;
> >> >         }
> >> >
> >> >> >> + */
> >> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> >> >> >> +({ \
> >> >> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
> >> >> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
> >> >> >
> >> >> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
> >> >> > Define this within RDMA, where a smaller number of people can more
> >> >> > easily be kept aware of the restrictions on use.  If it turns out to be
> >> >> > more generally useful, we can take a look at exactly what makes sense
> >> >> > more globally.
> >> >>
> >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
> >> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
> >> >>
> >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
> >> >>
> >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
> >> >> my list_next_or_null_rr_rcu() variant?
> >> >
> >> > Let's start with the basics:  It absolutely does not make sense to leak
> >> > pointers across rcu_read_unlock() unless you have arranged something else
> >> > to protect the pointed-to data in the meantime.  There are a number of ways
> >> > of implementing this protection.  Again, what protection are you using?
> >> >
> >> > Your code at the above URL looks plausible to me at first glance: You
> >> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> >> > rcu_read_unlock().  But at second glance, it looks like htcx->queue
> >> > might have the same vulnerability as rcu_conn in your earlier code.
> >>
> >> I am not the author of the code at the URL I specified. I provided the
> >> link answering the question to show other possible users of round-robin
> >> semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
> >> case I can't use a loop, I leak the pointer and indeed have to be very
> >> careful.  But perhaps we can come up with some generic solution to cover
> >> both cases: -rr loop and -rr next.
> >
> > Ah.  Could you please check their update-side code to make sure that it
> > looks correct to you?
> 
> BTW authors of this particular -rr loop are in CC, but they keep silence
> so far :)  According to my shallow understanding @queue can't disappear
> from the list on this calling path, i.e. existence of a @queue should be
> guaranteed by the fact that the queue has no IO in-flights and only then
> it can be removed.

But something has to enforce that, right?  For example, how does the
code know that there are no longer any I/Os in flight, and how does it
know that more I/Os won't be issued in the near future, possibly based
on old state?

But yes, the authors are free to respond.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
  2018-05-22 16:38                 ` Linus Torvalds
@ 2018-05-22 17:04                   ` Paul E. McKenney
  0 siblings, 0 replies; 55+ messages in thread
From: Paul E. McKenney @ 2018-05-22 17:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Pen, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Doug Ledford, swapnil.ingle, danil.kipnis, Jinpu Wang,
	Linux Kernel Mailing List

On Tue, May 22, 2018 at 09:38:13AM -0700, Linus Torvalds wrote:
> On Tue, May 22, 2018 at 2:09 AM Roman Penyaev <
> roman.penyaev@profitbricks.com> wrote:
> 
> > Should I resend current patch with more clear comments about how careful
> > caller should be with a leaking pointer?
> 
> No. Even if we go your way, there is *one* single user, and that one is
> special and needs to take a lot more care.
> 
> Just roll your own version, and make it an inline function like I've asked
> now now many times, and add a shit-ton of explanations of why it's safe to
> use in that *one* situation.
> 
> I don't want any crazy and unsafe stuff in the generic header file that
> absolutely *nobody* should ever use.

Completely agreed!

I was perhaps foolishly assuming that they would be making that adjustment
based on earlier emails, but yes, I should have explicitly stated this
requirement in my earlier reply.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2018-05-22 17:04 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
2018-05-18 13:03 ` [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Roman Pen
2018-05-18 16:56   ` Linus Torvalds
2018-05-19 20:25     ` Roman Penyaev
2018-05-19 21:04       ` Linus Torvalds
2018-05-19 16:37   ` Paul E. McKenney
2018-05-19 20:20     ` Roman Penyaev
2018-05-19 20:56       ` Linus Torvalds
2018-05-20  0:43       ` Paul E. McKenney
2018-05-21 13:50         ` Roman Penyaev
2018-05-21 15:16           ` Linus Torvalds
2018-05-21 15:33             ` Paul E. McKenney
2018-05-22  9:09               ` Roman Penyaev
2018-05-22 16:36                 ` Paul E. McKenney
2018-05-22 16:38                 ` Linus Torvalds
2018-05-22 17:04                   ` Paul E. McKenney
2018-05-21 15:31           ` Paul E. McKenney
2018-05-22  9:09             ` Roman Penyaev
2018-05-22 17:03               ` Paul E. McKenney
2018-05-18 13:03 ` [PATCH v2 02/26] sysfs: export sysfs_remove_file_self() Roman Pen
2018-05-18 15:08   ` Tejun Heo
2018-05-18 13:03 ` [PATCH v2 03/26] ibtrs: public interface header to establish RDMA connections Roman Pen
2018-05-18 13:03 ` [PATCH v2 04/26] ibtrs: private headers with IBTRS protocol structs and helpers Roman Pen
2018-05-18 13:03 ` [PATCH v2 05/26] ibtrs: core: lib functions shared between client and server modules Roman Pen
2018-05-18 13:03 ` [PATCH v2 06/26] ibtrs: client: private header with client structs and functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 07/26] ibtrs: client: main functionality Roman Pen
2018-05-18 13:03 ` [PATCH v2 08/26] ibtrs: client: statistics functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 09/26] ibtrs: client: sysfs interface functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 10/26] ibtrs: server: private header with server structs and functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 11/26] ibtrs: server: main functionality Roman Pen
2018-05-18 13:03 ` [PATCH v2 12/26] ibtrs: server: statistics functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 13/26] ibtrs: server: sysfs interface functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation Roman Pen
2018-05-20 22:14   ` kbuild test robot
2018-05-21  6:36   ` kbuild test robot
2018-05-22  5:05   ` Leon Romanovsky
2018-05-22  9:27     ` Roman Penyaev
2018-05-22 13:18       ` Leon Romanovsky
2018-05-22 16:12         ` Roman Penyaev
2018-05-18 13:04 ` [PATCH v2 15/26] ibtrs: a bit of documentation Roman Pen
2018-05-18 13:04 ` [PATCH v2 16/26] ibnbd: private headers with IBNBD protocol structs and helpers Roman Pen
2018-05-18 13:04 ` [PATCH v2 17/26] ibnbd: client: private header with client structs and functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 18/26] ibnbd: client: main functionality Roman Pen
2018-05-18 13:04 ` [PATCH v2 19/26] ibnbd: client: sysfs interface functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 20/26] ibnbd: server: private header with server structs and functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 21/26] ibnbd: server: main functionality Roman Pen
2018-05-18 13:04 ` [PATCH v2 22/26] ibnbd: server: functionality for IO submission to file or block dev Roman Pen
2018-05-18 13:04 ` [PATCH v2 23/26] ibnbd: server: sysfs interface functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation Roman Pen
2018-05-20 17:21   ` kbuild test robot
2018-05-20 22:14   ` kbuild test robot
2018-05-21  5:33   ` kbuild test robot
2018-05-18 13:04 ` [PATCH v2 25/26] ibnbd: a bit of documentation Roman Pen
2018-05-18 13:04 ` [PATCH v2 26/26] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Roman Pen
2018-05-22 16:45 ` [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).