linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
@ 2018-02-02 14:08 Roman Pen
  2018-02-02 14:08 ` [PATCH 01/24] ibtrs: public interface header to establish RDMA connections Roman Pen
                   ` (26 more replies)
  0 siblings, 27 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This series introduces IBNBD/IBTRS modules.

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is optimized to transfer (read/write) IO blocks
in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.

IBTRS is multipath capable and provides I/O fail-over and load-balancing
functionality.

IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.

Why?

   - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
     thus internal protocol is simple and consists of several request
	 types only without awareness of underlaying hardware devices.
   - IBTRS was developed as an independent RDMA transport library, which
     supports fail-over and load-balancing policies using multipath, thus
	 it can be used for any other IO needs rather than only for block
	 device.
   - IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
     https://www.spinics.net/lists/linux-rdma/msg48799.html
     (I retested on latest 4.14 kernel - there is no any significant
	  difference, thus I post the old link).

Key features of IBTRS transport library and IBNBD block device:

o High throughput and low latency due to:
   - Only two RDMA messages per IO.
   - IMM InfiniBand messages on responses to reduce round trip latency.
   - Simplified memory management: memory allocation happens once on
     server side when IBTRS session is established.

o IO fail-over and load-balancing by using multipath.

o Simple configuration of IBNBD:
   - Server side is completely passive: volumes do not need to be
     explicitly exported.
   - Only IB port GID and device path needed on client side to map
     a block device.
   - A device is remapped automatically i.e. after storage reboot.

This series is a second try, first variant was published [1] and
presented on Vault in 2017 [2].

Since the first version the following was changed:

   - Load-balancing and IO fail-over using multipath features were added.
   - Major parts of the code were rewritten, simplified and overall code
     size was reduced by a quarter.

Commits for kernel can be found here:
   https://github.com/profitbricks/ibnbd/commits/linux-4.15-rc8

The out-of-tree modules are here:
   https://github.com/profitbricks/ibnbd/

[1] https://lwn.net/Articles/718181/
[2] http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf

Roman Pen (24):
  ibtrs: public interface header to establish RDMA connections
  ibtrs: private headers with IBTRS protocol structs and helpers
  ibtrs: core: lib functions shared between client and server modules
  ibtrs: client: private header with client structs and functions
  ibtrs: client: main functionality
  ibtrs: client: statistics functions
  ibtrs: client: sysfs interface functions
  ibtrs: server: private header with server structs and functions
  ibtrs: server: main functionality
  ibtrs: server: statistics functions
  ibtrs: server: sysfs interface functions
  ibtrs: include client and server modules into kernel compilation
  ibtrs: a bit of documentation
  ibnbd: private headers with IBNBD protocol structs and helpers
  ibnbd: client: private header with client structs and functions
  ibnbd: client: main functionality
  ibnbd: client: sysfs interface functions
  ibnbd: server: private header with server structs and functions
  ibnbd: server: main functionality
  ibnbd: server: functionality for IO submission to file or block dev
  ibnbd: server: sysfs interface functions
  ibnbd: include client and server modules into kernel compilation
  ibnbd: a bit of documentation
  MAINTAINERS: Add maintainer for IBNBD/IBTRS modules

 MAINTAINERS                                    |   14 +
 drivers/block/Kconfig                          |    2 +
 drivers/block/Makefile                         |    1 +
 drivers/block/ibnbd/Kconfig                    |   22 +
 drivers/block/ibnbd/Makefile                   |   13 +
 drivers/block/ibnbd/README                     |  272 ++
 drivers/block/ibnbd/ibnbd-clt-sysfs.c          |  723 +++++
 drivers/block/ibnbd/ibnbd-clt.c                | 1959 +++++++++++++
 drivers/block/ibnbd/ibnbd-clt.h                |  193 ++
 drivers/block/ibnbd/ibnbd-log.h                |   71 +
 drivers/block/ibnbd/ibnbd-proto.h              |  360 +++
 drivers/block/ibnbd/ibnbd-srv-dev.c            |  410 +++
 drivers/block/ibnbd/ibnbd-srv-dev.h            |  149 +
 drivers/block/ibnbd/ibnbd-srv-sysfs.c          |  264 ++
 drivers/block/ibnbd/ibnbd-srv.c                |  901 ++++++
 drivers/block/ibnbd/ibnbd-srv.h                |  100 +
 drivers/infiniband/Kconfig                     |    1 +
 drivers/infiniband/ulp/Makefile                |    1 +
 drivers/infiniband/ulp/ibtrs/Kconfig           |   20 +
 drivers/infiniband/ulp/ibtrs/Makefile          |   15 +
 drivers/infiniband/ulp/ibtrs/README            |  238 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c |  455 +++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c |  519 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c       | 3496 ++++++++++++++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h       |  338 +++
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h       |   94 +
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h       |  494 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c |  110 +
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c |  278 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c       | 1811 ++++++++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h       |  169 ++
 drivers/infiniband/ulp/ibtrs/ibtrs.c           |  582 ++++
 drivers/infiniband/ulp/ibtrs/ibtrs.h           |  331 +++
 33 files changed, 14406 insertions(+)
 create mode 100644 drivers/block/ibnbd/Kconfig
 create mode 100644 drivers/block/ibnbd/Makefile
 create mode 100644 drivers/block/ibnbd/README
 create mode 100644 drivers/block/ibnbd/ibnbd-clt-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.c
 create mode 100644 drivers/block/ibnbd/ibnbd-clt.h
 create mode 100644 drivers/block/ibnbd/ibnbd-log.h
 create mode 100644 drivers/block/ibnbd/ibnbd-proto.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-dev.h
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.c
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
 create mode 100644 drivers/infiniband/ulp/ibtrs/README
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
-- 
2.13.1

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01/24] ibtrs: public interface header to establish RDMA connections
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 02/24] ibtrs: private headers with IBTRS protocol structs and helpers Roman Pen
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

Introduce public header which provides set of API functions to
establish RDMA connections from client to server machine using
IBTRS protocol, which manages RDMA connections for each session,
does multipathing and load balancing.

Main functions for client (active) side:

 ibtrs_clt_open() - Creates set of RDMA connections incapsulated
                    in IBTRS session and returns pointer on IBTRS
		    session object.
 ibtrs_clt_close() - Closes RDMA connections associated with IBTRS
                     session.
 ibtrs_clt_request() - Requests zero-copy RDMA transfer to/from
                       server.

Main functions for server (passive) side:

 ibtrs_srv_open() - Starts listening for IBTRS clients on specified
                    port and invokes IBTRS callbacks for incoming
		    RDMA requests or link events.
 ibtrs_srv_close() - Closes IBTRS server context.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs.h | 331 +++++++++++++++++++++++++++++++++++
 1 file changed, 331 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.h b/drivers/infiniband/ulp/ibtrs/ibtrs.h
new file mode 100644
index 000000000000..747cdde3d9cf
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.h
@@ -0,0 +1,331 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_H
+#define IBTRS_H
+
+#include <linux/socket.h>
+#include <linux/scatterlist.h>
+
+struct ibtrs_clt;
+struct ibtrs_srv_ctx;
+struct ibtrs_srv;
+struct ibtrs_srv_op;
+
+/*
+ * Here goes IBTRS client API
+ */
+
+/**
+ * enum ibtrs_clt_link_ev - Events about connectivity state of a client
+ * @IBTRS_CLT_LINK_EV_RECONNECTED	Client was reconnected.
+ * @IBTRS_CLT_LINK_EV_DISCONNECTED	Client was disconnected.
+ */
+enum ibtrs_clt_link_ev {
+	IBTRS_CLT_LINK_EV_RECONNECTED,
+	IBTRS_CLT_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * Source and destination address of a path to be established
+ */
+struct ibtrs_addr {
+	struct sockaddr *src;
+	struct sockaddr *dst;
+};
+
+typedef void (link_clt_ev_fn)(void *priv, enum ibtrs_clt_link_ev ev);
+/**
+ * ibtrs_clt_open() - Open a session to a IBTRS client
+ * @priv:		User supplied private data.
+ * @link_ev:		Event notification for connection state changes
+ *	@priv:			user supplied data that was passed to
+ *				ibtrs_clt_open()
+ *	@ev:			Occurred event
+ * @sessname: name of the session
+ * @paths: Paths to be established defined by their src and dst addresses
+ * @path_cnt: Number of elemnts in the @paths array
+ * @port: port to be used by the IBTRS session
+ * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
+ * @max_inflight_msg: Max. number of parallel inflight messages for the session
+ * @max_segments: Max. number of segments per IO request
+ * @reconnect_delay_sec: time between reconnect tries
+ * @max_reconnect_attempts: Number of times to reconnect on error before giving
+ *			    up, 0 for * disabled, -1 for forever
+ *
+ * Starts session establishment with the ibtrs_server. The function can block
+ * up to ~2000ms until it returns.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct ibtrs_addr *paths,
+				 size_t path_cnt, short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts);
+
+/**
+ * ibtrs_clt_close() - Close a session
+ * @sess: Session handler, is freed on return
+ */
+void ibtrs_clt_close(struct ibtrs_clt *sess);
+
+enum {
+	IBTRS_TAG_NOWAIT = 0,
+	IBTRS_TAG_WAIT   = 1,
+};
+
+/**
+ * enum ibtrs_clt_con_type() type of ib connection to use with a given tag
+ * @USR_CON - use connection reserved vor "service" messages
+ * @IO_CON - use a connection reserved for IO
+ */
+enum ibtrs_clt_con_type {
+	IBTRS_USR_CON,
+	IBTRS_IO_CON
+};
+
+/**
+ * ibtrs_tag - tags the memory allocation for future RDMA operation
+ */
+struct ibtrs_tag {
+	enum ibtrs_clt_con_type con_type;
+	unsigned int cpu_id;
+	unsigned int mem_id;
+	unsigned int mem_off;
+};
+
+static inline struct ibtrs_tag *ibtrs_tag_from_pdu(void *pdu)
+{
+	return pdu - sizeof(struct ibtrs_tag);
+}
+
+static inline void *ibtrs_tag_to_pdu(struct ibtrs_tag *tag)
+{
+	return tag + 1;
+}
+
+/**
+ * ibtrs_clt_get_tag() - allocates tag for future RDMA operation
+ * @sess:	Current session
+ * @con_type:	Type of connection to use with the tag
+ * @wait:	Wait type
+ *
+ * Description:
+ *    Allocates tag for the following RDMA operation.  Tag is used
+ *    to preallocate all resources and to propagate memory pressure
+ *    up earlier.
+ *
+ * Context:
+ *    Can sleep if @wait == IBTRS_TAG_WAIT
+ */
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *sess,
+				    enum ibtrs_clt_con_type con_type,
+				    int wait);
+
+/**
+ * ibtrs_clt_put_tag() - puts allocated tag
+ * @sess:	Current session
+ * @tag:	Tag to be freed
+ *
+ * Context:
+ *    Does not matter
+ */
+void ibtrs_clt_put_tag(struct ibtrs_clt *sess, struct ibtrs_tag *tag);
+
+typedef void (ibtrs_conf_fn)(void *priv, int errno);
+/**
+ * ibtrs_clt_request() - Request data transfer to/from server via RDMA.
+ *
+ * @dir:	READ/WRITE
+ * @conf:	callback function to be called as confirmation
+ * @sess:	Session
+ * @tag:	Preallocated tag
+ * @priv:	User provided data, passed back with corresponding
+ *		@(conf) confirmation.
+ * @vec:	Message that is send to server together with the request.
+ *		Sum of len of all @vec elements limited to <= IO_MSG_SIZE.
+ *		Since the msg is copied internally it can be allocated on stack.
+ * @nr:		Number of elements in @vec.
+ * @len:	length of data send to/from server
+ * @sg:		Pages to be sent/received to/from server.
+ * @sg_cnt:	Number of elements in the @sg
+ *
+ * Return:
+ * 0:		Success
+ * <0:		Error
+ *
+ * On dir=READ ibtrs client will request a data transfer from Server to client.
+ * The data that the server will respond with will be stored in @sg when
+ * the user receives an %IBTRS_CLT_RDMA_EV_RDMA_REQUEST_WRITE_COMPL event.
+ * On dir=WRITE ibtrs client will rdma write data in sg to server side.
+ */
+int ibtrs_clt_request(int dir, ibtrs_conf_fn *conf, struct ibtrs_clt *sess,
+		      struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		      size_t nr, size_t len, struct scatterlist *sg,
+		      unsigned int sg_cnt);
+
+/**
+ * ibtrs_attrs - IBTRS session attributes
+ */
+struct ibtrs_attrs {
+	u32	queue_depth;
+	u32	max_io_size;
+	u8	sessname[NAME_MAX];
+};
+
+/**
+ * ibtrs_clt_query() - queries IBTRS session attributes
+ *
+ * Returns:
+ *    0 on success
+ *    -ECOMM		no connection to the server
+ */
+int ibtrs_clt_query(struct ibtrs_clt *sess, struct ibtrs_attrs *attr);
+
+/*
+ * Here goes IBTRS server API
+ */
+
+/**
+ * enum ibtrs_srv_link_ev - Server link events
+ * @IBTRS_SRV_LINK_EV_CONNECTED:	Connection from client established
+ * @IBTRS_SRV_LINK_EV_DISCONNECTED:	Connection was disconnected, all
+ *					connection IBTRS resources were freed.
+ */
+enum ibtrs_srv_link_ev {
+	IBTRS_SRV_LINK_EV_CONNECTED,
+	IBTRS_SRV_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * rdma_ev_fn():	Event notification for RDMA operations
+ *			If the callback returns a value != 0, an error message
+ *			for the data transfer will be sent to the client.
+
+ *	@sess:		Session
+ *	@priv:		Private data set by ibtrs_srv_set_sess_priv()
+ *	@id:		internal IBTRS operation id
+ *	@dir:		READ/WRITE
+ *	@data:		Pointer to (bidirectional) rdma memory area:
+ *			- in case of %IBTRS_SRV_RDMA_EV_RECV contains
+ *			data sent by the client
+ *			- in case of %IBTRS_SRV_RDMA_EV_WRITE_REQ points to the
+ *			memory area where the response is to be written to
+ *	@datalen:	Size of the memory area in @data
+ *	@usr:		The extra user message sent by the client (%vec)
+ *	@usrlen:	Size of the user message
+ */
+typedef int (rdma_ev_fn)(struct ibtrs_srv *sess, void *priv,
+			 struct ibtrs_srv_op *id, int dir,
+			 void *data, size_t datalen, const void *usr,
+			 size_t usrlen);
+
+/**
+ * link_ev_fn():	Events about connective state changes
+ *			If the callback returns != 0 and the event
+ *			%IBTRS_SRV_LINK_EV_CONNECTED the corresponding session
+ *			will be destroyed.
+ *	@sess:		Session
+ *	@ev:		event
+ *	@priv:		Private data from user if previously set with
+ *			ibtrs_srv_set_sess_priv()
+ */
+typedef int (link_ev_fn)(struct ibtrs_srv *sess, enum ibtrs_srv_link_ev ev,
+			 void *priv);
+
+/**
+ * ibtrs_srv_open() - open IBTRS server context
+ * @ops:		callback functions
+ *
+ * Creates server context with specified callbacks.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port);
+
+/**
+ * ibtrs_srv_close() - close IBTRS server context
+ * @ctx: pointer to server context
+ *
+ * Closes IBTRS server context with all client sessions.
+ */
+void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx);
+
+/**
+ * ibtrs_srv_resp_rdma() - Finish an RDMA request
+ *
+ * @id:		Internal IBTRS operation identifier
+ * @errno:	Response Code send to the other side for this operation.
+ *		0 = success, <=0 error
+ *
+ * Finish a RDMA operation. A message is sent to the client and the
+ * corresponding memory areas will be released.
+ */
+void ibtrs_srv_resp_rdma(struct ibtrs_srv_op *id, int errno);
+
+/**
+ * ibtrs_srv_set_sess_priv() - Set private pointer in ibtrs_srv.
+ * @sess:	Session
+ * @priv:	The private pointer that is associated with the session.
+ */
+void ibtrs_srv_set_sess_priv(struct ibtrs_srv *sess, void *priv);
+
+/**
+ * ibtrs_srv_get_sess_qdepth() - Get ibtrs_srv qdepth.
+ * @sess:	Session
+ */
+int ibtrs_srv_get_queue_depth(struct ibtrs_srv *sess);
+
+/**
+ * ibtrs_srv_get_sess_name() - Get ibtrs_srv peer hostname.
+ * @sess:	Session
+ * @sessname:	Sessname buffer
+ * @len:	Length of sessname buffer
+ */
+int ibtrs_srv_get_sess_name(struct ibtrs_srv *sess, char *sessname, size_t len);
+
+/**
+ * ibtrs_addr_to_sockaddr() - convert path string "src,dst" to sockaddreses
+ * @str		string containing source and destination addr of a path
+ *		separated by comma. I.e. "ip:1.1.1.1,ip:1.1.1.2". If str
+ *		contains only one address it's considered to be destination.
+ * @len		string length
+ * @addr->dst	will be set to the destination sockadddr.
+ * @addr->src	will be set to the source address or to NULL
+ *		if str doesn't contain any sorce address.
+ *
+ * Returns zero if conversion successful. Non-zero otherwise.
+ */
+int ibtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct ibtrs_addr *addr);
+#endif
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 02/24] ibtrs: private headers with IBTRS protocol structs and helpers
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
  2018-02-02 14:08 ` [PATCH 01/24] ibtrs: public interface header to establish RDMA connections Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules Roman Pen
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

These are common private headers with IBTRS protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h |  94 ++++++
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h | 494 +++++++++++++++++++++++++++++++
 2 files changed, 588 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-log.h b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
new file mode 100644
index 000000000000..308593785c64
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
@@ -0,0 +1,94 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_LOG_H
+#define IBTRS_LOG_H
+
+#define P1 )
+#define P2 ))
+#define P3 )))
+#define P4 ))))
+#define P(N) P ## N
+
+#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
+#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
+
+#define COUNT_ARGS(...) COUNT_ARGS_(,##__VA_ARGS__,6,5,4,3,2,1,0)
+#define COUNT_ARGS_(z,a,b,c,d,e,f,cnt,...) cnt
+
+#define LIST(...)						\
+	__VA_ARGS__,						\
+	({ unknown_type(); NULL; })				\
+	CAT(P, COUNT_ARGS(__VA_ARGS__))				\
+
+#define EMPTY()
+#define DEFER(id) id EMPTY()
+
+#define _CASE(obj, type, member)				\
+	__builtin_choose_expr(					\
+	__builtin_types_compatible_p(				\
+		typeof(obj), type),				\
+		((type)obj)->member
+#define CASE(o, t, m) DEFER(_CASE)(o,t,m)
+
+/*
+ * Below we define retrieving of sessname from common IBTRS types.
+ * Client or server related types have to be defined by special
+ * TYPES_TO_SESSNAME macro.
+ */
+
+void unknown_type(void);
+
+#ifndef TYPES_TO_SESSNAME
+#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
+#endif
+
+#define ibtrs_prefix(obj)					\
+	_CASE(obj, struct ibtrs_con *,  sess->sessname),	\
+	_CASE(obj, struct ibtrs_sess *, sessname),		\
+	TYPES_TO_SESSNAME(obj)					\
+	))
+
+#define ibtrs_log(fn, obj, fmt, ...)				\
+	fn("<%s>: " fmt, ibtrs_prefix(obj), ##__VA_ARGS__)
+
+#define ibtrs_err(obj, fmt, ...)	\
+	ibtrs_log(pr_err, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_err_rl(obj, fmt, ...)	\
+	ibtrs_log(pr_err_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn(obj, fmt, ...)	\
+	ibtrs_log(pr_warn, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn_rl(obj, fmt, ...) \
+	ibtrs_log(pr_warn_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info(obj, fmt, ...) \
+	ibtrs_log(pr_info, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info_rl(obj, fmt, ...) \
+	ibtrs_log(pr_info_ratelimited, obj, fmt, ##__VA_ARGS__)
+
+#endif /* IBTRS_LOG_H */
diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
new file mode 100644
index 000000000000..b3b51af8607e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
@@ -0,0 +1,494 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_PRI_H
+#define IBTRS_PRI_H
+
+#include <linux/uuid.h>
+#include <rdma/rdma_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib.h>
+
+#include "ibtrs.h"
+
+#define IBTRS_VER_MAJOR 1
+#define IBTRS_VER_MINOR 0
+#define IBTRS_VER_STRING __stringify(IBTRS_VER_MAJOR) "." \
+			 __stringify(IBTRS_VER_MINOR)
+
+enum ibtrs_imm_consts {
+	MAX_IMM_TYPE_BITS = 4,
+	MAX_IMM_TYPE_MASK = ((1 << MAX_IMM_TYPE_BITS) - 1),
+	MAX_IMM_PAYL_BITS = 28,
+	MAX_IMM_PAYL_MASK = ((1 << MAX_IMM_PAYL_BITS) - 1),
+
+	IBTRS_IO_REQ_IMM = 0, /* client to server */
+	IBTRS_IO_RSP_IMM = 1, /* server to client */
+	IBTRS_HB_MSG_IMM = 2,
+	IBTRS_HB_ACK_IMM = 3,
+};
+
+enum {
+	SERVICE_CON_QUEUE_DEPTH = 512,
+
+	MIN_RTR_CNT = 1,
+	MAX_RTR_CNT = 7,
+
+	MAX_PATHS_NUM = 128,
+
+	/*
+	 * With the current size of the tag allocated on the client, 4K
+	 * is the maximum number of tags we can allocate.  This number is
+	 * also used on the client to allocate the IU for the user connection
+	 * to receive the RDMA addresses from the server.
+	 */
+	MAX_SESS_QUEUE_DEPTH = 4096,
+	/*
+	 * Size of user message atached to a request (@vec, @nr) is limited
+	 * by the IO_MSG_SIZE. max_req_size allocated by the server should
+	 * cover both: the user message and the ibtrs message attached
+	 * to an IO. ibtrs_msg_req_rdma_write attached to a read has variable
+	 * size: max number of descriptors we can send is limited by
+	 * max_desc = (max_req_size - IO_MSG_SIZE) / sizeof(desc)
+	 */
+	IO_MSG_SIZE = 512,
+
+	IBTRS_HB_INTERVAL_MS = 5000,
+	IBTRS_HB_MISSED_MAX = 5,
+
+	IBTRS_MAGIC = 0x1BBD,
+	IBTRS_VERSION = (IBTRS_VER_MAJOR << 8) | IBTRS_VER_MINOR,
+};
+
+struct ibtrs_ib_dev {
+	struct list_head	entry;
+	struct kref		ref;
+	struct ib_pd		*pd;
+	struct ib_device	*dev;
+	struct ib_device_attr	attrs;
+	u32			lkey;
+	u32			rkey;
+};
+
+struct ibtrs_con {
+	struct ibtrs_sess	*sess;
+	struct ib_qp		*qp;
+	struct ib_cq		*cq;
+	struct rdma_cm_id	*cm_id;
+	unsigned		cid;
+};
+
+typedef void (ibtrs_hb_handler_t)(struct ibtrs_con *con, int err);
+
+struct ibtrs_sess {
+	struct list_head	entry;
+	struct sockaddr_storage dst_addr;
+	struct sockaddr_storage src_addr;
+	char			sessname[NAME_MAX];
+	uuid_t			uuid;
+	struct ibtrs_con	**con;
+	unsigned int		con_num;
+	unsigned int		recon_cnt;
+	struct ibtrs_ib_dev	*ib_dev;
+	int			ib_dev_ref;
+	struct ib_cqe		*hb_cqe;
+	ibtrs_hb_handler_t	*hb_err_handler;
+	struct workqueue_struct *hb_wq;
+	struct delayed_work	hb_dwork;
+	unsigned		hb_interval_ms;
+	unsigned		hb_missed_cnt;
+	unsigned		hb_missed_max;
+};
+
+struct ibtrs_iu {
+	struct list_head        list;
+	struct ib_cqe           cqe;
+	dma_addr_t              dma_addr;
+	void                    *buf;
+	size_t                  size;
+	enum dma_data_direction direction;
+	u32			tag;
+};
+
+/**
+ * enum ibtrs_msg_types - IBTRS message types.
+ * @IBTRS_MSG_INFO_REQ:		Client additional info request to the server
+ * @IBTRS_MSG_INFO_RSP:		Server additional info response to the client
+ * @IBTRS_MSG_WRITE:		Client writes data per RDMA to server
+ * @IBTRS_MSG_READ:		Client requests data transfer from server
+ * @IBTRS_MSG_USER:		Data transfer per Infiniband message
+ */
+enum ibtrs_msg_types {
+	IBTRS_MSG_INFO_REQ,
+	IBTRS_MSG_INFO_RSP,
+	IBTRS_MSG_WRITE,
+	IBTRS_MSG_READ,
+	IBTRS_MSG_USER,
+};
+
+/**
+ * struct ibtrs_msg_conn_req - Client connection request to the server
+ * @magic:	   IBTRS magic
+ * @version:	   IBTRS protocol version
+ * @cid:	   Current connection id
+ * @cid_num:	   Number of connections per session
+ * @recon_cnt:	   Reconnections counter
+ * @sess_uuid:	   UUID of a session (path)
+ * @paths_uuid:	   UUID of a group of sessions (paths)
+ *
+ * NOTE: max size 56 bytes, see man rdma_connect().
+ */
+struct ibtrs_msg_conn_req {
+	u8		__cma_version; /* Is set to 0 by cma.c in case of
+					* AF_IB, do not touch that. */
+	u8		__ip_version;  /* On sender side that should be
+					* set to 0, or cma_save_ip_info()
+					* extract garbage and will fail. */
+	__le16		magic;
+	__le16		version;
+	__le16		cid;
+	__le16		cid_num;
+	__le16		recon_cnt;
+	uuid_t		sess_uuid;
+	uuid_t		paths_uuid;
+	u8		reserved[12];
+};
+
+/**
+ * struct ibtrs_msg_conn_rsp - Server connection response to the client
+ * @magic:	   IBTRS magic
+ * @version:	   IBTRS protocol version
+ * @errno:	   If rdma_accept() then 0, if rdma_reject() indicates error
+ * @queue_depth:   max inflight messages (queue-depth) in this session
+ * @rkey:	   remote key to allow client to access buffers
+ * @max_io_size:   max io size server supports
+ * @max_req_size:  max infiniband message size server supports
+ * @uuid:	   Server UUID
+ *
+ * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
+ */
+struct ibtrs_msg_conn_rsp {
+	__le16		magic;
+	__le16		version;
+	__le16		errno;
+	__le16		queue_depth;
+	__le32		rkey;
+	__le32		max_io_size;
+	__le32		max_req_size;
+	uuid_t		uuid;
+	u8		reserved[20];
+};
+
+/**
+ * struct ibtrs_msg_info_req
+ * @type:		@IBTRS_MSG_INFO_REQ
+ * @sessname:		Session name chosen by client
+ */
+struct ibtrs_msg_info_req {
+	__le16		type;
+	u8		sessname[NAME_MAX];
+	u8		reserved[15];
+};
+
+/**
+ * struct ibtrs_msg_info_rsp
+ * @type:		@IBTRS_MSG_INFO_RSP
+ * @addr_num:		Number of rdma addresses
+ * @addr:		RDMA addresses of buffers
+ */
+struct ibtrs_msg_info_rsp {
+	__le16		type;
+	__le16		addr_num;
+	u8		reserved[4];
+	__le64		addr[];
+};
+
+/*
+ *  Data Layout in RDMA-Bufs:
+ *
+ * +---------RDMA-BUF--------+
+ * |         Slice N	     |
+ * | +---------------------+ |
+ * | |      I/O data       | |
+ * | |---------------------| |
+ * | |      IBNBD MSG	   | |
+ * | |---------------------| |
+ * | |	    IBTRS MSG	   | |
+ * | +---------------------+ |
+ * +-------------------------+
+ * |	     Slice N+1	     |
+ * | +---------------------+ |
+ * | |       I/O data	   | |
+ * | |---------------------| |
+ * | |	     IBNBD MSG     | |
+ * | |---------------------| |
+ * | |       IBTRS MSG     | |
+ * | +---------------------+ |
+ * +-------------------------+
+ */
+
+/**
+ * struct ibtrs_msg_user - Data exchanged a Infiniband message
+ * @type:		@IBTRS_MSG_USER
+ * @psize:		Payload size
+ * @payl:		Payload data
+ */
+struct ibtrs_msg_user {
+	__le16			type;
+	__le16			psize;
+	u8			payl[];
+};
+
+/**
+ * struct ibtrs_sg_desc - RDMA-Buffer entry description
+ * @addr:	Address of RDMA destination buffer
+ * @key:	Authorization rkey to write to the buffer
+ * @len:	Size of the buffer
+ */
+struct ibtrs_sg_desc {
+	__le64			addr;
+	__le32			key;
+	__le32			len;
+};
+
+/**
+ * struct ibtrs_msg_rdma_read - RDMA data transfer request from client
+ * @type:		always @IBTRS_MSG_READ
+ * @usr_len:		length of user payload
+ * @sg_cnt:		number of @desc entries
+ * @desc:		RDMA bufferst where the server can write the result to
+ */
+struct ibtrs_msg_rdma_read {
+	__le16			type;
+	__le16			usr_len;
+	__le32			sg_cnt;
+	struct ibtrs_sg_desc    desc[];
+};
+
+/**
+ * struct_msg_rdma_write - Message transferred to server with RDMA-Write
+ * @type:		always @IBTRS_MSG_WRITE
+ * @usr_len:		length of user payload
+ */
+struct ibtrs_msg_rdma_write {
+	__le16			type;
+	__le16			usr_len;
+};
+
+/* ibtrs.c */
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t t,
+				struct ib_device *dev, enum dma_data_direction,
+				void (*done)(struct ib_cq *cq, struct ib_wc *wc));
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+		   struct ib_device *dev);
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu);
+int ibtrs_iu_post_send(struct ibtrs_con *con, struct ibtrs_iu *iu, size_t size);
+int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
+				 struct ib_sge *sge, unsigned int num_sge,
+				 u32 rkey, u64 rdma_addr, u32 imm_data,
+				 enum ib_send_flags flags);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe);
+int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
+				    u32 imm_data, enum ib_send_flags flags);
+
+struct ibtrs_ib_dev *ibtrs_ib_dev_find_get(struct rdma_cm_id *cm_id);
+void ibtrs_ib_dev_put(struct ibtrs_ib_dev *dev);
+
+int ibtrs_cq_qp_create(struct ibtrs_sess *ibtrs_sess, struct ibtrs_con *con,
+		       u32 max_send_sge, int cq_vector, u16 cq_size,
+		       u16 wr_queue_size, enum ib_poll_context poll_ctx);
+void ibtrs_cq_qp_destroy(struct ibtrs_con *con);
+
+void ibtrs_init_hb(struct ibtrs_sess *sess, struct ib_cqe *cqe,
+		   unsigned interval_ms, unsigned missed_max,
+		   ibtrs_hb_handler_t *err_handler,
+		   struct workqueue_struct *wq);
+void ibtrs_start_hb(struct ibtrs_sess *sess);
+void ibtrs_stop_hb(struct ibtrs_sess *sess);
+void ibtrs_send_hb_ack(struct ibtrs_sess *sess);
+
+#define XX(a) case (a): return #a
+static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
+{
+	switch (opcode) {
+	XX(IB_WC_SEND);
+	XX(IB_WC_RDMA_WRITE);
+	XX(IB_WC_RDMA_READ);
+	XX(IB_WC_COMP_SWAP);
+	XX(IB_WC_FETCH_ADD);
+	/* recv-side); inbound completion */
+	XX(IB_WC_RECV);
+	XX(IB_WC_RECV_RDMA_WITH_IMM);
+	default: return "IB_WC_OPCODE_UNKNOWN";
+	}
+}
+
+static inline int sockaddr_cmp(const struct sockaddr *a,
+			       const struct sockaddr *b)
+{
+	switch (a->sa_family) {
+	case AF_IB:
+		return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
+			      &((struct sockaddr_ib *)b)->sib_addr,
+			      sizeof(struct ib_addr));
+	case AF_INET:
+		return memcmp(&((struct sockaddr_in *)a)->sin_addr,
+			      &((struct sockaddr_in *)b)->sin_addr,
+			      sizeof(struct in_addr));
+	case AF_INET6:
+		return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
+			      &((struct sockaddr_in6 *)b)->sin6_addr,
+			      sizeof(struct in6_addr));
+	default:
+		return -ENOENT;
+	}
+}
+
+static inline void sockaddr_to_str(const struct sockaddr *addr,
+				   char *buf, size_t len)
+{
+	switch (addr->sa_family) {
+	case AF_IB:
+		scnprintf(buf, len, "gid:%pI6",
+			  &((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
+		return;
+	case AF_INET:
+		scnprintf(buf, len, "ip:%pI4",
+			  &((struct sockaddr_in *)addr)->sin_addr);
+		return;
+	case AF_INET6:
+		scnprintf(buf, len, "ip:%pI6c",
+			  &((struct sockaddr_in6 *)addr)->sin6_addr);
+		return;
+	}
+	scnprintf(buf, len, "<invalid address family>");
+	pr_err("Invalid address family\n");
+}
+
+/**
+ * kvec_length() - Total number of bytes covered by an kvec.
+ */
+static inline size_t kvec_length(const struct kvec *vec, size_t nr)
+{
+	size_t seg, ret = 0;
+
+	for (seg = 0; seg < nr; seg++)
+		ret += vec[seg].iov_len;
+	return ret;
+}
+
+/**
+ * copy_from_kvec() - Copy kvec to the buffer.
+ */
+static inline void copy_from_kvec(void *data, const struct kvec *vec,
+				  size_t copy)
+{
+	size_t seg, len;
+
+	for (seg = 0; copy; seg++) {
+		len = min(vec[seg].iov_len, copy);
+		memcpy(data, vec[seg].iov_base, len);
+		data += len;
+		copy -= len;
+	}
+}
+
+static inline u32 ibtrs_to_imm(u32 type, u32 payload)
+{
+	BUILD_BUG_ON(32 != MAX_IMM_PAYL_BITS + MAX_IMM_TYPE_BITS);
+	return ((type & MAX_IMM_TYPE_MASK) << MAX_IMM_PAYL_BITS) |
+		(payload & MAX_IMM_PAYL_MASK);
+}
+
+static inline void ibtrs_from_imm(u32 imm, u32 *type, u32 *payload)
+{
+	*payload = (imm & MAX_IMM_PAYL_MASK);
+	*type = (imm >> MAX_IMM_PAYL_BITS);
+}
+
+static inline u32 ibtrs_to_io_req_imm(u32 addr)
+{
+	return ibtrs_to_imm(IBTRS_IO_REQ_IMM, addr);
+}
+
+static inline u32 ibtrs_to_io_rsp_imm(u32 msg_id, int errno)
+{
+	u32 payload;
+
+	/* 9 bits for errno, 19 bits for msg_id */
+	payload = (abs(errno) & 0x1ff) << 19 | (msg_id & 0x7ffff);
+	return ibtrs_to_imm(IBTRS_IO_RSP_IMM, payload);
+}
+
+static inline void ibtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
+{
+	/* 9 bits for errno, 19 bits for msg_id */
+	*msg_id = (payload & 0x7ffff);
+	*errno = -(int)((payload >> 19) & 0x1ff);
+}
+
+#define STAT_STORE_FUNC(type, store, reset)				\
+static ssize_t store##_store(struct kobject *kobj,			\
+			     struct kobj_attribute *attr,		\
+			     const char *buf, size_t count)		\
+{									\
+	int ret = -EINVAL;						\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	if (sysfs_streq(buf, "1"))					\
+		ret = reset(&sess->stats, true);			\
+	else if (sysfs_streq(buf, "0"))					\
+		ret = reset(&sess->stats, false);			\
+	if (ret)							\
+		return ret;						\
+									\
+	return count;							\
+}
+
+#define STAT_SHOW_FUNC(type, show, print)				\
+static ssize_t show##_show(struct kobject *kobj,			\
+			   struct kobj_attribute *attr,			\
+			   char *page)					\
+{									\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	return print(&sess->stats, page, PAGE_SIZE);			\
+}
+
+#define STAT_ATTR(type, stat, print, reset)				\
+STAT_STORE_FUNC(type, stat, reset)					\
+STAT_SHOW_FUNC(type, stat, print)					\
+static struct kobj_attribute stat##_attr =				\
+		__ATTR(stat, 0644,					\
+		       stat##_show,					\
+		       stat##_store)
+
+#endif /* IBTRS_PRI_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
  2018-02-02 14:08 ` [PATCH 01/24] ibtrs: public interface header to establish RDMA connections Roman Pen
  2018-02-02 14:08 ` [PATCH 02/24] ibtrs: private headers with IBTRS protocol structs and helpers Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-05 10:52   ` Sagi Grimberg
  2018-02-02 14:08 ` [PATCH 04/24] ibtrs: client: private header with client structs and functions Roman Pen
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is a set of library functions existing as a ibtrs-core module,
used by client and server modules.

Mainly these functions wrap IB and RDMA calls and provide a bit higher
abstraction for implementing of IBTRS protocol on client or server
sides.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs.c | 582 +++++++++++++++++++++++++++++++++++
 1 file changed, 582 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.c b/drivers/infiniband/ulp/ibtrs/ibtrs.c
new file mode 100644
index 000000000000..007380506959
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.c
@@ -0,0 +1,582 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/inet.h>
+
+#include "ibtrs-pri.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Core");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static LIST_HEAD(device_list);
+static DEFINE_MUTEX(device_list_mutex);
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t gfp_mask,
+				struct ib_device *dma_dev,
+				enum dma_data_direction direction,
+				void (*done)(struct ib_cq *cq,
+					     struct ib_wc *wc))
+{
+	struct ibtrs_iu *iu;
+
+	iu = kmalloc(sizeof(*iu), gfp_mask);
+	if (unlikely(!iu))
+		return NULL;
+
+	iu->buf = kzalloc(size, gfp_mask);
+	if (unlikely(!iu->buf))
+		goto err1;
+
+	iu->dma_addr = ib_dma_map_single(dma_dev, iu->buf, size, direction);
+	if (unlikely(ib_dma_mapping_error(dma_dev, iu->dma_addr)))
+		goto err2;
+
+	iu->cqe.done  = done;
+	iu->size      = size;
+	iu->direction = direction;
+	iu->tag       = tag;
+
+	return iu;
+
+err2:
+	kfree(iu->buf);
+err1:
+	kfree(iu);
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_alloc);
+
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+		   struct ib_device *ibdev)
+{
+	if (!iu)
+		return;
+
+	ib_dma_unmap_single(ibdev, iu->dma_addr, iu->size, dir);
+	kfree(iu->buf);
+	kfree(iu);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_free);
+
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu)
+{
+	struct ibtrs_sess *sess = con->sess;
+	struct ib_recv_wr wr, *bad_wr;
+	struct ib_sge list;
+
+	list.addr   = iu->dma_addr;
+	list.length = iu->size;
+	list.lkey   = sess->ib_dev->lkey;
+
+	if (WARN_ON(list.length == 0)) {
+		ibtrs_wrn(con, "Posting receive work request failed,"
+			  " sg list is empty\n");
+		return -EINVAL;
+	}
+
+	wr.next    = NULL;
+	wr.wr_cqe  = &iu->cqe;
+	wr.sg_list = &list;
+	wr.num_sge = 1;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_recv);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+	struct ib_recv_wr wr, *bad_wr;
+
+	wr.next    = NULL;
+	wr.wr_cqe  = cqe;
+	wr.sg_list = NULL;
+	wr.num_sge = 0;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);
+
+int ibtrs_iu_post_send(struct ibtrs_con *con, struct ibtrs_iu *iu, size_t size)
+{
+	struct ibtrs_sess *sess = con->sess;
+	struct ib_send_wr wr, *bad_wr;
+	struct ib_sge list;
+
+	if ((WARN_ON(size == 0)))
+		return -EINVAL;
+
+	list.addr   = iu->dma_addr;
+	list.length = size;
+	list.lkey   = sess->ib_dev->lkey;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.next       = NULL;
+	wr.wr_cqe     = &iu->cqe;
+	wr.sg_list    = &list;
+	wr.num_sge    = 1;
+	wr.opcode     = IB_WR_SEND;
+	wr.send_flags = IB_SEND_SIGNALED;
+
+	return ib_post_send(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_send);
+
+int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
+				 struct ib_sge *sge, unsigned int num_sge,
+				 u32 rkey, u64 rdma_addr, u32 imm_data,
+				 enum ib_send_flags flags)
+{
+	struct ib_send_wr *bad_wr;
+	struct ib_rdma_wr wr;
+	int i;
+
+	wr.wr.next	  = NULL;
+	wr.wr.wr_cqe	  = &iu->cqe;
+	wr.wr.sg_list	  = sge;
+	wr.wr.num_sge	  = num_sge;
+	wr.rkey		  = rkey;
+	wr.remote_addr	  = rdma_addr;
+	wr.wr.opcode	  = IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.wr.ex.imm_data = cpu_to_be32(imm_data);
+	wr.wr.send_flags  = flags;
+
+	/*
+	 * If one of the sges has 0 size, the operation will fail with an
+	 * length error
+	 */
+	for (i = 0; i < num_sge; i++)
+		if (WARN_ON(sge[i].length == 0))
+			return -EINVAL;
+
+	return ib_post_send(con->qp, &wr.wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_rdma_write_imm);
+
+int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
+				    u32 imm_data, enum ib_send_flags flags)
+{
+	struct ib_send_wr wr, *bad_wr;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.wr_cqe	= cqe;
+	wr.send_flags	= flags;
+	wr.opcode	= IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.ex.imm_data	= cpu_to_be32(imm_data);
+
+	return ib_post_send(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_rdma_write_imm_empty);
+
+static void qp_event_handler(struct ib_event *ev, void *ctx)
+{
+	struct ibtrs_con *con = ctx;
+
+	switch (ev->event) {
+	case IB_EVENT_COMM_EST:
+		ibtrs_info(con, "QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		rdma_notify(con->cm_id, IB_EVENT_COMM_EST);
+		break;
+	default:
+		ibtrs_info(con, "Unhandled QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		break;
+	}
+}
+
+static int ibtrs_query_device(struct ibtrs_ib_dev *ib_dev)
+{
+	struct ib_udata uhw = {.outlen = 0, .inlen = 0};
+
+	memset(&ib_dev->attrs, 0, sizeof(ib_dev->attrs));
+
+	return ib_dev->dev->query_device(ib_dev->dev, &ib_dev->attrs, &uhw);
+}
+
+static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device *dev)
+{
+	int err;
+
+	d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
+	if (IS_ERR(d->pd))
+		return PTR_ERR(d->pd);
+	d->dev = dev;
+	d->lkey = d->pd->local_dma_lkey;
+	d->rkey = d->pd->unsafe_global_rkey;
+
+	err = ibtrs_query_device(d);
+	if (unlikely(err))
+		ib_dealloc_pd(d->pd);
+
+	return err;
+}
+
+static void ibtrs_ib_dev_destroy(struct ibtrs_ib_dev *d)
+{
+	if (d->pd) {
+		ib_dealloc_pd(d->pd);
+		d->pd = NULL;
+		d->dev = NULL;
+		d->lkey = 0;
+		d->rkey = 0;
+	}
+}
+
+struct ibtrs_ib_dev *ibtrs_ib_dev_find_get(struct rdma_cm_id *cm_id)
+{
+	struct ibtrs_ib_dev *dev;
+	int err;
+
+	mutex_lock(&device_list_mutex);
+	list_for_each_entry(dev, &device_list, entry) {
+		if (dev->dev->node_guid == cm_id->device->node_guid &&
+		    kref_get_unless_zero(&dev->ref))
+			goto out_unlock;
+	}
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (unlikely(!dev))
+		goto out_err;
+
+	kref_init(&dev->ref);
+	err = ibtrs_ib_dev_init(dev, cm_id->device);
+	if (unlikely(err))
+		goto out_free;
+	list_add(&dev->entry, &device_list);
+out_unlock:
+	mutex_unlock(&device_list_mutex);
+
+	return dev;
+
+out_free:
+	kfree(dev);
+out_err:
+	mutex_unlock(&device_list_mutex);
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(ibtrs_ib_dev_find_get);
+
+static void ibtrs_ib_dev_free(struct kref *ref)
+{
+	struct ibtrs_ib_dev *dev;
+
+	dev = container_of(ref, struct ibtrs_ib_dev, ref);
+
+	mutex_lock(&device_list_mutex);
+	list_del(&dev->entry);
+	mutex_unlock(&device_list_mutex);
+	ibtrs_ib_dev_destroy(dev);
+	kfree(dev);
+}
+
+void ibtrs_ib_dev_put(struct ibtrs_ib_dev *dev)
+{
+	kref_put(&dev->ref, ibtrs_ib_dev_free);
+}
+EXPORT_SYMBOL_GPL(ibtrs_ib_dev_put);
+
+static int create_cq(struct ibtrs_con *con, int cq_vector, u16 cq_size,
+		     enum ib_poll_context poll_ctx)
+{
+	struct rdma_cm_id *cm_id = con->cm_id;
+	struct ib_cq *cq;
+
+	cq = ib_alloc_cq(cm_id->device, con, cq_size * 2 + 1,
+			 cq_vector, poll_ctx);
+	if (unlikely(IS_ERR(cq))) {
+		ibtrs_err(con, "Creating completion queue failed, errno: %ld\n",
+			  PTR_ERR(cq));
+		return PTR_ERR(cq);
+	}
+	con->cq = cq;
+
+	return 0;
+}
+
+static int create_qp(struct ibtrs_con *con, struct ib_pd *pd,
+		     u16 wr_queue_size, u32 max_send_sge)
+{
+	struct ib_qp_init_attr init_attr = {NULL};
+	struct rdma_cm_id *cm_id = con->cm_id;
+	int ret;
+
+	init_attr.cap.max_send_wr = wr_queue_size;
+	init_attr.cap.max_recv_wr = wr_queue_size;
+	init_attr.cap.max_recv_sge = 2;
+	init_attr.event_handler = qp_event_handler;
+	init_attr.qp_context = con;
+	init_attr.cap.max_send_sge = max_send_sge;
+
+	init_attr.qp_type = IB_QPT_RC;
+	init_attr.send_cq = con->cq;
+	init_attr.recv_cq = con->cq;
+	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+
+	ret = rdma_create_qp(cm_id, pd, &init_attr);
+	if (unlikely(ret)) {
+		ibtrs_err(con, "Creating QP failed, err: %d\n", ret);
+		return ret;
+	}
+	con->qp = cm_id->qp;
+
+	return ret;
+}
+
+int ibtrs_cq_qp_create(struct ibtrs_sess *sess, struct ibtrs_con *con,
+		       u32 max_send_sge, int cq_vector, u16 cq_size,
+		       u16 wr_queue_size, enum ib_poll_context poll_ctx)
+{
+	int err;
+
+	err = create_cq(con, cq_vector, cq_size, poll_ctx);
+	if (unlikely(err))
+		return err;
+
+	err = create_qp(con, sess->ib_dev->pd, wr_queue_size, max_send_sge);
+	if (unlikely(err)) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+		return err;
+	}
+	con->sess = sess;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ibtrs_cq_qp_create);
+
+void ibtrs_cq_qp_destroy(struct ibtrs_con *con)
+{
+	if (con->qp) {
+		rdma_destroy_qp(con->cm_id);
+		con->qp = NULL;
+	}
+	if (con->cq) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(ibtrs_cq_qp_destroy);
+
+static void schedule_hb(struct ibtrs_sess *sess)
+{
+	queue_delayed_work(sess->hb_wq, &sess->hb_dwork,
+			   msecs_to_jiffies(sess->hb_interval_ms));
+}
+
+void ibtrs_send_hb_ack(struct ibtrs_sess *sess)
+{
+	struct ibtrs_con *usr_con = sess->con[0];
+	u32 imm;
+	int err;
+
+	imm = ibtrs_to_imm(IBTRS_HB_ACK_IMM, 0);
+	err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe,
+					      imm, IB_SEND_SIGNALED);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con, err);
+		return;
+	}
+}
+EXPORT_SYMBOL_GPL(ibtrs_send_hb_ack);
+
+static void hb_work(struct work_struct *work)
+{
+	struct ibtrs_con *usr_con;
+	struct ibtrs_sess *sess;
+	u32 imm;
+	int err;
+
+	sess = container_of(to_delayed_work(work), typeof(*sess), hb_dwork);
+	usr_con = sess->con[0];
+
+	if (sess->hb_missed_cnt > sess->hb_missed_max) {
+		sess->hb_err_handler(usr_con, -ETIMEDOUT);
+		return;
+	}
+	if (sess->hb_missed_cnt++) {
+		/* Reschedule work without sending hb */
+		schedule_hb(sess);
+		return;
+	}
+	imm = ibtrs_to_imm(IBTRS_HB_MSG_IMM, 0);
+	err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe,
+					      imm, IB_SEND_SIGNALED);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con, err);
+		return;
+	}
+
+	schedule_hb(sess);
+}
+
+void ibtrs_init_hb(struct ibtrs_sess *sess, struct ib_cqe *cqe,
+		   unsigned int interval_ms, unsigned int missed_max,
+		   ibtrs_hb_handler_t *err_handler,
+		   struct workqueue_struct *wq)
+{
+	sess->hb_cqe = cqe;
+	sess->hb_interval_ms = interval_ms;
+	sess->hb_err_handler = err_handler;
+	sess->hb_wq = wq;
+	sess->hb_missed_max = missed_max;
+	sess->hb_missed_cnt = 0;
+	INIT_DELAYED_WORK(&sess->hb_dwork, hb_work);
+}
+EXPORT_SYMBOL_GPL(ibtrs_init_hb);
+
+void ibtrs_start_hb(struct ibtrs_sess *sess)
+{
+	schedule_hb(sess);
+}
+EXPORT_SYMBOL_GPL(ibtrs_start_hb);
+
+void ibtrs_stop_hb(struct ibtrs_sess *sess)
+{
+	cancel_delayed_work_sync(&sess->hb_dwork);
+	sess->hb_missed_cnt = 0;
+	sess->hb_missed_max = 0;
+}
+EXPORT_SYMBOL_GPL(ibtrs_stop_hb);
+
+static int ibtrs_str_ipv4_to_sockaddr(const char *addr, size_t len,
+				      short port, struct sockaddr *dst)
+{
+	struct sockaddr_in *dst_sin = (struct sockaddr_in *)dst;
+	int ret;
+
+	ret = in4_pton(addr, len, (u8 *)&dst_sin->sin_addr.s_addr,
+		       '\0', NULL);
+	if (ret == 0)
+		return -EINVAL;
+
+	dst_sin->sin_family = AF_INET;
+	dst_sin->sin_port = htons(port);
+
+	return 0;
+}
+
+static int ibtrs_str_ipv6_to_sockaddr(const char *addr, size_t len,
+				      short port, struct sockaddr *dst)
+{
+	struct sockaddr_in6 *dst_sin6 = (struct sockaddr_in6 *)dst;
+	int ret;
+
+	ret = in6_pton(addr, len, dst_sin6->sin6_addr.s6_addr,
+		       '\0', NULL);
+	if (ret != 1)
+		return -EINVAL;
+
+	dst_sin6->sin6_family = AF_INET6;
+	dst_sin6->sin6_port = htons(port);
+
+	return 0;
+}
+
+static int ibtrs_str_gid_to_sockaddr(const char *addr, size_t len,
+				     short port, struct sockaddr *dst)
+{
+	struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
+	int ret;
+
+	/* We can use some of the I6 functions since GID is a valid
+	 * IPv6 address format
+	 */
+	ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
+	if (ret == 0)
+		return -EINVAL;
+
+	dst_ib->sib_family = AF_IB;
+	/*
+	 * Use the same TCP server port number as the IB service ID
+	 * on the IB port space range
+	 */
+	dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
+	dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
+	dst_ib->sib_pkey = cpu_to_be16(0xffff);
+
+	return 0;
+}
+
+/**
+ * ibtrs_str_to_sockaddr() - Convert ibtrs address string to sockaddr
+ * @addr	String representation of an addr (IPv4, IPv6 or IB GID):
+ *              - "ip:192.168.1.1"
+ *              - "ip:fe80::200:5aee:feaa:20a2"
+ *              - "gid:fe80::200:5aee:feaa:20a2"
+ * @len         String address length
+ * @port	Destination port
+ * @dst		Destination sockaddr structure
+ *
+ * Returns 0 if conversion successful. Non-zero on error.
+ */
+static int ibtrs_str_to_sockaddr(const char *addr, size_t len,
+				 short port, struct sockaddr *dst)
+{
+	if (strncmp(addr, "gid:", 4) == 0) {
+		return ibtrs_str_gid_to_sockaddr(addr + 4, len - 4, port, dst);
+	} else if (strncmp(addr, "ip:", 3) == 0) {
+		if (ibtrs_str_ipv4_to_sockaddr(addr + 3, len - 3, port, dst))
+			return ibtrs_str_ipv6_to_sockaddr(addr + 3, len - 3,
+							  port, dst);
+		else
+			return 0;
+	}
+	return -EPROTONOSUPPORT;
+}
+
+int ibtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct ibtrs_addr *addr)
+{
+	const char *d;
+	int ret;
+
+	d = strchr(str, ',');
+	if (d) {
+		if (ibtrs_str_to_sockaddr(str, d - str, 0, addr->src))
+			return -EINVAL;
+		d += 1;
+		len -= d - str;
+		str  = d;
+
+	} else {
+		addr->src = NULL;
+	}
+	ret = ibtrs_str_to_sockaddr(str, len, port, addr->dst);
+
+	return ret;
+}
+EXPORT_SYMBOL(ibtrs_addr_to_sockaddr);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 04/24] ibtrs: client: private header with client structs and functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (2 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-05 10:59   ` Sagi Grimberg
  2018-02-02 14:08 ` [PATCH 05/24] ibtrs: client: main functionality Roman Pen
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This header describes main structs and functions used by ibtrs-client
module, mainly for managing IBTRS sessions, creating/destroying sysfs
entries, accounting statistics on client side.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h | 338 +++++++++++++++++++++++++++++++
 1 file changed, 338 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
new file mode 100644
index 000000000000..b57af19ac833
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
@@ -0,0 +1,338 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_CLT_H
+#define IBTRS_CLT_H
+
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_clt_state - Client states.
+ */
+enum ibtrs_clt_state {
+	IBTRS_CLT_CONNECTING,
+	IBTRS_CLT_CONNECTING_ERR,
+	IBTRS_CLT_RECONNECTING,
+	IBTRS_CLT_CONNECTED,
+	IBTRS_CLT_CLOSING,
+	IBTRS_CLT_CLOSED,
+	IBTRS_CLT_DEAD,
+};
+
+static inline const char *ibtrs_clt_state_str(enum ibtrs_clt_state state)
+{
+	switch (state) {
+	case IBTRS_CLT_CONNECTING:
+		return "IBTRS_CLT_CONNECTING";
+	case IBTRS_CLT_CONNECTING_ERR:
+		return "IBTRS_CLT_CONNECTING_ERR";
+	case IBTRS_CLT_RECONNECTING:
+		return "IBTRS_CLT_RECONNECTING";
+	case IBTRS_CLT_CONNECTED:
+		return "IBTRS_CLT_CONNECTED";
+	case IBTRS_CLT_CLOSING:
+		return "IBTRS_CLT_CLOSING";
+	case IBTRS_CLT_CLOSED:
+		return "IBTRS_CLT_CLOSED";
+	case IBTRS_CLT_DEAD:
+		return "IBTRS_CLT_DEAD";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+enum ibtrs_fast_reg {
+	IBTRS_FAST_MEM_NONE,
+	IBTRS_FAST_MEM_FR,
+	IBTRS_FAST_MEM_FMR
+};
+
+enum ibtrs_mp_policy {
+	MP_POLICY_RR,
+	MP_POLICY_MIN_INFLIGHT,
+};
+
+struct ibtrs_clt_stats_reconnects {
+	int successful_cnt;
+	int fail_cnt;
+};
+
+struct ibtrs_clt_stats_wc_comp {
+	u32 cnt;
+	u64 total_cnt;
+};
+
+struct ibtrs_clt_stats_cpu_migr {
+	atomic_t from;
+	int to;
+};
+
+struct ibtrs_clt_stats_rdma {
+	struct {
+		u64 cnt;
+		u64 size_total;
+	} dir[2];
+
+	u64 failover_cnt;
+};
+
+struct ibtrs_clt_stats_rdma_lat {
+	u64 read;
+	u64 write;
+};
+
+#define MIN_LOG_SG 2
+#define MAX_LOG_SG 5
+#define MAX_LIN_SG BIT(MIN_LOG_SG)
+#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)
+
+#define MAX_LOG_LAT 16
+#define MIN_LOG_LAT 0
+#define LOG_LAT_SZ (MAX_LOG_LAT - MIN_LOG_LAT + 2)
+
+struct ibtrs_clt_stats_pcpu {
+	struct ibtrs_clt_stats_cpu_migr		cpu_migr;
+	struct ibtrs_clt_stats_rdma		rdma;
+	u64					sg_list_total;
+	u64					sg_list_distr[SG_DISTR_SZ];
+	struct ibtrs_clt_stats_rdma_lat		rdma_lat_distr[LOG_LAT_SZ];
+	struct ibtrs_clt_stats_rdma_lat		rdma_lat_max;
+	struct ibtrs_clt_stats_wc_comp		wc_comp;
+};
+
+struct ibtrs_clt_stats {
+	bool					enable_rdma_lat;
+	struct ibtrs_clt_stats_pcpu    __percpu	*pcpu_stats;
+	struct ibtrs_clt_stats_reconnects	reconnects;
+	atomic_t				inflight;
+};
+
+struct ibtrs_clt_con {
+	struct ibtrs_con	c;
+	unsigned		cpu;
+	atomic_t		io_cnt;
+	struct ibtrs_fr_pool	*fr_pool;
+	int			cm_err;
+};
+
+struct ibtrs_clt_io_req {
+	struct list_head        list;
+	struct ibtrs_iu		*iu;
+	struct scatterlist	*sglist; /* list holding user data */
+	unsigned int		sg_cnt;
+	unsigned int		sg_size;
+	unsigned int		data_len;
+	unsigned int		usr_len;
+	void			*priv;
+	bool			in_use;
+	struct ibtrs_clt_con	*con;
+	union {
+		struct ib_pool_fmr	**fmr_list;
+		struct ibtrs_fr_desc	**fr_list;
+	};
+	void			*map_page;
+	struct ibtrs_tag	*tag;
+	u16			nmdesc;
+	enum dma_data_direction dir;
+	ibtrs_conf_fn		*conf;
+	unsigned long		start_time;
+};
+
+struct ibtrs_clt_sess {
+	struct ibtrs_sess	s;
+	struct ibtrs_clt	*clt;
+	wait_queue_head_t	state_wq;
+	enum ibtrs_clt_state	state;
+	atomic_t		connected_cnt;
+	struct mutex		init_mutex;
+	struct ibtrs_clt_io_req	*reqs;
+	struct ib_fmr_pool	*fmr_pool;
+	struct delayed_work	reconnect_dwork;
+	struct work_struct	close_work;
+	unsigned		reconnect_attempts;
+	bool			established;
+	u64			*srv_rdma_addr;
+	u32			srv_rdma_buf_rkey;
+	u32			max_io_size;
+	u32			max_req_size;
+	u32			chunk_size;
+	u32			max_desc;
+	size_t			queue_depth;
+	enum ibtrs_fast_reg	fast_reg_mode;
+	u64			mr_page_mask;
+	u32			mr_page_size;
+	u32			mr_max_size;
+	u32			max_pages_per_mr;
+	int			max_sge;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct ibtrs_clt_stats  stats;
+	struct list_head __percpu
+				*mp_skip_entry;
+};
+
+struct ibtrs_clt {
+	struct list_head   /* __rcu */ paths_list;
+	size_t			       paths_num;
+	struct ibtrs_clt_sess
+		      __percpu * __rcu *pcpu_path;
+
+	bool			opened;
+	uuid_t			paths_uuid;
+	int			paths_up;
+	struct mutex		paths_mutex;
+	struct mutex		paths_ev_mutex;
+	char			sessname[NAME_MAX];
+	short			port;
+	unsigned		max_reconnect_attempts;
+	unsigned		reconnect_delay_sec;
+	unsigned		max_segments;
+	void			*tags;
+	unsigned long		*tags_map;
+	size_t			queue_depth;
+	size_t			max_io_size;
+	wait_queue_head_t	tags_wait;
+	size_t			pdu_sz;
+	void			*priv;
+	link_clt_ev_fn		*link_ev;
+	struct kobject		kobj;
+	struct kobject		kobj_paths;
+	enum ibtrs_mp_policy	mp_policy;
+};
+
+static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
+{
+	if (unlikely(!c))
+		return NULL;
+
+	return container_of(c, struct ibtrs_clt_con, c);
+}
+
+static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
+{
+	if (unlikely(!s))
+		return NULL;
+
+	return container_of(s, struct ibtrs_clt_sess, s);
+}
+
+/**
+ * list_next_or_null_rr - get next list element in round-robin fashion.
+ * @pos:     entry, starting cursor.
+ * @head:    head of the list to examine. This list must have at least one
+ *           element, namely @pos.
+ * @member:  name of the list_head structure within typeof(*pos).
+ *
+ * Important to understand that @pos is a list entry, which can be already
+ * removed using list_del_rcu(), so if @head has become empty NULL will be
+ * returned. Otherwise next element is returned in round-robin fashion.
+ */
+#define list_next_or_null_rcu_rr(pos, head, member) ({			\
+	typeof(pos) ________next = NULL;				\
+									\
+	if (!list_empty(head))						\
+		________next = (pos)->member.next != (head) ?		\
+			list_entry_rcu((pos)->member.next,		\
+				       typeof(*pos), member) :		\
+			list_entry_rcu((pos)->member.next->next,	\
+				       typeof(*pos), member);		\
+	________next;							\
+})
+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj)						\
+	LIST(CASE(obj, struct ibtrs_clt_sess *, s.sessname),		\
+	     CASE(obj, struct ibtrs_clt *, sessname))
+
+#define TAG_SIZE(clt) (sizeof(struct ibtrs_tag) + (clt)->pdu_sz)
+#define GET_TAG(clt, idx) ((clt)->tags + TAG_SIZE(clt) * idx)
+
+int ibtrs_clt_reconnect_from_sysfs(struct ibtrs_clt_sess *sess);
+int ibtrs_clt_disconnect_from_sysfs(struct ibtrs_clt_sess *sess);
+int ibtrs_clt_create_path_from_sysfs(struct ibtrs_clt *clt,
+				     struct ibtrs_addr *addr);
+int ibtrs_clt_remove_path_from_sysfs(struct ibtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self);
+
+void ibtrs_clt_set_max_reconnect_attempts(struct ibtrs_clt *clt, int value);
+int ibtrs_clt_get_max_reconnect_attempts(const struct ibtrs_clt *clt);
+
+/* ibtrs-clt-stats.c */
+
+int ibtrs_clt_init_stats(struct ibtrs_clt_stats *stats);
+void ibtrs_clt_free_stats(struct ibtrs_clt_stats *stats);
+
+void ibtrs_clt_decrease_inflight(struct ibtrs_clt_stats *s);
+void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *s);
+
+void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *s, bool read,
+			       unsigned long ms);
+void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con);
+void ibtrs_clt_update_all_stats(struct ibtrs_clt_io_req *req, int dir);
+
+int ibtrs_clt_reset_sg_list_distr_stats(struct ibtrs_clt_stats *stats,
+					bool enable);
+int ibtrs_clt_stats_sg_list_distr_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len);
+int ibtrs_clt_reset_rdma_lat_distr_stats(struct ibtrs_clt_stats *stats,
+					 bool enable);
+ssize_t ibtrs_clt_stats_rdma_lat_distr_to_str(struct ibtrs_clt_stats *stats,
+					      char *page, size_t len);
+int ibtrs_clt_reset_cpu_migr_stats(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_migration_cnt_to_str(struct ibtrs_clt_stats *stats, char *buf,
+					 size_t len);
+int ibtrs_clt_reset_reconnects_stat(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_reconnects_to_str(struct ibtrs_clt_stats *stats, char *buf,
+				      size_t len);
+int ibtrs_clt_reset_wc_comp_stats(struct ibtrs_clt_stats *stats, bool enable);
+int ibtrs_clt_stats_wc_completion_to_str(struct ibtrs_clt_stats *stats, char *buf,
+					 size_t len);
+int ibtrs_clt_reset_rdma_stats(struct ibtrs_clt_stats *stats, bool enable);
+ssize_t ibtrs_clt_stats_rdma_to_str(struct ibtrs_clt_stats *stats,
+				    char *page, size_t len);
+bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess);
+int ibtrs_clt_reset_all_stats(struct ibtrs_clt_stats *stats, bool enable);
+ssize_t ibtrs_clt_reset_all_help(struct ibtrs_clt_stats *stats,
+				 char *page, size_t len);
+
+/* ibtrs-clt-sysfs.c */
+
+int ibtrs_clt_create_sysfs_module_files(void);
+void ibtrs_clt_destroy_sysfs_module_files(void);
+
+int ibtrs_clt_create_sysfs_root_folders(struct ibtrs_clt *clt);
+int ibtrs_clt_create_sysfs_root_files(struct ibtrs_clt *clt);
+void ibtrs_clt_destroy_sysfs_root_folders(struct ibtrs_clt *clt);
+void ibtrs_clt_destroy_sysfs_root_files(struct ibtrs_clt *clt);
+
+int ibtrs_clt_create_sess_files(struct ibtrs_clt_sess *sess);
+void ibtrs_clt_destroy_sess_files(struct ibtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self);
+
+#endif /* IBTRS_CLT_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 05/24] ibtrs: client: main functionality
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (3 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 04/24] ibtrs: client: private header with client structs and functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 16:54   ` Bart Van Assche
  2018-02-05 11:19   ` Sagi Grimberg
  2018-02-02 14:08 ` [PATCH 06/24] ibtrs: client: statistics functions Roman Pen
                   ` (21 subsequent siblings)
  26 siblings, 2 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is main functionality of ibtrs-client module, which manages
set of RDMA connections for each IBTRS session, does multipathing,
load balancing and failover of RDMA requests.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c | 3496 ++++++++++++++++++++++++++++++
 1 file changed, 3496 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
new file mode 100644
index 000000000000..aa0a17f2a78c
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
@@ -0,0 +1,3496 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <rdma/ib_fmr_pool.h>
+
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+#define RECONNECT_SEED 8
+#define MAX_SEGMENTS 31
+
+#define IBTRS_CONNECT_TIMEOUT_MS 5000
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Client");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static bool use_fr;
+module_param(use_fr, bool, 0444);
+MODULE_PARM_DESC(use_fr, "use FRWR mode for memory registration if possible."
+		 " (default: 0)");
+
+static ushort nr_cons_per_session;
+module_param(nr_cons_per_session, ushort, 0444);
+MODULE_PARM_DESC(nr_cons_per_session, "Number of connections per session."
+		 " (default: nr_cpu_ids)");
+
+static int retry_count = 7;
+
+static int retry_count_set(const char *val, const struct kernel_param *kp)
+{
+	int err, ival;
+
+	err = kstrtoint(val, 0, &ival);
+	if (err)
+		return err;
+
+	if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT)
+		return -EINVAL;
+
+	retry_count = ival;
+
+	return 0;
+}
+
+static const struct kernel_param_ops retry_count_ops = {
+	.set		= retry_count_set,
+	.get		= param_get_int,
+};
+module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
+
+MODULE_PARM_DESC(retry_count, "Number of times to send the message if the"
+		 " remote side didn't respond with Ack or Nack (default: 3,"
+		 " min: " __stringify(MIN_RTR_CNT) ", max: "
+		 __stringify(MAX_RTR_CNT) ")");
+
+static int fmr_sg_cnt = 4;
+module_param_named(fmr_sg_cnt, fmr_sg_cnt, int, 0644);
+MODULE_PARM_DESC(fmr_sg_cnt, "when sg_cnt is bigger than fmr_sg_cnt, enable"
+		 " FMR (default: 4)");
+
+static struct workqueue_struct *ibtrs_wq;
+
+static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con);
+static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+
+static inline void ibtrs_clt_state_lock(void)
+{
+	rcu_read_lock();
+}
+
+static inline void ibtrs_clt_state_unlock(void)
+{
+	rcu_read_unlock();
+}
+
+#define cmpxchg_min(var, new) ({					\
+	typeof(var) old;						\
+									\
+	do {								\
+		old = var;						\
+		new = (!old ? new : min_t(typeof(var), old, new));	\
+	} while (cmpxchg(&var, old, new) != old);			\
+})
+
+static void ibtrs_clt_set_min_queue_depth(struct ibtrs_clt *clt, size_t new)
+{
+	/* Can be updated from different sessions (paths), so cmpxchg */
+
+	cmpxchg_min(clt->queue_depth, new);
+}
+
+static void ibtrs_clt_set_min_io_size(struct ibtrs_clt *clt, size_t new)
+{
+	/* Can be updated from different sessions (paths), so cmpxchg */
+
+	cmpxchg_min(clt->max_io_size, new);
+}
+
+bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess)
+{
+	return sess->state == IBTRS_CLT_CONNECTED;
+}
+
+static inline bool ibtrs_clt_is_connected(const struct ibtrs_clt *clt)
+{
+	struct ibtrs_clt_sess *sess;
+	bool connected = false;
+
+	ibtrs_clt_state_lock();
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry)
+		connected |= ibtrs_clt_sess_is_connected(sess);
+	ibtrs_clt_state_unlock();
+
+	return connected;
+}
+
+/**
+ * struct ibtrs_fr_desc - fast registration work request arguments
+ * @entry: Entry in ibtrs_fr_pool.free_list.
+ * @mr:    Memory region.
+ */
+struct ibtrs_fr_desc {
+	struct list_head		entry;
+	struct ib_mr			*mr;
+};
+
+/**
+ * struct ibtrs_fr_pool - pool of fast registration descriptors
+ *
+ * An entry is available for allocation if and only if it occurs in @free_list.
+ *
+ * @size:      Number of descriptors in this pool.
+ * @max_page_list_len: Maximum fast registration work request page list length.
+ * @lock:      Protects free_list.
+ * @free_list: List of free descriptors.
+ * @desc:      Fast registration descriptor pool.
+ */
+struct ibtrs_fr_pool {
+	int			size;
+	int			max_page_list_len;
+	spinlock_t		lock; /* protects free_list */
+	struct list_head	free_list;
+	struct ibtrs_fr_desc	desc[0];
+};
+
+/**
+ * struct ibtrs_map_state - per-request DMA memory mapping state
+ * @desc:	    Pointer to the element of the buffer descriptor array
+ *		    that is being filled in.
+ * @pages:	    Array with DMA addresses of pages being considered for
+ *		    memory registration.
+ * @base_dma_addr:  DMA address of the first page that has not yet been mapped.
+ * @dma_len:	    Number of bytes that will be registered with the next
+ *		    FMR or FR memory registration call.
+ * @total_len:	    Total number of bytes in the sg-list being mapped.
+ * @npages:	    Number of page addresses in the pages[] array.
+ * @nmdesc:	    Number of FMR or FR memory descriptors used for mapping.
+ * @ndesc:	    Number of buffer descriptors that have been filled in.
+ */
+struct ibtrs_map_state {
+	union {
+		struct ib_pool_fmr	**next_fmr;
+		struct ibtrs_fr_desc	**next_fr;
+	};
+	struct ibtrs_sg_desc	*desc;
+	union {
+		u64			*pages;
+		struct scatterlist      *sg;
+	};
+	dma_addr_t		base_dma_addr;
+	u32			dma_len;
+	u32			total_len;
+	u32			npages;
+	u32			nmdesc;
+	u32			ndesc;
+	enum dma_data_direction dir;
+};
+
+static inline struct ibtrs_tag *
+__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
+{
+	size_t max_depth = clt->queue_depth;
+	struct ibtrs_tag *tag;
+	int cpu, bit;
+
+	cpu = get_cpu();
+	do {
+		bit = find_first_zero_bit(clt->tags_map, max_depth);
+		if (unlikely(bit >= max_depth)) {
+			put_cpu();
+			return NULL;
+		}
+
+	} while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
+	put_cpu();
+
+	tag = GET_TAG(clt, bit);
+	WARN_ON(tag->mem_id != bit);
+	tag->cpu_id = cpu;
+	tag->con_type = con_type;
+
+	return tag;
+}
+
+static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
+				   struct ibtrs_tag *tag)
+{
+	clear_bit_unlock(tag->mem_id, clt->tags_map);
+}
+
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
+				    enum ibtrs_clt_con_type con_type,
+				    int can_wait)
+{
+	struct ibtrs_tag *tag;
+	DEFINE_WAIT(wait);
+
+	tag = __ibtrs_get_tag(clt, con_type);
+	if (likely(tag) || !can_wait)
+		return tag;
+
+	do {
+		prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
+		tag = __ibtrs_get_tag(clt, con_type);
+		if (likely(tag))
+			break;
+
+		io_schedule();
+	} while (1);
+
+	finish_wait(&clt->tags_wait, &wait);
+
+	return tag;
+}
+EXPORT_SYMBOL(ibtrs_clt_get_tag);
+
+void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
+{
+	if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
+		return;
+
+	__ibtrs_put_tag(clt, tag);
+
+	/*
+	 * Putting a tag is a barrier, so we will observe
+	 * new entry in the wait list, no worries.
+	 */
+	if (waitqueue_active(&clt->tags_wait))
+		wake_up(&clt->tags_wait);
+}
+EXPORT_SYMBOL(ibtrs_clt_put_tag);
+
+/**
+ * ibtrs_tag_to_clt_con() - returns RDMA connection id by the tag
+ *
+ * Note:
+ *     IO connection starts from 1.
+ *     0 connection is for user messages.
+ */
+static struct ibtrs_clt_con *ibtrs_tag_to_clt_con(struct ibtrs_clt_sess *sess,
+						  struct ibtrs_tag *tag)
+{
+	int id = 0;
+
+	if (likely(tag->con_type == IBTRS_IO_CON))
+		id = (tag->cpu_id % (sess->s.con_num - 1)) + 1;
+
+	return to_clt_con(sess->s.con[id]);
+}
+
+/**
+ * ibtrs_destroy_fr_pool() - free the resources owned by a pool
+ * @pool: Fast registration pool to be destroyed.
+ */
+static void ibtrs_destroy_fr_pool(struct ibtrs_fr_pool *pool)
+{
+	struct ibtrs_fr_desc *d;
+	int i, err;
+
+	if (!pool)
+		return;
+
+	for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
+		if (d->mr) {
+			err = ib_dereg_mr(d->mr);
+			if (err)
+				pr_err("Failed to deregister memory region,"
+				       " err: %d\n", err);
+		}
+	}
+	kfree(pool);
+}
+
+/**
+ * ibtrs_create_fr_pool() - allocate and initialize a pool for fast registration
+ * @device:            IB device to allocate fast registration descriptors for.
+ * @pd:                Protection domain associated with the FR descriptors.
+ * @pool_size:         Number of descriptors to allocate.
+ * @max_page_list_len: Maximum fast registration work request page list length.
+ */
+static struct ibtrs_fr_pool *ibtrs_create_fr_pool(struct ib_device *device,
+						  struct ib_pd *pd,
+						  int pool_size,
+						  int max_page_list_len)
+{
+	struct ibtrs_fr_pool *pool;
+	struct ibtrs_fr_desc *d;
+	struct ib_mr *mr;
+	int i, ret;
+
+	if (pool_size <= 0) {
+		pr_warn("Creating fr pool failed, invalid pool size %d\n",
+			pool_size);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	pool = kzalloc(sizeof(*pool) + pool_size * sizeof(*d), GFP_KERNEL);
+	if (!pool) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	pool->size = pool_size;
+	pool->max_page_list_len = max_page_list_len;
+	spin_lock_init(&pool->lock);
+	INIT_LIST_HEAD(&pool->free_list);
+
+	for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
+		mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, max_page_list_len);
+		if (IS_ERR(mr)) {
+			pr_warn("Failed to allocate fast region memory\n");
+			ret = PTR_ERR(mr);
+			goto destroy_pool;
+		}
+		d->mr = mr;
+		list_add_tail(&d->entry, &pool->free_list);
+	}
+
+	return pool;
+
+destroy_pool:
+	ibtrs_destroy_fr_pool(pool);
+err:
+	return ERR_PTR(ret);
+}
+
+/**
+ * ibtrs_fr_pool_get() - obtain a descriptor suitable for fast registration
+ * @pool: Pool to obtain descriptor from.
+ */
+static struct ibtrs_fr_desc *ibtrs_fr_pool_get(struct ibtrs_fr_pool *pool)
+{
+	struct ibtrs_fr_desc *d = NULL;
+
+	spin_lock_bh(&pool->lock);
+	if (!list_empty(&pool->free_list)) {
+		d = list_first_entry(&pool->free_list, typeof(*d), entry);
+		list_del(&d->entry);
+	}
+	spin_unlock_bh(&pool->lock);
+
+	return d;
+}
+
+/**
+ * ibtrs_fr_pool_put() - put an FR descriptor back in the free list
+ * @pool: Pool the descriptor was allocated from.
+ * @desc: Pointer to an array of fast registration descriptor pointers.
+ * @n:    Number of descriptors to put back.
+ *
+ * Note: The caller must already have queued an invalidation request for
+ * desc->mr->rkey before calling this function.
+ */
+static void ibtrs_fr_pool_put(struct ibtrs_fr_pool *pool,
+			      struct ibtrs_fr_desc **desc, int n)
+{
+	int i;
+
+	spin_lock_bh(&pool->lock);
+	for (i = 0; i < n; i++)
+		list_add(&desc[i]->entry, &pool->free_list);
+	spin_unlock_bh(&pool->lock);
+}
+
+static void ibtrs_map_desc(struct ibtrs_map_state *state, dma_addr_t dma_addr,
+			   u32 dma_len, u32 rkey, u32 max_desc)
+{
+	struct ibtrs_sg_desc *desc = state->desc;
+
+	pr_debug("dma_addr %llu, key %u, dma_len %u\n",
+		 dma_addr, rkey, dma_len);
+	desc->addr = cpu_to_le64(dma_addr);
+	desc->key  = cpu_to_le32(rkey);
+	desc->len  = cpu_to_le32(dma_len);
+
+	state->total_len += dma_len;
+	if (state->ndesc < max_desc) {
+		state->desc++;
+		state->ndesc++;
+	} else {
+		state->ndesc = INT_MIN;
+		pr_err("Could not fit S/G list into buffer descriptor %d.\n",
+		       max_desc);
+	}
+}
+
+static int ibtrs_map_finish_fmr(struct ibtrs_map_state *state,
+				struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ib_pool_fmr *fmr;
+	dma_addr_t dma_addr;
+	u64 io_addr = 0;
+
+	fmr = ib_fmr_pool_map_phys(sess->fmr_pool, state->pages,
+				   state->npages, io_addr);
+	if (IS_ERR(fmr)) {
+		ibtrs_wrn_rl(sess, "Failed to map FMR from FMR pool, "
+			     "err: %ld\n", PTR_ERR(fmr));
+		return PTR_ERR(fmr);
+	}
+
+	*state->next_fmr++ = fmr;
+	state->nmdesc++;
+	dma_addr = state->base_dma_addr & ~sess->mr_page_mask;
+	pr_debug("ndesc = %d, nmdesc = %d, npages = %d\n",
+		 state->ndesc, state->nmdesc, state->npages);
+	if (state->dir == DMA_TO_DEVICE)
+		ibtrs_map_desc(state, dma_addr, state->dma_len, fmr->fmr->lkey,
+			       sess->max_desc);
+	else
+		ibtrs_map_desc(state, dma_addr, state->dma_len, fmr->fmr->rkey,
+			       sess->max_desc);
+
+	return 0;
+}
+
+static void ibtrs_clt_fast_reg_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Failed IB_WR_REG_MR: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_rdma_error_recovery(con);
+	}
+}
+
+static struct ib_cqe fast_reg_cqe = {
+	.done = ibtrs_clt_fast_reg_done
+};
+
+/* TODO */
+static int ibtrs_map_finish_fr(struct ibtrs_map_state *state,
+			       struct ibtrs_clt_con *con, int sg_cnt,
+			       unsigned int *sg_offset_p)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_fr_desc *desc;
+	struct ib_send_wr *bad_wr;
+	struct ib_reg_wr wr;
+	struct ib_pd *pd;
+	u32 rkey;
+	int n;
+
+	pd = sess->s.ib_dev->pd;
+	if (sg_cnt == 1 && (pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY)) {
+		unsigned int sg_offset = sg_offset_p ? *sg_offset_p : 0;
+
+		ibtrs_map_desc(state, sg_dma_address(state->sg) + sg_offset,
+			       sg_dma_len(state->sg) - sg_offset,
+			       pd->unsafe_global_rkey, sess->max_desc);
+		if (sg_offset_p)
+			*sg_offset_p = 0;
+		return 1;
+	}
+
+	desc = ibtrs_fr_pool_get(con->fr_pool);
+	if (!desc) {
+		ibtrs_wrn_rl(sess, "Failed to get descriptor from FR pool\n");
+		return -ENOMEM;
+	}
+
+	rkey = ib_inc_rkey(desc->mr->rkey);
+	ib_update_fast_reg_key(desc->mr, rkey);
+
+	memset(&wr, 0, sizeof(wr));
+	n = ib_map_mr_sg(desc->mr, state->sg, sg_cnt, sg_offset_p,
+			 sess->mr_page_size);
+	if (unlikely(n < 0)) {
+		ibtrs_fr_pool_put(con->fr_pool, &desc, 1);
+		return n;
+	}
+
+	wr.wr.next = NULL;
+	wr.wr.opcode = IB_WR_REG_MR;
+	wr.wr.wr_cqe = &fast_reg_cqe;
+	wr.wr.num_sge = 0;
+	wr.wr.send_flags = 0;
+	wr.mr = desc->mr;
+	wr.key = desc->mr->rkey;
+	wr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE);
+
+	*state->next_fr++ = desc;
+	state->nmdesc++;
+
+	ibtrs_map_desc(state, state->base_dma_addr, state->dma_len,
+		       desc->mr->rkey, sess->max_desc);
+
+	return ib_post_send(con->c.qp, &wr.wr, &bad_wr);
+}
+
+static int ibtrs_finish_fmr_mapping(struct ibtrs_map_state *state,
+				    struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ib_pd *pd = sess->s.ib_dev->pd;
+	int ret = 0;
+
+	if (state->npages == 0)
+		return 0;
+
+	if (state->npages == 1 && (pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY))
+		ibtrs_map_desc(state, state->base_dma_addr, state->dma_len,
+			       pd->unsafe_global_rkey,
+			       sess->max_desc);
+	else
+		ret = ibtrs_map_finish_fmr(state, con);
+
+	if (ret == 0) {
+		state->npages = 0;
+		state->dma_len = 0;
+	}
+
+	return ret;
+}
+
+static int ibtrs_map_sg_entry(struct ibtrs_map_state *state,
+			      struct ibtrs_clt_con *con, struct scatterlist *sg,
+			      int sg_count)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	unsigned int dma_len, len;
+	struct ib_device *ibdev;
+	dma_addr_t dma_addr;
+	int ret;
+
+	ibdev = sess->s.ib_dev->dev;
+	dma_addr = ib_sg_dma_address(ibdev, sg);
+	dma_len = ib_sg_dma_len(ibdev, sg);
+	if (!dma_len)
+		return 0;
+
+	while (dma_len) {
+		unsigned int offset = dma_addr & ~sess->mr_page_mask;
+
+		if (state->npages == sess->max_pages_per_mr ||
+		    offset != 0) {
+			ret = ibtrs_finish_fmr_mapping(state, con);
+			if (ret)
+				return ret;
+		}
+
+		len = min_t(unsigned int, dma_len,
+			    sess->mr_page_size - offset);
+
+		if (!state->npages)
+			state->base_dma_addr = dma_addr;
+		state->pages[state->npages++] =
+			dma_addr & sess->mr_page_mask;
+		state->dma_len += len;
+		dma_addr += len;
+		dma_len -= len;
+	}
+
+	/*
+	 * If the last entry of the MR wasn't a full page, then we need to
+	 * close it out and start a new one -- we can only merge at page
+	 * boundaries.
+	 */
+	ret = 0;
+	if (len != sess->mr_page_size)
+		ret = ibtrs_finish_fmr_mapping(state, con);
+	return ret;
+}
+
+static int ibtrs_map_fr(struct ibtrs_map_state *state,
+			struct ibtrs_clt_con *con,
+			struct scatterlist *sg, int sg_count)
+{
+	unsigned int sg_offset = 0;
+
+	state->sg = sg;
+
+	while (sg_count) {
+		int i, n;
+
+		n = ibtrs_map_finish_fr(state, con, sg_count, &sg_offset);
+		if (unlikely(n < 0))
+			return n;
+
+		sg_count -= n;
+		for (i = 0; i < n; i++)
+			state->sg = sg_next(state->sg);
+	}
+
+	return 0;
+}
+
+static int ibtrs_map_fmr(struct ibtrs_map_state *state,
+			 struct ibtrs_clt_con *con,
+			 struct scatterlist *sg_first_entry,
+			 int sg_first_entry_index, int sg_count)
+{
+	int i, ret;
+	struct scatterlist *sg;
+
+	for (i = sg_first_entry_index, sg = sg_first_entry; i < sg_count;
+	     i++, sg = sg_next(sg)) {
+		ret = ibtrs_map_sg_entry(state, con, sg, sg_count);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int ibtrs_map_sg(struct ibtrs_map_state *state,
+			struct ibtrs_clt_con *con,
+			struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int ret = 0;
+
+	state->pages = req->map_page;
+	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
+		state->next_fr = req->fr_list;
+		ret = ibtrs_map_fr(state, con, req->sglist, req->sg_cnt);
+		if (ret)
+			goto out;
+	} else if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR) {
+		state->next_fmr = req->fmr_list;
+		ret = ibtrs_map_fmr(state, con, req->sglist, 0,
+				    req->sg_cnt);
+		if (ret)
+			goto out;
+		ret = ibtrs_finish_fmr_mapping(state, con);
+		if (ret)
+			goto out;
+	}
+
+out:
+	req->nmdesc = state->nmdesc;
+	return ret;
+}
+
+static void ibtrs_clt_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Failed IB_WR_LOCAL_INV: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_rdma_error_recovery(con);
+	}
+}
+
+static struct ib_cqe local_inv_cqe = {
+	.done = ibtrs_clt_inv_rkey_done
+};
+
+static int ibtrs_inv_rkey(struct ibtrs_clt_con *con, u32 rkey)
+{
+	struct ib_send_wr *bad_wr;
+	struct ib_send_wr wr = {
+		.opcode		    = IB_WR_LOCAL_INV,
+		.wr_cqe		    = &local_inv_cqe,
+		.next		    = NULL,
+		.num_sge	    = 0,
+		.send_flags	    = 0,
+		.ex.invalidate_rkey = rkey,
+	};
+
+	return ib_post_send(con->c.qp, &wr, &bad_wr);
+}
+
+static void ibtrs_unmap_fast_reg_data(struct ibtrs_clt_con *con,
+				      struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int i, ret;
+
+	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
+		struct ibtrs_fr_desc **pfr;
+
+		for (i = req->nmdesc, pfr = req->fr_list; i > 0; i--, pfr++) {
+			ret = ibtrs_inv_rkey(con, (*pfr)->mr->rkey);
+			if (ret < 0) {
+				ibtrs_err(sess,
+					  "Invalidating registered RDMA memory for"
+					  " rkey %#x failed, err: %d\n",
+					  (*pfr)->mr->rkey, ret);
+			}
+		}
+		if (req->nmdesc)
+			ibtrs_fr_pool_put(con->fr_pool, req->fr_list,
+					  req->nmdesc);
+	} else {
+		struct ib_pool_fmr **pfmr;
+
+		for (i = req->nmdesc, pfmr = req->fmr_list; i > 0; i--, pfmr++)
+			ib_fmr_pool_unmap(*pfmr);
+	}
+	req->nmdesc = 0;
+}
+
+/*
+ * We have more scatter/gather entries, so use fast_reg_map
+ * trying to merge as many entries as we can.
+ */
+static int ibtrs_fast_reg_map_data(struct ibtrs_clt_con *con,
+				   struct ibtrs_sg_desc *desc,
+				   struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_map_state state;
+	int ret;
+
+	memset(&state, 0, sizeof(state));
+	state.desc	= desc;
+	state.dir	= req->dir;
+	ret = ibtrs_map_sg(&state, con, req);
+
+	if (unlikely(ret))
+		goto unmap;
+
+	if (unlikely(state.ndesc <= 0)) {
+		ibtrs_err(sess,
+			  "Could not fit S/G list into buffer descriptor %d\n",
+			  state.ndesc);
+		ret = -EIO;
+		goto unmap;
+	}
+
+	return state.ndesc;
+unmap:
+	ibtrs_unmap_fast_reg_data(con, req);
+	return ret;
+}
+
+static int ibtrs_post_send_rdma(struct ibtrs_clt_con *con,
+				struct ibtrs_clt_io_req *req,
+				u64 addr, u32 off, u32 imm)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	enum ib_send_flags flags;
+	struct ib_sge list[1];
+
+	if (unlikely(!req->sg_size)) {
+		ibtrs_wrn(sess, "Doing RDMA Write failed, no data supplied\n");
+		return -EINVAL;
+	}
+
+	/* user data and user message in the first list element */
+	list[0].addr   = req->iu->dma_addr;
+	list[0].length = req->sg_size;
+	list[0].lkey   = sess->s.ib_dev->lkey;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+	return ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list, 1,
+					    sess->srv_rdma_buf_rkey,
+					    addr + off, imm, flags);
+}
+
+static void ibtrs_set_sge_with_desc(struct ib_sge *list,
+				    struct ibtrs_sg_desc *desc)
+{
+	list->addr   = le64_to_cpu(desc->addr);
+	list->length = le32_to_cpu(desc->len);
+	list->lkey   = le32_to_cpu(desc->key);
+	pr_debug("dma_addr %llu, key %u, dma_len %u\n",
+		 list->addr, list->lkey, list->length);
+}
+
+static void ibtrs_set_rdma_desc_last(struct ibtrs_clt_con *con,
+				     struct ib_sge *list,
+				     struct ibtrs_clt_io_req *req,
+				     struct ib_rdma_wr *wr, int offset,
+				     struct ibtrs_sg_desc *desc, int m,
+				     int n, u64 addr, u32 size, u32 imm)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	enum ib_send_flags flags;
+	int i;
+
+	for (i = m; i < n; i++, desc++)
+		ibtrs_set_sge_with_desc(&list[i], desc);
+
+	list[i].addr   = req->iu->dma_addr;
+	list[i].length = size;
+	list[i].lkey   = sess->s.ib_dev->lkey;
+
+	wr->wr.wr_cqe = &req->iu->cqe;
+	wr->wr.sg_list = &list[m];
+	wr->wr.num_sge = n - m + 1;
+	wr->remote_addr	= addr + offset;
+	wr->rkey = sess->srv_rdma_buf_rkey;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	wr->wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	wr->wr.send_flags  = flags;
+	wr->wr.ex.imm_data = cpu_to_be32(imm);
+}
+
+static int ibtrs_post_send_rdma_desc_more(struct ibtrs_clt_con *con,
+					  struct ib_sge *list,
+					  struct ibtrs_clt_io_req *req,
+					  struct ibtrs_sg_desc *desc, int n,
+					  u64 addr, u32 size, u32 imm)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	size_t max_sge, num_sge, num_wr;
+	struct ib_send_wr *bad_wr;
+	struct ib_rdma_wr *wrs, *wr;
+	int j = 0, k, offset = 0, len = 0;
+	int m = 0;
+	int ret;
+
+	max_sge = sess->max_sge;
+	num_sge = 1 + n;
+	num_wr = DIV_ROUND_UP(num_sge, max_sge);
+
+	wrs = kcalloc(num_wr, sizeof(*wrs), GFP_ATOMIC);
+	if (!wrs)
+		return -ENOMEM;
+
+	if (num_wr == 1)
+		goto last_one;
+
+	for (; j < num_wr; j++) {
+		wr = &wrs[j];
+		for (k = 0; k < max_sge; k++, desc++) {
+			m = k + j * max_sge;
+			ibtrs_set_sge_with_desc(&list[m], desc);
+			len += le32_to_cpu(desc->len);
+		}
+		wr->wr.wr_cqe = &req->iu->cqe;
+		wr->wr.sg_list = &list[m];
+		wr->wr.num_sge = max_sge;
+		wr->remote_addr	= addr + offset;
+		wr->rkey = sess->srv_rdma_buf_rkey;
+
+		offset += len;
+		wr->wr.next = &wrs[j + 1].wr;
+		wr->wr.opcode = IB_WR_RDMA_WRITE;
+	}
+
+last_one:
+	wr = &wrs[j];
+
+	ibtrs_set_rdma_desc_last(con, list, req, wr, offset,
+				 desc, m, n, addr, size, imm);
+
+	ret = ib_post_send(con->c.qp, &wrs[0].wr, &bad_wr);
+	if (unlikely(ret))
+		ibtrs_err(sess, "Posting write request to QP failed,"
+			  " err: %d\n", ret);
+	kfree(wrs);
+	return ret;
+}
+
+static int ibtrs_post_send_rdma_desc(struct ibtrs_clt_con *con,
+				     struct ibtrs_clt_io_req *req,
+				     struct ibtrs_sg_desc *desc, int n,
+				     u64 addr, u32 size, u32 imm)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	enum ib_send_flags flags;
+	struct ib_sge *list;
+	size_t num_sge;
+	int ret, i;
+
+	num_sge = 1 + n;
+	list = kmalloc_array(num_sge, sizeof(*list), GFP_ATOMIC);
+	if (!list)
+		return -ENOMEM;
+
+	if (num_sge < sess->max_sge) {
+		for (i = 0; i < n; i++, desc++)
+			ibtrs_set_sge_with_desc(&list[i], desc);
+		list[i].addr   = req->iu->dma_addr;
+		list[i].length = size;
+		list[i].lkey   = sess->s.ib_dev->lkey;
+
+		/*
+		 * From time to time we have to post signalled sends,
+		 * or send queue will fill up and only QP reset can help.
+		 */
+		flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+				0 : IB_SEND_SIGNALED;
+		ret = ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list,
+						   num_sge,
+						   sess->srv_rdma_buf_rkey,
+						   addr, imm, flags);
+	} else {
+		ret = ibtrs_post_send_rdma_desc_more(con, list, req, desc, n,
+						     addr, size, imm);
+	}
+
+	kfree(list);
+	return ret;
+}
+
+static int ibtrs_post_send_rdma_more(struct ibtrs_clt_con *con,
+				     struct ibtrs_clt_io_req *req,
+				     u64 addr, u32 size, u32 imm)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ib_device *ibdev = sess->s.ib_dev->dev;
+	enum ib_send_flags flags;
+	struct scatterlist *sg;
+	struct ib_sge *list;
+	size_t num_sge;
+	int i, ret;
+
+	num_sge = 1 + req->sg_cnt;
+	list = kmalloc_array(num_sge, sizeof(*list), GFP_ATOMIC);
+	if (!list)
+		return -ENOMEM;
+
+	for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+		list[i].addr   = ib_sg_dma_address(ibdev, sg);
+		list[i].length = ib_sg_dma_len(ibdev, sg);
+		list[i].lkey   = sess->s.ib_dev->lkey;
+	}
+	list[i].addr   = req->iu->dma_addr;
+	list[i].length = size;
+	list[i].lkey   = sess->s.ib_dev->lkey;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+	ret = ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list, num_sge,
+					   sess->srv_rdma_buf_rkey,
+					   addr, imm, flags);
+	kfree(list);
+
+	return ret;
+}
+
+static inline unsigned long ibtrs_clt_get_raw_ms(void)
+{
+	struct timespec ts;
+
+	getrawmonotonic(&ts);
+
+	return timespec_to_ns(&ts) / NSEC_PER_MSEC;
+}
+
+static void complete_rdma_req(struct ibtrs_clt_io_req *req,
+			      int errno, bool notify)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess;
+	enum dma_data_direction dir;
+	struct ibtrs_clt *clt;
+	void *priv;
+
+	if (WARN_ON(!req->in_use))
+		return;
+	if (WARN_ON(!req->con))
+		return;
+	sess = to_clt_sess(con->c.sess);
+	clt = sess->clt;
+
+	if (req->sg_cnt > fmr_sg_cnt)
+		ibtrs_unmap_fast_reg_data(req->con, req);
+	if (req->sg_cnt)
+		ib_dma_unmap_sg(sess->s.ib_dev->dev, req->sglist,
+				req->sg_cnt, req->dir);
+	if (sess->stats.enable_rdma_lat)
+		ibtrs_clt_update_rdma_lat(&sess->stats,
+					  req->dir == DMA_FROM_DEVICE,
+					  ibtrs_clt_get_raw_ms() -
+					  req->start_time);
+	ibtrs_clt_decrease_inflight(&sess->stats);
+
+	req->in_use = false;
+	req->con = NULL;
+	priv = req->priv;
+	dir = req->dir;
+
+	if (notify)
+		req->conf(priv, errno);
+}
+
+static void process_io_rsp(struct ibtrs_clt_sess *sess, u32 msg_id, s16 errno)
+{
+	if (WARN_ON(msg_id >= sess->queue_depth))
+		return;
+
+	complete_rdma_req(&sess->reqs[msg_id], errno, true);
+}
+
+static struct ib_cqe io_comp_cqe = {
+	.done = ibtrs_clt_rdma_done
+};
+
+static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u32 imm_type, imm_payload;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			ibtrs_err(sess, "RDMA failed: %s\n",
+				  ib_wc_status_msg(wc->status));
+			ibtrs_rdma_error_recovery(con);
+		}
+		return;
+	}
+	ibtrs_clt_update_wc_stats(con);
+
+	switch (wc->opcode) {
+	case IB_WC_RDMA_WRITE:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
+			return;
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "ibtrs_post_recv_empty(): %d\n", err);
+			ibtrs_rdma_error_recovery(con);
+			break;
+		}
+		ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == IBTRS_IO_RSP_IMM)) {
+			u32 msg_id;
+
+			ibtrs_from_io_rsp_imm(imm_payload, &msg_id, &err);
+			process_io_rsp(sess, msg_id, err);
+		} else if (imm_type == IBTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			ibtrs_send_hb_ack(&sess->s);
+		} else if (imm_type == IBTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+		} else {
+			ibtrs_wrn(sess, "Unknown IMM type %u\n", imm_type);
+		}
+		break;
+	default:
+		ibtrs_wrn(sess, "Unexpected WC type: %s\n",
+			  ib_wc_opcode_str(wc->opcode));
+		return;
+	}
+}
+
+static int post_recv_io(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int err, i;
+
+	for (i = 0; i < sess->queue_depth; i++) {
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct ibtrs_clt_sess *sess)
+{
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		err = post_recv_io(to_clt_con(sess->s.con[cid]));
+		if (unlikely(err)) {
+			ibtrs_err(sess, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+struct path_it {
+	int i;
+	struct list_head skip_list;
+	struct ibtrs_clt *clt;
+	struct ibtrs_clt_sess *(*next_path)(struct path_it *);
+};
+
+#define do_each_path(path, clt, it) {					\
+	path_it_init(it, clt);						\
+	ibtrs_clt_state_lock();						\
+	for ((it)->i = 0; ((path) = ((it)->next_path)(it)) &&		\
+			  (it)->i < (it)->clt->paths_num;		\
+	     (it)->i++)
+
+#define while_each_path(it)						\
+	path_it_deinit(it);						\
+	ibtrs_clt_state_unlock();					\
+	}
+
+/**
+ * get_next_path_rr() - Returns path in round-robin fashion.
+ *
+ * Related to @MP_POLICY_RR
+ *
+ * Locks:
+ *    ibtrs_clt_state_lock() must be hold.
+ */
+static struct ibtrs_clt_sess *get_next_path_rr(struct path_it *it)
+{
+	struct ibtrs_clt_sess __percpu * __rcu *ppcpu_path, *path;
+	struct ibtrs_clt *clt = it->clt;
+
+	ppcpu_path = this_cpu_ptr(clt->pcpu_path);
+	path = rcu_dereference(*ppcpu_path);
+	if (unlikely(!path))
+		path = list_first_or_null_rcu(&clt->paths_list,
+					      typeof(*path), s.entry);
+	else
+		path = list_next_or_null_rcu_rr(path, &clt->paths_list,
+						s.entry);
+	rcu_assign_pointer(*ppcpu_path, path);
+
+	return path;
+}
+
+/**
+ * get_next_path_min_inflight() - Returns path with minimal inflight count.
+ *
+ * Related to @MP_POLICY_MIN_INFLIGHT
+ *
+ * Locks:
+ *    ibtrs_clt_state_lock() must be hold.
+ */
+static struct ibtrs_clt_sess *get_next_path_min_inflight(struct path_it *it)
+{
+	struct ibtrs_clt_sess *min_path = NULL;
+	struct ibtrs_clt *clt = it->clt;
+	struct ibtrs_clt_sess *sess;
+	int min_inflight = INT_MAX;
+	int inflight;
+
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry) {
+		if (unlikely(!list_empty(raw_cpu_ptr(sess->mp_skip_entry))))
+			continue;
+
+		inflight = atomic_read(&sess->stats.inflight);
+
+		if (inflight < min_inflight) {
+			min_inflight = inflight;
+			min_path = sess;
+		}
+	}
+
+	/*
+	 * add the path to the skip list, so that next time we can get
+	 * a different one
+	 */
+	if (min_path)
+		list_add(raw_cpu_ptr(min_path->mp_skip_entry), &it->skip_list);
+
+	return min_path;
+}
+
+static inline void path_it_init(struct path_it *it, struct ibtrs_clt *clt)
+{
+	INIT_LIST_HEAD(&it->skip_list);
+	it->clt = clt;
+	it->i = 0;
+
+	if (clt->mp_policy == MP_POLICY_RR)
+		it->next_path = get_next_path_rr;
+	else
+		it->next_path = get_next_path_min_inflight;
+}
+
+static inline void path_it_deinit(struct path_it *it)
+{
+	struct list_head *skip, *tmp;
+	/*
+	 * The skip_list is used only for the MIN_INFLIGHT policy.
+	 * We need to remove paths from it, so that next IO can insert
+	 * paths (->mp_skip_entry) into a skip_list again.
+	 */
+	list_for_each_safe(skip, tmp, &it->skip_list)
+		list_del_init(skip);
+}
+
+static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
+				      struct ibtrs_clt_sess *sess,
+				      ibtrs_conf_fn *conf,
+				      struct ibtrs_tag *tag, void *priv,
+				      const struct kvec *vec, size_t usr_len,
+				      struct scatterlist *sg, size_t sg_cnt,
+				      size_t data_len, int dir)
+{
+	req->tag = tag;
+	req->in_use = true;
+	req->usr_len = usr_len;
+	req->data_len = data_len;
+	req->sglist = sg;
+	req->sg_cnt = sg_cnt;
+	req->priv = priv;
+	req->dir = dir;
+	req->con = ibtrs_tag_to_clt_con(sess, tag);
+	req->conf = conf;
+	copy_from_kvec(req->iu->buf, vec, usr_len);
+	if (sess->stats.enable_rdma_lat)
+		req->start_time = ibtrs_clt_get_raw_ms();
+}
+
+static inline struct ibtrs_clt_io_req *
+ibtrs_clt_get_req(struct ibtrs_clt_sess *sess, ibtrs_conf_fn *conf,
+		  struct ibtrs_tag *tag, void *priv,
+		  const struct kvec *vec, size_t usr_len,
+		  struct scatterlist *sg, size_t sg_cnt,
+		  size_t data_len, int dir)
+{
+	struct ibtrs_clt_io_req *req;
+
+	req = &sess->reqs[tag->mem_id];
+	ibtrs_clt_init_req(req, sess, conf, tag, priv, vec, usr_len,
+			   sg, sg_cnt, data_len, dir);
+	return req;
+}
+
+static inline struct ibtrs_clt_io_req *
+ibtrs_clt_get_copy_req(struct ibtrs_clt_sess *alive_sess,
+		       struct ibtrs_clt_io_req *fail_req)
+{
+	struct ibtrs_clt_io_req *req;
+	struct kvec vec = {
+		.iov_base = fail_req->iu->buf,
+		.iov_len  = fail_req->usr_len
+	};
+
+	req = &alive_sess->reqs[fail_req->tag->mem_id];
+	ibtrs_clt_init_req(req, alive_sess, fail_req->conf, fail_req->tag,
+			   fail_req->priv, &vec, fail_req->usr_len,
+			   fail_req->sglist, fail_req->sg_cnt,
+			   fail_req->data_len, fail_req->dir);
+	return req;
+}
+
+static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
+static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);
+
+static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
+				  struct ibtrs_clt_io_req *fail_req)
+{
+	struct ibtrs_clt_sess *alive_sess;
+	struct ibtrs_clt_io_req *req;
+	int err = -ECONNABORTED;
+	struct path_it it;
+
+	do_each_path(alive_sess, clt, &it) {
+		if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
+			continue;
+		req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
+		if (req->dir == DMA_TO_DEVICE)
+			err = ibtrs_clt_write_req(req);
+		else
+			err = ibtrs_clt_read_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		ibtrs_clt_inc_failover_cnt(&alive_sess->stats);
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+
+static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess,
+				      bool failover)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_clt_io_req *req;
+	int i;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		bool notify;
+		int err = 0;
+
+		req = &sess->reqs[i];
+		if (!req->in_use)
+			continue;
+
+		if (failover)
+			err = ibtrs_clt_failover_req(clt, req);
+
+		notify = (!failover || err);
+		complete_rdma_req(req, -ECONNABORTED, notify);
+	}
+}
+
+static void free_sess_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_io_req *req;
+	int i;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR)
+			kfree(req->fr_list);
+		else if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR)
+			kfree(req->fmr_list);
+		kfree(req->map_page);
+		ibtrs_iu_free(req->iu, DMA_TO_DEVICE,
+			      sess->s.ib_dev->dev);
+	}
+	kfree(sess->reqs);
+	sess->reqs = NULL;
+}
+
+static int alloc_sess_reqs(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_io_req *req;
+	void *mr_list;
+	int i;
+
+	sess->reqs = kcalloc(sess->queue_depth, sizeof(*sess->reqs),
+			     GFP_KERNEL);
+	if (unlikely(!sess->reqs))
+		return -ENOMEM;
+
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		req->iu = ibtrs_iu_alloc(i, sess->max_req_size, GFP_KERNEL,
+					 sess->s.ib_dev->dev, DMA_TO_DEVICE,
+					 ibtrs_clt_rdma_done);
+		if (unlikely(!req->iu))
+			goto out;
+		mr_list = kmalloc_array(sess->max_pages_per_mr,
+					sizeof(void *), GFP_KERNEL);
+		if (unlikely(!mr_list))
+			goto out;
+		if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR)
+			req->fr_list = mr_list;
+		else if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR)
+			req->fmr_list = mr_list;
+
+		req->map_page = kmalloc_array(sess->max_pages_per_mr,
+					      sizeof(void *), GFP_KERNEL);
+		if (unlikely(!req->map_page))
+			goto out;
+	}
+
+	return 0;
+
+out:
+	free_sess_reqs(sess);
+
+	return -ENOMEM;
+}
+
+static int alloc_tags(struct ibtrs_clt *clt)
+{
+	unsigned int chunk_bits;
+	int err, i;
+
+	clt->tags_map = kcalloc(BITS_TO_LONGS(clt->queue_depth), sizeof(long),
+				GFP_KERNEL);
+	if (unlikely(!clt->tags_map)) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+	clt->tags = kcalloc(clt->queue_depth, TAG_SIZE(clt), GFP_KERNEL);
+	if (unlikely(!clt->tags)) {
+		err = -ENOMEM;
+		goto err_map;
+	}
+	chunk_bits = ilog2(clt->queue_depth - 1) + 1;
+	for (i = 0; i < clt->queue_depth; i++) {
+		struct ibtrs_tag *tag;
+
+		tag = GET_TAG(clt, i);
+		tag->mem_id = i;
+		tag->mem_off = i << (MAX_IMM_PAYL_BITS - chunk_bits);
+	}
+
+	return 0;
+
+err_map:
+	kfree(clt->tags_map);
+	clt->tags_map = NULL;
+out_err:
+	return err;
+}
+
+static void free_tags(struct ibtrs_clt *clt)
+{
+	kfree(clt->tags_map);
+	clt->tags_map = NULL;
+	kfree(clt->tags);
+	clt->tags = NULL;
+}
+
+static void query_fast_reg_mode(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_ib_dev *ib_dev;
+	u64 max_pages_per_mr;
+	int mr_page_shift;
+
+	ib_dev = sess->s.ib_dev;
+	if (ib_dev->dev->alloc_fmr && ib_dev->dev->dealloc_fmr &&
+	    ib_dev->dev->map_phys_fmr && ib_dev->dev->unmap_fmr) {
+		sess->fast_reg_mode = IBTRS_FAST_MEM_FMR;
+		ibtrs_info(sess, "Device %s supports FMR\n", ib_dev->dev->name);
+	}
+	if (ib_dev->attrs.device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS &&
+	    use_fr) {
+		sess->fast_reg_mode = IBTRS_FAST_MEM_FR;
+		ibtrs_info(sess, "Device %s supports FR\n", ib_dev->dev->name);
+	}
+
+	/*
+	 * Use the smallest page size supported by the HCA, down to a
+	 * minimum of 4096 bytes. We're unlikely to build large sglists
+	 * out of smaller entries.
+	 */
+	mr_page_shift      = max(12, ffs(ib_dev->attrs.page_size_cap) - 1);
+	sess->mr_page_size = 1 << mr_page_shift;
+	sess->max_sge      = ib_dev->attrs.max_sge;
+	sess->mr_page_mask = ~((u64)sess->mr_page_size - 1);
+	max_pages_per_mr   = ib_dev->attrs.max_mr_size;
+	do_div(max_pages_per_mr, sess->mr_page_size);
+	sess->max_pages_per_mr = min_t(u64, sess->max_pages_per_mr,
+				       max_pages_per_mr);
+	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
+		sess->max_pages_per_mr =
+			min_t(u32, sess->max_pages_per_mr,
+			      ib_dev->attrs.max_fast_reg_page_list_len);
+	}
+	sess->mr_max_size = sess->mr_page_size * sess->max_pages_per_mr;
+}
+
+static int alloc_con_fast_pool(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_fr_pool *fr_pool;
+	int err = 0;
+
+	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
+		fr_pool = ibtrs_create_fr_pool(sess->s.ib_dev->dev,
+					       sess->s.ib_dev->pd,
+					       sess->queue_depth,
+					       sess->max_pages_per_mr);
+		if (unlikely(IS_ERR(fr_pool))) {
+			err = PTR_ERR(fr_pool);
+			ibtrs_err(sess, "FR pool allocation failed, err: %d\n",
+				  err);
+			return err;
+		}
+		con->fr_pool = fr_pool;
+	}
+
+	return err;
+}
+
+static void free_con_fast_pool(struct ibtrs_clt_con *con)
+{
+	if (con->fr_pool) {
+		ibtrs_destroy_fr_pool(con->fr_pool);
+		con->fr_pool = NULL;
+	}
+}
+
+static int alloc_sess_fast_pool(struct ibtrs_clt_sess *sess)
+{
+	struct ib_fmr_pool_param fmr_param;
+	struct ib_fmr_pool *fmr_pool;
+	int err = 0;
+
+	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR) {
+		memset(&fmr_param, 0, sizeof(fmr_param));
+		fmr_param.pool_size	    = sess->queue_depth *
+					      sess->max_pages_per_mr;
+		fmr_param.dirty_watermark   = fmr_param.pool_size / 4;
+		fmr_param.cache		    = 0;
+		fmr_param.max_pages_per_fmr = sess->max_pages_per_mr;
+		fmr_param.page_shift	    = ilog2(sess->mr_page_size);
+		fmr_param.access	    = (IB_ACCESS_LOCAL_WRITE |
+					       IB_ACCESS_REMOTE_WRITE);
+
+		fmr_pool = ib_create_fmr_pool(sess->s.ib_dev->pd, &fmr_param);
+		if (unlikely(IS_ERR(fmr_pool))) {
+			err = PTR_ERR(fmr_pool);
+			ibtrs_err(sess, "FMR pool allocation failed, err: %d\n",
+				  err);
+			return err;
+		}
+		sess->fmr_pool = fmr_pool;
+	}
+
+	return err;
+}
+
+static void free_sess_fast_pool(struct ibtrs_clt_sess *sess)
+{
+	if (sess->fmr_pool) {
+		ib_destroy_fmr_pool(sess->fmr_pool);
+		sess->fmr_pool = NULL;
+	}
+}
+
+static int alloc_sess_io_bufs(struct ibtrs_clt_sess *sess)
+{
+	int ret;
+
+	ret = alloc_sess_reqs(sess);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "alloc_sess_reqs(), err: %d\n", ret);
+		return ret;
+	}
+	ret = alloc_sess_fast_pool(sess);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "alloc_sess_fast_pool(), err: %d\n", ret);
+		goto free_reqs;
+	}
+
+	return 0;
+
+free_reqs:
+	free_sess_reqs(sess);
+
+	return ret;
+}
+
+static void free_sess_io_bufs(struct ibtrs_clt_sess *sess)
+{
+	free_sess_reqs(sess);
+	free_sess_fast_pool(sess);
+}
+
+static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
+				     enum ibtrs_clt_state new_state)
+{
+	enum ibtrs_clt_state old_state;
+	bool changed = false;
+
+	old_state = sess->state;
+	switch (new_state) {
+	case IBTRS_CLT_CONNECTING:
+		switch (old_state) {
+		case IBTRS_CLT_RECONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_RECONNECTING:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTED:
+		case IBTRS_CLT_CONNECTING_ERR:
+		case IBTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CONNECTED:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CONNECTING_ERR:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CLOSING:
+		switch (old_state) {
+		case IBTRS_CLT_CONNECTING:
+		case IBTRS_CLT_CONNECTING_ERR:
+		case IBTRS_CLT_RECONNECTING:
+		case IBTRS_CLT_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_CLOSED:
+		switch (old_state) {
+		case IBTRS_CLT_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_CLT_DEAD:
+		switch (old_state) {
+		case IBTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed) {
+		sess->state = new_state;
+		wake_up_locked(&sess->state_wq);
+	}
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state_from_to(struct ibtrs_clt_sess *sess,
+					   enum ibtrs_clt_state old_state,
+					   enum ibtrs_clt_state new_state)
+{
+	bool changed = false;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	if (sess->state == old_state)
+		changed = __ibtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state_get_old(struct ibtrs_clt_sess *sess,
+					   enum ibtrs_clt_state new_state,
+					   enum ibtrs_clt_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	*old_state = sess->state;
+	changed = __ibtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static bool ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
+				   enum ibtrs_clt_state new_state)
+{
+	enum ibtrs_clt_state old_state;
+
+	return ibtrs_clt_change_state_get_old(sess, new_state, &old_state);
+}
+
+static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess)
+{
+	enum ibtrs_clt_state state;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	state = sess->state;
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return state;
+}
+
+static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err)
+{
+	struct ibtrs_clt_con *con;
+
+	(void)err;
+	con = container_of(c, typeof(*con), c);
+	ibtrs_rdma_error_recovery(con);
+}
+
+static void ibtrs_clt_init_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_init_hb(&sess->s, &io_comp_cqe,
+		      IBTRS_HB_INTERVAL_MS,
+		      IBTRS_HB_MISSED_MAX,
+		      ibtrs_clt_hb_err_handler,
+		      ibtrs_wq);
+}
+
+static void ibtrs_clt_start_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_start_hb(&sess->s);
+}
+
+static void ibtrs_clt_stop_hb(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_stop_hb(&sess->s);
+}
+
+static void ibtrs_clt_reconnect_work(struct work_struct *work);
+static void ibtrs_clt_close_work(struct work_struct *work);
+
+static struct ibtrs_clt_sess *alloc_sess(struct ibtrs_clt *clt,
+					 const struct ibtrs_addr *path,
+					 size_t con_num, u16 max_segments)
+{
+	struct ibtrs_clt_sess *sess;
+	int err = -ENOMEM;
+	int cpu;
+
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	/* Extra connection for user messages */
+	con_num += 1;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_sess;
+
+	mutex_init(&sess->init_mutex);
+	uuid_gen(&sess->s.uuid);
+	memcpy(&sess->s.dst_addr, path->dst,
+	       rdma_addr_size((struct sockaddr *)path->dst));
+
+	/*
+	 * rdma_resolve_addr() passes src_addr to cma_bind_addr, which
+	 * checks the sa_family to be non-zero. If user passed src_addr=NULL
+	 * the sess->src_addr will contain only zeros, which is then fine.
+	 */
+	if (path->src)
+		memcpy(&sess->s.src_addr, path->src,
+		       rdma_addr_size((struct sockaddr *)path->src));
+	strlcpy(sess->s.sessname, clt->sessname, sizeof(sess->s.sessname));
+	sess->s.con_num = con_num;
+	sess->clt = clt;
+	sess->max_pages_per_mr = max_segments;
+	init_waitqueue_head(&sess->state_wq);
+	sess->state = IBTRS_CLT_CONNECTING;
+	atomic_set(&sess->connected_cnt, 0);
+	INIT_WORK(&sess->close_work, ibtrs_clt_close_work);
+	INIT_DELAYED_WORK(&sess->reconnect_dwork, ibtrs_clt_reconnect_work);
+	ibtrs_clt_init_hb(sess);
+
+	sess->mp_skip_entry = alloc_percpu(typeof(*sess->mp_skip_entry));
+	if (unlikely(!sess->mp_skip_entry))
+		goto err_free_con;
+
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(per_cpu_ptr(sess->mp_skip_entry, cpu));
+
+	err = ibtrs_clt_init_stats(&sess->stats);
+	if (unlikely(err))
+		goto err_free_percpu;
+
+	return sess;
+
+err_free_percpu:
+	free_percpu(sess->mp_skip_entry);
+err_free_con:
+	kfree(sess->s.con);
+err_free_sess:
+	kfree(sess);
+err:
+	return ERR_PTR(err);
+}
+
+static void free_sess(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_clt_free_stats(&sess->stats);
+	free_percpu(sess->mp_skip_entry);
+	kfree(sess->s.con);
+	kfree(sess->srv_rdma_addr);
+	kfree(sess);
+}
+
+static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid)
+{
+	struct ibtrs_clt_con *con;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con))
+		return -ENOMEM;
+
+	/* Map first two connections to the first CPU */
+	con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
+	con->c.cid = cid;
+	con->c.sess = &sess->s;
+	atomic_set(&con->io_cnt, 0);
+
+	sess->s.con[cid] = &con->c;
+
+	return 0;
+}
+
+static void destroy_con(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	sess->s.con[con->c.cid] = NULL;
+	kfree(con);
+}
+
+static int create_con_cq_qp(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u16 cq_size, wr_queue_size;
+	int err, cq_vector;
+
+	/*
+	 * This function can fail, but still destroy_con_cq_qp() should
+	 * be called, this is because create_con_cq_qp() is called on cm
+	 * event path, thus caller/waiter never knows: have we failed before
+	 * create_con_cq_qp() or after.  To solve this dilemma without
+	 * creating any additional flags just allow destroy_con_cq_qp() be
+	 * called many times.
+	 */
+
+	if (con->c.cid == 0) {
+		cq_size = SERVICE_CON_QUEUE_DEPTH;
+		/* + 2 for drain and heartbeat */
+		wr_queue_size = SERVICE_CON_QUEUE_DEPTH + 2;
+		/* We must be the first here */
+		if (WARN_ON(sess->s.ib_dev))
+			return -EINVAL;
+
+		/*
+		 * The whole session uses device from user connection.
+		 * Be careful not to close user connection before ib dev
+		 * is gracefully put.
+		 */
+		sess->s.ib_dev = ibtrs_ib_dev_find_get(con->c.cm_id);
+		if (unlikely(!sess->s.ib_dev)) {
+			ibtrs_wrn(sess, "ibtrs_ib_dev_find_get(): no memory\n");
+			return -ENOMEM;
+		}
+		sess->s.ib_dev_ref = 1;
+		query_fast_reg_mode(sess);
+	} else {
+		int num_wr;
+
+		/*
+		 * Here we assume that session members are correctly set.
+		 * This is always true if user connection (cid == 0) is
+		 * established first.
+		 */
+		if (WARN_ON(!sess->s.ib_dev))
+			return -EINVAL;
+		if (WARN_ON(!sess->queue_depth))
+			return -EINVAL;
+
+		/* Shared between connections */
+		sess->s.ib_dev_ref++;
+		cq_size = sess->queue_depth;
+		num_wr = DIV_ROUND_UP(sess->max_pages_per_mr, sess->max_sge);
+		wr_queue_size = sess->s.ib_dev->attrs.max_qp_wr;
+		wr_queue_size = min_t(int, wr_queue_size,
+				      sess->queue_depth * num_wr *
+				      (use_fr ? 3 : 2) + 1);
+	}
+	cq_vector = con->cpu % sess->s.ib_dev->dev->num_comp_vectors;
+	err = ibtrs_cq_qp_create(&sess->s, &con->c, sess->max_sge,
+				 cq_vector, cq_size, wr_queue_size,
+				 IB_POLL_SOFTIRQ);
+	/*
+	 * In case of error we do not bother to clean previous allocations,
+	 * since destroy_con_cq_qp() must be called.
+	 */
+
+	if (unlikely(err))
+		return err;
+
+	if (con->c.cid) {
+		err = alloc_con_fast_pool(con);
+		if (unlikely(err))
+			ibtrs_cq_qp_destroy(&con->c);
+	}
+
+	return err;
+}
+
+static void destroy_con_cq_qp(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	/*
+	 * Be careful here: destroy_con_cq_qp() can be called even
+	 * create_con_cq_qp() failed, see comments there.
+	 */
+
+	ibtrs_cq_qp_destroy(&con->c);
+	if (con->c.cid != 0)
+		free_con_fast_pool(con);
+	if (sess->s.ib_dev_ref && !--sess->s.ib_dev_ref) {
+		ibtrs_ib_dev_put(sess->s.ib_dev);
+		sess->s.ib_dev = NULL;
+	}
+}
+
+static void stop_cm(struct ibtrs_clt_con *con)
+{
+	rdma_disconnect(con->c.cm_id);
+	if (con->c.qp)
+		ib_drain_qp(con->c.qp);
+}
+
+static void destroy_cm(struct ibtrs_clt_con *con)
+{
+	rdma_destroy_id(con->c.cm_id);
+	con->c.cm_id = NULL;
+}
+
+static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev);
+
+static int create_cm(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rdma_cm_id *cm_id;
+	int err;
+
+	cm_id = rdma_create_id(&init_net, ibtrs_clt_rdma_cm_handler, con,
+			       sess->s.dst_addr.ss_family == AF_IB ?
+			       RDMA_PS_IB : RDMA_PS_TCP, IB_QPT_RC);
+	if (unlikely(IS_ERR(cm_id))) {
+		err = PTR_ERR(cm_id);
+		ibtrs_err(sess, "Failed to create CM ID, err: %d\n", err);
+
+		return err;
+	}
+	con->c.cm_id = cm_id;
+	con->cm_err = 0;
+	/* allow the port to be reused */
+	err = rdma_set_reuseaddr(cm_id, 1);
+	if (err != 0) {
+		ibtrs_err(sess, "Set address reuse failed, err: %d\n", err);
+		goto destroy_cm;
+	}
+	err = rdma_resolve_addr(cm_id, (struct sockaddr *)&sess->s.src_addr,
+				(struct sockaddr *)&sess->s.dst_addr,
+				IBTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "Failed to resolve address, err: %d\n", err);
+		goto destroy_cm;
+	}
+	/*
+	 * Combine connection status and session events. This is needed
+	 * for waiting two possible cases: cm_err has something meaningful
+	 * or session state was really changed to error by device removal.
+	 */
+	err = wait_event_interruptible_timeout(sess->state_wq,
+			con->cm_err || sess->state != IBTRS_CLT_CONNECTING,
+			msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(err == 0 || err == -ERESTARTSYS)) {
+		if (err == 0)
+			err = -ETIMEDOUT;
+		/* Timedout or interrupted */
+		goto errr;
+	}
+	if (unlikely(con->cm_err < 0)) {
+		err = con->cm_err;
+		goto errr;
+	}
+	if (unlikely(sess->state != IBTRS_CLT_CONNECTING)) {
+		/* Device removal */
+		err = -ECONNABORTED;
+		goto errr;
+	}
+
+	return 0;
+
+errr:
+	stop_cm(con);
+	/* Is safe to call destroy if cq_qp is not inited */
+	destroy_con_cq_qp(con);
+destroy_cm:
+	destroy_cm(con);
+
+	return err;
+}
+
+static void ibtrs_clt_sess_up(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	int up;
+
+	/*
+	 * We can fire RECONNECTED event only when all paths were
+	 * connected on ibtrs_clt_open(), then each was disconnected
+	 * and the first one connected again.  That's why this nasty
+	 * game with counter value.
+	 */
+
+	mutex_lock(&clt->paths_ev_mutex);
+	up = ++clt->paths_up;
+	/*
+	 * Here it is safe to access paths num directly since up counter
+	 * is greater than MAX_PATHS_NUM only while ibtrs_clt_open() is
+	 * in progress, thus paths removals are impossible.
+	 */
+	if (up > MAX_PATHS_NUM && up == MAX_PATHS_NUM + clt->paths_num)
+		clt->paths_up = clt->paths_num;
+	else if (up == 1)
+		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_RECONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+	sess->reconnect_attempts = 0;
+	sess->stats.reconnects.successful_cnt++;
+}
+
+static void ibtrs_clt_sess_down(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&clt->paths_ev_mutex);
+	WARN_ON(!clt->paths_up);
+	if (--clt->paths_up == 0)
+		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_DISCONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+}
+
+static void ibtrs_clt_stop_and_destroy_conns(struct ibtrs_clt_sess *sess,
+					     bool failover)
+{
+	struct ibtrs_clt_con *con;
+	unsigned int cid;
+
+	WARN_ON(sess->state == IBTRS_CLT_CONNECTED);
+
+	/*
+	 * Possible race with ibtrs_clt_open(), when DEVICE_REMOVAL comes
+	 * exactly in between.  Start destroying after it finishes.
+	 */
+	mutex_lock(&sess->init_mutex);
+	mutex_unlock(&sess->init_mutex);
+
+	/*
+	 * All IO paths must observe !CONNECTED state before we
+	 * free everything.
+	 */
+	synchronize_rcu();
+
+	ibtrs_clt_stop_hb(sess);
+
+	/*
+	 * The order it utterly crucial: firstly disconnect and complete all
+	 * rdma requests with error (thus set in_use=false for requests),
+	 * then fail outstanding requests checking in_use for each, and
+	 * eventually notify upper layer about session disconnection.
+	 */
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		con = to_clt_con(sess->s.con[cid]);
+		if (!con)
+			break;
+
+		stop_cm(con);
+	}
+	fail_all_outstanding_reqs(sess, failover);
+	free_sess_io_bufs(sess);
+	ibtrs_clt_sess_down(sess);
+
+	/*
+	 * Wait for graceful shutdown, namely when peer side invokes
+	 * rdma_disconnect(). 'connected_cnt' is decremented only on
+	 * CM events, thus if other side had crashed and hb has detected
+	 * something is wrong, here we will stuck for exactly timeout ms,
+	 * since CM does not fire anything.  That is fine, we are not in
+	 * hurry.
+	 */
+	wait_event_timeout(sess->state_wq, !atomic_read(&sess->connected_cnt),
+			   msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		con = to_clt_con(sess->s.con[cid]);
+		if (!con)
+			break;
+
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+}
+
+static void ibtrs_clt_remove_path_from_arr(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_clt_sess *next;
+	int cpu;
+
+	mutex_lock(&clt->paths_mutex);
+	list_del_rcu(&sess->s.entry);
+
+	/* Make sure everybody observes path removal. */
+	synchronize_rcu();
+
+	/*
+	 * Decrement paths number only after grace period, because
+	 * caller of do_each_path() must firstly observe list without
+	 * path and only then decremented paths number.
+	 *
+	 * Otherwise there can be the following situation:
+	 *    o Two paths exist and IO is coming.
+	 *    o One path is removed:
+	 *      CPU#0                          CPU#1
+	 *      do_each_path():                ibtrs_clt_remove_path_from_arr():
+	 *          path = get_next_path()
+	 *          ^^^                            list_del_rcu(path)
+	 *          [!CONNECTED path]              clt->paths_num--
+	 *                                              ^^^^^^^^^
+	 *          load clt->paths_num                 from 2 to 1
+	 *                    ^^^^^^^^^
+	 *                    sees 1
+	 *
+	 *      path is observed as !CONNECTED, but do_each_path() loop
+	 *      ends, because expression i < clt->paths_num is false.
+	 */
+	clt->paths_num--;
+
+	next = list_next_or_null_rcu_rr(sess, &clt->paths_list, s.entry);
+
+	/*
+	 * Pcpu paths can still point to the path which is going to be
+	 * removed, so change the pointer manually.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct ibtrs_clt_sess **ppcpu_path;
+
+		ppcpu_path = per_cpu_ptr(clt->pcpu_path, cpu);
+		if (*ppcpu_path != sess)
+			/*
+			 * synchronize_rcu() was called just after deleting
+			 * entry from the list, thus IO code path cannot
+			 * change pointer back to the pointer which is going
+			 * to be removed, we are safe here.
+			 */
+			continue;
+
+		/*
+		 * We race with IO code path, which also changes pointer,
+		 * thus we have to be careful not to override it.
+		 */
+		cmpxchg(ppcpu_path, sess, next);
+	}
+	mutex_unlock(&clt->paths_mutex);
+}
+
+static inline bool __ibtrs_clt_path_exists(struct ibtrs_clt *clt,
+					   struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt_sess *sess;
+
+	list_for_each_entry(sess, &clt->paths_list, s.entry)
+		if (!sockaddr_cmp((struct sockaddr *)&sess->s.dst_addr,
+				  addr->dst))
+			return true;
+
+	return false;
+}
+
+static bool ibtrs_clt_path_exists(struct ibtrs_clt *clt,
+				  struct ibtrs_addr *addr)
+{
+	bool res;
+
+	mutex_lock(&clt->paths_mutex);
+	res = __ibtrs_clt_path_exists(clt, addr);
+	mutex_unlock(&clt->paths_mutex);
+
+	return res;
+}
+
+static int ibtrs_clt_add_path_to_arr(struct ibtrs_clt_sess *sess,
+				     struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	int err = 0;
+
+	mutex_lock(&clt->paths_mutex);
+	if (!__ibtrs_clt_path_exists(clt, addr)) {
+		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+		clt->paths_num++;
+	} else
+		err = -EEXIST;
+	mutex_unlock(&clt->paths_mutex);
+
+	return err;
+}
+
+static void ibtrs_clt_close_work(struct work_struct *work)
+{
+	struct ibtrs_clt_sess *sess;
+	/*
+	 * Always try to do a failover, if only single path remains,
+	 * all requests will be completed with error.
+	 */
+	bool failover = true;
+
+	sess = container_of(work, struct ibtrs_clt_sess, close_work);
+
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	ibtrs_clt_stop_and_destroy_conns(sess, failover);
+	/*
+	 * Sounds stupid, huh?  No, it is not.  Consider this sequence:
+	 *
+	 *   #CPU0                              #CPU1
+	 *   1.  CONNECTED->RECONNECTING
+	 *   2.                                 RECONNECTING->CLOSING
+	 *   3.  queue_work(&reconnect_dwork)
+	 *   4.                                 queue_work(&close_work);
+	 *   5.  reconnect_work();              close_work();
+	 *
+	 * To avoid that case do cancel twice: before and after.
+	 */
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSED);
+}
+
+static void ibtrs_clt_close_conns(struct ibtrs_clt_sess *sess, bool wait)
+{
+	if (ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSING))
+		queue_work(ibtrs_wq, &sess->close_work);
+	if (wait)
+		flush_work(&sess->close_work);
+}
+
+static int init_conns(struct ibtrs_clt_sess *sess)
+{
+	unsigned int cid;
+	int err;
+
+	/*
+	 * On every new session connections increase reconnect counter
+	 * to avoid clashes with previous sessions not yet closed
+	 * sessions on a server side.
+	 */
+	sess->s.recon_cnt++;
+
+	/* Establish all RDMA connections  */
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		err = create_con(sess, cid);
+		if (unlikely(err))
+			goto destroy;
+
+		err = create_cm(to_clt_con(sess->s.con[cid]));
+		if (unlikely(err)) {
+			destroy_con(to_clt_con(sess->s.con[cid]));
+			goto destroy;
+		}
+	}
+	/* Allocate all session related buffers */
+	err = alloc_sess_io_bufs(sess);
+	if (unlikely(err))
+		goto destroy;
+
+	ibtrs_clt_start_hb(sess);
+
+	return 0;
+
+destroy:
+	while (cid--) {
+		struct ibtrs_clt_con *con = to_clt_con(sess->s.con[cid]);
+
+		stop_cm(con);
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+	/*
+	 * If we've never taken async path and got an error, say,
+	 * doing rdma_resolve_addr(), switch to CONNECTION_ERR state
+	 * manually to keep reconnecting.
+	 */
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+static int ibtrs_rdma_addr_resolved(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int err;
+
+	err = create_con_cq_qp(con);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "create_con_cq_qp(), err: %d\n", err);
+		return err;
+	}
+	err = rdma_resolve_route(con->c.cm_id, IBTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "Resolving route failed, err: %d\n", err);
+		destroy_con_cq_qp(con);
+	}
+
+	return err;
+}
+
+static int ibtrs_rdma_route_resolved(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt *clt = sess->clt;
+	struct ibtrs_msg_conn_req msg;
+	struct rdma_conn_param param;
+
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.retry_count = retry_count;
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	/*
+	 * Those two are the part of struct cma_hdr which is shared
+	 * with private_data in case of AF_IB, so put zeroes to avoid
+	 * wrong validation inside cma.c on receiver side.
+	 */
+	msg.__cma_version = 0;
+	msg.__ip_version = 0;
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_VERSION);
+	msg.cid = cpu_to_le16(con->c.cid);
+	msg.cid_num = cpu_to_le16(sess->s.con_num);
+	msg.recon_cnt = cpu_to_le16(sess->s.recon_cnt);
+	uuid_copy(&msg.sess_uuid, &sess->s.uuid);
+	uuid_copy(&msg.paths_uuid, &clt->paths_uuid);
+
+	err = rdma_connect(con->c.cm_id, &param);
+	if (err)
+		ibtrs_err(sess, "rdma_connect(): %d\n", err);
+
+	return err;
+}
+
+static int ibtrs_rdma_conn_established(struct ibtrs_clt_con *con,
+				       struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	const struct ibtrs_msg_conn_rsp *msg;
+	u16 version, queue_depth;
+	int errno;
+	u8 len;
+
+	msg = ev->param.conn.private_data;
+	len = ev->param.conn.private_data_len;
+	if (unlikely(len < sizeof(*msg))) {
+		ibtrs_err(sess, "Invalid IBTRS connection response");
+		return -ECONNRESET;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != IBTRS_MAGIC)) {
+		ibtrs_err(sess, "Invalid IBTRS magic");
+		return -ECONNRESET;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != IBTRS_VER_MAJOR)) {
+		ibtrs_err(sess, "Unsupported major IBTRS version: %d",
+			  version);
+		return -ECONNRESET;
+	}
+	errno = le16_to_cpu(msg->errno);
+	if (unlikely(errno)) {
+		ibtrs_err(sess, "Invalid IBTRS message: errno %d",
+			  errno);
+		return -ECONNRESET;
+	}
+	if (con->c.cid == 0) {
+		queue_depth = le16_to_cpu(msg->queue_depth);
+
+		if (queue_depth > MAX_SESS_QUEUE_DEPTH) {
+			ibtrs_err(sess, "Invalid IBTRS message: queue=%d\n",
+				  queue_depth);
+			return -ECONNRESET;
+		}
+		if (!sess->srv_rdma_addr || sess->queue_depth < queue_depth) {
+			kfree(sess->srv_rdma_addr);
+			sess->srv_rdma_addr =
+				kcalloc(queue_depth,
+					sizeof(*sess->srv_rdma_addr),
+					GFP_KERNEL);
+			if (unlikely(!sess->srv_rdma_addr)) {
+				ibtrs_err(sess, "Failed to allocate "
+					  "queue_depth=%d\n", queue_depth);
+				return -ENOMEM;
+			}
+		}
+		sess->queue_depth = queue_depth;
+		sess->srv_rdma_buf_rkey = le32_to_cpu(msg->rkey);
+		sess->max_req_size = le32_to_cpu(msg->max_req_size);
+		sess->max_io_size = le32_to_cpu(msg->max_io_size);
+		sess->chunk_size = sess->max_io_size + sess->max_req_size;
+		sess->max_desc  = sess->max_req_size;
+		sess->max_desc -= sizeof(u32) + sizeof(u32) + IO_MSG_SIZE;
+		sess->max_desc /= sizeof(struct ibtrs_sg_desc);
+
+		/*
+		 * Global queue depth and is always a minimum.  If while a
+		 * reconnection server sends us a value a bit higher -
+		 * client does not care and uses cached minimum.
+		 */
+		ibtrs_clt_set_min_queue_depth(sess->clt, sess->queue_depth);
+		ibtrs_clt_set_min_io_size(sess->clt, sess->max_io_size);
+	}
+
+	return 0;
+}
+
+static int ibtrs_rdma_conn_rejected(struct ibtrs_clt_con *con,
+				    struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	const struct ibtrs_msg_conn_rsp *msg;
+	const char *rej_msg;
+	int status, errno;
+	u8 data_len;
+
+	status = ev->status;
+	rej_msg = rdma_reject_msg(con->c.cm_id, status);
+	msg = rdma_consumer_reject_data(con->c.cm_id, ev, &data_len);
+
+	if (msg && data_len >= sizeof(*msg)) {
+		errno = (int16_t)le16_to_cpu(msg->errno);
+		if (errno == -EBUSY)
+			ibtrs_err(sess,
+				  "Previous session is still exists on the "
+				  "server, please reconnect later\n");
+		else
+			ibtrs_err(sess,
+				  "Connect rejected: status %d (%s), ibtrs "
+				  "errno %d\n", status, rej_msg, errno);
+	} else {
+		ibtrs_err(sess,
+			  "Connect rejected but with malformed message: "
+			  "status %d (%s)\n", status, rej_msg);
+	}
+
+	return -ECONNRESET;
+}
+
+static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (ibtrs_clt_change_state_from_to(sess,
+					   IBTRS_CLT_CONNECTED,
+					   IBTRS_CLT_RECONNECTING)) {
+		/*
+		 * Normal scenario, reconnect if we were successfully connected
+		 */
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
+	} else {
+		/*
+		 * Error can happen just on establishing new connection,
+		 * so notify waiter with error state, waiter is responsible
+		 * for cleaning the rest and reconnect if needed.
+		 */
+		ibtrs_clt_change_state_from_to(sess,
+					       IBTRS_CLT_CONNECTING,
+					       IBTRS_CLT_CONNECTING_ERR);
+	}
+}
+
+static inline void flag_success_on_conn(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	atomic_inc(&sess->connected_cnt);
+	con->cm_err = 1;
+}
+
+static inline void flag_error_on_conn(struct ibtrs_clt_con *con, int cm_err)
+{
+	if (con->cm_err == 1) {
+		struct ibtrs_clt_sess *sess;
+
+		sess = to_clt_sess(con->c.sess);
+		if (atomic_dec_and_test(&sess->connected_cnt))
+			wake_up(&sess->state_wq);
+	}
+	con->cm_err = cm_err;
+}
+
+static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct ibtrs_clt_con *con = cm_id->context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	int cm_err = 0;
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		cm_err = ibtrs_rdma_addr_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		cm_err = ibtrs_rdma_route_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ESTABLISHED:
+		con->cm_err = ibtrs_rdma_conn_established(con, ev);
+		if (likely(!con->cm_err)) {
+			/*
+			 * Report success and wake up. Here we abuse state_wq,
+			 * i.e. wake up without state change, but we set cm_err.
+			 */
+			flag_success_on_conn(con);
+			wake_up(&sess->state_wq);
+			return 0;
+		}
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+		cm_err = ibtrs_rdma_conn_rejected(con, ev);
+		break;
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		ibtrs_wrn(sess, "CM error event %d\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+		cm_err = -EHOSTUNREACH;
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		/*
+		 * Device removal is a special case.  Queue close and return 0.
+		 */
+		ibtrs_clt_close_conns(sess, false);
+		return 0;
+	default:
+		ibtrs_err(sess, "Unexpected RDMA CM event (%d)\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	}
+
+	if (cm_err) {
+		/*
+		 * cm error makes sense only on connection establishing,
+		 * in other cases we rely on normal procedure of reconnecting.
+		 */
+		flag_error_on_conn(con, cm_err);
+		ibtrs_rdma_error_recovery(con);
+	}
+
+	return 0;
+}
+
+static void ibtrs_clt_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	ibtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.ib_dev->dev);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info request send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+		return;
+	}
+
+	ibtrs_clt_update_wc_stats(con);
+}
+
+static int process_info_rsp(struct ibtrs_clt_sess *sess,
+			    const struct ibtrs_msg_info_rsp *msg)
+{
+	unsigned int addr_num;
+	int i;
+
+	addr_num = le16_to_cpu(msg->addr_num);
+	/*
+	 * Check if IB immediate data size is enough to hold the mem_id and
+	 * the offset inside the memory chunk.
+	 */
+	if (unlikely(ilog2(addr_num - 1) + ilog2(sess->chunk_size - 1) >
+		     MAX_IMM_PAYL_BITS)) {
+		ibtrs_err(sess, "RDMA immediate size (%db) not enough to "
+			  "encode %d buffers of size %dB\n",  MAX_IMM_PAYL_BITS,
+			  addr_num, sess->chunk_size);
+		return -EINVAL;
+	}
+	if (unlikely(addr_num > sess->queue_depth)) {
+		ibtrs_err(sess, "Incorrect addr_num=%d\n", addr_num);
+		return -EINVAL;
+	}
+	for (i = 0; i < msg->addr_num; i++)
+		sess->srv_rdma_addr[i] = le64_to_cpu(msg->addr[i]);
+
+	return 0;
+}
+
+static void ibtrs_clt_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_clt_con *con = cq->cq_context;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_info_rsp *msg;
+	enum ibtrs_clt_state state;
+	struct ibtrs_iu *iu;
+	size_t rx_sz;
+	int err;
+
+	state = IBTRS_CLT_CONNECTING_ERR;
+
+	WARN_ON(con->c.cid);
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info response recv failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto out;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	msg = iu->buf;
+	if (unlikely(le16_to_cpu(msg->type) != IBTRS_MSG_INFO_RSP)) {
+		ibtrs_err(sess, "Sess info response is malformed: type %d\n",
+			  le32_to_cpu(msg->type));
+		goto out;
+	}
+	rx_sz  = sizeof(*msg);
+	rx_sz += sizeof(msg->addr[0]) * le16_to_cpu(msg->addr_num);
+	if (unlikely(wc->byte_len < rx_sz)) {
+		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	err = process_info_rsp(sess, msg);
+	if (unlikely(err))
+		goto out;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err))
+		goto out;
+
+	state = IBTRS_CLT_CONNECTED;
+
+out:
+	ibtrs_clt_update_wc_stats(con);
+	ibtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.ib_dev->dev);
+	ibtrs_clt_change_state(sess, state);
+}
+
+static int ibtrs_send_sess_info(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt_con *usr_con = to_clt_con(sess->s.con[0]);
+	struct ibtrs_msg_info_req *msg;
+	struct ibtrs_iu *tx_iu, *rx_iu;
+	size_t rx_sz;
+	int err;
+
+	rx_sz  = sizeof(struct ibtrs_msg_info_rsp);
+	rx_sz += sizeof(u64) * MAX_SESS_QUEUE_DEPTH;
+
+	tx_iu = ibtrs_iu_alloc(0, sizeof(struct ibtrs_msg_info_req), GFP_KERNEL,
+			       sess->s.ib_dev->dev, DMA_TO_DEVICE,
+			       ibtrs_clt_info_req_done);
+	rx_iu = ibtrs_iu_alloc(0, rx_sz, GFP_KERNEL, sess->s.ib_dev->dev,
+			       DMA_FROM_DEVICE, ibtrs_clt_info_rsp_done);
+	if (unlikely(!tx_iu || !rx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(): no memory\n");
+		err = -ENOMEM;
+		goto out;
+	}
+	/* Prepare for getting info response */
+	err = ibtrs_iu_post_recv(&usr_con->c, rx_iu);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_recv(), err: %d\n", err);
+		goto out;
+	}
+	rx_iu = NULL;
+
+	msg = tx_iu->buf;
+	msg->type = cpu_to_le16(IBTRS_MSG_INFO_REQ);
+	memcpy(msg->sessname, sess->s.sessname, sizeof(msg->sessname));
+
+	/* Send info request */
+	err = ibtrs_iu_post_send(&usr_con->c, tx_iu, sizeof(*msg));
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_send(), err: %d\n", err);
+		goto out;
+	}
+	tx_iu = NULL;
+
+	/* Wait for state change */
+	wait_event_interruptible_timeout(sess->state_wq,
+				sess->state != IBTRS_CLT_CONNECTING,
+				msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(sess->state != IBTRS_CLT_CONNECTED)) {
+		if (sess->state == IBTRS_CLT_CONNECTING_ERR)
+			err = -ECONNRESET;
+		else
+			err = -ETIMEDOUT;
+		goto out;
+	}
+
+out:
+	if (tx_iu)
+		ibtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.ib_dev->dev);
+	if (rx_iu)
+		ibtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.ib_dev->dev);
+	if (unlikely(err))
+		/* If we've never taken async path because of malloc problems */
+		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+/**
+ * init_sess() - establishes all session connections and does handshake
+ *
+ * In case of error full close or reconnect procedure should be taken,
+ * because reconnect or close async works can be started.
+ */
+static int init_sess(struct ibtrs_clt_sess *sess)
+{
+	int err;
+
+	mutex_lock(&sess->init_mutex);
+	err = init_conns(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "init_conns(), err: %d\n", err);
+		goto out;
+	}
+	err = ibtrs_send_sess_info(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_send_sess_info(), err: %d\n", err);
+		goto out;
+	}
+	ibtrs_clt_sess_up(sess);
+out:
+	mutex_unlock(&sess->init_mutex);
+
+	return err;
+}
+
+static void ibtrs_clt_reconnect_work(struct work_struct *work)
+{
+	struct ibtrs_clt_sess *sess;
+	struct ibtrs_clt *clt;
+	unsigned int delay_ms;
+	int err;
+
+	sess = container_of(to_delayed_work(work), struct ibtrs_clt_sess,
+			    reconnect_dwork);
+	clt = sess->clt;
+
+	if (ibtrs_clt_state(sess) == IBTRS_CLT_CLOSING)
+		/* User requested closing */
+		return;
+
+	if (sess->reconnect_attempts >= clt->max_reconnect_attempts) {
+		/* Close a session completely if max attempts is reached */
+		ibtrs_clt_close_conns(sess, false);
+		return;
+	}
+	sess->reconnect_attempts++;
+
+	/* Stop everything */
+	ibtrs_clt_stop_and_destroy_conns(sess, true);
+	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING);
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto reconnect_again;
+
+	return;
+
+reconnect_again:
+	if (ibtrs_clt_change_state(sess, IBTRS_CLT_RECONNECTING)) {
+		sess->stats.reconnects.fail_cnt++;
+		delay_ms = clt->reconnect_delay_sec * 1000;
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork,
+				   msecs_to_jiffies(delay_ms));
+	}
+}
+
+static struct ibtrs_clt *alloc_clt(const char *sessname, size_t paths_num,
+				   short port, size_t pdu_sz,
+				   void *priv, link_clt_ev_fn *link_ev,
+				   unsigned int max_segments,
+				   unsigned int reconnect_delay_sec,
+				   unsigned int max_reconnect_attempts)
+{
+	struct ibtrs_clt *clt;
+	int err;
+
+	if (unlikely(!paths_num || paths_num > MAX_PATHS_NUM))
+		return ERR_PTR(-EINVAL);
+
+	if (unlikely(strlen(sessname) >= sizeof(clt->sessname)))
+		return ERR_PTR(-EINVAL);
+
+	clt = kzalloc(sizeof(*clt), GFP_KERNEL);
+	if (unlikely(!clt))
+		return ERR_PTR(-ENOMEM);
+
+	clt->pcpu_path = alloc_percpu(typeof(*clt->pcpu_path));
+	if (unlikely(!clt->pcpu_path)) {
+		kfree(clt);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	uuid_gen(&clt->paths_uuid);
+	INIT_LIST_HEAD_RCU(&clt->paths_list);
+	clt->paths_num = paths_num;
+	clt->paths_up = MAX_PATHS_NUM;
+	clt->port = port;
+	clt->pdu_sz = pdu_sz;
+	clt->max_segments = max_segments;
+	clt->reconnect_delay_sec = reconnect_delay_sec;
+	clt->max_reconnect_attempts = max_reconnect_attempts;
+	clt->priv = priv;
+	clt->link_ev = link_ev;
+	clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	strlcpy(clt->sessname, sessname, sizeof(clt->sessname));
+	init_waitqueue_head(&clt->tags_wait);
+	mutex_init(&clt->paths_ev_mutex);
+	mutex_init(&clt->paths_mutex);
+
+	err = ibtrs_clt_create_sysfs_root_folders(clt);
+	if (unlikely(err)) {
+		free_percpu(clt->pcpu_path);
+		kfree(clt);
+		return ERR_PTR(err);
+	}
+
+	return clt;
+}
+
+static void wait_for_inflight_tags(struct ibtrs_clt *clt)
+{
+	if (clt->tags_map) {
+		size_t sz = clt->queue_depth;
+
+		wait_event(clt->tags_wait,
+			   find_first_bit(clt->tags_map, sz) >= sz);
+	}
+}
+
+static void free_clt(struct ibtrs_clt *clt)
+{
+	ibtrs_clt_destroy_sysfs_root_folders(clt);
+	wait_for_inflight_tags(clt);
+	free_tags(clt);
+	free_percpu(clt->pcpu_path);
+	kfree(clt);
+}
+
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct ibtrs_addr *paths,
+				 size_t paths_num,
+				 short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts)
+{
+	struct ibtrs_clt_sess *sess, *tmp;
+	struct ibtrs_clt *clt;
+	int err, i;
+
+	clt = alloc_clt(sessname, paths_num, port, pdu_sz, priv, link_ev,
+			max_segments, reconnect_delay_sec,
+			max_reconnect_attempts);
+	if (unlikely(IS_ERR(clt))) {
+		err = PTR_ERR(clt);
+		goto out;
+	}
+	for (i = 0; i < paths_num; i++) {
+		struct ibtrs_clt_sess *sess;
+
+		sess = alloc_sess(clt, &paths[i], nr_cons_per_session,
+				  max_segments);
+		if (unlikely(IS_ERR(sess))) {
+			err = PTR_ERR(sess);
+			ibtrs_err(clt, "alloc_sess(), err: %d\n", err);
+			goto close_all_sess;
+		}
+		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+
+		err = init_sess(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+
+		err = ibtrs_clt_create_sess_files(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+	}
+	err = alloc_tags(clt);
+	if (unlikely(err)) {
+		ibtrs_err(clt, "alloc_tags(), err: %d\n", err);
+		goto close_all_sess;
+	}
+	err = ibtrs_clt_create_sysfs_root_files(clt);
+	if (unlikely(err))
+		goto close_all_sess;
+
+	/*
+	 * There is a race if someone decides to completely remove just
+	 * newly created path using sysfs entry.  To avoid the race we
+	 * use simple 'opened' flag, see ibtrs_clt_remove_path_from_sysfs().
+	 */
+	clt->opened = true;
+
+	/* Do not let module be unloaded if client is alive */
+	__module_get(THIS_MODULE);
+
+	return clt;
+
+close_all_sess:
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		ibtrs_clt_destroy_sess_files(sess, NULL);
+		ibtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+
+out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(ibtrs_clt_open);
+
+void ibtrs_clt_close(struct ibtrs_clt *clt)
+{
+	struct ibtrs_clt_sess *sess, *tmp;
+
+	/* Firstly forbid sysfs access */
+	ibtrs_clt_destroy_sysfs_root_files(clt);
+	ibtrs_clt_destroy_sysfs_root_folders(clt);
+
+	/* Now it is safe to iterate over all paths without locks */
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		ibtrs_clt_destroy_sess_files(sess, NULL);
+		ibtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(ibtrs_clt_close);
+
+int ibtrs_clt_reconnect_from_sysfs(struct ibtrs_clt_sess *sess)
+{
+	enum ibtrs_clt_state old_state;
+	int err = -EBUSY;
+	bool changed;
+
+	changed = ibtrs_clt_change_state_get_old(sess, IBTRS_CLT_RECONNECTING,
+						 &old_state);
+	if (changed) {
+		sess->reconnect_attempts = 0;
+		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
+	}
+	if (changed || old_state == IBTRS_CLT_RECONNECTING) {
+		/*
+		 * flush_delayed_work() queues pending work for immediate
+		 * execution, so do the flush if we have queued something
+		 * right now or work is pending.
+		 */
+		flush_delayed_work(&sess->reconnect_dwork);
+		err = ibtrs_clt_sess_is_connected(sess) ? 0 : -ENOTCONN;
+	}
+
+	return err;
+}
+
+int ibtrs_clt_disconnect_from_sysfs(struct ibtrs_clt_sess *sess)
+{
+	ibtrs_clt_close_conns(sess, true);
+
+	return 0;
+}
+
+int ibtrs_clt_remove_path_from_sysfs(struct ibtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	enum ibtrs_clt_state old_state;
+	bool changed;
+
+	/*
+	 * That can happen only when userspace tries to remove path
+	 * very early, when ibtrs_clt_open() is not yet finished.
+	 */
+	if (unlikely(!clt->opened))
+		return -EBUSY;
+
+	/*
+	 * Continue stopping path till state was changed to DEAD or
+	 * state was observed as DEAD:
+	 * 1. State was changed to DEAD - we were fast and nobody
+	 *    invoked ibtrs_clt_reconnect(), which can again start
+	 *    reconnecting.
+	 * 2. State was observed as DEAD - we have someone in parallel
+	 *    removing the path.
+	 */
+	do {
+		ibtrs_clt_close_conns(sess, true);
+	} while (!(changed = ibtrs_clt_change_state_get_old(sess,
+							    IBTRS_CLT_DEAD,
+							    &old_state)) &&
+		   old_state != IBTRS_CLT_DEAD);
+
+	/*
+	 * If state was successfully changed to DEAD, commit suicide.
+	 */
+	if (likely(changed)) {
+		ibtrs_clt_destroy_sess_files(sess, sysfs_self);
+		ibtrs_clt_remove_path_from_arr(sess);
+		free_sess(sess);
+	}
+
+	return 0;
+}
+
+void ibtrs_clt_set_max_reconnect_attempts(struct ibtrs_clt *clt, int value)
+{
+	clt->max_reconnect_attempts = (unsigned int)value;
+}
+
+int ibtrs_clt_get_max_reconnect_attempts(const struct ibtrs_clt *clt)
+{
+	return (int)clt->max_reconnect_attempts;
+}
+
+static int ibtrs_clt_rdma_write_desc(struct ibtrs_clt_con *con,
+				     struct ibtrs_clt_io_req *req, u64 buf,
+				     size_t u_msg_len, u32 imm,
+				     struct ibtrs_msg_rdma_write *msg)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_sg_desc *desc;
+	int ret;
+
+	desc = kmalloc_array(sess->max_pages_per_mr, sizeof(*desc), GFP_ATOMIC);
+	if (unlikely(!desc))
+		return -ENOMEM;
+
+	ret = ibtrs_fast_reg_map_data(con, desc, req);
+	if (unlikely(ret < 0)) {
+		ibtrs_err_rl(sess,
+			     "Write request failed, fast reg. data mapping"
+			     " failed, err: %d\n", ret);
+		kfree(desc);
+		return ret;
+	}
+	ret = ibtrs_post_send_rdma_desc(con, req, desc, ret, buf,
+					u_msg_len + sizeof(*msg), imm);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "Write request failed, posting work"
+			  " request failed, err: %d\n", ret);
+		ibtrs_unmap_fast_reg_data(con, req);
+	}
+	kfree(desc);
+	return ret;
+}
+
+static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_rdma_write *msg;
+
+	int ret, count = 0;
+	u32 imm, buf_id;
+	u64 buf;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		ibtrs_wrn(sess, "Write request failed, size too big %zu > %d\n",
+			  tsize, sess->chunk_size);
+		return -EMSGSIZE;
+	}
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(sess->s.ib_dev->dev, req->sglist,
+				      req->sg_cnt, req->dir);
+		if (unlikely(!count)) {
+			ibtrs_wrn(sess, "Write request failed, map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put ibtrs msg after sg and user message */
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(IBTRS_MSG_WRITE);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	/* ibtrs message on server side will be after user data and message */
+	imm = req->tag->mem_off + req->data_len + req->usr_len;
+	imm = ibtrs_to_io_req_imm(imm);
+	buf_id = req->tag->mem_id;
+	req->sg_size = tsize;
+	buf = sess->srv_rdma_addr[buf_id];
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	ibtrs_clt_update_all_stats(req, WRITE);
+
+	if (count > fmr_sg_cnt)
+		ret = ibtrs_clt_rdma_write_desc(req->con, req, buf,
+						req->usr_len, imm, msg);
+	else
+		ret = ibtrs_post_send_rdma_more(req->con, req, buf,
+						req->usr_len + sizeof(*msg),
+						imm);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "Write request failed: %d\n", ret);
+		ibtrs_clt_decrease_inflight(&sess->stats);
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(sess->s.ib_dev->dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+int ibtrs_clt_write(struct ibtrs_clt *clt, ibtrs_conf_fn *conf,
+		    struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		    size_t nr, size_t data_len, struct scatterlist *sg,
+		    unsigned int sg_cnt)
+{
+	struct ibtrs_clt_io_req *req;
+	struct ibtrs_clt_sess *sess;
+
+	int err = -ECONNABORTED;
+	struct path_it it;
+	size_t usr_len;
+
+	usr_len = kvec_length(vec, nr);
+	do_each_path(sess, clt, &it) {
+		if (unlikely(sess->state != IBTRS_CLT_CONNECTED))
+			continue;
+
+		if (unlikely(usr_len > IO_MSG_SIZE)) {
+			ibtrs_wrn_rl(sess, "Write request failed, user message"
+				     " size is %zu B big, max size is %d B\n",
+				     usr_len, IO_MSG_SIZE);
+			err = -EMSGSIZE;
+			break;
+		}
+		req = ibtrs_clt_get_req(sess, conf, tag, priv, vec, usr_len,
+					sg, sg_cnt, data_len, DMA_TO_DEVICE);
+		err = ibtrs_clt_write_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+
+static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_msg_rdma_read *msg;
+	struct ibtrs_ib_dev *ibdev;
+	struct scatterlist *sg;
+
+	int i, ret, count = 0;
+	u32 imm, buf_id;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	ibdev = sess->s.ib_dev;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		ibtrs_wrn(sess, "Read request failed, message size is"
+			  " %zu, bigger than CHUNK_SIZE %d\n", tsize,
+			  sess->chunk_size);
+		return -EMSGSIZE;
+	}
+
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(ibdev->dev, req->sglist, req->sg_cnt,
+				      req->dir);
+		if (unlikely(!count)) {
+			ibtrs_wrn(sess, "Read request failed, "
+				  "dma map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put our message into req->buf after user message*/
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(IBTRS_MSG_READ);
+	msg->sg_cnt = cpu_to_le32(count);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	if (count > fmr_sg_cnt) {
+		ret = ibtrs_fast_reg_map_data(req->con, msg->desc, req);
+		if (ret < 0) {
+			ibtrs_err_rl(sess,
+				     "Read request failed, failed to map "
+				     " fast reg. data, err: %d\n", ret);
+			ib_dma_unmap_sg(ibdev->dev, req->sglist, req->sg_cnt,
+					req->dir);
+			return ret;
+		}
+		msg->sg_cnt = cpu_to_le32(ret);
+	} else {
+		for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+			msg->desc[i].addr =
+				cpu_to_le64(ib_sg_dma_address(ibdev->dev, sg));
+			msg->desc[i].key =
+				cpu_to_le32(ibdev->rkey);
+			msg->desc[i].len =
+				cpu_to_le32(ib_sg_dma_len(ibdev->dev, sg));
+		}
+		req->nmdesc = 0;
+	}
+	/*
+	 * ibtrs message will be after the space reserved for disk data and
+	 * user message
+	 */
+	imm = req->tag->mem_off + req->data_len + req->usr_len;
+	imm = ibtrs_to_io_req_imm(imm);
+	buf_id = req->tag->mem_id;
+
+	req->sg_size  = sizeof(*msg);
+	req->sg_size += le32_to_cpu(msg->sg_cnt) * sizeof(struct ibtrs_sg_desc);
+	req->sg_size += req->usr_len;
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	ibtrs_clt_update_all_stats(req, READ);
+
+	ret = ibtrs_post_send_rdma(req->con, req, sess->srv_rdma_addr[buf_id],
+				   req->data_len, imm);
+	if (unlikely(ret)) {
+		ibtrs_err(sess, "Read request failed: %d\n", ret);
+		ibtrs_clt_decrease_inflight(&sess->stats);
+		if (unlikely(count > fmr_sg_cnt))
+			ibtrs_unmap_fast_reg_data(req->con, req);
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(ibdev->dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+int ibtrs_clt_read(struct ibtrs_clt *clt, ibtrs_conf_fn *conf,
+		   struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		   size_t nr, size_t data_len, struct scatterlist *sg,
+		   unsigned int sg_cnt)
+{
+	struct ibtrs_clt_io_req *req;
+	struct ibtrs_clt_sess *sess;
+
+	int err = -ECONNABORTED;
+	struct path_it it;
+	size_t usr_len;
+
+	usr_len = kvec_length(vec, nr);
+	do_each_path(sess, clt, &it) {
+		if (unlikely(sess->state != IBTRS_CLT_CONNECTED))
+			continue;
+
+		if (unlikely(usr_len > IO_MSG_SIZE ||
+			     sizeof(struct ibtrs_msg_rdma_read) +
+			     sg_cnt * sizeof(struct ibtrs_sg_desc) >
+			     sess->max_req_size)) {
+			ibtrs_wrn_rl(sess, "Read request failed, user message"
+				     " size is %zu B big, max size is %d B\n",
+				     usr_len, IO_MSG_SIZE);
+			err = -EMSGSIZE;
+			break;
+		}
+		req = ibtrs_clt_get_req(sess, conf, tag, priv, vec, usr_len,
+					sg, sg_cnt, data_len, DMA_FROM_DEVICE);
+		err = ibtrs_clt_read_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+
+int ibtrs_clt_request(int dir, ibtrs_conf_fn *conf, struct ibtrs_clt *clt,
+		      struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
+		      size_t nr, size_t len, struct scatterlist *sg,
+		      unsigned int sg_len)
+{
+	if (dir == READ)
+		return ibtrs_clt_read(clt, conf, tag, priv, vec, nr, len, sg,
+				      sg_len);
+	else
+		return ibtrs_clt_write(clt, conf, tag, priv, vec, nr, len, sg,
+				       sg_len);
+}
+EXPORT_SYMBOL(ibtrs_clt_request);
+
+int ibtrs_clt_query(struct ibtrs_clt *clt, struct ibtrs_attrs *attr)
+{
+	if (unlikely(!ibtrs_clt_is_connected(clt)))
+		return -ECOMM;
+
+	attr->queue_depth      = clt->queue_depth;
+	attr->max_io_size      = clt->max_io_size;
+	strlcpy(attr->sessname, clt->sessname, sizeof(attr->sessname));
+
+	return 0;
+}
+EXPORT_SYMBOL(ibtrs_clt_query);
+
+int ibtrs_clt_create_path_from_sysfs(struct ibtrs_clt *clt,
+				     struct ibtrs_addr *addr)
+{
+	struct ibtrs_clt_sess *sess;
+	int err;
+
+	if (ibtrs_clt_path_exists(clt, addr))
+		return -EEXIST;
+
+	sess = alloc_sess(clt, addr, nr_cons_per_session, clt->max_segments);
+	if (unlikely(IS_ERR(sess)))
+		return PTR_ERR(sess);
+
+	/*
+	 * It is totally safe to add path in CONNECTING state: coming
+	 * IO will never grab it.  Also it is very important to add
+	 * path before init, since init fires LINK_CONNECTED event.
+	 */
+	err = ibtrs_clt_add_path_to_arr(sess, addr);
+	if (unlikely(err))
+		goto free_sess;
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	err = ibtrs_clt_create_sess_files(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	return 0;
+
+close_sess:
+	ibtrs_clt_remove_path_from_arr(sess);
+	ibtrs_clt_close_conns(sess, true);
+free_sess:
+	free_sess(sess);
+
+	return err;
+}
+
+static int check_module_params(void)
+{
+	if (fmr_sg_cnt > MAX_SEGMENTS || fmr_sg_cnt < 0) {
+		pr_err("invalid fmr_sg_cnt values\n");
+		return -EINVAL;
+	}
+	if (nr_cons_per_session == 0)
+		nr_cons_per_session = min_t(unsigned int, nr_cpu_ids, U16_MAX);
+
+	return 0;
+}
+
+static int __init ibtrs_client_init(void)
+{
+	int err;
+
+	pr_info("Loading module %s, version: %s "
+		"(use_fr: %d, retry_count: %d, "
+		"fmr_sg_cnt: %d)\n",
+		KBUILD_MODNAME, IBTRS_VER_STRING,
+		use_fr,	retry_count, fmr_sg_cnt);
+	err = check_module_params();
+	if (err) {
+		pr_err("Failed to load module, invalid module parameters,"
+		       " err: %d\n", err);
+		return err;
+	}
+	ibtrs_wq = alloc_workqueue("ibtrs_client_wq", WQ_MEM_RECLAIM, 0);
+	if (!ibtrs_wq) {
+		pr_err("Failed to load module, alloc ibtrs_client_wq failed\n");
+		return -ENOMEM;
+	}
+	err = ibtrs_clt_create_sysfs_module_files();
+	if (err) {
+		pr_err("Failed to load module, can't create sysfs files,"
+		       " err: %d\n", err);
+		goto out_ibtrs_wq;
+	}
+
+	return 0;
+
+out_ibtrs_wq:
+	destroy_workqueue(ibtrs_wq);
+
+	return err;
+}
+
+static void __exit ibtrs_client_exit(void)
+{
+	ibtrs_clt_destroy_sysfs_module_files();
+	destroy_workqueue(ibtrs_wq);
+}
+
+module_init(ibtrs_client_init);
+module_exit(ibtrs_client_exit);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 06/24] ibtrs: client: statistics functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (4 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 05/24] ibtrs: client: main functionality Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 07/24] ibtrs: client: sysfs interface functions Roman Pen
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This introduces set of functions used on client side to account
statistics of RDMA data sent/received, amount of IOs inflight,
latency, cpu migrations, etc.  Almost all statistics is collected
using percpu variables.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c | 455 +++++++++++++++++++++++++
 1 file changed, 455 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
new file mode 100644
index 000000000000..af2ed05d2900
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
@@ -0,0 +1,455 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-clt.h"
+
+static inline int ibtrs_clt_ms_to_id(unsigned long ms)
+{
+	int id = ms ? ilog2(ms) - MIN_LOG_LAT + 1 : 0;
+
+	return clamp(id, 0, LOG_LAT_SZ - 1);
+}
+
+void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *stats, bool read,
+			       unsigned long ms)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int id;
+
+	id = ibtrs_clt_ms_to_id(ms);
+	s = this_cpu_ptr(stats->pcpu_stats);
+	if (read) {
+		s->rdma_lat_distr[id].read++;
+		if (s->rdma_lat_max.read < ms)
+			s->rdma_lat_max.read = ms;
+	} else {
+		s->rdma_lat_distr[id].write++;
+		if (s->rdma_lat_max.write < ms)
+			s->rdma_lat_max.write = ms;
+	}
+}
+
+void ibtrs_clt_decrease_inflight(struct ibtrs_clt_stats *stats)
+{
+	atomic_dec(&stats->inflight);
+}
+
+void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con)
+{
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt_stats *stats = &sess->stats;
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	cpu = raw_smp_processor_id();
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->wc_comp.cnt++;
+	s->wc_comp.total_cnt++;
+	if (unlikely(con->cpu != cpu)) {
+		s->cpu_migr.to++;
+
+		/* Careful here, override s pointer */
+		s = per_cpu_ptr(stats->pcpu_stats, con->cpu);
+		atomic_inc(&s->cpu_migr.from);
+	}
+}
+
+void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *stats)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.failover_cnt++;
+}
+
+static inline u32 ibtrs_clt_stats_get_avg_wc_cnt(struct ibtrs_clt_stats *stats)
+{
+	u32 cnt = 0;
+	u64 sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct ibtrs_clt_stats_pcpu *s;
+
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		sum += s->wc_comp.total_cnt;
+		cnt += s->wc_comp.cnt;
+	}
+
+	return cnt ? sum / cnt : 0;
+}
+
+int ibtrs_clt_stats_wc_completion_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	return scnprintf(buf, len, "%u\n",
+			 ibtrs_clt_stats_get_avg_wc_cnt(stats));
+}
+
+ssize_t ibtrs_clt_stats_rdma_lat_distr_to_str(struct ibtrs_clt_stats *stats,
+					      char *page, size_t len)
+{
+	struct ibtrs_clt_stats_rdma_lat res[LOG_LAT_SZ];
+	struct ibtrs_clt_stats_rdma_lat max;
+	struct ibtrs_clt_stats_pcpu *s;
+
+	ssize_t cnt = 0;
+	int i, cpu;
+
+	max.write = 0;
+	max.read = 0;
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+		if (max.write < s->rdma_lat_max.write)
+			max.write = s->rdma_lat_max.write;
+		if (max.read < s->rdma_lat_max.read)
+			max.read = s->rdma_lat_max.read;
+	}
+	for (i = 0; i < ARRAY_SIZE(res); i++) {
+		res[i].write = 0;
+		res[i].read = 0;
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+			res[i].write += s->rdma_lat_distr[i].write;
+			res[i].read += s->rdma_lat_distr[i].read;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(res) - 1; i++)
+		cnt += scnprintf(page + cnt, len - cnt,
+				 "< %6d ms: %llu %llu\n",
+				 1 << (i + MIN_LOG_LAT), res[i].read,
+				 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, ">= %5d ms: %llu %llu\n",
+			 1 << (i - 1 + MIN_LOG_LAT), res[i].read,
+			 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, " maximum ms: %llu %llu\n",
+			 max.read, max.write);
+
+	return cnt;
+}
+
+int ibtrs_clt_stats_migration_cnt_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	size_t used;
+	int cpu;
+
+	used = scnprintf(buf, len, "    ");
+	for_each_possible_cpu(cpu)
+		used += scnprintf(buf + used, len - used, " CPU%u", cpu);
+
+	used += scnprintf(buf + used, len - used, "\nfrom:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  atomic_read(&s->cpu_migr.from));
+	}
+
+	used += scnprintf(buf + used, len - used, "\nto  :");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  s->cpu_migr.to);
+	}
+	used += scnprintf(buf + used, len - used, "\n");
+
+	return used;
+}
+
+int ibtrs_clt_stats_reconnects_to_str(struct ibtrs_clt_stats *stats, char *buf,
+				      size_t len)
+{
+	return scnprintf(buf, len, "%d %d\n",
+			 stats->reconnects.successful_cnt,
+			 stats->reconnects.fail_cnt);
+}
+
+ssize_t ibtrs_clt_stats_rdma_to_str(struct ibtrs_clt_stats *stats,
+				    char *page, size_t len)
+{
+	struct ibtrs_clt_stats_rdma sum;
+	struct ibtrs_clt_stats_rdma *r;
+	int cpu;
+
+	memset(&sum, 0, sizeof(sum));
+
+	for_each_possible_cpu(cpu) {
+		r = &per_cpu_ptr(stats->pcpu_stats, cpu)->rdma;
+
+		sum.dir[READ].cnt	  += r->dir[READ].cnt;
+		sum.dir[READ].size_total  += r->dir[READ].size_total;
+		sum.dir[WRITE].cnt	  += r->dir[WRITE].cnt;
+		sum.dir[WRITE].size_total += r->dir[WRITE].size_total;
+		sum.failover_cnt	  += r->failover_cnt;
+	}
+
+	return scnprintf(page, len, "%llu %llu %llu %llu %u %llu\n",
+			 sum.dir[READ].cnt, sum.dir[READ].size_total,
+			 sum.dir[WRITE].cnt, sum.dir[WRITE].size_total,
+			 atomic_read(&stats->inflight), sum.failover_cnt);
+}
+
+int ibtrs_clt_stats_sg_list_distr_to_str(struct ibtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	int i, cpu, cnt;
+
+	cnt = scnprintf(buf, len, "n\\cpu:");
+	for_each_possible_cpu(cpu)
+		cnt += scnprintf(buf + cnt, len - cnt, "%5d", cpu);
+
+	for (i = 0; i < SG_DISTR_SZ; i++) {
+		if (i <= MAX_LIN_SG)
+			cnt += scnprintf(buf + cnt, len - cnt, "\n= %3d:", i);
+		else if (i < SG_DISTR_SZ - 1)
+			cnt += scnprintf(buf + cnt, len - cnt,
+					 "\n< %3d:",
+					 1 << (i + MIN_LOG_SG - MAX_LIN_SG));
+		else
+			cnt += scnprintf(buf + cnt, len - cnt,
+					 "\n>=%3d:",
+					 1 << (i + MIN_LOG_SG - MAX_LIN_SG - 1));
+
+		for_each_possible_cpu(cpu) {
+			unsigned int p, p_i, p_f;
+			u64 total, distr;
+
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			total = s->sg_list_total;
+			distr = s->sg_list_distr[i];
+
+			p = total ? distr * 1000 / total : 0;
+			p_i = p / 10;
+			p_f = p % 10;
+
+			if (distr)
+				cnt += scnprintf(buf + cnt, len - cnt,
+						 " %2u.%01u", p_i, p_f);
+			else
+				cnt += scnprintf(buf + cnt, len - cnt, "    0");
+		}
+	}
+
+	cnt += scnprintf(buf + cnt, len - cnt, "\ntotal:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		cnt += scnprintf(buf + cnt, len - cnt, " %llu",
+				 s->sg_list_total);
+	}
+	cnt += scnprintf(buf + cnt, len - cnt, "\n");
+
+	return cnt;
+}
+
+ssize_t ibtrs_clt_reset_all_help(struct ibtrs_clt_stats *s,
+				 char *page, size_t len)
+{
+	return scnprintf(page, len, "echo 1 to reset all statistics\n");
+}
+
+int ibtrs_clt_reset_rdma_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->rdma, 0, sizeof(s->rdma));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_rdma_lat_distr_stats(struct ibtrs_clt_stats *stats,
+					 bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (enable) {
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			memset(&s->rdma_lat_max, 0, sizeof(s->rdma_lat_max));
+			memset(&s->rdma_lat_distr, 0,
+			       sizeof(s->rdma_lat_distr));
+		}
+	}
+	stats->enable_rdma_lat = enable;
+
+	return 0;
+}
+
+int ibtrs_clt_reset_sg_list_distr_stats(struct ibtrs_clt_stats *stats,
+					bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->sg_list_total, 0, sizeof(s->sg_list_total));
+		memset(&s->sg_list_distr, 0, sizeof(s->sg_list_distr));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_cpu_migr_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->cpu_migr, 0, sizeof(s->cpu_migr));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_reconnects_stat(struct ibtrs_clt_stats *stats, bool enable)
+{
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	memset(&stats->reconnects, 0, sizeof(stats->reconnects));
+
+	return 0;
+}
+
+int ibtrs_clt_reset_wc_comp_stats(struct ibtrs_clt_stats *stats, bool enable)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->wc_comp, 0, sizeof(s->wc_comp));
+	}
+
+	return 0;
+}
+
+int ibtrs_clt_reset_all_stats(struct ibtrs_clt_stats *s, bool enable)
+{
+	if (enable) {
+		ibtrs_clt_reset_rdma_stats(s, enable);
+		ibtrs_clt_reset_rdma_lat_distr_stats(s, enable);
+		ibtrs_clt_reset_sg_list_distr_stats(s, enable);
+		ibtrs_clt_reset_cpu_migr_stats(s, enable);
+		ibtrs_clt_reset_reconnects_stat(s, enable);
+		ibtrs_clt_reset_wc_comp_stats(s, enable);
+		atomic_set(&s->inflight, 0);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static inline void ibtrs_clt_record_sg_distr(u64 stat[SG_DISTR_SZ], u64 *total,
+					     unsigned int cnt)
+{
+	int i;
+
+	i = cnt > MAX_LIN_SG ? ilog2(cnt) + MAX_LIN_SG - MIN_LOG_SG + 1 : cnt;
+	i = i < SG_DISTR_SZ ? i : SG_DISTR_SZ - 1;
+
+	stat[i]++;
+	(*total)++;
+}
+
+static inline void ibtrs_clt_update_rdma_stats(struct ibtrs_clt_stats *stats,
+					       size_t size, int d)
+{
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.dir[d].cnt++;
+	s->rdma.dir[d].size_total += size;
+}
+
+void ibtrs_clt_update_all_stats(struct ibtrs_clt_io_req *req, int dir)
+{
+	struct ibtrs_clt_con *con = req->con;
+	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ibtrs_clt_stats *stats = &sess->stats;
+	unsigned int len;
+
+	struct ibtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	ibtrs_clt_record_sg_distr(s->sg_list_distr, &s->sg_list_total,
+				  req->sg_cnt);
+	len = req->usr_len + req->data_len;
+	ibtrs_clt_update_rdma_stats(stats, len, dir);
+	atomic_inc(&stats->inflight);
+}
+
+int ibtrs_clt_init_stats(struct ibtrs_clt_stats *stats)
+{
+	stats->enable_rdma_lat = false;
+	stats->pcpu_stats = alloc_percpu(typeof(*stats->pcpu_stats));
+	if (unlikely(!stats->pcpu_stats))
+		return -ENOMEM;
+
+	/*
+	 * successful_cnt will be set to 0 after session
+	 * is established for the first time
+	 */
+	stats->reconnects.successful_cnt = -1;
+
+	return 0;
+}
+
+void ibtrs_clt_free_stats(struct ibtrs_clt_stats *stats)
+{
+	free_percpu(stats->pcpu_stats);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 07/24] ibtrs: client: sysfs interface functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (5 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 06/24] ibtrs: client: statistics functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-05 11:20   ` Sagi Grimberg
  2018-02-02 14:08 ` [PATCH 08/24] ibtrs: server: private header with server structs and functions Roman Pen
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is the sysfs interface to IBTRS sessions on client side:

  /sys/kernel/ibtrs_client/<SESS-NAME>/
    *** IBTRS session created by ibtrs_clt_open() API call
    |
    |- max_reconnect_attempts
    |  *** number of reconnect attempts for session
    |
    |- add_path
    |  *** adds another connection path into IBTRS session
    |
    |- paths/<DEST-IP>/
       *** established paths to server in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- reconnect
       |  *** reconnect path
       |
       |- remove_path
       |  *** remove current path
       |
       |- state
       |  *** retrieve current path state
       |
       |- stats/
          *** current path statistics
          |
	  |- cpu_migration
	  |- rdma
	  |- rdma_lat
	  |- reconnects
	  |- reset_all
	  |- sg_entries
	  |- wc_completions

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c | 519 +++++++++++++++++++++++++
 1 file changed, 519 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
new file mode 100644
index 000000000000..04949d6d796b
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
@@ -0,0 +1,519 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+static struct kobject *ibtrs_kobj;
+
+#define MIN_MAX_RECONN_ATT -1
+#define MAX_MAX_RECONN_ATT 9999
+
+static struct kobj_type ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+};
+
+static ssize_t ibtrs_clt_max_reconn_attempts_show(struct kobject *kobj,
+						  struct kobj_attribute *attr,
+						  char *page)
+{
+	struct ibtrs_clt *clt;
+
+	clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+	return sprintf(page, "%d\n", ibtrs_clt_get_max_reconnect_attempts(clt));
+}
+
+static ssize_t ibtrs_clt_max_reconn_attempts_store(struct kobject *kobj,
+						   struct kobj_attribute *attr,
+						   const char *buf,
+						   size_t count)
+{
+	struct ibtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (unlikely(ret)) {
+		ibtrs_err(clt, "%s: failed to convert string '%s' to int\n",
+			  attr->attr.name, buf);
+		return ret;
+	}
+	if (unlikely(value > MAX_MAX_RECONN_ATT ||
+		     value < MIN_MAX_RECONN_ATT)) {
+		ibtrs_err(clt, "%s: invalid range"
+			  " (provided: '%s', accepted: min: %d, max: %d)\n",
+			  attr->attr.name, buf, MIN_MAX_RECONN_ATT,
+			  MAX_MAX_RECONN_ATT);
+		return -EINVAL;
+	}
+	ibtrs_clt_set_max_reconnect_attempts(clt, value);
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_max_reconnect_attempts_attr =
+	__ATTR(max_reconnect_attempts, 0644,
+	       ibtrs_clt_max_reconn_attempts_show,
+	       ibtrs_clt_max_reconn_attempts_store);
+
+static ssize_t ibtrs_clt_mp_policy_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *page)
+{
+	struct ibtrs_clt *clt;
+
+	clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+	switch (clt->mp_policy) {
+	case MP_POLICY_RR:
+		return sprintf(page, "round-robin (RR: %d)\n", clt->mp_policy);
+	case MP_POLICY_MIN_INFLIGHT:
+		return sprintf(page, "min-inflight (MI: %d)\n", clt->mp_policy);
+	default:
+		return sprintf(page, "Unknown (%d)\n", clt->mp_policy);
+	}
+}
+
+static ssize_t ibtrs_clt_mp_policy_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf,
+					 size_t count)
+{
+	struct ibtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (!ret && (value == MP_POLICY_RR || value == MP_POLICY_MIN_INFLIGHT)) {
+		clt->mp_policy = value;
+		return count;
+	}
+
+	if (!strncasecmp(buf, "round-robin", 11) ||
+	    !strncasecmp(buf, "rr", 2))
+		clt->mp_policy = MP_POLICY_RR;
+	else if (!strncasecmp(buf, "min-inflight", 12) ||
+		 !strncasecmp(buf, "mi", 2))
+		clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_mp_policy_attr =
+	__ATTR(mp_policy, 0644,
+	       ibtrs_clt_mp_policy_show,
+	       ibtrs_clt_mp_policy_store);
+
+static ssize_t ibtrs_clt_state_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *page)
+{
+	struct ibtrs_clt_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (ibtrs_clt_sess_is_connected(sess))
+		return sprintf(page, "connected\n");
+
+	return sprintf(page, "disconnected\n");
+}
+
+static struct kobj_attribute ibtrs_clt_state_attr =
+	__ATTR(state, 0444, ibtrs_clt_state_show, NULL);
+
+static ssize_t ibtrs_clt_reconnect_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_reconnect_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_reconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_reconnect_attr =
+	__ATTR(reconnect, 0644, ibtrs_clt_reconnect_show,
+	       ibtrs_clt_reconnect_store);
+
+static ssize_t ibtrs_clt_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_disconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_disconnect_attr =
+	__ATTR(disconnect, 0644, ibtrs_clt_disconnect_show,
+	       ibtrs_clt_disconnect_store);
+
+static ssize_t ibtrs_clt_remove_path_show(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_remove_path_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct ibtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct ibtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = ibtrs_clt_remove_path_from_sysfs(sess, &attr->attr);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_remove_path_attr =
+	__ATTR(remove_path, 0644, ibtrs_clt_remove_path_show,
+	       ibtrs_clt_remove_path_store);
+
+static ssize_t ibtrs_clt_add_path_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo"
+			 " [<source addr>,]<destination addr> > %s\n\n"
+			"*addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_clt_add_path_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct sockaddr_storage srcaddr, dstaddr;
+	struct ibtrs_addr addr = {
+		.src = (struct sockaddr *)&srcaddr,
+		.dst = (struct sockaddr *)&dstaddr
+	};
+	struct ibtrs_clt *clt;
+	const char *nl;
+	size_t len;
+	int err;
+
+	clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+	nl = strchr(buf, '\n');
+	if (nl)
+		len = nl - buf;
+	else
+		len = count;
+	err = ibtrs_addr_to_sockaddr(buf, len, clt->port, &addr);
+	if (unlikely(err))
+		return -EINVAL;
+
+	err = ibtrs_clt_create_path_from_sysfs(clt, &addr);
+	if (unlikely(err))
+		return err;
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_clt_add_path_attr =
+	__ATTR(add_path, 0644, ibtrs_clt_add_path_show,
+	       ibtrs_clt_add_path_store);
+
+STAT_ATTR(struct ibtrs_clt_sess, cpu_migration,
+	  ibtrs_clt_stats_migration_cnt_to_str,
+	  ibtrs_clt_reset_cpu_migr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, sg_entries,
+	  ibtrs_clt_stats_sg_list_distr_to_str,
+	  ibtrs_clt_reset_sg_list_distr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, reconnects,
+	  ibtrs_clt_stats_reconnects_to_str,
+	  ibtrs_clt_reset_reconnects_stat);
+
+STAT_ATTR(struct ibtrs_clt_sess, rdma_lat,
+	  ibtrs_clt_stats_rdma_lat_distr_to_str,
+	  ibtrs_clt_reset_rdma_lat_distr_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, wc_completion,
+	  ibtrs_clt_stats_wc_completion_to_str,
+	  ibtrs_clt_reset_wc_comp_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, rdma,
+	  ibtrs_clt_stats_rdma_to_str,
+	  ibtrs_clt_reset_rdma_stats);
+
+STAT_ATTR(struct ibtrs_clt_sess, reset_all,
+	  ibtrs_clt_reset_all_help,
+	  ibtrs_clt_reset_all_stats);
+
+static struct attribute *ibtrs_clt_stats_attrs[] = {
+	&sg_entries_attr.attr,
+	&cpu_migration_attr.attr,
+	&reconnects_attr.attr,
+	&rdma_lat_attr.attr,
+	&wc_completion_attr.attr,
+	&rdma_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_stats_attr_group = {
+	.attrs = ibtrs_clt_stats_attrs,
+};
+
+static int ibtrs_clt_create_stats_files(struct kobject *kobj,
+					struct kobject *kobj_stats)
+{
+	int ret;
+
+	ret = kobject_init_and_add(kobj_stats, &ktype, kobj, "stats");
+	if (ret) {
+		pr_err("Failed to init and add stats kobject, err: %d\n",
+		       ret);
+		return ret;
+	}
+
+	ret = sysfs_create_group(kobj_stats, &ibtrs_clt_stats_attr_group);
+	if (ret) {
+		pr_err("failed to create stats sysfs group, err: %d\n",
+		       ret);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_del(kobj_stats);
+	kobject_put(kobj_stats);
+
+	return ret;
+}
+
+static struct attribute *ibtrs_clt_sess_attrs[] = {
+	&ibtrs_clt_state_attr.attr,
+	&ibtrs_clt_reconnect_attr.attr,
+	&ibtrs_clt_disconnect_attr.attr,
+	&ibtrs_clt_remove_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_sess_attr_group = {
+	.attrs = ibtrs_clt_sess_attrs,
+};
+
+int ibtrs_clt_create_sess_files(struct ibtrs_clt_sess *sess)
+{
+	struct ibtrs_clt *clt = sess->clt;
+	char str[MAXHOSTNAMELEN];
+	int err;
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &clt->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add: %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj, &ibtrs_clt_sess_attr_group);
+	if (unlikely(err)) {
+		pr_err("sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = ibtrs_clt_create_stats_files(&sess->kobj, &sess->kobj_stats);
+	if (unlikely(err))
+		goto put_kobj;
+
+	return 0;
+
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+
+	return err;
+}
+
+static void ibtrs_sysfs_remove_file_self(struct kobject *kobj,
+					 const struct attribute *attr)
+{
+	struct device_attribute dattr = {
+		.attr.name = attr->name
+	};
+	struct device *device;
+
+	/*
+	 * Unfortunately original sysfs_remove_file_self() is not exported,
+	 * so consider this as a hack to call self removal of a sysfs entry
+	 * just using another "door".
+	 */
+
+	device = container_of(kobj, typeof(*device), kobj);
+	device_remove_file_self(device, &dattr);
+}
+
+void ibtrs_clt_destroy_sess_files(struct ibtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		if (sysfs_self)
+			/* To avoid deadlock firstly commit suicide */
+			ibtrs_sysfs_remove_file_self(&sess->kobj, sysfs_self);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+	}
+}
+
+static struct attribute *ibtrs_clt_attrs[] = {
+	&ibtrs_clt_max_reconnect_attempts_attr.attr,
+	&ibtrs_clt_mp_policy_attr.attr,
+	&ibtrs_clt_add_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_clt_attr_group = {
+	.attrs = ibtrs_clt_attrs,
+};
+
+int ibtrs_clt_create_sysfs_root_folders(struct ibtrs_clt *clt)
+{
+	int err;
+
+	err = kobject_init_and_add(&clt->kobj, &ktype, ibtrs_kobj,
+				   "%s", clt->sessname);
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add(): %d\n", err);
+		return err;
+	}
+	err = kobject_init_and_add(&clt->kobj_paths, &ktype,
+				   &clt->kobj, "paths");
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add(): %d\n", err);
+		goto put_kobj;
+	}
+
+	return 0;
+
+put_kobj:
+	kobject_del(&clt->kobj);
+	kobject_put(&clt->kobj);
+
+	return err;
+}
+
+int ibtrs_clt_create_sysfs_root_files(struct ibtrs_clt *clt)
+{
+	return sysfs_create_group(&clt->kobj, &ibtrs_clt_attr_group);
+}
+
+void ibtrs_clt_destroy_sysfs_root_folders(struct ibtrs_clt *clt)
+{
+	if (clt->kobj_paths.state_in_sysfs) {
+		kobject_del(&clt->kobj_paths);
+		kobject_put(&clt->kobj_paths);
+	}
+	if (clt->kobj.state_in_sysfs) {
+		kobject_del(&clt->kobj);
+		kobject_put(&clt->kobj);
+	}
+}
+
+void ibtrs_clt_destroy_sysfs_root_files(struct ibtrs_clt *clt)
+{
+	sysfs_remove_group(&clt->kobj, &ibtrs_clt_attr_group);
+}
+
+int ibtrs_clt_create_sysfs_module_files(void)
+{
+	ibtrs_kobj = kobject_create_and_add(KBUILD_MODNAME, kernel_kobj);
+	if (unlikely(!ibtrs_kobj))
+		return -ENOMEM;
+
+	return 0;
+}
+
+void ibtrs_clt_destroy_sysfs_module_files(void)
+{
+	kobject_del(ibtrs_kobj);
+	kobject_put(ibtrs_kobj);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 08/24] ibtrs: server: private header with server structs and functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (6 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 07/24] ibtrs: client: sysfs interface functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 09/24] ibtrs: server: main functionality Roman Pen
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This header describes main structs and functions used by ibtrs-server
module, mainly for accepting IBTRS sessions, creating/destroying
sysfs entries, accounting statistics on server side.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h | 169 +++++++++++++++++++++++++++++++
 1 file changed, 169 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
new file mode 100644
index 000000000000..f54e159eaf2a
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
@@ -0,0 +1,169 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBTRS_SRV_H
+#define IBTRS_SRV_H
+
+#include <linux/refcount.h>
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_srv_state - Server states.
+ */
+enum ibtrs_srv_state {
+	IBTRS_SRV_CONNECTING,
+	IBTRS_SRV_CONNECTED,
+	IBTRS_SRV_CLOSING,
+	IBTRS_SRV_CLOSED,
+};
+
+static inline const char *ibtrs_srv_state_str(enum ibtrs_srv_state state)
+{
+	switch (state) {
+	case IBTRS_SRV_CONNECTING:
+		return "IBTRS_SRV_CONNECTING";
+	case IBTRS_SRV_CONNECTED:
+		return "IBTRS_SRV_CONNECTED";
+	case IBTRS_SRV_CLOSING:
+		return "IBTRS_SRV_CLOSING";
+	case IBTRS_SRV_CLOSED:
+		return "IBTRS_SRV_CLOSED";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+struct ibtrs_stats_wc_comp {
+	atomic64_t	calls;
+	atomic64_t	total_wc_cnt;
+};
+
+struct ibtrs_srv_stats_rdma_stats {
+	struct {
+		atomic64_t	cnt;
+		atomic64_t	size_total;
+	} dir[2];
+};
+
+struct ibtrs_srv_stats {
+	struct ibtrs_srv_stats_rdma_stats	rdma_stats;
+	atomic_t				apm_cnt;
+	struct ibtrs_stats_wc_comp		wc_comp;
+};
+
+struct ibtrs_srv_con {
+	struct ibtrs_con	c;
+	atomic_t		wr_cnt;
+};
+
+struct ibtrs_srv_op {
+	struct ibtrs_srv_con		*con;
+	u32				msg_id;
+	u8				dir;
+	u64				data_dma_addr;
+	struct ibtrs_msg_rdma_read	*msg;
+	struct ib_rdma_wr		*tx_wr;
+	struct ib_sge			*tx_sg;
+};
+
+struct ibtrs_srv_sess {
+	struct ibtrs_sess	s;
+	struct ibtrs_srv	*srv;
+	struct work_struct	close_work;
+	enum ibtrs_srv_state	state;
+	spinlock_t		state_lock;
+	int			cur_cq_vector;
+	struct ibtrs_srv_op	**ops_ids;
+	atomic_t		ids_inflight;
+	wait_queue_head_t	ids_waitq;
+	dma_addr_t		*rdma_addr;
+	bool			established;
+	unsigned int		mem_bits;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct ibtrs_srv_stats	stats;
+};
+
+struct ibtrs_srv {
+	struct list_head	paths_list;
+	int			paths_up;
+	struct mutex		paths_ev_mutex;
+	size_t			paths_num;
+	struct mutex		paths_mutex;
+	uuid_t			paths_uuid;
+	refcount_t		refcount;
+	struct ibtrs_srv_ctx	*ctx;
+	struct list_head	ctx_list;
+	void			*priv;
+	size_t			queue_depth;
+	struct page		**chunks;
+	struct kobject		kobj;
+	struct kobject		kobj_paths;
+};
+
+struct ibtrs_srv_ctx {
+	rdma_ev_fn *rdma_ev;
+	link_ev_fn *link_ev;
+	struct rdma_cm_id *cm_id_ip;
+	struct rdma_cm_id *cm_id_ib;
+	struct mutex srv_mutex;
+	struct list_head srv_list;
+};
+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj)						\
+	LIST(CASE(obj, struct ibtrs_srv_sess *, s.sessname))
+
+void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess);
+
+/* ibtrs-srv-stats.c */
+
+void ibtrs_srv_update_rdma_stats(struct ibtrs_srv_stats *s, size_t size, int d);
+void ibtrs_srv_update_wc_stats(struct ibtrs_srv_stats *s);
+
+int ibtrs_srv_reset_rdma_stats(struct ibtrs_srv_stats *stats, bool enable);
+ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
+				    char *page, size_t len);
+int ibtrs_srv_reset_wc_completion_stats(struct ibtrs_srv_stats *stats,
+					bool enable);
+int ibtrs_srv_stats_wc_completion_to_str(struct ibtrs_srv_stats *stats, char *buf,
+					 size_t len);
+int ibtrs_srv_reset_all_stats(struct ibtrs_srv_stats *stats, bool enable);
+ssize_t ibtrs_srv_reset_all_help(struct ibtrs_srv_stats *stats,
+				 char *page, size_t len);
+
+/* ibtrs-srv-sysfs.c */
+
+int ibtrs_srv_create_sysfs_module_files(void);
+void ibtrs_srv_destroy_sysfs_module_files(void);
+
+int ibtrs_srv_create_sess_files(struct ibtrs_srv_sess *sess);
+void ibtrs_srv_destroy_sess_files(struct ibtrs_srv_sess *sess);
+
+#endif /* IBTRS_SRV_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 09/24] ibtrs: server: main functionality
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (7 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 08/24] ibtrs: server: private header with server structs and functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-05 11:29   ` Sagi Grimberg
  2018-02-02 14:08 ` [PATCH 10/24] ibtrs: server: statistics functions Roman Pen
                   ` (17 subsequent siblings)
  26 siblings, 1 reply; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is main functionality of ibtrs-server module, which accepts
set of RDMA connections (so called IBTRS session), creates/destroys
sysfs entries associated with IBTRS session and notifies upper layer
(user of IBTRS API) about RDMA requests or link events.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1811 ++++++++++++++++++++++++++++++
 1 file changed, 1811 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
new file mode 100644
index 000000000000..0d1fc08bd821
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
@@ -0,0 +1,1811 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/mempool.h>
+
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Server");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_MAX_IO_SIZE_KB 128
+#define DEFAULT_MAX_IO_SIZE (DEFAULT_MAX_IO_SIZE_KB * 1024)
+#define MAX_REQ_SIZE PAGE_SIZE
+#define MAX_SG_COUNT ((MAX_REQ_SIZE - sizeof(struct ibtrs_msg_rdma_read)) \
+		      / sizeof(struct ibtrs_sg_desc))
+
+static int max_io_size = DEFAULT_MAX_IO_SIZE;
+static int rcv_buf_size = DEFAULT_MAX_IO_SIZE + MAX_REQ_SIZE;
+
+static int max_io_size_set(const char *val, const struct kernel_param *kp)
+{
+	int err, ival;
+
+	err = kstrtoint(val, 0, &ival);
+	if (err)
+		return err;
+
+	if (ival < 4096 || ival + MAX_REQ_SIZE > (4096 * 1024) ||
+	    (ival + MAX_REQ_SIZE) % 512 != 0) {
+		pr_err("Invalid max io size value %d, has to be"
+		       " > %d, < %d\n", ival, 4096, 4194304);
+		return -EINVAL;
+	}
+
+	max_io_size = ival;
+	rcv_buf_size = max_io_size + MAX_REQ_SIZE;
+	pr_info("max io size changed to %d\n", ival);
+
+	return 0;
+}
+
+static const struct kernel_param_ops max_io_size_ops = {
+	.set		= max_io_size_set,
+	.get		= param_get_int,
+};
+module_param_cb(max_io_size, &max_io_size_ops, &max_io_size, 0444);
+MODULE_PARM_DESC(max_io_size,
+		 "Max size for each IO request, when change the unit is in byte"
+		 " (default: " __stringify(DEFAULT_MAX_IO_SIZE_KB) "KB)");
+
+#define DEFAULT_SESS_QUEUE_DEPTH 512
+static int sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
+module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
+MODULE_PARM_DESC(sess_queue_depth,
+		 "Number of buffers for pending I/O requests to allocate"
+		 " per session. Maximum: " __stringify(MAX_SESS_QUEUE_DEPTH)
+		 " (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
+
+/* We guarantee to serve 10 paths at least */
+#define CHUNK_POOL_SIZE (DEFAULT_SESS_QUEUE_DEPTH * 10)
+static mempool_t *chunk_pool;
+
+static int retry_count = 7;
+
+static int retry_count_set(const char *val, const struct kernel_param *kp)
+{
+	int err, ival;
+
+	err = kstrtoint(val, 0, &ival);
+	if (err)
+		return err;
+
+	if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT) {
+		pr_err("Invalid retry count value %d, has to be"
+		       " > %d, < %d\n", ival, MIN_RTR_CNT, MAX_RTR_CNT);
+		return -EINVAL;
+	}
+
+	retry_count = ival;
+	pr_info("QP retry count changed to %d\n", ival);
+
+	return 0;
+}
+
+static const struct kernel_param_ops retry_count_ops = {
+	.set		= retry_count_set,
+	.get		= param_get_int,
+};
+module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
+
+MODULE_PARM_DESC(retry_count, "Number of times to send the message if the"
+		 " remote side didn't respond with Ack or Nack (default: 3,"
+		 " min: " __stringify(MIN_RTR_CNT) ", max: "
+		 __stringify(MAX_RTR_CNT) ")");
+
+static char cq_affinity_list[256] = "";
+static cpumask_t cq_affinity_mask = { CPU_BITS_ALL };
+
+static void init_cq_affinity(void)
+{
+	sprintf(cq_affinity_list, "0-%d", nr_cpu_ids - 1);
+}
+
+static int cq_affinity_list_set(const char *val, const struct kernel_param *kp)
+{
+	int ret = 0, len = strlen(val);
+	cpumask_var_t new_value;
+
+	if (!strlen(cq_affinity_list))
+		init_cq_affinity();
+
+	if (len >= sizeof(cq_affinity_list))
+		return -EINVAL;
+	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = cpulist_parse(val, new_value);
+	if (ret) {
+		pr_err("Can't set cq_affinity_list \"%s\": %d\n", val,
+		       ret);
+		goto free_cpumask;
+	}
+
+	strlcpy(cq_affinity_list, val, sizeof(cq_affinity_list));
+	*strchrnul(cq_affinity_list, '\n') = '\0';
+	cpumask_copy(&cq_affinity_mask, new_value);
+
+	pr_info("cq_affinity_list changed to %*pbl\n",
+		cpumask_pr_args(&cq_affinity_mask));
+free_cpumask:
+	free_cpumask_var(new_value);
+	return ret;
+}
+
+static struct kparam_string cq_affinity_list_kparam_str = {
+	.maxlen	= sizeof(cq_affinity_list),
+	.string	= cq_affinity_list
+};
+
+static const struct kernel_param_ops cq_affinity_list_ops = {
+	.set	= cq_affinity_list_set,
+	.get	= param_get_string,
+};
+
+module_param_cb(cq_affinity_list, &cq_affinity_list_ops,
+		&cq_affinity_list_kparam_str, 0644);
+MODULE_PARM_DESC(cq_affinity_list, "Sets the list of cpus to use as cq vectors."
+		 "(default: use all possible CPUs)");
+
+static struct workqueue_struct *ibtrs_wq;
+
+static void close_sess(struct ibtrs_srv_sess *sess);
+
+static inline struct ibtrs_srv_con *to_srv_con(struct ibtrs_con *c)
+{
+	if (unlikely(!c))
+		return NULL;
+
+	return container_of(c, struct ibtrs_srv_con, c);
+}
+
+static inline struct ibtrs_srv_sess *to_srv_sess(struct ibtrs_sess *s)
+{
+	if (unlikely(!s))
+		return NULL;
+
+	return container_of(s, struct ibtrs_srv_sess, s);
+}
+
+static bool __ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
+				     enum ibtrs_srv_state new_state)
+{
+	enum ibtrs_srv_state old_state;
+	bool changed = false;
+
+	old_state = sess->state;
+	switch (new_state) {
+	case IBTRS_SRV_CONNECTED:
+		switch (old_state) {
+		case IBTRS_SRV_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_SRV_CLOSING:
+		switch (old_state) {
+		case IBTRS_SRV_CONNECTING:
+		case IBTRS_SRV_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case IBTRS_SRV_CLOSED:
+		switch (old_state) {
+		case IBTRS_SRV_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed)
+		sess->state = new_state;
+
+	return changed;
+}
+
+static bool ibtrs_srv_change_state_get_old(struct ibtrs_srv_sess *sess,
+					   enum ibtrs_srv_state new_state,
+					   enum ibtrs_srv_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_lock);
+	*old_state = sess->state;
+	changed = __ibtrs_srv_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_lock);
+
+	return changed;
+}
+
+static bool ibtrs_srv_change_state(struct ibtrs_srv_sess *sess,
+				   enum ibtrs_srv_state new_state)
+{
+	enum ibtrs_srv_state old_state;
+
+	return ibtrs_srv_change_state_get_old(sess, new_state, &old_state);
+}
+
+static void free_id(struct ibtrs_srv_op *id)
+{
+	if (!id)
+		return;
+	kfree(id->tx_wr);
+	kfree(id->tx_sg);
+	kfree(id);
+}
+
+static void ibtrs_srv_free_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int i;
+
+	WARN_ON(atomic_read(&sess->ids_inflight));
+	if (sess->ops_ids) {
+		for (i = 0; i < srv->queue_depth; i++)
+			free_id(sess->ops_ids[i]);
+		kfree(sess->ops_ids);
+		sess->ops_ids = NULL;
+	}
+}
+
+static int ibtrs_srv_alloc_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_op *id;
+	int i;
+
+	sess->ops_ids = kcalloc(srv->queue_depth, sizeof(*sess->ops_ids),
+				GFP_KERNEL);
+	if (unlikely(!sess->ops_ids))
+		goto err;
+
+	for (i = 0; i < srv->queue_depth; ++i) {
+		id = kzalloc(sizeof(*id), GFP_KERNEL);
+		if (unlikely(!id))
+			goto err;
+
+		sess->ops_ids[i] = id;
+		id->tx_wr = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_wr),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_wr))
+			goto err;
+
+		id->tx_sg = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_sg),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_sg))
+			goto err;
+	}
+	init_waitqueue_head(&sess->ids_waitq);
+	atomic_set(&sess->ids_inflight, 0);
+
+	return 0;
+
+err:
+	ibtrs_srv_free_ops_ids(sess);
+	return -ENOMEM;
+}
+
+static void ibtrs_srv_get_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	atomic_inc(&sess->ids_inflight);
+}
+
+static void ibtrs_srv_put_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	if (atomic_dec_and_test(&sess->ids_inflight))
+		wake_up(&sess->ids_waitq);
+}
+
+static void ibtrs_srv_wait_ops_ids(struct ibtrs_srv_sess *sess)
+{
+	wait_event(sess->ids_waitq, !atomic_read(&sess->ids_inflight));
+}
+
+static void ibtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+
+static struct ib_cqe io_comp_cqe = {
+	.done = ibtrs_srv_rdma_done
+};
+
+static int rdma_write_sg(struct ibtrs_srv_op *id)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(id->con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ib_rdma_wr *wr = NULL;
+	struct ib_send_wr *bad_wr;
+	enum ib_send_flags flags;
+	size_t sg_cnt;
+	int err, i, offset;
+
+	sg_cnt = le32_to_cpu(id->msg->sg_cnt);
+	if (unlikely(!sg_cnt))
+		return -EINVAL;
+
+	offset = 0;
+	for (i = 0; i < sg_cnt; i++) {
+		struct ib_sge *list;
+
+		wr		= &id->tx_wr[i];
+		list		= &id->tx_sg[i];
+		list->addr	= id->data_dma_addr + offset;
+		list->length	= le32_to_cpu(id->msg->desc[i].len);
+
+		/* WR will fail with length error
+		 * if this is 0
+		 */
+		if (unlikely(list->length == 0)) {
+			ibtrs_err(sess, "Invalid RDMA-Write sg list length 0\n");
+			return -EINVAL;
+		}
+
+		list->lkey = sess->s.ib_dev->lkey;
+		offset += list->length;
+
+		wr->wr.wr_cqe	= &io_comp_cqe;
+		wr->wr.sg_list	= list;
+		wr->wr.num_sge	= 1;
+		wr->remote_addr	= le64_to_cpu(id->msg->desc[i].addr);
+		wr->rkey	= le32_to_cpu(id->msg->desc[i].key);
+
+		if (i < (sg_cnt - 1)) {
+			wr->wr.next	= &id->tx_wr[i + 1].wr;
+			wr->wr.opcode	= IB_WR_RDMA_WRITE;
+			wr->wr.ex.imm_data	= 0;
+			wr->wr.send_flags	= 0;
+		}
+	}
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	wr->wr.wr_cqe = &io_comp_cqe;
+	wr->wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	wr->wr.next = NULL;
+	wr->wr.send_flags = flags;
+	wr->wr.ex.imm_data = cpu_to_be32(ibtrs_to_io_rsp_imm(id->msg_id, 0));
+
+	err = ib_post_send(id->con->c.qp, &id->tx_wr[0].wr, &bad_wr);
+	if (unlikely(err))
+		ibtrs_err(sess,
+			  "Posting RDMA-Write-Request to QP failed, err: %d\n",
+			  err);
+
+	return err;
+}
+
+static int send_io_resp_imm(struct ibtrs_srv_con *con, int msg_id, s16 errno)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	enum ib_send_flags flags;
+	u32 imm;
+	int err;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+	imm = ibtrs_to_io_rsp_imm(msg_id, errno);
+	err = ibtrs_post_rdma_write_imm_empty(&con->c, &io_comp_cqe,
+					      imm, flags);
+	if (unlikely(err))
+		ibtrs_err_rl(sess, "ib_post_send(), err: %d\n", err);
+
+	return err;
+}
+
+/*
+ * ibtrs_srv_resp_rdma() - sends response to the client.
+ *
+ * Context: any
+ */
+void ibtrs_srv_resp_rdma(struct ibtrs_srv_op *id, int status)
+{
+	struct ibtrs_srv_con *con = id->con;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	int err;
+
+	if (WARN_ON(!id))
+		return;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Sending I/O response failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		goto out;
+	}
+	if (status || id->dir == WRITE || !id->msg->sg_cnt)
+		err = send_io_resp_imm(con, id->msg_id, status);
+	else
+		err = rdma_write_sg(id);
+	if (unlikely(err)) {
+		ibtrs_err_rl(sess, "IO response failed: %d\n", err);
+		close_sess(sess);
+	}
+out:
+	ibtrs_srv_put_ops_ids(sess);
+}
+EXPORT_SYMBOL(ibtrs_srv_resp_rdma);
+
+void ibtrs_srv_set_sess_priv(struct ibtrs_srv *srv, void *priv)
+{
+	srv->priv = priv;
+}
+EXPORT_SYMBOL(ibtrs_srv_set_sess_priv);
+
+static void unmap_cont_bufs(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int i;
+
+	for (i = 0; i < srv->queue_depth; i++)
+		ib_dma_unmap_page(sess->s.ib_dev->dev, sess->rdma_addr[i],
+				  rcv_buf_size, DMA_BIDIRECTIONAL);
+}
+
+static int map_cont_bufs(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	unsigned int chunk_bits;
+	dma_addr_t addr;
+	int i, err;
+
+	for (i = 0; i < srv->queue_depth; i++) {
+		addr = ib_dma_map_page(sess->s.ib_dev->dev, srv->chunks[i],
+				       0, rcv_buf_size, DMA_BIDIRECTIONAL);
+		if (unlikely(ib_dma_mapping_error(sess->s.ib_dev->dev, addr))) {
+			ibtrs_err(sess, "ib_dma_map_page() failed\n");
+			err = -EIO;
+			goto err_map;
+		}
+		sess->rdma_addr[i] = addr;
+	}
+	chunk_bits = ilog2(srv->queue_depth - 1) + 1;
+	sess->mem_bits = (MAX_IMM_PAYL_BITS - chunk_bits);
+
+	return 0;
+
+err_map:
+	while (i--)
+		ib_dma_unmap_page(sess->s.ib_dev->dev, sess->rdma_addr[i],
+				  rcv_buf_size, DMA_BIDIRECTIONAL);
+
+	return err;
+}
+
+static void ibtrs_srv_hb_err_handler(struct ibtrs_con *c, int err)
+{
+	(void)err;
+	close_sess(to_srv_sess(c->sess));
+}
+
+static void ibtrs_srv_init_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_init_hb(&sess->s, &io_comp_cqe,
+		      IBTRS_HB_INTERVAL_MS,
+		      IBTRS_HB_MISSED_MAX,
+		      ibtrs_srv_hb_err_handler,
+		      ibtrs_wq);
+}
+
+static void ibtrs_srv_start_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_start_hb(&sess->s);
+}
+
+static void ibtrs_srv_stop_hb(struct ibtrs_srv_sess *sess)
+{
+	ibtrs_stop_hb(&sess->s);
+}
+
+static void ibtrs_srv_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	ibtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.ib_dev->dev);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info response send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+		return;
+	}
+	WARN_ON(wc->opcode != IB_WC_SEND);
+	ibtrs_srv_update_wc_stats(&sess->stats);
+}
+
+static void ibtrs_srv_sess_up(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	int up;
+
+	mutex_lock(&srv->paths_ev_mutex);
+	up = ++srv->paths_up;
+	if (up == 1)
+		ctx->link_ev(srv, IBTRS_SRV_LINK_EV_CONNECTED, NULL);
+	mutex_unlock(&srv->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+}
+
+static void ibtrs_srv_sess_down(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&srv->paths_ev_mutex);
+	WARN_ON(!srv->paths_up);
+	if (--srv->paths_up == 0)
+		ctx->link_ev(srv, IBTRS_SRV_LINK_EV_DISCONNECTED, srv->priv);
+	mutex_unlock(&srv->paths_ev_mutex);
+}
+
+static int post_recv_sess(struct ibtrs_srv_sess *sess);
+
+static int process_info_req(struct ibtrs_srv_con *con,
+			    struct ibtrs_msg_info_req *msg)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_msg_info_rsp *rsp;
+	struct ibtrs_iu *tx_iu;
+	size_t tx_sz;
+	int i, err;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "post_recv_sess(), err: %d\n", err);
+		return err;
+	}
+	memcpy(sess->s.sessname, msg->sessname, sizeof(sess->s.sessname));
+
+	tx_sz  = sizeof(struct ibtrs_msg_info_rsp);
+	tx_sz += sizeof(u64) * srv->queue_depth;
+	tx_iu = ibtrs_iu_alloc(0, tx_sz, GFP_KERNEL, sess->s.ib_dev->dev,
+			       DMA_TO_DEVICE, ibtrs_srv_info_rsp_done);
+	if (unlikely(!tx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(), err: %d\n", -ENOMEM);
+		return -ENOMEM;
+	}
+
+	rsp = tx_iu->buf;
+	rsp->type = cpu_to_le16(IBTRS_MSG_INFO_RSP);
+	rsp->addr_num = cpu_to_le16(srv->queue_depth);
+	for (i = 0; i < srv->queue_depth; i++)
+		rsp->addr[i] = cpu_to_le64(sess->rdma_addr[i]);
+
+	err = ibtrs_srv_create_sess_files(sess);
+	if (unlikely(err))
+		goto iu_free;
+
+	ibtrs_srv_change_state(sess, IBTRS_SRV_CONNECTED);
+	ibtrs_srv_start_hb(sess);
+
+	/*
+	 * We do not account number of established connections at the current
+	 * moment, we rely on the client, which should send info request when
+	 * all connections are successfully established.  Thus, simply notify
+	 * listener with a proper event if we are the first path.
+	 */
+	ibtrs_srv_sess_up(sess);
+
+	/* Send info response */
+	err = ibtrs_iu_post_send(&con->c, tx_iu, tx_sz);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_send(), err: %d\n", err);
+iu_free:
+		ibtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.ib_dev->dev);
+	}
+
+	return err;
+}
+
+static void ibtrs_srv_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_msg_info_req *msg;
+	struct ibtrs_iu *iu;
+	int err;
+
+	WARN_ON(con->c.cid);
+
+	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ibtrs_err(sess, "Sess info request receive failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto close;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		ibtrs_err(sess, "Sess info request is malformed: size %d\n",
+			  wc->byte_len);
+		goto close;
+	}
+	msg = iu->buf;
+	if (unlikely(le32_to_cpu(msg->type) != IBTRS_MSG_INFO_REQ)) {
+		ibtrs_err(sess, "Sess info request is malformed: type %d\n",
+			  le32_to_cpu(msg->type));
+		goto close;
+	}
+	err = process_info_req(con, msg);
+	if (unlikely(err))
+		goto close;
+
+out:
+	ibtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.ib_dev->dev);
+	return;
+close:
+	close_sess(sess);
+	goto out;
+}
+
+static int post_recv_info_req(struct ibtrs_srv_con *con)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_iu *rx_iu;
+	int err;
+
+	rx_iu = ibtrs_iu_alloc(0, sizeof(struct ibtrs_msg_info_req),
+			       GFP_KERNEL, sess->s.ib_dev->dev,
+			       DMA_FROM_DEVICE, ibtrs_srv_info_req_done);
+	if (unlikely(!rx_iu)) {
+		ibtrs_err(sess, "ibtrs_iu_alloc(): no memory\n");
+		return -ENOMEM;
+	}
+	/* Prepare for getting info response */
+	err = ibtrs_iu_post_recv(&con->c, rx_iu);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_iu_post_recv(), err: %d\n", err);
+		ibtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.ib_dev->dev);
+		return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_io(struct ibtrs_srv_con *con)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	int i, err;
+
+	for (i = 0; i < srv->queue_depth; i++) {
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct ibtrs_srv_sess *sess)
+{
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		err = post_recv_io(to_srv_con(sess->s.con[cid]));
+		if (unlikely(err)) {
+			ibtrs_err(sess, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static void process_read(struct ibtrs_srv_con *con,
+			 struct ibtrs_msg_rdma_read *msg,
+			 u32 buf_id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	struct ibtrs_srv_op *id;
+
+	size_t usr_len, data_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Processing read request failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		return;
+	}
+	ibtrs_srv_get_ops_ids(sess);
+	ibtrs_srv_update_rdma_stats(&sess->stats, off, READ);
+	id = sess->ops_ids[buf_id];
+	id->con		= con;
+	id->dir		= READ;
+	id->msg_id	= buf_id;
+	id->msg		= msg;
+	usr_len = le16_to_cpu(msg->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	id->data_dma_addr = sess->rdma_addr[buf_id];
+	ret = ctx->rdma_ev(srv, srv->priv, id, READ, data, data_len,
+			   data + data_len, usr_len);
+
+	if (unlikely(ret)) {
+		ibtrs_err_rl(sess, "Processing read request failed, user "
+			     "module cb reported for msg_id %d, err: %d\n",
+			     buf_id, ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, buf_id, ret);
+	if (ret < 0) {
+		ibtrs_err_rl(sess, "Sending err msg for failed RDMA-Write-Req"
+			     " failed, msg_id %d, err: %d\n", buf_id, ret);
+		close_sess(sess);
+	}
+	ibtrs_srv_put_ops_ids(sess);
+}
+
+static void process_write(struct ibtrs_srv_con *con,
+			  struct ibtrs_msg_rdma_write *req,
+			  u32 buf_id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_ctx *ctx = srv->ctx;
+	struct ibtrs_srv_op *id;
+
+	size_t data_len, usr_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != IBTRS_SRV_CONNECTED)) {
+		ibtrs_err_rl(sess, "Processing write request failed, "
+			     " session is disconnected, sess state %s\n",
+			     ibtrs_srv_state_str(sess->state));
+		return;
+	}
+	ibtrs_srv_get_ops_ids(sess);
+	ibtrs_srv_update_rdma_stats(&sess->stats, off, WRITE);
+	id = sess->ops_ids[buf_id];
+	id->con    = con;
+	id->dir    = WRITE;
+	id->msg_id = buf_id;
+
+	usr_len = le16_to_cpu(req->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	ret = ctx->rdma_ev(srv, srv->priv, id, WRITE, data, data_len,
+			   data + data_len, usr_len);
+	if (unlikely(ret)) {
+		ibtrs_err_rl(sess, "Processing write request failed, user"
+			     " module callback reports err: %d\n", ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, buf_id, ret);
+	if (ret < 0) {
+		ibtrs_err_rl(sess, "Processing write request failed, sending"
+			     " I/O response failed, msg_id %d, err: %d\n",
+			     buf_id, ret);
+		close_sess(sess);
+	}
+	ibtrs_srv_put_ops_ids(sess);
+}
+
+static void process_io_req(struct ibtrs_srv_con *con, void *msg,
+			   u32 id, u32 off)
+{
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	unsigned int type;
+
+	type = le16_to_cpu(le16_to_cpu(*(__le16 *)msg));
+
+	switch (type) {
+	case IBTRS_MSG_WRITE:
+		process_write(con, msg, id, off);
+		break;
+	case IBTRS_MSG_READ:
+		process_read(con, msg, id, off);
+		break;
+	default:
+		ibtrs_err(sess, "Processing I/O request failed, "
+			  "unknown message type received: 0x%02x\n", type);
+		goto err;
+	}
+
+	return;
+
+err:
+	close_sess(sess);
+}
+
+static void ibtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct ibtrs_srv_con *con = cq->cq_context;
+	struct ibtrs_srv_sess *sess = to_srv_sess(con->c.sess);
+	struct ibtrs_srv *srv = sess->srv;
+	u32 imm_type, imm_payload;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			ibtrs_err(sess, "%s (wr_cqe: %p,"
+				  " type: %s, vendor_err: 0x%x, len: %u)\n",
+				  ib_wc_status_msg(wc->status), wc->wr_cqe,
+				  ib_wc_opcode_str(wc->opcode),
+				  wc->vendor_err, wc->byte_len);
+			close_sess(sess);
+		}
+		return;
+	}
+	ibtrs_srv_update_wc_stats(&sess->stats);
+
+	switch (wc->opcode) {
+	case IB_WC_RDMA_WRITE:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
+			return;
+		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			ibtrs_err(sess, "ibtrs_post_recv(), err: %d\n", err);
+			close_sess(sess);
+			break;
+		}
+		ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == IBTRS_IO_REQ_IMM)) {
+			u32 msg_id, off;
+			void *data;
+
+			msg_id = imm_payload >> sess->mem_bits;
+			off = imm_payload & ((1 << sess->mem_bits) - 1);
+			if (unlikely(msg_id > srv->queue_depth ||
+				     off > rcv_buf_size)) {
+				ibtrs_err(sess, "Wrong msg_id %u, off %u\n",
+					  msg_id, off);
+				close_sess(sess);
+				return;
+			}
+			data = page_address(srv->chunks[msg_id]) + off;
+			process_io_req(con, data, msg_id, off);
+		} else if (imm_type == IBTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			ibtrs_send_hb_ack(&sess->s);
+		} else if (imm_type == IBTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+		} else {
+			ibtrs_wrn(sess, "Unknown IMM type %u\n", imm_type);
+		}
+		break;
+	default:
+		ibtrs_wrn(sess, "Unexpected WC type: %s\n",
+			  ib_wc_opcode_str(wc->opcode));
+		return;
+	}
+}
+
+int ibtrs_srv_get_sess_name(struct ibtrs_srv *srv, char *sessname, size_t len)
+{
+	struct ibtrs_srv_sess *sess;
+	int err = -ENOTCONN;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (sess->state != IBTRS_SRV_CONNECTED)
+			continue;
+		memcpy(sessname, sess->s.sessname,
+		       min_t(size_t, sizeof(sess->s.sessname), len));
+		err = 0;
+		break;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL(ibtrs_srv_get_sess_name);
+
+int ibtrs_srv_get_queue_depth(struct ibtrs_srv *srv)
+{
+	return srv->queue_depth;
+}
+EXPORT_SYMBOL(ibtrs_srv_get_queue_depth);
+
+static int find_next_bit_ring(int cur)
+{
+	int v = cpumask_next(cur, &cq_affinity_mask);
+
+	if (v >= nr_cpu_ids)
+		v = cpumask_first(&cq_affinity_mask);
+	return v;
+}
+
+static int ibtrs_srv_get_next_cq_vector(struct ibtrs_srv_sess *sess)
+{
+	sess->cur_cq_vector = find_next_bit_ring(sess->cur_cq_vector);
+
+	return sess->cur_cq_vector;
+}
+
+static struct ibtrs_srv *__alloc_srv(struct ibtrs_srv_ctx *ctx,
+				     const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+	int i;
+
+	srv = kzalloc(sizeof(*srv), GFP_KERNEL);
+	if  (unlikely(!srv))
+		return NULL;
+
+	refcount_set(&srv->refcount, 1);
+	INIT_LIST_HEAD(&srv->paths_list);
+	mutex_init(&srv->paths_mutex);
+	mutex_init(&srv->paths_ev_mutex);
+	uuid_copy(&srv->paths_uuid, paths_uuid);
+	srv->queue_depth = sess_queue_depth;
+	srv->ctx = ctx;
+
+	srv->chunks = kcalloc(srv->queue_depth, sizeof(*srv->chunks),
+			      GFP_KERNEL);
+	if (unlikely(!srv->chunks))
+		goto err_free_srv;
+
+	for (i = 0; i < srv->queue_depth; i++) {
+		srv->chunks[i] = mempool_alloc(chunk_pool, GFP_KERNEL);
+		if (unlikely(!srv->chunks[i])) {
+			pr_err("mempool_alloc() failed\n");
+			goto err_free_chunks;
+		}
+	}
+	list_add(&srv->ctx_list, &ctx->srv_list);
+
+	return srv;
+
+err_free_chunks:
+	while (i--)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+
+err_free_srv:
+	kfree(srv);
+
+	return NULL;
+}
+
+static void free_srv(struct ibtrs_srv *srv)
+{
+	int i;
+
+	WARN_ON(refcount_read(&srv->refcount));
+	for (i = 0; i < srv->queue_depth; i++)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+	kfree(srv);
+}
+
+static inline struct ibtrs_srv *__find_srv_and_get(struct ibtrs_srv_ctx *ctx,
+						   const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list) {
+		if (uuid_equal(&srv->paths_uuid, paths_uuid) &&
+		    refcount_inc_not_zero(&srv->refcount))
+			return srv;
+	}
+
+	return NULL;
+}
+
+static struct ibtrs_srv *get_or_create_srv(struct ibtrs_srv_ctx *ctx,
+					   const uuid_t *paths_uuid)
+{
+	struct ibtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	srv = __find_srv_and_get(ctx, paths_uuid);
+	if (!srv)
+		srv = __alloc_srv(ctx, paths_uuid);
+	mutex_unlock(&ctx->srv_mutex);
+
+	return srv;
+}
+
+static void put_srv(struct ibtrs_srv *srv)
+{
+	if (refcount_dec_and_test(&srv->refcount)) {
+		struct ibtrs_srv_ctx *ctx = srv->ctx;
+
+		WARN_ON(srv->kobj.state_in_sysfs);
+		WARN_ON(srv->kobj_paths.state_in_sysfs);
+
+		mutex_lock(&ctx->srv_mutex);
+		list_del(&srv->ctx_list);
+		mutex_unlock(&ctx->srv_mutex);
+		free_srv(srv);
+	}
+}
+
+static void __add_path_to_srv(struct ibtrs_srv *srv,
+			      struct ibtrs_srv_sess *sess)
+{
+	list_add_tail(&sess->s.entry, &srv->paths_list);
+	srv->paths_num++;
+	WARN_ON(srv->paths_num >= MAX_PATHS_NUM);
+}
+
+static void del_path_from_srv(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+
+	if (WARN_ON(!srv))
+		return;
+
+	mutex_lock(&srv->paths_mutex);
+	list_del(&sess->s.entry);
+	WARN_ON(!srv->paths_num);
+	srv->paths_num--;
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static void ibtrs_srv_close_work(struct work_struct *work)
+{
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_srv_ctx *ctx;
+	struct ibtrs_srv_con *con;
+	int i;
+
+	sess = container_of(work, typeof(*sess), close_work);
+	ctx = sess->srv->ctx;
+
+	ibtrs_srv_destroy_sess_files(sess);
+	ibtrs_srv_stop_hb(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		con = to_srv_con(sess->s.con[i]);
+		if (!con)
+			continue;
+
+		rdma_disconnect(con->c.cm_id);
+		ib_drain_qp(con->c.qp);
+	}
+	/* Wait for all inflights */
+	ibtrs_srv_wait_ops_ids(sess);
+
+	/* Notify upper layer if we are the last path */
+	ibtrs_srv_sess_down(sess);
+
+	unmap_cont_bufs(sess);
+	ibtrs_srv_free_ops_ids(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		con = to_srv_con(sess->s.con[i]);
+		if (!con)
+			continue;
+
+		ibtrs_cq_qp_destroy(&con->c);
+		rdma_destroy_id(con->c.cm_id);
+		kfree(con);
+	}
+	ibtrs_ib_dev_put(sess->s.ib_dev);
+
+	del_path_from_srv(sess);
+	put_srv(sess->srv);
+	sess->srv = NULL;
+	ibtrs_srv_change_state(sess, IBTRS_SRV_CLOSED);
+
+	kfree(sess->rdma_addr);
+	kfree(sess->s.con);
+	kfree(sess);
+}
+
+static int ibtrs_rdma_do_accept(struct ibtrs_srv_sess *sess,
+				struct rdma_cm_id *cm_id)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_msg_conn_rsp msg;
+	struct rdma_conn_param param;
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.retry_count = retry_count;
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_VERSION);
+	msg.errno = 0;
+	msg.queue_depth = cpu_to_le16(srv->queue_depth);
+	msg.rkey = cpu_to_le32(sess->s.ib_dev->rkey);
+	msg.max_io_size = cpu_to_le32(max_io_size);
+	msg.max_req_size = cpu_to_le32(MAX_REQ_SIZE);
+	uuid_copy(&msg.uuid, &sess->s.uuid);
+
+	err = rdma_accept(cm_id, &param);
+	if (err)
+		pr_err("rdma_accept(), err: %d\n", err);
+
+	return err;
+}
+
+static int ibtrs_rdma_do_reject(struct rdma_cm_id *cm_id, int errno)
+{
+	struct ibtrs_msg_conn_rsp msg;
+	int err;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(IBTRS_MAGIC);
+	msg.version = cpu_to_le16(IBTRS_VERSION);
+	msg.errno = cpu_to_le16(errno);
+
+	err = rdma_reject(cm_id, &msg, sizeof(msg));
+	if (err)
+		pr_err("rdma_reject(), err: %d\n", err);
+
+	/* Bounce errno back */
+	return errno;
+}
+
+static struct ibtrs_srv_sess *
+__find_sess(struct ibtrs_srv *srv, const uuid_t *sess_uuid)
+{
+	struct ibtrs_srv_sess *sess;
+
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (uuid_equal(&sess->s.uuid, sess_uuid))
+			return sess;
+	}
+
+	return NULL;
+}
+
+static int create_con(struct ibtrs_srv_sess *sess,
+		      struct rdma_cm_id *cm_id,
+		      unsigned int cid)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	struct ibtrs_srv_con *con;
+
+	u16 cq_size, wr_queue_size;
+	int err, cq_vector;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con)) {
+		ibtrs_err(sess, "kzalloc() failed\n");
+		err = -ENOMEM;
+		goto err;
+	}
+
+	con->c.cm_id = cm_id;
+	con->c.sess = &sess->s;
+	con->c.cid = cid;
+	atomic_set(&con->wr_cnt, 0);
+
+	if (con->c.cid == 0) {
+		cq_size       = SERVICE_CON_QUEUE_DEPTH;
+		/* + 2 for drain and heartbeat */
+		wr_queue_size = SERVICE_CON_QUEUE_DEPTH + 2;
+	} else {
+		cq_size       = srv->queue_depth;
+		wr_queue_size = sess->s.ib_dev->attrs.max_qp_wr;
+	}
+
+	cq_vector = ibtrs_srv_get_next_cq_vector(sess);
+
+	/* TODO: SOFTIRQ can be faster, but be careful with softirq context */
+	err = ibtrs_cq_qp_create(&sess->s, &con->c, 1, cq_vector, cq_size,
+				 wr_queue_size, IB_POLL_WORKQUEUE);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "ibtrs_cq_qp_create(), err: %d\n", err);
+		goto free_con;
+	}
+	if (con->c.cid == 0) {
+		err = post_recv_info_req(con);
+		if (unlikely(err))
+			goto free_cqqp;
+	}
+	WARN_ON(sess->s.con[cid]);
+	sess->s.con[cid] = &con->c;
+
+	/*
+	 * Change context from server to current connection.  The other
+	 * way is to use cm_id->qp->qp_context, which does not work on OFED.
+	 */
+	cm_id->context = &con->c;
+
+	return 0;
+
+free_cqqp:
+	ibtrs_cq_qp_destroy(&con->c);
+free_con:
+	kfree(con);
+
+err:
+	return err;
+}
+
+static struct ibtrs_srv_sess *__alloc_sess(struct ibtrs_srv *srv,
+					   struct rdma_cm_id *cm_id,
+					   unsigned int con_num,
+					   unsigned int recon_cnt,
+					   const uuid_t *uuid)
+{
+	struct ibtrs_srv_sess *sess;
+	int err = -ENOMEM;
+
+	if (unlikely(srv->paths_num >= MAX_PATHS_NUM)) {
+		err = -ECONNRESET;
+		goto err;
+	}
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	sess->rdma_addr = kcalloc(srv->queue_depth, sizeof(*sess->rdma_addr),
+				  GFP_KERNEL);
+	if (unlikely(!sess->rdma_addr))
+		goto err_free_sess;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_rdma_addr;
+
+	sess->state = IBTRS_SRV_CONNECTING;
+	sess->srv = srv;
+	sess->cur_cq_vector = -1;
+	sess->s.dst_addr = cm_id->route.addr.dst_addr;
+	sess->s.con_num = con_num;
+	sess->s.recon_cnt = recon_cnt;
+	uuid_copy(&sess->s.uuid, uuid);
+	spin_lock_init(&sess->state_lock);
+	INIT_WORK(&sess->close_work, ibtrs_srv_close_work);
+	ibtrs_srv_init_hb(sess);
+
+	sess->s.ib_dev = ibtrs_ib_dev_find_get(cm_id);
+	if (unlikely(!sess->s.ib_dev)) {
+		err = -ENOMEM;
+		ibtrs_wrn(sess, "Failed to alloc ibtrs_device\n");
+		goto err_free_con;
+	}
+	err = map_cont_bufs(sess);
+	if (unlikely(err))
+		goto err_put_dev;
+
+	err = ibtrs_srv_alloc_ops_ids(sess);
+	if (unlikely(err))
+		goto err_unmap_bufs;
+
+	__add_path_to_srv(srv, sess);
+
+	return sess;
+
+err_unmap_bufs:
+	unmap_cont_bufs(sess);
+err_put_dev:
+	ibtrs_ib_dev_put(sess->s.ib_dev);
+err_free_con:
+	kfree(sess->s.con);
+err_free_rdma_addr:
+	kfree(sess->rdma_addr);
+err_free_sess:
+	kfree(sess);
+
+err:
+	return ERR_PTR(err);
+}
+
+static int ibtrs_rdma_connect(struct rdma_cm_id *cm_id,
+			      const struct ibtrs_msg_conn_req *msg,
+			      size_t len)
+{
+	struct ibtrs_srv_ctx *ctx = cm_id->context;
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_srv *srv;
+
+	u16 version, con_num, cid;
+	u16 recon_cnt;
+	int err;
+
+	if (unlikely(len < sizeof(*msg))) {
+		pr_err("Invalid IBTRS connection request");
+		goto reject_w_econnreset;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != IBTRS_MAGIC)) {
+		pr_err("Invalid IBTRS magic");
+		goto reject_w_econnreset;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != IBTRS_VER_MAJOR)) {
+		pr_err("Unsupported major IBTRS version: %d", version);
+		goto reject_w_econnreset;
+	}
+	con_num = le16_to_cpu(msg->cid_num);
+	if (unlikely(con_num > 4096)) {
+		/* Sanity check */
+		pr_err("Too many connections requested: %d\n", con_num);
+		goto reject_w_econnreset;
+	}
+	cid = le16_to_cpu(msg->cid);
+	if (unlikely(cid >= con_num)) {
+		/* Sanity check */
+		pr_err("Incorrect cid: %d >= %d\n", cid, con_num);
+		goto reject_w_econnreset;
+	}
+	recon_cnt = le16_to_cpu(msg->recon_cnt);
+	srv = get_or_create_srv(ctx, &msg->paths_uuid);
+	if (unlikely(!srv)) {
+		err = -ENOMEM;
+		goto reject_w_err;
+	}
+	mutex_lock(&srv->paths_mutex);
+	sess = __find_sess(srv, &msg->sess_uuid);
+	if (sess) {
+		/* Session already holds a reference */
+		put_srv(srv);
+
+		if (unlikely(sess->s.recon_cnt != recon_cnt)) {
+			ibtrs_err(sess, "Reconnect detected %d != %d, but "
+				  "previous session is still alive, reconnect "
+				  "later\n", sess->s.recon_cnt, recon_cnt);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_ebusy;
+		}
+		if (unlikely(sess->state != IBTRS_SRV_CONNECTING)) {
+			ibtrs_err(sess, "Session in wrong state: %s\n",
+				  ibtrs_srv_state_str(sess->state));
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		/*
+		 * Sanity checks
+		 */
+		if (unlikely(con_num != sess->s.con_num ||
+			     cid >= sess->s.con_num)) {
+			ibtrs_err(sess, "Incorrect request: %d, %d\n",
+				  cid, con_num);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		if (unlikely(sess->s.con[cid])) {
+			ibtrs_err(sess, "Connection already exists: %d\n",
+				  cid);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+	} else {
+		sess = __alloc_sess(srv, cm_id, con_num, recon_cnt,
+				    &msg->sess_uuid);
+		if (unlikely(IS_ERR(sess))) {
+			mutex_unlock(&srv->paths_mutex);
+			put_srv(srv);
+			err = PTR_ERR(sess);
+			goto reject_w_err;
+		}
+	}
+	err = create_con(sess, cm_id, cid);
+	if (unlikely(err)) {
+		(void)ibtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since session has other connections we follow normal way
+		 * through workqueue, but still return an error to tell cma.c
+		 * to call rdma_destroy_id() for current connection.
+		 */
+		goto close_and_return_err;
+	}
+	err = ibtrs_rdma_do_accept(sess, cm_id);
+	if (unlikely(err)) {
+		(void)ibtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since current connection was successfully added to the
+		 * session we follow normal way through workqueue to close the
+		 * session, thus return 0 to tell cma.c we call
+		 * rdma_destroy_id() ourselves.
+		 */
+		err = 0;
+		goto close_and_return_err;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return 0;
+
+reject_w_err:
+	return ibtrs_rdma_do_reject(cm_id, err);
+
+reject_w_econnreset:
+	return ibtrs_rdma_do_reject(cm_id, -ECONNRESET);
+
+reject_w_ebusy:
+	return ibtrs_rdma_do_reject(cm_id, -EBUSY);
+
+close_and_return_err:
+	close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+
+static int ibtrs_srv_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct ibtrs_srv_sess *sess = NULL;
+
+	if (ev->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+		struct ibtrs_con *c = cm_id->context;
+
+		sess = to_srv_sess(c->sess);
+	}
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_CONNECT_REQUEST:
+		/*
+		 * In case of error cma.c will destroy cm_id,
+		 * see cma_process_remove()
+		 */
+		return ibtrs_rdma_connect(cm_id, ev->param.conn.private_data,
+					  ev->param.conn.private_data_len);
+	case RDMA_CM_EVENT_ESTABLISHED:
+		/* Nothing here */
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		ibtrs_err(sess, "CM error (CM event: %s, err: %d)\n",
+			  rdma_event_msg(ev->event), ev->status);
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		close_sess(sess);
+		break;
+	default:
+		pr_err("Ignoring unexpected CM event %s, err %d\n",
+		       rdma_event_msg(ev->event), ev->status);
+		break;
+	}
+
+	return 0;
+}
+
+static struct rdma_cm_id *ibtrs_srv_cm_init(struct ibtrs_srv_ctx *ctx,
+					    struct sockaddr *addr,
+					    enum rdma_port_space ps)
+{
+	struct rdma_cm_id *cm_id;
+	int ret;
+
+	cm_id = rdma_create_id(&init_net, ibtrs_srv_rdma_cm_handler,
+			       ctx, ps, IB_QPT_RC);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		pr_err("Creating id for RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_out;
+	}
+	ret = rdma_bind_addr(cm_id, addr);
+	if (ret) {
+		pr_err("Binding RDMA address failed, err: %d\n", ret);
+		goto err_cm;
+	}
+	ret = rdma_listen(cm_id, 64);
+	if (ret) {
+		pr_err("Listening on RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_cm;
+	}
+
+	switch (addr->sa_family) {
+	case AF_INET:
+		pr_debug("listening on port %u\n",
+			 ntohs(((struct sockaddr_in *)addr)->sin_port));
+		break;
+	case AF_INET6:
+		pr_debug("listening on port %u\n",
+			 ntohs(((struct sockaddr_in6 *)addr)->sin6_port));
+		break;
+	case AF_IB:
+		pr_debug("listening on service id 0x%016llx\n",
+			 be64_to_cpu(rdma_get_service_id(cm_id, addr)));
+		break;
+	default:
+		pr_debug("listening on address family %u\n", addr->sa_family);
+	}
+
+	return cm_id;
+
+err_cm:
+	rdma_destroy_id(cm_id);
+err_out:
+
+	return ERR_PTR(ret);
+}
+
+static int ibtrs_srv_rdma_init(struct ibtrs_srv_ctx *ctx, unsigned int port)
+{
+	struct sockaddr_in6 sin = {
+		.sin6_family	= AF_INET6,
+		.sin6_addr	= IN6ADDR_ANY_INIT,
+		.sin6_port	= htons(port),
+	};
+	struct sockaddr_ib sib = {
+		.sib_family			= AF_IB,
+		.sib_addr.sib_subnet_prefix	= 0ULL,
+		.sib_addr.sib_interface_id	= 0ULL,
+		.sib_sid	= cpu_to_be64(RDMA_IB_IP_PS_IB | port),
+		.sib_sid_mask	= cpu_to_be64(0xffffffffffffffffULL),
+		.sib_pkey	= cpu_to_be16(0xffff),
+	};
+	struct rdma_cm_id *cm_ip, *cm_ib;
+	int ret;
+
+	/*
+	 * We accept both IPoIB and IB connections, so we need to keep
+	 * two cm id's, one for each socket type and port space.
+	 * If the cm initialization of one of the id's fails, we abort
+	 * everything.
+	 */
+	cm_ip = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sin, RDMA_PS_TCP);
+	if (unlikely(IS_ERR(cm_ip)))
+		return PTR_ERR(cm_ip);
+
+	cm_ib = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sib, RDMA_PS_IB);
+	if (unlikely(IS_ERR(cm_ib))) {
+		ret = PTR_ERR(cm_ib);
+		goto free_cm_ip;
+	}
+
+	ctx->cm_id_ip = cm_ip;
+	ctx->cm_id_ib = cm_ib;
+
+	return 0;
+
+free_cm_ip:
+	rdma_destroy_id(cm_ip);
+
+	return ret;
+}
+
+static struct ibtrs_srv_ctx *alloc_srv_ctx(rdma_ev_fn *rdma_ev,
+					   link_ev_fn *link_ev)
+{
+	struct ibtrs_srv_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->rdma_ev = rdma_ev;
+	ctx->link_ev = link_ev;
+	mutex_init(&ctx->srv_mutex);
+	INIT_LIST_HEAD(&ctx->srv_list);
+
+	return ctx;
+}
+
+static void free_srv_ctx(struct ibtrs_srv_ctx *ctx)
+{
+	WARN_ON(!list_empty(&ctx->srv_list));
+	kfree(ctx);
+}
+
+struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port)
+{
+	struct ibtrs_srv_ctx *ctx;
+	int err;
+
+	ctx = alloc_srv_ctx(rdma_ev, link_ev);
+	if (unlikely(!ctx))
+		return ERR_PTR(-ENOMEM);
+
+	err = ibtrs_srv_rdma_init(ctx, port);
+	if (unlikely(err)) {
+		free_srv_ctx(ctx);
+		return ERR_PTR(err);
+	}
+	/* Do not let module be unloaded if server context is alive */
+	__module_get(THIS_MODULE);
+
+	return ctx;
+}
+EXPORT_SYMBOL(ibtrs_srv_open);
+
+void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess)
+{
+	close_sess(sess);
+}
+
+static void close_sess(struct ibtrs_srv_sess *sess)
+{
+	enum ibtrs_srv_state old_state;
+
+	if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING,
+					   &old_state))
+		queue_work(ibtrs_wq, &sess->close_work);
+	WARN_ON(sess->state != IBTRS_SRV_CLOSING);
+}
+
+static void close_sessions(struct ibtrs_srv *srv)
+{
+	struct ibtrs_srv_sess *sess;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry)
+		close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static void close_ctx(struct ibtrs_srv_ctx *ctx)
+{
+	struct ibtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list)
+		close_sessions(srv);
+	mutex_unlock(&ctx->srv_mutex);
+	flush_workqueue(ibtrs_wq);
+}
+
+void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx)
+{
+	rdma_destroy_id(ctx->cm_id_ip);
+	rdma_destroy_id(ctx->cm_id_ib);
+	close_ctx(ctx);
+	free_srv_ctx(ctx);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(ibtrs_srv_close);
+
+static int check_module_params(void)
+{
+	if (sess_queue_depth < 1 || sess_queue_depth > MAX_SESS_QUEUE_DEPTH) {
+		pr_err("Invalid sess_queue_depth parameter value\n");
+		return -EINVAL;
+	}
+
+	/* check if IB immediate data size is enough to hold the mem_id and the
+	 * offset inside the memory chunk
+	 */
+	if (ilog2(sess_queue_depth - 1) + ilog2(rcv_buf_size - 1) >
+	    MAX_IMM_PAYL_BITS) {
+		pr_err("RDMA immediate size (%db) not enough to encode "
+		       "%d buffers of size %dB. Reduce 'sess_queue_depth' "
+		       "or 'max_io_size' parameters.\n", MAX_IMM_PAYL_BITS,
+		       sess_queue_depth, rcv_buf_size);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init ibtrs_server_init(void)
+{
+	int err;
+
+	if (!strlen(cq_affinity_list))
+		init_cq_affinity();
+
+	pr_info("Loading module %s, version: %s "
+		"(retry_count: %d, cq_affinity_list: %s, "
+		"max_io_size: %d, sess_queue_depth: %d)\n",
+		KBUILD_MODNAME, IBTRS_VER_STRING, retry_count,
+		cq_affinity_list, max_io_size, sess_queue_depth);
+
+	err = check_module_params();
+	if (err) {
+		pr_err("Failed to load module, invalid module parameters,"
+		       " err: %d\n", err);
+		return err;
+	}
+	chunk_pool = mempool_create_page_pool(CHUNK_POOL_SIZE,
+					      get_order(rcv_buf_size));
+	if (unlikely(!chunk_pool)) {
+		pr_err("Failed preallocate pool of chunks\n");
+		return -ENOMEM;
+	}
+	ibtrs_wq = alloc_workqueue("ibtrs_server_wq", WQ_MEM_RECLAIM, 0);
+	if (!ibtrs_wq) {
+		pr_err("Failed to load module, alloc ibtrs_server_wq failed\n");
+		goto out_chunk_pool;
+	}
+	err = ibtrs_srv_create_sysfs_module_files();
+	if (err) {
+		pr_err("Failed to load module, can't create sysfs files,"
+		       " err: %d\n", err);
+		goto out_ibtrs_wq;
+	}
+
+	return 0;
+
+out_ibtrs_wq:
+	destroy_workqueue(ibtrs_wq);
+out_chunk_pool:
+	mempool_destroy(chunk_pool);
+
+	return err;
+}
+
+static void __exit ibtrs_server_exit(void)
+{
+	ibtrs_srv_destroy_sysfs_module_files();
+	destroy_workqueue(ibtrs_wq);
+	mempool_destroy(chunk_pool);
+}
+
+module_init(ibtrs_server_init);
+module_exit(ibtrs_server_exit);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 10/24] ibtrs: server: statistics functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (8 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 09/24] ibtrs: server: main functionality Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 11/24] ibtrs: server: sysfs interface functions Roman Pen
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This introduces set of functions used on server side to account
statistics of RDMA data sent/received.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c | 110 +++++++++++++++++++++++++
 1 file changed, 110 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
new file mode 100644
index 000000000000..441b07fdf44a
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
@@ -0,0 +1,110 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-srv.h"
+
+void ibtrs_srv_update_rdma_stats(struct ibtrs_srv_stats *s,
+				 size_t size, int d)
+{
+	atomic64_inc(&s->rdma_stats.dir[d].cnt);
+	atomic64_add(size, &s->rdma_stats.dir[d].size_total);
+}
+
+void ibtrs_srv_update_wc_stats(struct ibtrs_srv_stats *s)
+{
+	atomic64_inc(&s->wc_comp.calls);
+	atomic64_inc(&s->wc_comp.total_wc_cnt);
+}
+
+int ibtrs_srv_reset_rdma_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+
+		memset(r, 0, sizeof(*r));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
+				    char *page, size_t len)
+{
+	struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+	struct ibtrs_srv_sess *sess;
+
+	sess = container_of(stats, typeof(*sess), stats);
+
+	return scnprintf(page, len, "%ld %ld %ld %ld %u\n",
+			 atomic64_read(&r->dir[READ].cnt),
+			 atomic64_read(&r->dir[READ].size_total),
+			 atomic64_read(&r->dir[WRITE].cnt),
+			 atomic64_read(&r->dir[WRITE].size_total),
+			 atomic_read(&sess->ids_inflight));
+}
+
+int ibtrs_srv_reset_wc_completion_stats(struct ibtrs_srv_stats *stats,
+					bool enable)
+{
+	if (enable) {
+		memset(&stats->wc_comp, 0, sizeof(stats->wc_comp));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+int ibtrs_srv_stats_wc_completion_to_str(struct ibtrs_srv_stats *stats,
+					 char *buf, size_t len)
+{
+	return snprintf(buf, len, "%ld %ld\n",
+			atomic64_read(&stats->wc_comp.total_wc_cnt),
+			atomic64_read(&stats->wc_comp.calls));
+}
+
+ssize_t ibtrs_srv_reset_all_help(struct ibtrs_srv_stats *stats,
+				 char *page, size_t len)
+{
+	return scnprintf(page, PAGE_SIZE, "echo 1 to reset all statistics\n");
+}
+
+int ibtrs_srv_reset_all_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		ibtrs_srv_reset_wc_completion_stats(stats, enable);
+		ibtrs_srv_reset_rdma_stats(stats, enable);
+		return 0;
+	}
+
+	return -EINVAL;
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 11/24] ibtrs: server: sysfs interface functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (9 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 10/24] ibtrs: server: statistics functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 12/24] ibtrs: include client and server modules into kernel compilation Roman Pen
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is the sysfs interface to IBTRS sessions on server side:

  /sys/kernel/ibtrs_server/<SESS-NAME>/
    *** IBTRS session accepted from a client peer
    |
    |- paths/<SOURCE-IP>/
       *** established paths from a client in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- hca_name
       |  *** HCA name
       |
       |- hca_port
       |  *** HCA port
       |
       |- stats/
          *** current path statistics
          |
	  |- rdma
	  |- reset_all
	  |- wc_completions

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c | 278 +++++++++++++++++++++++++
 1 file changed, 278 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
new file mode 100644
index 000000000000..ec2c86fe4181
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
@@ -0,0 +1,278 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+static struct kobject *ibtrs_kobj;
+
+static struct kobj_type ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+};
+
+static ssize_t ibtrs_srv_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibtrs_srv_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibtrs_srv_sess *sess;
+	char str[MAXHOSTNAMELEN];
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		ibtrs_err(sess, "%s: invalid value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	ibtrs_info(sess, "disconnect for path %s requested\n", str);
+	ibtrs_srv_queue_close(sess);
+
+	return count;
+}
+
+static struct kobj_attribute ibtrs_srv_disconnect_attr =
+	__ATTR(disconnect, 0644,
+	       ibtrs_srv_disconnect_show, ibtrs_srv_disconnect_store);
+
+static ssize_t ibtrs_srv_hca_port_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+	struct ibtrs_con *usr_con;
+
+	sess = container_of(kobj, typeof(*sess), kobj);
+	usr_con = sess->s.con[0];
+
+	return scnprintf(page, PAGE_SIZE, "%u\n",
+			 usr_con->cm_id->port_num);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_port_attr =
+	__ATTR(hca_port, 0444, ibtrs_srv_hca_port_show, NULL);
+
+static ssize_t ibtrs_srv_hca_name_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct ibtrs_srv_sess *sess;
+
+	sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 sess->s.ib_dev->dev->name);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_name_attr =
+	__ATTR(hca_name, 0444, ibtrs_srv_hca_name_show, NULL);
+
+static struct attribute *ibtrs_srv_sess_attrs[] = {
+	&ibtrs_srv_hca_name_attr.attr,
+	&ibtrs_srv_hca_port_attr.attr,
+	&ibtrs_srv_disconnect_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_srv_sess_attr_group = {
+	.attrs = ibtrs_srv_sess_attrs,
+};
+
+STAT_ATTR(struct ibtrs_srv_sess, rdma,
+	  ibtrs_srv_stats_rdma_to_str,
+	  ibtrs_srv_reset_rdma_stats);
+
+STAT_ATTR(struct ibtrs_srv_sess, wc_completion,
+	  ibtrs_srv_stats_wc_completion_to_str,
+	  ibtrs_srv_reset_wc_completion_stats);
+
+STAT_ATTR(struct ibtrs_srv_sess, reset_all,
+	  ibtrs_srv_reset_all_help,
+	  ibtrs_srv_reset_all_stats);
+
+static struct attribute *ibtrs_srv_stats_attrs[] = {
+	&rdma_attr.attr,
+	&wc_completion_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibtrs_srv_stats_attr_group = {
+	.attrs = ibtrs_srv_stats_attrs,
+};
+
+static int ibtrs_srv_create_once_sysfs_root_folders(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	int err = 0;
+
+	mutex_lock(&srv->paths_mutex);
+	if (srv->kobj.state_in_sysfs) {
+		/* Just increase references if kobjs were already inited */
+		kobject_get(&srv->kobj_paths);
+		kobject_get(&srv->kobj);
+		goto unlock;
+	}
+	err = kobject_init_and_add(&srv->kobj, &ktype, ibtrs_kobj,
+				   "%s", sess->s.sessname);
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add(): %d\n", err);
+		goto unlock;
+	}
+	err = kobject_init_and_add(&srv->kobj_paths, &ktype,
+				   &srv->kobj, "paths");
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add(): %d\n", err);
+		goto put_kobj;
+	}
+unlock:
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+
+put_kobj:
+	kobject_del(&srv->kobj);
+	kobject_put(&srv->kobj);
+	goto unlock;
+}
+
+static void ibtrs_srv_destroy_once_sysfs_root_folders(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+
+	mutex_lock(&srv->paths_mutex);
+	kobject_put(&srv->kobj_paths);
+	kobject_put(&srv->kobj);
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static int ibtrs_srv_create_stats_files(struct ibtrs_srv_sess *sess)
+{
+	int err;
+
+	err = kobject_init_and_add(&sess->kobj_stats, &ktype,
+				   &sess->kobj, "stats");
+	if (unlikely(err)) {
+		ibtrs_err(sess, "kobject_init_and_add(): %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj_stats,
+				 &ibtrs_srv_stats_attr_group);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "sysfs_create_group(): %d\n", err);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_put(&sess->kobj_stats);
+
+	return err;
+}
+
+int ibtrs_srv_create_sess_files(struct ibtrs_srv_sess *sess)
+{
+	struct ibtrs_srv *srv = sess->srv;
+	char str[MAXHOSTNAMELEN];
+	int err;
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	err = ibtrs_srv_create_once_sysfs_root_folders(sess);
+	if (unlikely(err))
+		return err;
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &srv->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "kobject_init_and_add(): %d\n", err);
+		goto destroy_root;
+	}
+	err = sysfs_create_group(&sess->kobj, &ibtrs_srv_sess_attr_group);
+	if (unlikely(err)) {
+		ibtrs_err(sess, "sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = ibtrs_srv_create_stats_files(sess);
+	if (unlikely(err))
+		goto remove_group;
+
+	return 0;
+
+remove_group:
+	sysfs_remove_group(&sess->kobj, &ibtrs_srv_sess_attr_group);
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+destroy_root:
+	ibtrs_srv_destroy_once_sysfs_root_folders(sess);
+
+	return err;
+}
+
+void ibtrs_srv_destroy_sess_files(struct ibtrs_srv_sess *sess)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+
+		ibtrs_srv_destroy_once_sysfs_root_folders(sess);
+	}
+}
+
+int ibtrs_srv_create_sysfs_module_files(void)
+{
+	ibtrs_kobj = kobject_create_and_add(KBUILD_MODNAME, kernel_kobj);
+	if (unlikely(!ibtrs_kobj))
+		return -ENOMEM;
+
+	return 0;
+}
+
+void ibtrs_srv_destroy_sysfs_module_files(void)
+{
+	kobject_del(ibtrs_kobj);
+	kobject_put(ibtrs_kobj);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 12/24] ibtrs: include client and server modules into kernel compilation
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (10 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 11/24] ibtrs: server: sysfs interface functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 13/24] ibtrs: a bit of documentation Roman Pen
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

Add IBTRS Makefile, Kconfig and also corresponding lines into upper
layer infiniband/ulp files.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/Kconfig            |  1 +
 drivers/infiniband/ulp/Makefile       |  1 +
 drivers/infiniband/ulp/ibtrs/Kconfig  | 20 ++++++++++++++++++++
 drivers/infiniband/ulp/ibtrs/Makefile | 15 +++++++++++++++
 4 files changed, 37 insertions(+)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index cbf186522016..7adbd0e272c4 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -93,6 +93,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
+source "drivers/infiniband/ulp/ibtrs/Kconfig"
 
 source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
index 437813c7b481..1c4f10dc8d49 100644
--- a/drivers/infiniband/ulp/Makefile
+++ b/drivers/infiniband/ulp/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)		+= srpt/
 obj-$(CONFIG_INFINIBAND_ISER)		+= iser/
 obj-$(CONFIG_INFINIBAND_ISERT)		+= isert/
 obj-$(CONFIG_INFINIBAND_OPA_VNIC)	+= opa_vnic/
+obj-$(CONFIG_INFINIBAND_IBTRS)		+= ibtrs/
diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig b/drivers/infiniband/ulp/ibtrs/Kconfig
new file mode 100644
index 000000000000..eaeb8f3f6b4e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Kconfig
@@ -0,0 +1,20 @@
+config INFINIBAND_IBTRS
+	tristate
+	depends on INFINIBAND_ADDR_TRANS
+
+config INFINIBAND_IBTRS_CLIENT
+	tristate "IBTRS client module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_IBTRS
+	help
+	  IBTRS client allows for simplified data transfer and connection
+	  establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
+	  READ/WRITE semantics and provides multipath capabilities.
+
+config INFINIBAND_IBTRS_SERVER
+	tristate "IBTRS server module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_IBTRS
+	help
+	  IBTRS server module processing connection and IO requests received
+	  from the IBTRS client module.
diff --git a/drivers/infiniband/ulp/ibtrs/Makefile b/drivers/infiniband/ulp/ibtrs/Makefile
new file mode 100644
index 000000000000..e6ea858745ad
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Makefile
@@ -0,0 +1,15 @@
+ibtrs-client-y := ibtrs-clt.o \
+		  ibtrs-clt-stats.o \
+		  ibtrs-clt-sysfs.o
+
+ibtrs-server-y := ibtrs-srv.o \
+		  ibtrs-srv-stats.o \
+		  ibtrs-srv-sysfs.o
+
+ibtrs-core-y := ibtrs.o
+
+obj-$(CONFIG_INFINIBAND_IBTRS)        += ibtrs-core.o
+obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
+obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 13/24] ibtrs: a bit of documentation
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (11 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 12/24] ibtrs: include client and server modules into kernel compilation Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 14/24] ibnbd: private headers with IBNBD protocol structs and helpers Roman Pen
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

README with description of major sysfs entries.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/README | 238 ++++++++++++++++++++++++++++++++++++
 1 file changed, 238 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/README b/drivers/infiniband/ulp/ibtrs/README
new file mode 100644
index 000000000000..ed506c7e202d
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/README
@@ -0,0 +1,238 @@
+****************************
+InfiniBand Transport (IBTRS)
+****************************
+
+IBTRS (InfiniBand Transport) is a reliable high speed transport library
+which provides support to establish optimal number of connections
+between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
+transport. It is optimized to transfer (read/write) IO blocks.
+
+In its core interface it follows the BIO semantics of providing the
+possibility to either write data from an sg list to the remote side
+or to request ("read") data transfer from the remote side into a given
+sg list.
+
+IBTRS provides I/O fail-over and load-balancing capabilities by using
+multipath I/O (see "add_path" and "mp_policy" configuration entries).
+
+IBTRS is used by the IBNBD (Infiniband Network Block Device) modules.
+
+======================
+Client Sysfs Interface
+======================
+
+This chapter describes only the most important files of sysfs interface
+on client side.
+
+Entries under /sys/kernel/ibtrs_client/
+=======================================
+
+When a user of IBTRS API creates a new session, a directory entry with
+the name of that session is created.
+
+Entries under /sys/kernel/ibtrs_client/<session-name>/
+======================================================
+
+add_path (RW)
+-------------
+
+Adds a new path (connection) to an existing session. Expected format is the
+following:
+
+  <[source addr,]destination addr>
+
+  *addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]
+
+max_reconnect_attempts (RW)
+---------------------------
+
+Maximum number reconnect attempts the client should make before giving up
+after connection breaks unexpectedly.
+
+mp_policy (RW)
+--------------
+
+Multipath policy specifies which path should be selected on each IO:
+
+   round-robin (0):
+       select path in per CPU round-robin manner.
+
+   min-inflight (1):
+       select path with minimum inflights.
+
+Entries under /sys/kernel/ibtrs_client/<session-name>/paths/
+============================================================
+
+
+Each path belonging to a given session is listed here by its destination
+address. When a new path is added to a session by writing to the "add_path"
+entry, a directory with the corresponding destination address is created.
+
+Entries under /sys/kernel/ibtrs_client/<session-name>/paths/<dest-addr>/
+========================================================================
+
+state (R)
+---------
+
+Contains "connected" if the session is connected to the peer and fully
+functional.  Otherwise the file contains "disconnected"
+
+reconnect (RW)
+--------------
+
+Write "1" to the file in order to reconnect the path.
+Operation is blocking and returns 0 if reconnect was successfull.
+
+disconnect (RW)
+---------------
+
+Write "1" to the file in order to disconnect the path.
+Operation blocks until IBTRS path is disconnected.
+
+remove_path (RW)
+----------------
+
+Write "1" to the file in order to disconnected and remove the path
+from the session.  Operation blocks until the path is disconnected
+and removed from the session.
+
+Entries under /sys/kernel/ibtrs_client/<session-name>/paths/<dest-addr>/stats/
+==============================================================================
+
+Write "0" to any file in that directory to reset corresponding statistics.
+
+reset_all (RW)
+--------------
+
+Read will return usage help, write 0 will clear all the statistics.
+
+sg_entries (RW)
+---------------
+
+Data to be transfered via RDMA is passed to IBTRS as scather-gather
+list. A scather-gather list can contain multiple entries.
+Scather-gather list with less entries require less processing power
+and can therefore transfered faster. The file sg_entries outputs a
+per-CPU distribution table for the number of entries in the
+scather-gather lists, that were passed to the IBTRS API function
+ibtrs_clt_request (READ or WRITE).
+
+cpu_migration (RW)
+------------------
+
+IBTRS expects that each HCA IRQ is pinned to a separate CPU. If it's
+not the case, the processing of an I/O response could be processed on a
+different CPU than where it was originally submitted.  This file shows
+how many interrupts where generated on a non expected CPU.
+"from:" is the CPU on which the IRQ was expected, but not generated.
+"to:" is the CPU on which the IRQ was generated, but not expected.
+
+reconnects (RW)
+---------------
+
+Contains 2 unsigned int values, the first one records number of successful
+reconnects in the path lifetime, the second one records number of failed
+reconnects in the path lifetime.
+
+rdma_lat (RW)
+-------------
+
+Latency distribution of IBTRS requests.
+The format is:
+   1 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   2 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   4 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   8 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  16 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  ...
+  65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  >= 65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  maximum ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+
+wc_completion (RW)
+------------------
+
+Contains 2 unsigned int values, the first one records max number of work
+requests processed in work_completion in session lifetime, the second
+one records average number of work requests processed in work_completion
+in session lifetime.
+
+rdma (RW)
+---------
+
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 6 values:
+
+<read-count> <read-total-size> <write-count> <write-total-size> \
+<inflights> <failovered>
+
+======================
+Server Sysfs Interface
+======================
+
+Entries under /sys/kernel/ibtrs_server/
+=======================================
+
+When a user of IBTRS API creates a new session on a client side, a
+directory entry with the name of that session is created in here.
+
+Entries under /sys/kernel/ibtrs_server/<session-name>/paths/
+============================================================
+
+When new path is created by writing to "add_path" entry on client side,
+a directory entry with source address is created on server.
+
+Entries under /sys/kernel/ibtrs_server/<session-name>/paths/<source-addr>/
+==========================================================================
+
+disconnect (RW)
+---------------
+
+When "1" is written to the file, the IBTRS session is being disconnected.
+Oprations is non-blocking and returns control immediately to the caller.
+
+hca_name (R)
+------------
+
+Contains the the name of HCA the connection established on.
+
+hca_port (R)
+------------
+
+Contains the port number of active port traffic is going through.
+
+Entries under /sys/kernel/ibtrs_server/<session-name>/paths/<source-addr>/stats/
+================================================================================
+
+When "0" is written to a file in this directory, the corresponding counters
+will be reset.
+
+reset_all (RW)
+--------------
+
+Read will return usage help, write 0 will clear all the counters about
+stats.
+
+rdma (RW)
+---------
+
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 5 values:
+
+<read-count> <read-total-size> <write-count> <write-total-size> <inflights>
+
+wc_completion (RW)
+------------------
+
+Contains 3 values, the first one is int, records max number of work
+requests processed in work_completion in session lifetime, the second
+one long int records total number of work requests processed in
+work_completion in session lifetime and the 3rd one long int records
+total number of calls to the cq completion handler. Devision of 2nd
+number through 3rd gives the average number of completions processed
+in completion handler.
+
+Contact
+-------
+
+Mailing list: "IBNBD/IBTRS Storage Team" <ibnbd@profitbricks.com>
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 14/24] ibnbd: private headers with IBNBD protocol structs and helpers
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (12 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 13/24] ibtrs: a bit of documentation Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 15/24] ibnbd: client: private header with client structs and functions Roman Pen
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

These are common private headers with IBNBD protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-log.h   |  71 ++++++++
 drivers/block/ibnbd/ibnbd-proto.h | 360 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 431 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-log.h b/drivers/block/ibnbd/ibnbd-log.h
new file mode 100644
index 000000000000..489343a61171
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-log.h
@@ -0,0 +1,71 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_LOG_H
+#define IBNBD_LOG_H
+
+#include "ibnbd-clt.h"
+#include "ibnbd-srv.h"
+
+#define ibnbd_diskname(dev) ({						\
+	struct gendisk *gd = ((struct ibnbd_clt_dev *)dev)->gd;		\
+	gd ? gd->disk_name : "<no dev>";				\
+})
+
+void unknown_type(void);
+
+#define ibnbd_log(fn, dev, fmt, ...) ({					\
+	__builtin_choose_expr(						\
+		__builtin_types_compatible_p(				\
+			typeof(dev), struct ibnbd_clt_dev *),		\
+		fn("<%s@%s> %s: " fmt, (dev)->pathname,		\
+		   (dev)->sess->sessname, ibnbd_diskname(dev),		\
+		   ##__VA_ARGS__),					\
+		__builtin_choose_expr(					\
+			__builtin_types_compatible_p(typeof(dev),	\
+					struct ibnbd_srv_sess_dev *),	\
+			fn("<%s@%s>: " fmt, (dev)->pathname,	\
+			   (dev)->sess->sessname, ##__VA_ARGS__),		\
+			unknown_type()));				\
+})
+
+#define ibnbd_err(dev, fmt, ...)	\
+	ibnbd_log(pr_err, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_err_rl(dev, fmt, ...)	\
+	ibnbd_log(pr_err_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_wrn(dev, fmt, ...)	\
+	ibnbd_log(pr_warn, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_wrn_rl(dev, fmt, ...) \
+	ibnbd_log(pr_warn_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_info(dev, fmt, ...) \
+	ibnbd_log(pr_info, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_info_rl(dev, fmt, ...) \
+	ibnbd_log(pr_info_ratelimited, dev, fmt, ##__VA_ARGS__)
+
+#endif /* IBNBD_LOG_H */
diff --git a/drivers/block/ibnbd/ibnbd-proto.h b/drivers/block/ibnbd/ibnbd-proto.h
new file mode 100644
index 000000000000..c809705a2322
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-proto.h
@@ -0,0 +1,360 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_PROTO_H
+#define IBNBD_PROTO_H
+
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/limits.h>
+#include <linux/inet.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <rdma/ib.h>
+
+#define IBNBD_VER_MAJOR 1
+#define IBNBD_VER_MINOR 0
+#define IBNBD_VER_STRING __stringify(IBNBD_VER_MAJOR) "." \
+			 __stringify(IBNBD_VER_MINOR)
+
+/* TODO: should be configurable */
+#define IBTRS_PORT 1234
+
+/**
+ * enum ibnbd_msg_types - IBNBD message types
+ * @IBNBD_MSG_SESS_INFO:	initial session info from client to server
+ * @IBNBD_MSG_SESS_INFO_RSP:	initial session info from server to client
+ * @IBNBD_MSG_OPEN:		open (map) device request
+ * @IBNBD_MSG_OPEN_RSP:		response to an @IBNBD_MSG_OPEN
+ * @IBNBD_MSG_IO:		block IO request operation
+ * @IBNBD_MSG_CLOSE:		close (unmap) device request
+ * @IBNBD_MSG_CLOSE_RSP:	response to an @IBNBD_MSG_CLOSE
+ */
+enum ibnbd_msg_type {
+	IBNBD_MSG_SESS_INFO,
+	IBNBD_MSG_SESS_INFO_RSP,
+	IBNBD_MSG_OPEN,
+	IBNBD_MSG_OPEN_RSP,
+	IBNBD_MSG_IO,
+	IBNBD_MSG_CLOSE,
+	IBNBD_MSG_CLOSE_RSP,
+};
+
+/**
+ * struct ibnbd_msg_hdr - header of IBNBD messages
+ * @type:	Message type, valid values see: enum ibnbd_msg_types
+ */
+struct ibnbd_msg_hdr {
+	__le16		type;
+	__le16		__padding;
+};
+
+enum ibnbd_access_mode {
+	IBNBD_ACCESS_RO,
+	IBNBD_ACCESS_RW,
+	IBNBD_ACCESS_MIGRATION,
+};
+
+#define _IBNBD_FILEIO  0
+#define _IBNBD_BLOCKIO 1
+#define _IBNBD_AUTOIO  2
+
+enum ibnbd_io_mode {
+	IBNBD_FILEIO = _IBNBD_FILEIO,
+	IBNBD_BLOCKIO = _IBNBD_BLOCKIO,
+	IBNBD_AUTOIO = _IBNBD_AUTOIO,
+};
+
+/**
+ * struct ibnbd_msg_sess_info - initial session info from client to server
+ * @hdr:		message header
+ * @ver:		IBNBD protocol version
+ */
+struct ibnbd_msg_sess_info {
+	struct ibnbd_msg_hdr hdr;
+	u8		ver;
+	u8		reserved[31];
+};
+
+/**
+ * struct ibnbd_msg_sess_info_rsp - initial session info from server to client
+ * @hdr:		message header
+ * @ver:		IBNBD protocol version
+ */
+struct ibnbd_msg_sess_info_rsp {
+	struct ibnbd_msg_hdr hdr;
+	u8		ver;
+	u8		reserved[31];
+};
+
+/**
+ * struct ibnbd_msg_open - request to open a remote device.
+ * @hdr:		message header
+ * @access_mode:	the mode to open remote device, valid values see:
+ *			enum ibnbd_access_mode
+ * @io_mode:		Open volume on server as block device or as file
+ * @device_name:	device path on remote side
+ */
+struct ibnbd_msg_open {
+	struct ibnbd_msg_hdr hdr;
+	u8		access_mode;
+	u8		io_mode;
+	s8		dev_name[NAME_MAX];
+	u8		__padding[3];
+};
+
+/**
+ * struct ibnbd_msg_close - request to close a remote device.
+ * @hdr:	message header
+ * @device_id:	device_id on server side to identify the device
+ */
+struct ibnbd_msg_close {
+	struct ibnbd_msg_hdr hdr;
+	__le32		device_id;
+};
+
+/**
+ * struct ibnbd_msg_open_rsp - response message to IBNBD_MSG_OPEN
+ * @hdr:		message header
+ * @nsectors:		number of sectors
+ * @device_id:		device_id on server side to identify the device
+ * @queue_flags:	queue_flags of the device on server side
+ * @max_hw_sectors:	max hardware sectors in the usual 512b unit
+ * @max_write_same_sectors: max sectors for WRITE SAME in the 512b unit
+ * @max_discard_sectors: max. sectors that can be discarded at once
+ * @discard_granularity: size of the internal discard allocation unit
+ * @discard_alignment: offset from internal allocation assignment
+ * @physical_block_size: physical block size device supports
+ * @logical_block_size: logical block size device supports
+ * @max_segments:	max segments hardware support in one transfer
+ * @secure_discard:	supports secure discard
+ * @rotation:		is a rotational disc?
+ * @io_mode:		io_mode device is opened.
+ */
+struct ibnbd_msg_open_rsp {
+	struct ibnbd_msg_hdr	hdr;
+	__le32			device_id;
+	__le64			nsectors;
+	__le32			max_hw_sectors;
+	__le32			max_write_same_sectors;
+	__le32			max_discard_sectors;
+	__le32			discard_granularity;
+	__le32			discard_alignment;
+	__le16			physical_block_size;
+	__le16			logical_block_size;
+	__le16			max_segments;
+	__le16			secure_discard;
+	u8			rotational;
+	u8			io_mode;
+	u8			__padding[10];
+};
+
+/**
+ * struct ibnbd_msg_io - message for I/O read/write
+ * @hdr:	message header
+ * @device_id:	device_id on server side to find the right device
+ * @sector:	bi_sector attribute from struct bio
+ * @rw:		bitmask, valid values are defined in enum ibnbd_io_flags
+ * @bi_size:   number of bytes for I/O read/write
+ */
+struct ibnbd_msg_io {
+	struct ibnbd_msg_hdr hdr;
+	__le32		device_id;
+	__le64		sector;
+	__le32		rw;
+	__le32		bi_size;
+};
+
+#define IBNBD_OP_BITS  8
+#define IBNBD_OP_MASK  ((1 << IBNBD_OP_BITS) - 1)
+
+/**
+ * enum ibnbd_io_flags - IBNBD request types from rq_flag_bits
+ * @IBNBD_OP_READ:	     read sectors from the device
+ * @IBNBD_OP_WRITE:	     write sectors to the device
+ * @IBNBD_OP_FLUSH:	     flush the volatile write cache
+ * @IBNBD_OP_DISCARD:        discard sectors
+ * @IBNBD_OP_SECURE_ERASE:   securely erase sectors
+ * @IBNBD_OP_WRITE_SAME:     write the same sectors many times
+
+ * @IBNBD_F_SYNC:	     request is sync (sync write or read)
+ * @IBNBD_F_FUA:             forced unit access
+ */
+enum ibnbd_io_flags {
+
+	/* Operations */
+
+	IBNBD_OP_READ		= 0,
+	IBNBD_OP_WRITE		= 1,
+	IBNBD_OP_FLUSH		= 2,
+	IBNBD_OP_DISCARD	= 3,
+	IBNBD_OP_SECURE_ERASE	= 4,
+	IBNBD_OP_WRITE_SAME	= 5,
+
+	IBNBD_OP_LAST,
+
+	/* Flags */
+
+	IBNBD_F_SYNC  = 1<<(IBNBD_OP_BITS + 0),
+	IBNBD_F_FUA   = 1<<(IBNBD_OP_BITS + 1),
+
+	IBNBD_F_ALL   = (IBNBD_F_SYNC | IBNBD_F_FUA)
+
+};
+
+static inline u32 ibnbd_op(u32 flags)
+{
+	return (flags & IBNBD_OP_MASK);
+}
+
+static inline u32 ibnbd_flags(u32 flags)
+{
+	return (flags & ~IBNBD_OP_MASK);
+}
+
+static inline bool ibnbd_flags_supported(u32 flags)
+{
+	u32 op;
+
+	op = ibnbd_op(flags);
+	flags = ibnbd_flags(flags);
+
+	if (op >= IBNBD_OP_LAST)
+		return false;
+	if (flags & ~IBNBD_F_ALL)
+		return false;
+
+	return true;
+}
+
+static inline u32 ibnbd_to_bio_flags(u32 ibnbd_flags)
+{
+	u32 bio_flags;
+
+	switch (ibnbd_op(ibnbd_flags)) {
+	case IBNBD_OP_READ:
+		bio_flags = REQ_OP_READ;
+		break;
+	case IBNBD_OP_WRITE:
+		bio_flags = REQ_OP_WRITE;
+		break;
+	case IBNBD_OP_FLUSH:
+		bio_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
+		break;
+	case IBNBD_OP_DISCARD:
+		bio_flags = REQ_OP_DISCARD;
+		break;
+	case IBNBD_OP_SECURE_ERASE:
+		bio_flags = REQ_OP_SECURE_ERASE;
+		break;
+	case IBNBD_OP_WRITE_SAME:
+		bio_flags = REQ_OP_WRITE_SAME;
+		break;
+	default:
+		WARN(1, "Unknown IBNBD type: %d (flags %d)\n",
+		     ibnbd_op(ibnbd_flags), ibnbd_flags);
+		bio_flags = 0;
+	}
+
+	if (ibnbd_flags & IBNBD_F_SYNC)
+		bio_flags |= REQ_SYNC;
+
+	if (ibnbd_flags & IBNBD_F_FUA)
+		bio_flags |= REQ_FUA;
+
+	return bio_flags;
+}
+
+static inline u32 rq_to_ibnbd_flags(struct request *rq)
+{
+	u32 ibnbd_flags;
+
+	switch (req_op(rq)) {
+	case REQ_OP_READ:
+		ibnbd_flags = IBNBD_OP_READ;
+		break;
+	case REQ_OP_WRITE:
+		ibnbd_flags = IBNBD_OP_WRITE;
+		break;
+	case REQ_OP_DISCARD:
+		ibnbd_flags = IBNBD_OP_DISCARD;
+		break;
+	case REQ_OP_SECURE_ERASE:
+		ibnbd_flags = IBNBD_OP_SECURE_ERASE;
+		break;
+	case REQ_OP_WRITE_SAME:
+		ibnbd_flags = IBNBD_OP_WRITE_SAME;
+		break;
+	case REQ_OP_FLUSH:
+		ibnbd_flags = IBNBD_OP_FLUSH;
+		break;
+	default:
+		WARN(1, "Unknown request type %d (flags %llu)\n",
+		     req_op(rq), (unsigned long long)rq->cmd_flags);
+		ibnbd_flags = 0;
+	}
+
+	if (op_is_sync(rq->cmd_flags))
+		ibnbd_flags |= IBNBD_F_SYNC;
+
+	if (op_is_flush(rq->cmd_flags))
+		ibnbd_flags |= IBNBD_F_FUA;
+
+	return ibnbd_flags;
+}
+
+static inline const char *ibnbd_io_mode_str(enum ibnbd_io_mode mode)
+{
+	switch (mode) {
+	case IBNBD_FILEIO:
+		return "fileio";
+	case IBNBD_BLOCKIO:
+		return "blockio";
+	case IBNBD_AUTOIO:
+		return "autoio";
+	default:
+		return "unknown";
+	}
+}
+
+static inline const char *ibnbd_access_mode_str(enum ibnbd_access_mode mode)
+{
+	switch (mode) {
+	case IBNBD_ACCESS_RO:
+		return "ro";
+	case IBNBD_ACCESS_RW:
+		return "rw";
+	case IBNBD_ACCESS_MIGRATION:
+		return "migration";
+	default:
+		return "unknown";
+	}
+}
+
+#endif /* IBNBD_PROTO_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 15/24] ibnbd: client: private header with client structs and functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (13 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 14/24] ibnbd: private headers with IBNBD protocol structs and helpers Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 16/24] ibnbd: client: main functionality Roman Pen
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This header describes main structs and functions used by ibnbd-client
module, mainly for managing IBNBD sessions and mapped block devices,
creating and destroying sysfs entries.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-clt.h | 193 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 193 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-clt.h b/drivers/block/ibnbd/ibnbd-clt.h
new file mode 100644
index 000000000000..b3d72b2962dd
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt.h
@@ -0,0 +1,193 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_CLT_H
+#define IBNBD_CLT_H
+
+#include <linux/wait.h>
+#include <linux/in.h>
+#include <linux/inet.h>
+#include <linux/blk-mq.h>
+#include <linux/refcount.h>
+
+#include "ibtrs.h"
+#include "ibnbd-proto.h"
+#include "ibnbd-log.h"
+
+#define BMAX_SEGMENTS 31
+#define RECONNECT_DELAY 30
+#define MAX_RECONNECTS -1
+
+enum ibnbd_clt_dev_state {
+	DEV_STATE_INIT,
+	DEV_STATE_MAPPED,
+	DEV_STATE_MAPPED_DISCONNECTED,
+	DEV_STATE_UNMAPPED,
+};
+
+enum ibnbd_queue_mode {
+	BLK_MQ,
+	BLK_RQ
+};
+
+struct ibnbd_iu_comp {
+	wait_queue_head_t wait;
+	int errno;
+};
+
+struct ibnbd_iu {
+	union {
+		struct request *rq; /* for block io */
+		void *buf; /* for user messages */
+	};
+	struct ibtrs_tag	*tag;
+	union {
+		/* use to send msg associated with a dev */
+		struct ibnbd_clt_dev *dev;
+		/* use to send msg associated with a sess */
+		struct ibnbd_clt_session *sess;
+	};
+	blk_status_t		status;
+	struct scatterlist	sglist[BMAX_SEGMENTS];
+	struct work_struct	work;
+	int			errno;
+	struct ibnbd_iu_comp	*comp;
+};
+
+struct ibnbd_cpu_qlist {
+	struct list_head	requeue_list;
+	spinlock_t		requeue_lock;
+	unsigned int		cpu;
+};
+
+struct ibnbd_clt_session {
+	struct list_head        list;
+	struct ibtrs_clt        *ibtrs;
+	wait_queue_head_t       ibtrs_waitq;
+	bool                    ibtrs_ready;
+	struct ibnbd_cpu_qlist	__percpu
+				*cpu_queues;
+	DECLARE_BITMAP(cpu_queues_bm, NR_CPUS);
+	int	__percpu	*cpu_rr; /* per-cpu var for CPU round-robin */
+	atomic_t		busy;
+	int			queue_depth;
+	u32			max_io_size;
+	struct blk_mq_tag_set	tag_set;
+	struct mutex		lock; /* protects state and devs_list */
+	struct list_head        devs_list; /* list of struct ibnbd_clt_dev */
+	refcount_t		refcount;
+	char			sessname[NAME_MAX];
+	u8			ver; /* protocol version */
+};
+
+/**
+ * Submission queues.
+ */
+struct ibnbd_queue {
+	struct list_head	requeue_list;
+	unsigned long		in_list;
+	struct ibnbd_clt_dev	*dev;
+	struct blk_mq_hw_ctx	*hctx;
+};
+
+struct ibnbd_clt_dev {
+	struct ibnbd_clt_session	*sess;
+	struct request_queue	*queue;
+	struct ibnbd_queue	*hw_queues;
+	struct delayed_work	rq_delay_work;
+	u32			device_id;
+	/* local Idr index - used to track minor number allocations. */
+	u32			clt_device_id;
+	struct mutex		lock;
+	enum ibnbd_clt_dev_state	dev_state;
+	enum ibnbd_queue_mode	queue_mode;
+	enum ibnbd_io_mode	io_mode; /* user requested */
+	enum ibnbd_io_mode	remote_io_mode; /* server really used */
+	char			pathname[NAME_MAX];
+	enum ibnbd_access_mode	access_mode;
+	bool			read_only;
+	bool			rotational;
+	u32			max_hw_sectors;
+	u32			max_write_same_sectors;
+	u32			max_discard_sectors;
+	u32			discard_granularity;
+	u32			discard_alignment;
+	u16			secure_discard;
+	u16			physical_block_size;
+	u16			logical_block_size;
+	u16			max_segments;
+	size_t			nsectors;
+	u64			size;		/* device size in bytes */
+	struct list_head        list;
+	struct gendisk		*gd;
+	struct kobject		kobj;
+	char			blk_symlink_name[NAME_MAX];
+	refcount_t		refcount;
+	struct work_struct	unmap_on_rmmod_work;
+};
+
+static inline const char *ibnbd_queue_mode_str(enum ibnbd_queue_mode mode)
+{
+	switch (mode) {
+	case BLK_RQ:
+		return "rq";
+	case BLK_MQ:
+		return "mq";
+	default:
+		return "unknown";
+	}
+}
+
+/* ibnbd-clt.c */
+
+struct ibnbd_clt_dev *ibnbd_clt_map_device(const char *sessname,
+					   struct ibtrs_addr *paths,
+					   size_t path_cnt,
+					   const char *pathname,
+					   enum ibnbd_access_mode access_mode,
+					   enum ibnbd_queue_mode queue_mode,
+					   enum ibnbd_io_mode io_mode);
+int ibnbd_clt_unmap_device(struct ibnbd_clt_dev *dev, bool force,
+			   const struct attribute *sysfs_self);
+
+int ibnbd_clt_remap_device(struct ibnbd_clt_dev *dev);
+int ibnbd_clt_resize_disk(struct ibnbd_clt_dev *dev, size_t newsize);
+
+/* ibnbd-clt-sysfs.c */
+
+int ibnbd_clt_create_sysfs_files(void);
+
+void ibnbd_clt_destroy_sysfs_files(void);
+void ibnbd_clt_destroy_default_group(void);
+
+void ibnbd_clt_remove_dev_symlink(struct ibnbd_clt_dev *dev);
+void ibnbd_sysfs_remove_file_self(struct kobject *kobj,
+				  const struct attribute *attr);
+
+#endif /* IBNBD_CLT_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 16/24] ibnbd: client: main functionality
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (14 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 15/24] ibnbd: client: private header with client structs and functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 15:11   ` Jens Axboe
  2018-02-02 14:08 ` [PATCH 17/24] ibnbd: client: sysfs interface functions Roman Pen
                   ` (10 subsequent siblings)
  26 siblings, 1 reply; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is main functionality of ibnbd-client module, which provides
interface to map remote device as local block device /dev/ibnbd<N>
and feeds IBTRS with IO requests.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-clt.c | 1959 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 1959 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-clt.c b/drivers/block/ibnbd/ibnbd-clt.c
new file mode 100644
index 000000000000..b5bc71414778
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt.c
@@ -0,0 +1,1959 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/scatterlist.h>
+#include <linux/idr.h>
+
+#include "ibnbd-clt.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_DESCRIPTION("InfiniBand Network Block Device Client");
+MODULE_VERSION(IBNBD_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static int ibnbd_client_major;
+static DEFINE_IDA(index_ida);
+static DEFINE_MUTEX(ida_lock);
+static DEFINE_MUTEX(sess_lock);
+static LIST_HEAD(sess_list);
+
+static bool softirq_enable;
+module_param(softirq_enable, bool, 0444);
+MODULE_PARM_DESC(softirq_enable, "finish request in softirq_fn."
+		 " (default: 0)");
+/*
+ * Maximum number of partitions an instance can have.
+ * 6 bits = 64 minors = 63 partitions (one minor is used for the device itself)
+ */
+#define IBNBD_PART_BITS		6
+#define KERNEL_SECTOR_SIZE      512
+
+static inline bool ibnbd_clt_get_sess(struct ibnbd_clt_session *sess)
+{
+	return refcount_inc_not_zero(&sess->refcount);
+}
+
+static void free_sess(struct ibnbd_clt_session *sess);
+
+static void ibnbd_clt_put_sess(struct ibnbd_clt_session *sess)
+{
+	might_sleep();
+
+	if (refcount_dec_and_test(&sess->refcount))
+		free_sess(sess);
+}
+
+static inline bool ibnbd_clt_dev_is_mapped(struct ibnbd_clt_dev *dev)
+{
+	return dev->dev_state == DEV_STATE_MAPPED;
+}
+
+static void ibnbd_clt_put_dev(struct ibnbd_clt_dev *dev)
+{
+	might_sleep();
+
+	if (refcount_dec_and_test(&dev->refcount)) {
+		mutex_lock(&ida_lock);
+		ida_simple_remove(&index_ida, dev->clt_device_id);
+		mutex_unlock(&ida_lock);
+		kfree(dev->hw_queues);
+		ibnbd_clt_put_sess(dev->sess);
+		kfree(dev);
+	}
+}
+
+static inline bool ibnbd_clt_get_dev(struct ibnbd_clt_dev *dev)
+{
+	return refcount_inc_not_zero(&dev->refcount);
+}
+
+static void ibnbd_clt_set_dev_attr(struct ibnbd_clt_dev *dev,
+				   const struct ibnbd_msg_open_rsp *rsp)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+
+	dev->device_id		    = le32_to_cpu(rsp->device_id);
+	dev->nsectors		    = le64_to_cpu(rsp->nsectors);
+	dev->logical_block_size	    = le16_to_cpu(rsp->logical_block_size);
+	dev->physical_block_size    = le16_to_cpu(rsp->physical_block_size);
+	dev->max_write_same_sectors = le32_to_cpu(rsp->max_write_same_sectors);
+	dev->max_discard_sectors    = le32_to_cpu(rsp->max_discard_sectors);
+	dev->discard_granularity    = le32_to_cpu(rsp->discard_granularity);
+	dev->discard_alignment	    = le32_to_cpu(rsp->discard_alignment);
+	dev->secure_discard	    = le16_to_cpu(rsp->secure_discard);
+	dev->rotational		    = rsp->rotational;
+	dev->remote_io_mode	    = rsp->io_mode;
+
+	dev->max_hw_sectors = sess->max_io_size / dev->logical_block_size;
+	dev->max_segments = BMAX_SEGMENTS;
+
+	if (dev->remote_io_mode == IBNBD_BLOCKIO) {
+		dev->max_hw_sectors = min_t(u32, dev->max_hw_sectors,
+					    le32_to_cpu(rsp->max_hw_sectors));
+		dev->max_segments = min_t(u16, dev->max_segments,
+					  le16_to_cpu(rsp->max_segments));
+	}
+}
+
+static int ibnbd_clt_revalidate_disk(struct ibnbd_clt_dev *dev,
+				     size_t new_nsectors)
+{
+	int err = 0;
+
+	ibnbd_info(dev, "Device size changed from %zu to %zu sectors\n",
+		   dev->nsectors, new_nsectors);
+	dev->nsectors = new_nsectors;
+	set_capacity(dev->gd,
+		     dev->nsectors * (dev->logical_block_size /
+				      KERNEL_SECTOR_SIZE));
+	err = revalidate_disk(dev->gd);
+	if (err)
+		ibnbd_err(dev, "Failed to change device size from"
+			  " %zu to %zu, err: %d\n", dev->nsectors,
+			  new_nsectors, err);
+	return err;
+}
+
+static int process_msg_open_rsp(struct ibnbd_clt_dev *dev,
+				struct ibnbd_msg_open_rsp *rsp)
+{
+	int err = 0;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state == DEV_STATE_UNMAPPED) {
+		ibnbd_info(dev, "Ignoring Open-Response message from server for "
+			   " unmapped device\n");
+		err = -ENOENT;
+		goto out;
+	}
+	if (dev->dev_state == DEV_STATE_MAPPED_DISCONNECTED) {
+		u64 nsectors = le64_to_cpu(rsp->nsectors);
+
+		/*
+		 * If the device was remapped and the size changed in the
+		 * meantime we need to revalidate it
+		 */
+		if (dev->nsectors != nsectors)
+			ibnbd_clt_revalidate_disk(dev, nsectors);
+		ibnbd_info(dev, "Device online, device remapped successfully\n");
+	}
+	ibnbd_clt_set_dev_attr(dev, rsp);
+	dev->dev_state = DEV_STATE_MAPPED;
+
+out:
+	mutex_unlock(&dev->lock);
+
+	return err;
+}
+
+int ibnbd_clt_resize_disk(struct ibnbd_clt_dev *dev, size_t newsize)
+{
+	int ret = 0;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state != DEV_STATE_MAPPED) {
+		pr_err("Failed to set new size of the device, "
+		       "device is not opened\n");
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = ibnbd_clt_revalidate_disk(dev, newsize);
+
+out:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static void ibnbd_blk_delay_work(struct work_struct *work)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(work, struct ibnbd_clt_dev, rq_delay_work.work);
+	spin_lock_irq(dev->queue->queue_lock);
+	blk_start_queue(dev->queue);
+	spin_unlock_irq(dev->queue->queue_lock);
+}
+
+/**
+ * What is the difference between this and original blk_delay_queue() ?
+ * Here the stop queue flag is cleared, so we are like MQ.
+ */
+static void ibnbd_blk_delay_queue(struct ibnbd_clt_dev *dev,
+				  unsigned long msecs)
+{
+	int cpu = get_cpu();
+
+	kblockd_schedule_delayed_work_on(cpu, &dev->rq_delay_work,
+					 msecs_to_jiffies(msecs));
+	put_cpu();
+}
+
+static inline void ibnbd_clt_dev_requeue(struct ibnbd_queue *q)
+{
+	struct ibnbd_clt_dev *dev = q->dev;
+
+	if (dev->queue_mode == BLK_MQ) {
+		if (WARN_ON(!q->hctx))
+			return;
+		blk_mq_delay_queue(q->hctx, 0);
+	} else if (dev->queue_mode == BLK_RQ) {
+		ibnbd_blk_delay_queue(q->dev, 0);
+	} else {
+		WARN(1, "We support requeueing only for RQ or MQ");
+	}
+}
+
+enum {
+	IBNBD_DELAY_10ms   = 10,
+	IBNBD_DELAY_IFBUSY = -1,
+};
+
+/**
+ * ibnbd_get_cpu_qlist() - finds a list with HW queues to be requeued
+ *
+ * Description:
+ *     Each CPU has a list of HW queues, which needs to be requeed.  If a list
+ *     is not empty - it is marked with a bit.  This function finds first
+ *     set bit in a bitmap and returns corresponding CPU list.
+ */
+static struct ibnbd_cpu_qlist *
+ibnbd_get_cpu_qlist(struct ibnbd_clt_session *sess, int cpu)
+{
+	int bit;
+
+	/* First half */
+	bit = find_next_bit(sess->cpu_queues_bm, nr_cpu_ids, cpu);
+	if (bit < nr_cpu_ids) {
+		return per_cpu_ptr(sess->cpu_queues, bit);
+	} else if (cpu != 0) {
+		/* Second half */
+		bit = find_next_bit(sess->cpu_queues_bm, cpu, 0);
+		if (bit < cpu)
+			return per_cpu_ptr(sess->cpu_queues, bit);
+	}
+
+	return NULL;
+}
+
+static inline int nxt_cpu(int cpu)
+{
+	return (cpu + 1) % nr_cpu_ids;
+}
+
+/**
+ * ibnbd_requeue_if_needed() - requeue if CPU queue is marked as non empty
+ *
+ * Description:
+ *     Each CPU has it's own list of HW queues, which should be requeued.
+ *     Function finds such list with HW queues, takes a list lock, picks up
+ *     the first HW queue out of the list and requeues it.
+ *
+ * Return:
+ *     True if the queue was requeued, false otherwise.
+ *
+ * Context:
+ *     Does not matter.
+ */
+static inline bool ibnbd_requeue_if_needed(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_queue *q = NULL;
+	struct ibnbd_cpu_qlist *cpu_q;
+	unsigned long flags;
+	int *cpup;
+
+	/*
+	 * To keep fairness and not to let other queues starve we always
+	 * try to wake up someone else in round-robin manner.  That of course
+	 * increases latency but queues always have a chance to be executed.
+	 */
+	cpup = get_cpu_ptr(sess->cpu_rr);
+	for (cpu_q = ibnbd_get_cpu_qlist(sess, nxt_cpu(*cpup)); cpu_q;
+	     cpu_q = ibnbd_get_cpu_qlist(sess, nxt_cpu(cpu_q->cpu))) {
+		if (!spin_trylock_irqsave(&cpu_q->requeue_lock, flags))
+			continue;
+		if (likely(test_bit(cpu_q->cpu, sess->cpu_queues_bm))) {
+			q = list_first_entry_or_null(&cpu_q->requeue_list,
+						     typeof(*q), requeue_list);
+			if (WARN_ON(!q))
+				goto clear_bit;
+			list_del_init(&q->requeue_list);
+			clear_bit_unlock(0, &q->in_list);
+
+			if (list_empty(&cpu_q->requeue_list)) {
+				/* Clear bit if nothing is left */
+clear_bit:
+				clear_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			}
+		}
+		spin_unlock_irqrestore(&cpu_q->requeue_lock, flags);
+
+		if (q)
+			break;
+	}
+
+	/**
+	 * Saves the CPU that is going to be requeued on the per-cpu var. Just
+	 * incrementing it doesn't work because ibnbd_get_cpu_qlist() will
+	 * always return the first CPU with something on the queue list when the
+	 * value stored on the var is greater than the last CPU with something
+	 * on the list.
+	 */
+	if (cpu_q)
+		*cpup = cpu_q->cpu;
+	put_cpu_var(sess->cpu_rr);
+
+	if (q)
+		ibnbd_clt_dev_requeue(q);
+
+	return !!q;
+}
+
+/**
+ * ibnbd_requeue_all_if_idle() - requeue all queues left in the list if
+ *     session is idling (there are no requests in-flight).
+ *
+ * Description:
+ *     This function tries to rerun all stopped queues if there are no
+ *     requests in-flight anymore.  This function tries to solve an obvious
+ *     problem, when number of tags < than number of queues (hctx), which
+ *     are stopped and put to sleep.  If last tag, which has been just put,
+ *     does not wake up all left queues (hctxs), IO requests hang forever.
+ *
+ *     That can happen when all number of tags, say N, have been exhausted
+ *     from one CPU, and we have many block devices per session, say M.
+ *     Each block device has it's own queue (hctx) for each CPU, so eventually
+ *     we can put that number of queues (hctxs) to sleep: M x nr_cpu_ids.
+ *     If number of tags N < M x nr_cpu_ids finally we will get an IO hang.
+ *
+ *     To avoid this hang last caller of ibnbd_put_tag() (last caller is the
+ *     one who observes sess->busy == 0) must wake up all remaining queues.
+ *
+ * Context:
+ *     Does not matter.
+ */
+static inline void ibnbd_requeue_all_if_idle(struct ibnbd_clt_session *sess)
+{
+	bool requeued;
+
+	do {
+		requeued = ibnbd_requeue_if_needed(sess);
+	} while (atomic_read(&sess->busy) == 0 && requeued);
+}
+
+static struct ibtrs_tag *ibnbd_get_tag(struct ibnbd_clt_session *sess,
+				       enum ibtrs_clt_con_type con_type,
+				       int wait)
+{
+	struct ibtrs_tag *tag;
+
+	tag = ibtrs_clt_get_tag(sess->ibtrs, con_type,
+				wait ? IBTRS_TAG_WAIT : IBTRS_TAG_NOWAIT);
+	if (likely(tag))
+		/* We have a subtle rare case here, when all tags can be
+		 * consumed before busy counter increased.  This is safe,
+		 * because loser will get NULL as a tag, observe 0 busy
+		 * counter and immediately restart the queue himself.
+		 */
+		atomic_inc(&sess->busy);
+
+	return tag;
+}
+
+static void ibnbd_put_tag(struct ibnbd_clt_session *sess, struct ibtrs_tag *tag)
+{
+	ibtrs_clt_put_tag(sess->ibtrs, tag);
+	atomic_dec(&sess->busy);
+	/* Paired with ibnbd_clt_dev_add_to_requeue().  Decrement first
+	 * and then check queue bits.
+	 */
+	smp_mb__after_atomic();
+	ibnbd_requeue_all_if_idle(sess);
+}
+
+static struct ibnbd_iu *ibnbd_get_iu(struct ibnbd_clt_session *sess,
+				     enum ibtrs_clt_con_type con_type,
+				     int wait)
+{
+	struct ibnbd_iu *iu;
+	struct ibtrs_tag *tag;
+
+	tag = ibnbd_get_tag(sess, con_type,
+			    wait ? IBTRS_TAG_WAIT : IBTRS_TAG_NOWAIT);
+	if (unlikely(!tag))
+		return NULL;
+	iu = ibtrs_tag_to_pdu(tag);
+	iu->tag = tag; /* yes, ibtrs_tag_from_pdu() can be nice here,
+			* but also we have to think about MQ mode
+			*/
+
+	return iu;
+}
+
+static void ibnbd_put_iu(struct ibnbd_clt_session *sess, struct ibnbd_iu *iu)
+{
+	ibnbd_put_tag(sess, iu->tag);
+}
+
+static void ibnbd_softirq_done_fn(struct request *rq)
+{
+	struct ibnbd_clt_dev *dev	= rq->rq_disk->private_data;
+	struct ibnbd_clt_session *sess	= dev->sess;
+	struct ibnbd_iu *iu;
+
+	switch (dev->queue_mode) {
+	case BLK_MQ:
+		iu = blk_mq_rq_to_pdu(rq);
+		ibnbd_put_tag(sess, iu->tag);
+		blk_mq_end_request(rq, iu->status);
+		break;
+	case BLK_RQ:
+		iu = rq->special;
+		blk_end_request_all(rq, iu->status);
+		break;
+	default:
+		WARN(true, "dev->queue_mode , contains unexpected"
+		     " value: %d. Memory Corruption? Inflight I/O stalled!\n",
+		     dev->queue_mode);
+		return;
+	}
+}
+
+static void msg_io_conf(void *priv, int errno)
+{
+	struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
+	struct ibnbd_clt_dev *dev = iu->dev;
+	struct request *rq = iu->rq;
+
+	iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
+
+	switch (dev->queue_mode) {
+	case BLK_MQ:
+		if (softirq_enable) {
+			blk_mq_complete_request(rq);
+		} else {
+			ibnbd_put_tag(dev->sess, iu->tag);
+			blk_mq_end_request(rq, iu->status);
+		}
+		break;
+	case BLK_RQ:
+		if (softirq_enable)
+			blk_complete_request(rq);
+		else
+			blk_end_request_all(rq, iu->status);
+		break;
+	default:
+		WARN(true, "dev->queue_mode , contains unexpected"
+		     " value: %d. Memory Corruption? Inflight I/O stalled!\n",
+		     dev->queue_mode);
+		return;
+	}
+
+	if (errno)
+		ibnbd_info_rl(dev, "%s I/O failed with err: %d\n",
+			      rq_data_dir(rq) == READ ? "read" : "write",
+			      errno);
+}
+
+static void init_iu_comp(struct ibnbd_iu *iu, struct ibnbd_iu_comp *comp)
+{
+	init_waitqueue_head(&comp->wait);
+	comp->errno = INT_MAX;
+	iu->comp = comp;
+}
+
+static void deinit_iu_comp(struct ibnbd_iu *iu)
+{
+	iu->comp = NULL;
+}
+
+static void wake_up_iu_comp(struct ibnbd_iu *iu, int errno)
+{
+	struct ibnbd_iu_comp *comp = iu->comp;
+
+	if (comp) {
+		comp->errno = errno;
+		wake_up(&comp->wait);
+		deinit_iu_comp(iu);
+	}
+}
+
+static void wait_iu_comp(struct ibnbd_iu_comp *comp)
+{
+	wait_event(comp->wait, comp->errno != INT_MAX);
+}
+
+static void msg_conf(void *priv, int errno)
+{
+	struct ibnbd_iu *iu = (struct ibnbd_iu *)priv;
+
+	iu->errno = errno;
+	schedule_work(&iu->work);
+}
+
+enum {
+	NO_WAIT = 0,
+	WAIT    = 1
+};
+
+static int send_usr_msg(struct ibtrs_clt *ibtrs, int dir,
+			struct ibnbd_iu *iu, struct kvec *vec, size_t nr,
+			size_t len, struct scatterlist *sg, unsigned int sg_len,
+			void (*conf)(struct work_struct *work),
+			int *errno, bool wait)
+{
+	struct ibnbd_iu_comp comp;
+	int err;
+
+	if (wait)
+		init_iu_comp(iu, &comp);
+	INIT_WORK(&iu->work, conf);
+	err = ibtrs_clt_request(dir, msg_conf, ibtrs, iu->tag,
+				iu, vec, nr, len, sg, sg_len);
+	if (unlikely(err)) {
+		deinit_iu_comp(iu);
+	} else if (wait) {
+		wait_iu_comp(&comp);
+		*errno = comp.errno;
+	} else {
+		*errno = 0;
+	}
+
+	return err;
+}
+
+static void msg_close_conf(struct work_struct *work)
+{
+	struct ibnbd_iu *iu = container_of(work, struct ibnbd_iu, work);
+	struct ibnbd_clt_dev *dev = iu->dev;
+
+	wake_up_iu_comp(iu, iu->errno);
+	ibnbd_put_iu(dev->sess, iu);
+	ibnbd_clt_put_dev(dev);
+}
+
+static int send_msg_close(struct ibnbd_clt_dev *dev, u32 device_id, bool wait)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	struct ibnbd_msg_close msg;
+	struct ibnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	iu = ibnbd_get_iu(sess, IBTRS_USR_CON, IBTRS_TAG_WAIT);
+	if (unlikely(!iu))
+		return -ENOMEM;
+
+	iu->buf = NULL;
+	iu->dev = dev;
+
+	sg_mark_end(&iu->sglist[0]);
+
+	msg.hdr.type	= cpu_to_le16(IBNBD_MSG_CLOSE);
+	msg.device_id	= cpu_to_le32(device_id);
+
+	ibnbd_clt_get_dev(dev);
+	err = send_usr_msg(sess->ibtrs, WRITE, iu, &vec, 1, 0, NULL, 0,
+			   msg_close_conf, &errno, wait);
+	if (unlikely(err)) {
+		ibnbd_clt_put_dev(dev);
+		ibnbd_put_iu(sess, iu);
+	} else {
+		err = errno;
+	}
+
+	return err;
+}
+
+static void msg_open_conf(struct work_struct *work)
+{
+	struct ibnbd_iu *iu = container_of(work, struct ibnbd_iu, work);
+	struct ibnbd_msg_open_rsp *rsp = iu->buf;
+	struct ibnbd_clt_dev *dev = iu->dev;
+	int errno = iu->errno;
+
+	if (errno) {
+		ibnbd_err(dev, "Opening failed, server responded: %d\n", errno);
+	} else {
+		errno = process_msg_open_rsp(dev, rsp);
+		if (unlikely(errno)) {
+			u32 device_id = le32_to_cpu(rsp->device_id);
+			/*
+			 * If server thinks its fine, but we fail to process
+			 * then be nice and send a close to server.
+			 */
+			(void)send_msg_close(dev, device_id, NO_WAIT);
+		}
+	}
+	kfree(rsp);
+	wake_up_iu_comp(iu, errno);
+	ibnbd_put_iu(dev->sess, iu);
+	ibnbd_clt_put_dev(dev);
+}
+
+static void msg_sess_info_conf(struct work_struct *work)
+{
+	struct ibnbd_iu *iu = container_of(work, struct ibnbd_iu, work);
+	struct ibnbd_msg_sess_info_rsp *rsp = iu->buf;
+	struct ibnbd_clt_session *sess = iu->sess;
+
+	if (likely(!iu->errno))
+		sess->ver = min_t(u8, rsp->ver, IBNBD_VER_MAJOR);
+
+	kfree(rsp);
+	wake_up_iu_comp(iu, iu->errno);
+	ibnbd_put_iu(sess, iu);
+	ibnbd_clt_put_sess(sess);
+}
+
+static int send_msg_open(struct ibnbd_clt_dev *dev, bool wait)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	struct ibnbd_msg_open_rsp *rsp;
+	struct ibnbd_msg_open msg;
+	struct ibnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+	if (unlikely(!rsp))
+		return -ENOMEM;
+
+	iu = ibnbd_get_iu(sess, IBTRS_USR_CON, IBTRS_TAG_WAIT);
+	if (unlikely(!iu)) {
+		kfree(rsp);
+		return -ENOMEM;
+	}
+
+	iu->buf = rsp;
+	iu->dev = dev;
+
+	sg_init_one(iu->sglist, rsp, sizeof(*rsp));
+
+	msg.hdr.type	= cpu_to_le16(IBNBD_MSG_OPEN);
+	msg.access_mode	= dev->access_mode;
+	msg.io_mode	= dev->io_mode;
+	strlcpy(msg.dev_name, dev->pathname, sizeof(msg.dev_name));
+
+	ibnbd_clt_get_dev(dev);
+	err = send_usr_msg(sess->ibtrs, READ, iu,
+			   &vec, 1, sizeof(*rsp), iu->sglist, 1,
+			   msg_open_conf, &errno, wait);
+	if (unlikely(err)) {
+		ibnbd_clt_put_dev(dev);
+		ibnbd_put_iu(sess, iu);
+		kfree(rsp);
+	} else {
+		err = errno;
+	}
+
+	return err;
+}
+
+static int send_msg_sess_info(struct ibnbd_clt_session *sess, bool wait)
+{
+	struct ibnbd_msg_sess_info_rsp *rsp;
+	struct ibnbd_msg_sess_info msg;
+	struct ibnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+	if (unlikely(!rsp))
+		return -ENOMEM;
+
+	iu = ibnbd_get_iu(sess, IBTRS_USR_CON, IBTRS_TAG_WAIT);
+	if (unlikely(!iu)) {
+		kfree(rsp);
+		return -ENOMEM;
+	}
+
+	iu->buf = rsp;
+	iu->sess = sess;
+
+	sg_init_one(iu->sglist, rsp, sizeof(*rsp));
+
+	msg.hdr.type = cpu_to_le16(IBNBD_MSG_SESS_INFO);
+	msg.ver      = IBNBD_VER_MAJOR;
+
+	ibnbd_clt_get_sess(sess);
+	err = send_usr_msg(sess->ibtrs, READ, iu,
+			   &vec, 1, sizeof(*rsp), iu->sglist, 1,
+			   msg_sess_info_conf, &errno, wait);
+	if (unlikely(err)) {
+		ibnbd_clt_put_sess(sess);
+		ibnbd_put_iu(sess, iu);
+		kfree(rsp);
+	} else {
+		err = errno;
+	}
+
+	return err;
+}
+
+static void set_dev_states_to_disconnected(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_clt_dev *dev;
+
+	mutex_lock(&sess->lock);
+	list_for_each_entry(dev, &sess->devs_list, list) {
+		ibnbd_err(dev, "Device disconnected.\n");
+
+		mutex_lock(&dev->lock);
+		if (dev->dev_state == DEV_STATE_MAPPED)
+			dev->dev_state = DEV_STATE_MAPPED_DISCONNECTED;
+		mutex_unlock(&dev->lock);
+	}
+	mutex_unlock(&sess->lock);
+}
+
+static void remap_devs(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_clt_dev *dev;
+	struct ibtrs_attrs attrs;
+	int err;
+
+	/*
+	 * Careful here: we are called from IBTRS link event directly,
+	 * thus we can't send any IBTRS request and wait for response
+	 * or IBTRS will not be able to complete request with failure
+	 * if something goes wrong (failing of outstanding requests
+	 * happens exactly from the context where we are blocking now).
+	 *
+	 * So to avoid deadlocks each usr message sent from here must
+	 * be asynchronous.
+	 */
+
+	err = send_msg_sess_info(sess, NO_WAIT);
+	if (unlikely(err)) {
+		pr_err("send_msg_sess_info(\"%s\"): %d\n", sess->sessname, err);
+		return;
+	}
+
+	ibtrs_clt_query(sess->ibtrs, &attrs);
+	mutex_lock(&sess->lock);
+	sess->max_io_size = attrs.max_io_size;
+
+	list_for_each_entry(dev, &sess->devs_list, list) {
+		bool skip;
+
+		mutex_lock(&dev->lock);
+		skip = (dev->dev_state == DEV_STATE_INIT);
+		mutex_unlock(&dev->lock);
+		if (skip)
+			/*
+			 * When device is establishing connection for the first
+			 * time - do not remap, it will be closed soon.
+			 */
+			continue;
+
+		ibnbd_info(dev, "session reconnected, remapping device\n");
+		err = send_msg_open(dev, NO_WAIT);
+		if (unlikely(err)) {
+			ibnbd_err(dev, "send_msg_open(): %d\n", err);
+			break;
+		}
+	}
+	mutex_unlock(&sess->lock);
+}
+
+static void ibnbd_clt_link_ev(void *priv, enum ibtrs_clt_link_ev ev)
+{
+	struct ibnbd_clt_session *sess = priv;
+
+	switch (ev) {
+	case IBTRS_CLT_LINK_EV_DISCONNECTED:
+		set_dev_states_to_disconnected(sess);
+		break;
+	case IBTRS_CLT_LINK_EV_RECONNECTED:
+		remap_devs(sess);
+		break;
+	default:
+		pr_err("Unknown session event received (%d), session: %s\n",
+		       ev, sess->sessname);
+	}
+}
+
+static void ibnbd_init_cpu_qlists(struct ibnbd_cpu_qlist __percpu *cpu_queues)
+{
+	unsigned int cpu;
+	struct ibnbd_cpu_qlist *cpu_q;
+
+	for_each_possible_cpu(cpu) {
+		cpu_q = per_cpu_ptr(cpu_queues, cpu);
+
+		cpu_q->cpu = cpu;
+		INIT_LIST_HEAD(&cpu_q->requeue_list);
+		spin_lock_init(&cpu_q->requeue_lock);
+	}
+}
+
+static struct blk_mq_ops ibnbd_mq_ops;
+static int setup_mq_tags(struct ibnbd_clt_session *sess)
+{
+	struct blk_mq_tag_set *tags = &sess->tag_set;
+
+	memset(tags, 0, sizeof(*tags));
+	tags->ops		= &ibnbd_mq_ops;
+	tags->queue_depth	= sess->queue_depth;
+	tags->numa_node		= NUMA_NO_NODE;
+	tags->flags		= BLK_MQ_F_SHOULD_MERGE |
+				  BLK_MQ_F_SG_MERGE     |
+				  BLK_MQ_F_TAG_SHARED;
+	tags->cmd_size		= sizeof(struct ibnbd_iu);
+	tags->nr_hw_queues	= num_online_cpus();
+
+	return blk_mq_alloc_tag_set(tags);
+}
+
+static void destroy_mq_tags(struct ibnbd_clt_session *sess)
+{
+	blk_mq_free_tag_set(&sess->tag_set);
+}
+
+static inline void wake_up_ibtrs_waiters(struct ibnbd_clt_session *sess)
+{
+	/* paired with rmb() in wait_for_ibtrs_connection() */
+	smp_wmb();
+	sess->ibtrs_ready = true;
+	wake_up_all(&sess->ibtrs_waitq);
+}
+
+static void close_ibtrs(struct ibnbd_clt_session *sess)
+{
+	might_sleep();
+
+	if (!IS_ERR_OR_NULL(sess->ibtrs)) {
+		ibtrs_clt_close(sess->ibtrs);
+		sess->ibtrs = NULL;
+		wake_up_ibtrs_waiters(sess);
+	}
+}
+
+static void free_sess(struct ibnbd_clt_session *sess)
+{
+	WARN_ON(!list_empty(&sess->devs_list));
+
+	might_sleep();
+
+	close_ibtrs(sess);
+	destroy_mq_tags(sess);
+	if (!list_empty(&sess->list)) {
+		mutex_lock(&sess_lock);
+		list_del(&sess->list);
+		mutex_unlock(&sess_lock);
+	}
+	free_percpu(sess->cpu_queues);
+	free_percpu(sess->cpu_rr);
+	kfree(sess);
+}
+
+static struct ibnbd_clt_session *alloc_sess(const char *sessname,
+					    const struct ibtrs_addr *paths,
+					    size_t path_cnt)
+{
+	struct ibnbd_clt_session *sess;
+	int err, cpu;
+
+	sess = kzalloc_node(sizeof(*sess), GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!sess)) {
+		pr_err("Failed to create session %s,"
+		       " allocating session struct failed\n", sessname);
+		return ERR_PTR(-ENOMEM);
+	}
+	strlcpy(sess->sessname, sessname, sizeof(sess->sessname));
+	atomic_set(&sess->busy, 0);
+	mutex_init(&sess->lock);
+	INIT_LIST_HEAD(&sess->devs_list);
+	INIT_LIST_HEAD(&sess->list);
+	bitmap_zero(sess->cpu_queues_bm, NR_CPUS);
+	init_waitqueue_head(&sess->ibtrs_waitq);
+	refcount_set(&sess->refcount, 1);
+
+	sess->cpu_queues = alloc_percpu(struct ibnbd_cpu_qlist);
+	if (unlikely(!sess->cpu_queues)) {
+		pr_err("Failed to create session to %s,"
+		       " alloc of percpu var (cpu_queues) failed\n", sessname);
+		err = -ENOMEM;
+		goto err;
+	}
+	ibnbd_init_cpu_qlists(sess->cpu_queues);
+
+	/**
+	 * That is simple percpu variable which stores cpu indeces, which are
+	 * incremented on each access.  We need that for the sake of fairness
+	 * to wake up queues in a round-robin manner.
+	 */
+	sess->cpu_rr = alloc_percpu(int);
+	if (unlikely(!sess->cpu_rr)) {
+		pr_err("Failed to create session %s,"
+		       " alloc of percpu var (cpu_rr) failed\n", sessname);
+		err = -ENOMEM;
+		goto err;
+	}
+	for_each_possible_cpu(cpu)
+		*per_cpu_ptr(sess->cpu_rr, cpu) = cpu;
+
+	return sess;
+
+err:
+	free_sess(sess);
+
+	return ERR_PTR(err);
+}
+
+static int wait_for_ibtrs_connection(struct ibnbd_clt_session *sess)
+{
+	wait_event(sess->ibtrs_waitq, sess->ibtrs_ready);
+	/* paired with wmb() in wake_up_ibtrs_waiters() */
+	smp_rmb();
+	if (unlikely(IS_ERR_OR_NULL(sess->ibtrs)))
+		return -ECONNRESET;
+
+	return 0;
+}
+
+static void wait_for_ibtrs_disconnection(struct ibnbd_clt_session *sess)
+__releases(&sess_lock)
+__acquires(&sess_lock)
+{
+	DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
+
+	prepare_to_wait(&sess->ibtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
+	if (IS_ERR_OR_NULL(sess->ibtrs)) {
+		finish_wait(&sess->ibtrs_waitq, &wait);
+		return;
+	}
+	mutex_unlock(&sess_lock);
+	/* After unlock session can be freed, so careful */
+	schedule();
+	mutex_lock(&sess_lock);
+}
+
+static struct ibnbd_clt_session *__find_and_get_sess(const char *sessname)
+__releases(&sess_lock)
+__acquires(&sess_lock)
+{
+	struct ibnbd_clt_session *sess;
+	int err;
+
+again:
+	list_for_each_entry(sess, &sess_list, list) {
+		if (strcmp(sessname, sess->sessname))
+			continue;
+
+		if (unlikely(sess->ibtrs_ready && IS_ERR_OR_NULL(sess->ibtrs)))
+			/*
+			 * No IBTRS connection, session is dying.
+			 */
+			continue;
+
+		if (likely(ibnbd_clt_get_sess(sess))) {
+			/*
+			 * Alive session is found, wait for IBTRS connection.
+			 */
+			mutex_unlock(&sess_lock);
+			err = wait_for_ibtrs_connection(sess);
+			if (unlikely(err))
+				ibnbd_clt_put_sess(sess);
+			mutex_lock(&sess_lock);
+
+			if (unlikely(err))
+				/* Session is dying, repeat the loop */
+				goto again;
+
+			return sess;
+		} else {
+			/*
+			 * Ref is 0, session is dying, wait for IBTRS disconnect
+			 * in order to avoid session names clashes.
+			 */
+			wait_for_ibtrs_disconnection(sess);
+			/*
+			 * IBTRS is disconnected and soon session will be freed,
+			 * so repeat a loop.
+			 */
+			goto again;
+		}
+	}
+
+	return NULL;
+}
+
+static struct ibnbd_clt_session *find_and_get_sess(const char *sessname)
+{
+	struct ibnbd_clt_session *sess;
+
+	mutex_lock(&sess_lock);
+	sess = __find_and_get_sess(sessname);
+	mutex_unlock(&sess_lock);
+
+	return sess;
+}
+
+static struct ibnbd_clt_session *
+find_and_get_or_insert_sess(struct ibnbd_clt_session *sess)
+{
+	struct ibnbd_clt_session *found;
+
+	mutex_lock(&sess_lock);
+	found = __find_and_get_sess(sess->sessname);
+	if (!found)
+		list_add(&sess->list, &sess_list);
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static struct ibnbd_clt_session *
+find_and_get_or_create_sess(const char *sessname,
+			    const struct ibtrs_addr *paths,
+			    size_t path_cnt)
+{
+	struct ibnbd_clt_session *sess, *found;
+	struct ibtrs_attrs attrs;
+	int err;
+
+	sess = find_and_get_sess(sessname);
+	if (IS_ERR(sess) || sess)
+		/* Either success or error path */
+		return sess;
+
+	sess = alloc_sess(sessname, paths, path_cnt);
+	if (unlikely(IS_ERR(sess)))
+		return sess;
+
+	found = find_and_get_or_insert_sess(sess);
+	if (IS_ERR(found) || found) {
+		/* Either success or error path */
+		free_sess(sess);
+
+		return found;
+	}
+	/*
+	 * Nothing was found, establish ibtrs connection and proceed further.
+	 */
+	sess->ibtrs = ibtrs_clt_open(sess, ibnbd_clt_link_ev, sessname,
+				     paths, path_cnt, IBTRS_PORT,
+				     sizeof(struct ibnbd_iu),
+				     RECONNECT_DELAY, BMAX_SEGMENTS,
+				     MAX_RECONNECTS);
+	if (unlikely(IS_ERR(sess->ibtrs))) {
+		err = PTR_ERR(sess->ibtrs);
+		goto wake_up_and_put;
+	}
+	ibtrs_clt_query(sess->ibtrs, &attrs);
+	sess->max_io_size = attrs.max_io_size;
+	sess->queue_depth = attrs.queue_depth;
+
+	err = setup_mq_tags(sess);
+	if (unlikely(err))
+		goto close_ibtrs;
+
+	err = send_msg_sess_info(sess, WAIT);
+	if (unlikely(err))
+		goto close_ibtrs;
+
+	wake_up_ibtrs_waiters(sess);
+
+	return sess;
+
+close_ibtrs:
+	close_ibtrs(sess);
+put_sess:
+	ibnbd_clt_put_sess(sess);
+
+	return ERR_PTR(err);
+
+wake_up_and_put:
+	wake_up_ibtrs_waiters(sess);
+	goto put_sess;
+}
+
+static int ibnbd_client_open(struct block_device *block_device, fmode_t mode)
+{
+	struct ibnbd_clt_dev *dev = block_device->bd_disk->private_data;
+
+	if (dev->read_only && (mode & FMODE_WRITE))
+		return -EPERM;
+
+	if (dev->dev_state == DEV_STATE_UNMAPPED ||
+	    !ibnbd_clt_get_dev(dev))
+		return -EIO;
+
+	return 0;
+}
+
+static void ibnbd_client_release(struct gendisk *gen, fmode_t mode)
+{
+	struct ibnbd_clt_dev *dev = gen->private_data;
+
+	ibnbd_clt_put_dev(dev);
+}
+
+static int ibnbd_client_getgeo(struct block_device *block_device,
+			       struct hd_geometry *geo)
+{
+	u64 size;
+	struct ibnbd_clt_dev *dev;
+
+	dev = block_device->bd_disk->private_data;
+	size = dev->size * (dev->logical_block_size / KERNEL_SECTOR_SIZE);
+	geo->cylinders	= (size & ~0x3f) >> 6;	/* size/64 */
+	geo->heads	= 4;
+	geo->sectors	= 16;
+	geo->start	= 0;
+
+	return 0;
+}
+
+static const struct block_device_operations ibnbd_client_ops = {
+	.owner		= THIS_MODULE,
+	.open		= ibnbd_client_open,
+	.release	= ibnbd_client_release,
+	.getgeo		= ibnbd_client_getgeo
+};
+
+static size_t ibnbd_clt_get_sg_size(struct scatterlist *sglist, u32 len)
+{
+	struct scatterlist *sg;
+	size_t tsize = 0;
+	int i;
+
+	for_each_sg(sglist, sg, len, i)
+		tsize += sg->length;
+	return tsize;
+}
+
+static int ibnbd_client_xfer_request(struct ibnbd_clt_dev *dev,
+				     struct request *rq,
+				     struct ibnbd_iu *iu)
+{
+	struct ibtrs_clt *ibtrs = dev->sess->ibtrs;
+	struct ibtrs_tag *tag = iu->tag;
+	struct ibnbd_msg_io msg;
+	unsigned int sg_cnt;
+	struct kvec vec;
+	size_t size;
+	int err;
+
+	iu->rq		= rq;
+	iu->dev		= dev;
+	msg.sector	= cpu_to_le64(blk_rq_pos(rq));
+	msg.bi_size	= cpu_to_le32(blk_rq_bytes(rq));
+	msg.rw		= cpu_to_le32(rq_to_ibnbd_flags(rq));
+
+	sg_cnt = blk_rq_map_sg(dev->queue, rq, iu->sglist);
+	if (sg_cnt == 0)
+		/* Do not forget to mark the end */
+		sg_mark_end(&iu->sglist[0]);
+
+	msg.hdr.type	= cpu_to_le16(IBNBD_MSG_IO);
+	msg.device_id	= cpu_to_le32(dev->device_id);
+
+	vec = (struct kvec) {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+
+	size = ibnbd_clt_get_sg_size(iu->sglist, sg_cnt);
+	err = ibtrs_clt_request(rq_data_dir(rq), msg_io_conf, ibtrs, tag,
+				iu, &vec, 1, size, iu->sglist, sg_cnt);
+	if (unlikely(err)) {
+		ibnbd_err_rl(dev, "IBTRS failed to transfer IO, err: %d\n",
+			     err);
+		return err;
+	}
+
+	return 0;
+}
+
+/**
+ * ibnbd_clt_dev_add_to_requeue() - add device to requeue if session is busy
+ *
+ * Description:
+ *     If session is busy, that means someone will requeue us when resources
+ *     are freed.  If session is not doing anything - device is not added to
+ *     the list and @false is returned.
+ */
+static inline bool ibnbd_clt_dev_add_to_requeue(struct ibnbd_clt_dev *dev,
+						struct ibnbd_queue *q)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	struct ibnbd_cpu_qlist *cpu_q;
+	unsigned long flags;
+	bool added = true;
+	bool need_set;
+
+	cpu_q = get_cpu_ptr(sess->cpu_queues);
+	spin_lock_irqsave(&cpu_q->requeue_lock, flags);
+
+	if (likely(!test_and_set_bit_lock(0, &q->in_list))) {
+		if (WARN_ON(!list_empty(&q->requeue_list)))
+			goto unlock;
+
+		need_set = !test_bit(cpu_q->cpu, sess->cpu_queues_bm);
+		if (need_set) {
+			set_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			/* Paired with ibnbd_put_tag().	 Set a bit first
+			 * and then observe the busy counter.
+			 */
+			smp_mb__before_atomic();
+		}
+		if (likely(atomic_read(&sess->busy))) {
+			list_add_tail(&q->requeue_list, &cpu_q->requeue_list);
+		} else {
+			/* Very unlikely, but possible: busy counter was
+			 * observed as zero.  Drop all bits and return
+			 * false to restart the queue by ourselves.
+			 */
+			if (need_set)
+				clear_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			clear_bit_unlock(0, &q->in_list);
+			added = false;
+		}
+	}
+unlock:
+	spin_unlock_irqrestore(&cpu_q->requeue_lock, flags);
+	put_cpu_ptr(sess->cpu_queues);
+
+	return added;
+}
+
+static void ibnbd_clt_dev_kick_mq_queue(struct ibnbd_clt_dev *dev,
+					struct blk_mq_hw_ctx *hctx,
+					int delay)
+{
+	struct ibnbd_queue *q = hctx->driver_data;
+
+	if (WARN_ON(dev->queue_mode != BLK_MQ))
+		return;
+	blk_mq_stop_hw_queue(hctx);
+
+	if (delay != IBNBD_DELAY_IFBUSY)
+		blk_mq_delay_queue(hctx, delay);
+	else if (unlikely(!ibnbd_clt_dev_add_to_requeue(dev, q)))
+		/* If session is not busy we have to restart
+		 * the queue ourselves.
+		 */
+		blk_mq_delay_queue(hctx, IBNBD_DELAY_10ms);
+}
+
+static void ibnbd_clt_dev_kick_queue(struct ibnbd_clt_dev *dev, int delay)
+{
+	if (WARN_ON(dev->queue_mode != BLK_RQ))
+		return;
+	blk_stop_queue(dev->queue);
+
+	if (delay != IBNBD_DELAY_IFBUSY)
+		ibnbd_blk_delay_queue(dev, delay);
+	else if (unlikely(!ibnbd_clt_dev_add_to_requeue(dev, dev->hw_queues)))
+		/* If session is not busy we have to restart
+		 * the queue ourselves.
+		 */
+		ibnbd_blk_delay_queue(dev, IBNBD_DELAY_10ms);
+}
+
+static blk_status_t ibnbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+				   const struct blk_mq_queue_data *bd)
+{
+	struct request *rq = bd->rq;
+	struct ibnbd_clt_dev *dev = rq->rq_disk->private_data;
+	struct ibnbd_iu *iu = blk_mq_rq_to_pdu(rq);
+	int err;
+
+	if (unlikely(!ibnbd_clt_dev_is_mapped(dev)))
+		return BLK_STS_IOERR;
+
+	iu->tag = ibnbd_get_tag(dev->sess, IBTRS_IO_CON, IBTRS_TAG_NOWAIT);
+	if (unlikely(!iu->tag)) {
+		ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_IFBUSY);
+		return BLK_STS_RESOURCE;
+	}
+
+	blk_mq_start_request(rq);
+	err = ibnbd_client_xfer_request(dev, rq, iu);
+	if (likely(err == 0))
+		return BLK_STS_OK;
+	if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
+		ibnbd_clt_dev_kick_mq_queue(dev, hctx, IBNBD_DELAY_10ms);
+		ibnbd_put_tag(dev->sess, iu->tag);
+		return BLK_STS_RESOURCE;
+	}
+
+	ibnbd_put_tag(dev->sess, iu->tag);
+	return BLK_STS_IOERR;
+}
+
+static int ibnbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
+			      unsigned int hctx_idx, unsigned int numa_node)
+{
+	struct ibnbd_iu *iu = blk_mq_rq_to_pdu(rq);
+
+	sg_init_table(iu->sglist, BMAX_SEGMENTS);
+	return 0;
+}
+
+static inline void ibnbd_init_hw_queue(struct ibnbd_clt_dev *dev,
+				       struct ibnbd_queue *q,
+				       struct blk_mq_hw_ctx *hctx)
+{
+	INIT_LIST_HEAD(&q->requeue_list);
+	q->dev  = dev;
+	q->hctx = hctx;
+}
+
+static void ibnbd_init_mq_hw_queues(struct ibnbd_clt_dev *dev)
+{
+	int i;
+	struct blk_mq_hw_ctx *hctx;
+	struct ibnbd_queue *q;
+
+	queue_for_each_hw_ctx(dev->queue, hctx, i) {
+		q = &dev->hw_queues[i];
+		ibnbd_init_hw_queue(dev, q, hctx);
+		hctx->driver_data = q;
+	}
+}
+
+static struct blk_mq_ops ibnbd_mq_ops = {
+	.queue_rq	= ibnbd_queue_rq,
+	.init_request	= ibnbd_init_request,
+	.complete	= ibnbd_softirq_done_fn,
+};
+
+static int index_to_minor(int index)
+{
+	return index << IBNBD_PART_BITS;
+}
+
+static int minor_to_index(int minor)
+{
+	return minor >> IBNBD_PART_BITS;
+}
+
+static int ibnbd_rq_prep_fn(struct request_queue *q, struct request *rq)
+{
+	struct ibnbd_clt_dev *dev = q->queuedata;
+	struct ibnbd_iu *iu;
+
+	iu = ibnbd_get_iu(dev->sess, IBTRS_TAG_NOWAIT, IBTRS_IO_CON);
+	if (likely(iu)) {
+		rq->special = iu;
+		rq->rq_flags |= RQF_DONTPREP;
+
+		return BLKPREP_OK;
+	}
+
+	ibnbd_clt_dev_kick_queue(dev, IBNBD_DELAY_IFBUSY);
+	return BLKPREP_DEFER;
+}
+
+static void ibnbd_rq_unprep_fn(struct request_queue *q, struct request *rq)
+{
+	struct ibnbd_clt_dev *dev = q->queuedata;
+
+	if (WARN_ON(!rq->special))
+		return;
+	ibnbd_put_iu(dev->sess, rq->special);
+	rq->special = NULL;
+	rq->rq_flags &= ~RQF_DONTPREP;
+}
+
+static void ibnbd_clt_request(struct request_queue *q)
+__must_hold(q->queue_lock)
+{
+	int err;
+	struct request *req;
+	struct ibnbd_iu *iu;
+	struct ibnbd_clt_dev *dev = q->queuedata;
+
+	while ((req = blk_fetch_request(q)) != NULL) {
+		spin_unlock_irq(q->queue_lock);
+
+		if (unlikely(!ibnbd_clt_dev_is_mapped(dev))) {
+			err = -EIO;
+			goto next;
+		}
+
+		iu = req->special;
+		if (WARN_ON(!iu)) {
+			err = -EIO;
+			goto next;
+		}
+
+		sg_init_table(iu->sglist, dev->max_segments);
+		err = ibnbd_client_xfer_request(dev, req, iu);
+next:
+		if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
+			ibnbd_rq_unprep_fn(q, req);
+			spin_lock_irq(q->queue_lock);
+			blk_requeue_request(q, req);
+			ibnbd_clt_dev_kick_queue(dev, IBNBD_DELAY_10ms);
+			break;
+		} else if (err) {
+			blk_end_request_all(req, err);
+		}
+
+		spin_lock_irq(q->queue_lock);
+	}
+}
+
+static int setup_mq_dev(struct ibnbd_clt_dev *dev)
+{
+	dev->queue = blk_mq_init_queue(&dev->sess->tag_set);
+	if (IS_ERR(dev->queue)) {
+		ibnbd_err(dev,
+			  "Initializing multiqueue queue failed, err: %ld\n",
+			  PTR_ERR(dev->queue));
+		return PTR_ERR(dev->queue);
+	}
+	ibnbd_init_mq_hw_queues(dev);
+	return 0;
+}
+
+static int setup_rq_dev(struct ibnbd_clt_dev *dev)
+{
+	dev->queue = blk_init_queue(ibnbd_clt_request, NULL);
+	if (IS_ERR_OR_NULL(dev->queue)) {
+		if (IS_ERR(dev->queue)) {
+			ibnbd_err(dev, "Initializing request queue failed, "
+				  "err: %ld\n", PTR_ERR(dev->queue));
+			return PTR_ERR(dev->queue);
+		}
+		ibnbd_err(dev, "Initializing request queue failed\n");
+		return -ENOMEM;
+	}
+
+	blk_queue_prep_rq(dev->queue, ibnbd_rq_prep_fn);
+	blk_queue_softirq_done(dev->queue, ibnbd_softirq_done_fn);
+	blk_queue_unprep_rq(dev->queue, ibnbd_rq_unprep_fn);
+
+	return 0;
+}
+
+static void setup_request_queue(struct ibnbd_clt_dev *dev)
+{
+	blk_queue_logical_block_size(dev->queue, dev->logical_block_size);
+	blk_queue_physical_block_size(dev->queue, dev->physical_block_size);
+	blk_queue_max_hw_sectors(dev->queue, dev->max_hw_sectors);
+	blk_queue_max_write_same_sectors(dev->queue,
+					 dev->max_write_same_sectors);
+
+	blk_queue_max_discard_sectors(dev->queue, dev->max_discard_sectors);
+	dev->queue->limits.discard_granularity	= dev->discard_granularity;
+	dev->queue->limits.discard_alignment	= dev->discard_alignment;
+	if (dev->max_discard_sectors)
+		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, dev->queue);
+	if (dev->secure_discard)
+		queue_flag_set_unlocked(QUEUE_FLAG_SECERASE, dev->queue);
+
+	queue_flag_set_unlocked(QUEUE_FLAG_SAME_COMP, dev->queue);
+	queue_flag_set_unlocked(QUEUE_FLAG_SAME_FORCE, dev->queue);
+	/* our hca only support 32 sg cnt, proto use one, so 31 left */
+	blk_queue_max_segments(dev->queue, dev->max_segments);
+	blk_queue_io_opt(dev->queue, dev->sess->max_io_size);
+	blk_queue_write_cache(dev->queue, true, true);
+	dev->queue->queuedata = dev;
+}
+
+static void ibnbd_clt_setup_gen_disk(struct ibnbd_clt_dev *dev, int idx)
+{
+	dev->gd->major		= ibnbd_client_major;
+	dev->gd->first_minor	= index_to_minor(idx);
+	dev->gd->fops		= &ibnbd_client_ops;
+	dev->gd->queue		= dev->queue;
+	dev->gd->private_data	= dev;
+	snprintf(dev->gd->disk_name, sizeof(dev->gd->disk_name), "ibnbd%d",
+		 idx);
+	pr_debug("disk_name=%s, capacity=%zu, queue_mode=%s\n",
+		 dev->gd->disk_name,
+		 dev->nsectors * (dev->logical_block_size / KERNEL_SECTOR_SIZE),
+		 ibnbd_queue_mode_str(dev->queue_mode));
+
+	set_capacity(dev->gd, dev->nsectors * (dev->logical_block_size /
+					       KERNEL_SECTOR_SIZE));
+
+	if (dev->access_mode == IBNBD_ACCESS_RO) {
+		dev->read_only = true;
+		set_disk_ro(dev->gd, true);
+	} else {
+		dev->read_only = false;
+	}
+
+	if (!dev->rotational)
+		queue_flag_set_unlocked(QUEUE_FLAG_NONROT, dev->queue);
+}
+
+static void ibnbd_clt_add_gen_disk(struct ibnbd_clt_dev *dev)
+{
+	add_disk(dev->gd);
+}
+
+static int ibnbd_client_setup_device(struct ibnbd_clt_session *sess,
+				     struct ibnbd_clt_dev *dev, int idx)
+{
+	int err;
+
+	dev->size = dev->nsectors * dev->logical_block_size;
+
+	switch (dev->queue_mode) {
+	case BLK_MQ:
+		err = setup_mq_dev(dev);
+		break;
+	case BLK_RQ:
+		err = setup_rq_dev(dev);
+		break;
+	default:
+		err = -EINVAL;
+	}
+
+	if (err)
+		return err;
+
+	setup_request_queue(dev);
+
+	dev->gd = alloc_disk_node(1 << IBNBD_PART_BITS,	NUMA_NO_NODE);
+	if (!dev->gd) {
+		ibnbd_err(dev, "Failed to allocate disk node\n");
+		blk_cleanup_queue(dev->queue);
+		return -ENOMEM;
+	}
+
+	ibnbd_clt_setup_gen_disk(dev, idx);
+
+	return 0;
+}
+
+static struct ibnbd_clt_dev *init_dev(struct ibnbd_clt_session *sess,
+				      enum ibnbd_access_mode access_mode,
+				      enum ibnbd_queue_mode queue_mode,
+				      enum ibnbd_io_mode io_mode,
+				      const char *pathname)
+{
+	int ret;
+	struct ibnbd_clt_dev *dev;
+	size_t nr;
+
+	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, NUMA_NO_NODE);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	nr = (queue_mode == BLK_MQ ? nr_cpu_ids : 1);
+	dev->hw_queues = kcalloc(nr, sizeof(*dev->hw_queues), GFP_KERNEL);
+	if (unlikely(!dev->hw_queues)) {
+		pr_err("Failed to initialize device '%s' from session"
+		       " %s, allocating hw_queues failed.", pathname,
+		       sess->sessname);
+		ret = -ENOMEM;
+		goto out_alloc;
+	}
+	if (queue_mode == BLK_RQ)
+		ibnbd_init_hw_queue(dev, dev->hw_queues, NULL);
+	mutex_lock(&ida_lock);
+	ret = ida_simple_get(&index_ida, 0, minor_to_index(1 << MINORBITS),
+			     GFP_KERNEL);
+	mutex_unlock(&ida_lock);
+	if (ret < 0) {
+		pr_err("Failed to initialize device '%s' from session %s,"
+		       " allocating idr failed, err: %d\n", pathname,
+		       sess->sessname, ret);
+		goto out_queues;
+	}
+	dev->clt_device_id	= ret;
+	dev->sess		= sess;
+	dev->access_mode	= access_mode;
+	dev->queue_mode		= queue_mode;
+	dev->io_mode		= io_mode;
+	strlcpy(dev->pathname, pathname, sizeof(dev->pathname));
+	INIT_DELAYED_WORK(&dev->rq_delay_work, ibnbd_blk_delay_work);
+	mutex_init(&dev->lock);
+	refcount_set(&dev->refcount, 1);
+	dev->dev_state = DEV_STATE_INIT;
+
+	/*
+	 * Here we called from sysfs entry, thus clt-sysfs is
+	 * responsible that session will not disappear.
+	 */
+	ibnbd_clt_get_sess(sess);
+
+	return dev;
+
+out_queues:
+	kfree(dev->hw_queues);
+out_alloc:
+	kfree(dev);
+	return ERR_PTR(ret);
+}
+
+static bool __exists_dev(const char *pathname)
+{
+	struct ibnbd_clt_session *sess;
+	struct ibnbd_clt_dev *dev;
+	bool found = false;
+
+	list_for_each_entry(sess, &sess_list, list) {
+		mutex_lock(&sess->lock);
+		list_for_each_entry(dev, &sess->devs_list, list) {
+			if (!strncmp(dev->pathname, pathname,
+				     sizeof(dev->pathname))) {
+				found = true;
+				break;
+			}
+		}
+		mutex_unlock(&sess->lock);
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+static bool exists_devpath(const char *pathname)
+{
+	bool found;
+
+	mutex_lock(&sess_lock);
+	found = __exists_dev(pathname);
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static bool insert_dev_if_not_exists_devpath(const char *pathname,
+					     struct ibnbd_clt_session *sess,
+					     struct ibnbd_clt_dev *dev)
+{
+	bool found;
+
+	mutex_lock(&sess_lock);
+	found = __exists_dev(pathname);
+	if (!found) {
+		mutex_lock(&sess->lock);
+		list_add_tail(&dev->list, &sess->devs_list);
+		mutex_unlock(&sess->lock);
+	}
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static void delete_dev(struct ibnbd_clt_dev *dev)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+
+	mutex_lock(&sess->lock);
+	list_del(&dev->list);
+	mutex_unlock(&sess->lock);
+}
+
+struct ibnbd_clt_dev *ibnbd_clt_map_device(const char *sessname,
+					   struct ibtrs_addr *paths,
+					   size_t path_cnt,
+					   const char *pathname,
+					   enum ibnbd_access_mode access_mode,
+					   enum ibnbd_queue_mode queue_mode,
+					   enum ibnbd_io_mode io_mode)
+{
+	struct ibnbd_clt_session *sess;
+	struct ibnbd_clt_dev *dev;
+	int ret;
+
+	if (unlikely(exists_devpath(pathname)))
+		return ERR_PTR(-EEXIST);
+
+	sess = find_and_get_or_create_sess(sessname, paths, path_cnt);
+	if (unlikely(IS_ERR(sess)))
+		return ERR_CAST(sess);
+
+	dev = init_dev(sess, access_mode, queue_mode, io_mode, pathname);
+	if (unlikely(IS_ERR(dev))) {
+		pr_err("map_device: failed to map device '%s' from session %s,"
+		       " can't initialize device, err: %ld\n", pathname,
+		       sess->sessname, PTR_ERR(dev));
+		ret = PTR_ERR(dev);
+		goto put_sess;
+	}
+	if (unlikely(insert_dev_if_not_exists_devpath(pathname, sess, dev))) {
+		ret = -EEXIST;
+		goto put_dev;
+	}
+	ret = send_msg_open(dev, WAIT);
+	if (unlikely(ret)) {
+		ibnbd_err(dev, "map_device: failed, can't open remote device,"
+			  " err: %d\n", ret);
+		goto del_dev;
+	}
+	mutex_lock(&dev->lock);
+	pr_debug("Opened remote device: session=%s, path='%s'\n",
+		 sess->sessname, pathname);
+	ret = ibnbd_client_setup_device(sess, dev, dev->clt_device_id);
+	if (ret) {
+		ibnbd_err(dev, "map_device: Failed to configure device, err: %d\n",
+			  ret);
+		mutex_unlock(&dev->lock);
+		goto del_dev;
+	}
+
+	ibnbd_info(dev, "map_device: Device mapped as %s (nsectors: %zu,"
+		   " logical_block_size: %d, physical_block_size: %d,"
+		   " max_write_same_sectors: %d, max_discard_sectors: %d,"
+		   " discard_granularity: %d, discard_alignment: %d, "
+		   "secure_discard: %d, max_segments: %d, max_hw_sectors: %d, "
+		   "rotational: %d)\n",
+		   dev->gd->disk_name, dev->nsectors, dev->logical_block_size,
+		   dev->physical_block_size, dev->max_write_same_sectors,
+		   dev->max_discard_sectors, dev->discard_granularity,
+		   dev->discard_alignment, dev->secure_discard,
+		   dev->max_segments, dev->max_hw_sectors, dev->rotational);
+
+	mutex_unlock(&dev->lock);
+
+	ibnbd_clt_add_gen_disk(dev);
+	ibnbd_clt_put_sess(sess);
+
+	return dev;
+
+del_dev:
+	delete_dev(dev);
+put_dev:
+	ibnbd_clt_put_dev(dev);
+put_sess:
+	ibnbd_clt_put_sess(sess);
+
+	return ERR_PTR(ret);
+}
+
+static void destroy_gen_disk(struct ibnbd_clt_dev *dev)
+{
+	del_gendisk(dev->gd);
+	/*
+	 * Before marking queue as dying (blk_cleanup_queue() does that)
+	 * we have to be sure that everything in-flight has gone.
+	 * Blink with freeze/unfreeze.
+	 */
+	blk_mq_freeze_queue(dev->queue);
+	blk_mq_unfreeze_queue(dev->queue);
+	blk_cleanup_queue(dev->queue);
+	put_disk(dev->gd);
+}
+
+static void destroy_sysfs(struct ibnbd_clt_dev *dev,
+			  const struct attribute *sysfs_self)
+{
+	ibnbd_clt_remove_dev_symlink(dev);
+	if (dev->kobj.state_initialized) {
+		if (sysfs_self)
+			/* To avoid deadlock firstly commit suicide */
+			ibnbd_sysfs_remove_file_self(&dev->kobj, sysfs_self);
+		kobject_del(&dev->kobj);
+		kobject_put(&dev->kobj);
+	}
+}
+
+int ibnbd_clt_unmap_device(struct ibnbd_clt_dev *dev, bool force,
+			   const struct attribute *sysfs_self)
+{
+	struct ibnbd_clt_session *sess = dev->sess;
+	enum ibnbd_clt_dev_state prev_state;
+	int refcount, ret = 0;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state == DEV_STATE_UNMAPPED) {
+		ibnbd_info(dev, "Device is already being unmapped\n");
+		ret = -EALREADY;
+		goto err;
+	}
+	refcount = refcount_read(&dev->refcount);
+	if (!force && refcount > 1) {
+		ibnbd_err(dev, "Closing device failed, device is in use,"
+			  " (%d device users)\n", refcount - 1);
+		ret = -EBUSY;
+		goto err;
+	}
+	prev_state = dev->dev_state;
+	dev->dev_state = DEV_STATE_UNMAPPED;
+	mutex_unlock(&dev->lock);
+
+	delete_dev(dev);
+
+	if (prev_state == DEV_STATE_MAPPED && sess->ibtrs)
+		send_msg_close(dev, dev->device_id, WAIT);
+
+	ibnbd_info(dev, "Device is unmapped\n");
+	destroy_sysfs(dev, sysfs_self);
+	destroy_gen_disk(dev);
+
+	/* Likely last reference put */
+	ibnbd_clt_put_dev(dev);
+
+	/*
+	 * Here device and session can be vanished!
+	 */
+
+	return 0;
+err:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+int ibnbd_clt_remap_device(struct ibnbd_clt_dev *dev)
+{
+	int err;
+
+	mutex_lock(&dev->lock);
+	if (likely(dev->dev_state == DEV_STATE_MAPPED_DISCONNECTED))
+		err = 0;
+	else if (dev->dev_state == DEV_STATE_UNMAPPED)
+		err = -ENODEV;
+	else if (dev->dev_state == DEV_STATE_MAPPED)
+		err = -EALREADY;
+	else
+		err = -EBUSY;
+	mutex_unlock(&dev->lock);
+	if (likely(!err)) {
+		ibnbd_info(dev, "Remapping device.\n");
+		err = send_msg_open(dev, WAIT);
+		if (unlikely(err))
+			ibnbd_err(dev, "remap_device: %d\n", err);
+	}
+
+	return err;
+}
+
+static void unmap_device_work(struct work_struct *work)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(work, typeof(*dev), unmap_on_rmmod_work);
+	ibnbd_clt_unmap_device(dev, true, NULL);
+}
+
+static void ibnbd_destroy_sessions(void)
+{
+	struct ibnbd_clt_session *sess, *sn;
+	struct ibnbd_clt_dev *dev, *tn;
+
+	/* Firstly forbid access through sysfs interface */
+	ibnbd_clt_destroy_default_group();
+	ibnbd_clt_destroy_sysfs_files();
+
+	/*
+	 * Here at this point there is no any concurrent access to sessions
+	 * list and devices list:
+	 *   1. New session or device can'be be created - session sysfs files
+	 *      are removed.
+	 *   2. Device or session can't be removed - module reference is taken
+	 *      into account in unmap device sysfs callback.
+	 *   3. No IO requests inflight - each file open of block_dev increases
+	 *      module reference in get_disk().
+	 *
+	 * But still there can be user requests inflights, which are sent by
+	 * asynchronous send_msg_*() functions, thus before unmapping devices
+	 * IBTRS session must be explicitly closed.
+	 */
+
+	list_for_each_entry_safe(sess, sn, &sess_list, list) {
+		ibnbd_clt_get_sess(sess);
+		close_ibtrs(sess);
+		list_for_each_entry_safe(dev, tn, &sess->devs_list, list) {
+			/*
+			 * Here unmap happens in parallel for only one reason:
+			 * blk_cleanup_queue() takes around half a second, so
+			 * on huge amount of devices the whole module unload
+			 * procedure takes minutes.
+			 */
+			INIT_WORK(&dev->unmap_on_rmmod_work, unmap_device_work);
+			schedule_work(&dev->unmap_on_rmmod_work);
+		}
+		ibnbd_clt_put_sess(sess);
+	}
+	/* Wait for all scheduled unmap works */
+	flush_scheduled_work();
+	WARN_ON(!list_empty(&sess_list));
+}
+
+static int __init ibnbd_client_init(void)
+{
+	int err;
+
+	pr_info("Loading module %s, version %s: (softirq_enable: %d)\n",
+		KBUILD_MODNAME, IBNBD_VER_STRING, softirq_enable);
+
+	ibnbd_client_major = register_blkdev(ibnbd_client_major, "ibnbd");
+	if (ibnbd_client_major <= 0) {
+		pr_err("Failed to load module,"
+		       " block device registration failed\n");
+		err = -EBUSY;
+		goto out;
+	}
+
+	err = ibnbd_clt_create_sysfs_files();
+	if (err) {
+		pr_err("Failed to load module,"
+		       " creating sysfs device files failed, err: %d\n",
+		       err);
+		goto out_unregister_blk;
+	}
+
+	return 0;
+
+out_unregister_blk:
+	unregister_blkdev(ibnbd_client_major, "ibnbd");
+out:
+	return err;
+}
+
+static void __exit ibnbd_client_exit(void)
+{
+	pr_info("Unloading module\n");
+	ibnbd_destroy_sessions();
+	unregister_blkdev(ibnbd_client_major, "ibnbd");
+	ida_destroy(&index_ida);
+	pr_info("Module unloaded\n");
+}
+
+module_init(ibnbd_client_init);
+module_exit(ibnbd_client_exit);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 17/24] ibnbd: client: sysfs interface functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (15 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 16/24] ibnbd: client: main functionality Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 18/24] ibnbd: server: private header with server structs and functions Roman Pen
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is the sysfs interface to IBNBD block devices on client side:

  /sys/kernel/ibnbd_client/
    |- map_device
    |  *** maps remote device
    |
    |- devices/
       *** all mapped devices

  /sys/block/ibnbd<N>/ibnbd_client/
    |- unmap_device
    |  *** unmaps device
    |
    |- state
    |  *** device state
    |
    |- session
    |  *** session name
    |
    |- mapping_path
       *** path of the dev that was mapped on server

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-clt-sysfs.c | 723 ++++++++++++++++++++++++++++++++++
 1 file changed, 723 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-clt-sysfs.c b/drivers/block/ibnbd/ibnbd-clt-sysfs.c
new file mode 100644
index 000000000000..2770b5c81c23
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt-sysfs.c
@@ -0,0 +1,723 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/parser.h>
+#include <linux/module.h>
+#include <linux/in6.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <rdma/ib.h>
+#include <rdma/rdma_cm.h>
+
+#include "ibnbd-clt.h"
+
+static struct kobject *ibnbd_kobject;
+static struct kobject *ibnbd_devices_kobject;
+
+enum {
+	IBNBD_OPT_ERR		= 0,
+	IBNBD_OPT_PATH		= 1 << 0,
+	IBNBD_OPT_DEV_PATH	= 1 << 1,
+	IBNBD_OPT_ACCESS_MODE	= 1 << 3,
+	IBNBD_OPT_INPUT_MODE	= 1 << 4,
+	IBNBD_OPT_IO_MODE	= 1 << 5,
+	IBNBD_OPT_SESSNAME	= 1 << 6,
+};
+
+static unsigned int ibnbd_opt_mandatory[] = {
+	IBNBD_OPT_PATH,
+	IBNBD_OPT_DEV_PATH,
+	IBNBD_OPT_SESSNAME,
+};
+
+static const match_table_t ibnbd_opt_tokens = {
+	{	IBNBD_OPT_PATH,		"path=%s"		},
+	{	IBNBD_OPT_DEV_PATH,	"device_path=%s"	},
+	{	IBNBD_OPT_ACCESS_MODE,	"access_mode=%s"	},
+	{	IBNBD_OPT_INPUT_MODE,	"input_mode=%s"		},
+	{	IBNBD_OPT_IO_MODE,	"io_mode=%s"		},
+	{	IBNBD_OPT_SESSNAME,	"sessname=%s"		},
+	{	IBNBD_OPT_ERR,		NULL			},
+};
+
+/* remove new line from string */
+static void strip(char *s)
+{
+	char *p = s;
+
+	while (*s != '\0') {
+		if (*s != '\n')
+			*p++ = *s++;
+		else
+			++s;
+	}
+	*p = '\0';
+}
+
+static int ibnbd_clt_parse_map_options(const char *buf,
+				       char *sessname,
+				       struct ibtrs_addr *paths,
+				       size_t *path_cnt,
+				       size_t max_path_cnt,
+				       char *pathname,
+				       enum ibnbd_access_mode *access_mode,
+				       enum ibnbd_queue_mode *queue_mode,
+				       enum ibnbd_io_mode *io_mode)
+{
+	char *options, *sep_opt;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int opt_mask = 0;
+	int token;
+	int ret = -EINVAL;
+	int i;
+	int p_cnt = 0;
+
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	options = strstrip(options);
+	strip(options);
+	sep_opt = options;
+	while ((p = strsep(&sep_opt, " ")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, ibnbd_opt_tokens, args);
+		opt_mask |= token;
+
+		switch (token) {
+		case IBNBD_OPT_SESSNAME:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (strlen(p) > NAME_MAX) {
+				pr_err("map_device: sessname too long\n");
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			strlcpy(sessname, p, NAME_MAX);
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_PATH:
+			p = match_strdup(args);
+			if (!p || p_cnt >= max_path_cnt) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			ret = ibtrs_addr_to_sockaddr(p, strlen(p), IBTRS_PORT,
+						     &paths[p_cnt]);
+			if (ret) {
+				pr_err("Can't parse path %s: %d\n", p, ret);
+				kfree(p);
+				goto out;
+			}
+
+			p_cnt++;
+
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_DEV_PATH:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (strlen(p) > NAME_MAX) {
+				pr_err("map_device: Device path too long\n");
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			strlcpy(pathname, p, NAME_MAX);
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_ACCESS_MODE:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			if (!strcmp(p, "ro")) {
+				*access_mode = IBNBD_ACCESS_RO;
+			} else if (!strcmp(p, "rw")) {
+				*access_mode = IBNBD_ACCESS_RW;
+			} else if (!strcmp(p, "migration")) {
+				*access_mode = IBNBD_ACCESS_MIGRATION;
+			} else {
+				pr_err("map_device: Invalid access_mode:"
+				       " '%s'\n", p);
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_INPUT_MODE:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (!strcmp(p, "mq")) {
+				*queue_mode = BLK_MQ;
+			} else if (!strcmp(p, "rq")) {
+				*queue_mode = BLK_RQ;
+			} else {
+				pr_err("map_device: Invalid input_mode: "
+				       "'%s'.\n", p);
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+
+		case IBNBD_OPT_IO_MODE:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (!strcmp(p, "blockio")) {
+				*io_mode = IBNBD_BLOCKIO;
+			} else if (!strcmp(p, "fileio")) {
+				*io_mode = IBNBD_FILEIO;
+			} else {
+				pr_err("map_device: Invalid io_mode: '%s'.\n",
+				       p);
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			kfree(p);
+			break;
+
+		default:
+			pr_err("map_device: Unknown parameter or missing value"
+			       " '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ibnbd_opt_mandatory); i++) {
+		if ((opt_mask & ibnbd_opt_mandatory[i])) {
+			ret = 0;
+		} else {
+			pr_err("map_device: Parameters missing\n");
+			ret = -EINVAL;
+			break;
+		}
+	}
+
+out:
+	*path_cnt = p_cnt;
+	kfree(options);
+	return ret;
+}
+
+static ssize_t ibnbd_clt_state_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	switch (dev->dev_state) {
+	case (DEV_STATE_INIT):
+		return scnprintf(page, PAGE_SIZE, "init\n");
+	case (DEV_STATE_MAPPED):
+		/* TODO fix cli tool before changing to proper state */
+		return scnprintf(page, PAGE_SIZE, "open\n");
+	case (DEV_STATE_MAPPED_DISCONNECTED):
+		/* TODO fix cli tool before changing to proper state */
+		return scnprintf(page, PAGE_SIZE, "closed\n");
+	case (DEV_STATE_UNMAPPED):
+		return scnprintf(page, PAGE_SIZE, "unmapped\n");
+	default:
+		return scnprintf(page, PAGE_SIZE, "unknown\n");
+	}
+}
+
+static struct kobj_attribute ibnbd_clt_state_attr =
+	__ATTR(state, 0444, ibnbd_clt_state_show, NULL);
+
+static ssize_t ibnbd_clt_input_mode_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 ibnbd_queue_mode_str(dev->queue_mode));
+}
+
+static struct kobj_attribute ibnbd_clt_input_mode_attr =
+	__ATTR(input_mode, 0444, ibnbd_clt_input_mode_show, NULL);
+
+static ssize_t ibnbd_clt_mapping_path_show(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", dev->pathname);
+}
+
+static struct kobj_attribute ibnbd_clt_mapping_path_attr =
+	__ATTR(mapping_path, 0444, ibnbd_clt_mapping_path_show, NULL);
+
+static ssize_t ibnbd_clt_io_mode_show(struct kobject *kobj,
+				      struct kobj_attribute *attr, char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 ibnbd_io_mode_str(dev->remote_io_mode));
+}
+
+static struct kobj_attribute ibnbd_clt_io_mode =
+	__ATTR(io_mode, 0444, ibnbd_clt_io_mode_show, NULL);
+
+static ssize_t ibnbd_clt_unmap_dev_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo <normal|force> > %s\n",
+			 attr->attr.name);
+}
+
+void ibnbd_sysfs_remove_file_self(struct kobject *kobj,
+				  const struct attribute *attr)
+{
+	struct device_attribute dattr = {
+		.attr.name = attr->name
+	};
+	struct device *device;
+
+	/*
+	 * Unfortunately original sysfs_remove_file_self() is not exported,
+	 * so consider this as a hack to call self removal of a sysfs entry
+	 * just using another "door".
+	 */
+
+	device = container_of(kobj, typeof(*device), kobj);
+	device_remove_file_self(device, &dattr);
+}
+
+static ssize_t ibnbd_clt_unmap_dev_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct ibnbd_clt_dev *dev;
+	char *options;
+	bool force;
+	int err;
+
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	options = strstrip(options);
+	strip(options);
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	if (sysfs_streq(options, "normal")) {
+		force = false;
+	} else if (sysfs_streq(options, "force")) {
+		force = true;
+	} else {
+		ibnbd_err(dev, "unmap_device: Invalid value: %s\n", options);
+		err = -EINVAL;
+		goto out;
+	}
+
+	ibnbd_info(dev, "Unmapping device, option: %s.\n",
+		   force ? "force" : "normal");
+
+	/*
+	 * We take explicit module reference only for one reason: do not
+	 * race with lockless ibnbd_destroy_sessions().
+	 */
+	if (!try_module_get(THIS_MODULE)) {
+		err = -ENODEV;
+		goto out;
+	}
+	err = ibnbd_clt_unmap_device(dev, force, &attr->attr);
+	if (unlikely(err)) {
+		if (unlikely(err != -EALREADY))
+			ibnbd_err(dev, "unmap_device: %d\n",  err);
+		goto module_put;
+	}
+
+	/*
+	 * Here device can be vanished!
+	 */
+
+	err = count;
+
+module_put:
+	module_put(THIS_MODULE);
+out:
+	kfree(options);
+
+	return err;
+}
+
+static struct kobj_attribute ibnbd_clt_unmap_device_attr =
+	__ATTR(unmap_device, 0644, ibnbd_clt_unmap_dev_show,
+	       ibnbd_clt_unmap_dev_store);
+
+static ssize_t ibnbd_clt_resize_dev_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE,
+			 "Usage: echo <new size in sectors> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibnbd_clt_resize_dev_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int ret;
+	unsigned long sectors;
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	ret = kstrtoul(buf, 0, &sectors);
+	if (ret)
+		return ret;
+
+	ret = ibnbd_clt_resize_disk(dev, (size_t)sectors);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute ibnbd_clt_resize_dev_attr =
+	__ATTR(resize, 0644, ibnbd_clt_resize_dev_show,
+	       ibnbd_clt_resize_dev_store);
+
+static ssize_t ibnbd_clt_remap_dev_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo <1> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t ibnbd_clt_remap_dev_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct ibnbd_clt_dev *dev;
+	char *options;
+	int err;
+
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	options = strstrip(options);
+	strip(options);
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+	if (!sysfs_streq(options, "1")) {
+		ibnbd_err(dev, "remap_device: Invalid value: %s\n", options);
+		err = -EINVAL;
+		goto out;
+	}
+	err = ibnbd_clt_remap_device(dev);
+	if (likely(!err))
+		err = count;
+
+out:
+	kfree(options);
+
+	return err;
+}
+
+static struct kobj_attribute ibnbd_clt_remap_device_attr =
+	__ATTR(remap_device, 0644, ibnbd_clt_remap_dev_show,
+	       ibnbd_clt_remap_dev_store);
+
+static ssize_t ibnbd_clt_session_show(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      char *page)
+{
+	struct ibnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct ibnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", dev->sess->sessname);
+}
+
+static struct kobj_attribute ibnbd_clt_session_attr =
+	__ATTR(session, 0444, ibnbd_clt_session_show, NULL);
+
+static struct attribute *ibnbd_dev_attrs[] = {
+	&ibnbd_clt_unmap_device_attr.attr,
+	&ibnbd_clt_resize_dev_attr.attr,
+	&ibnbd_clt_remap_device_attr.attr,
+	&ibnbd_clt_mapping_path_attr.attr,
+	&ibnbd_clt_state_attr.attr,
+	&ibnbd_clt_input_mode_attr.attr,
+	&ibnbd_clt_session_attr.attr,
+	&ibnbd_clt_io_mode.attr,
+	NULL,
+};
+
+void ibnbd_clt_remove_dev_symlink(struct ibnbd_clt_dev *dev)
+{
+	/*
+	 * The module_is_live() check is crucial and helps to avoid annoying
+	 * sysfs warning raised in sysfs_remove_link(), when the whole sysfs
+	 * path was just removed, see ibnbd_close_sessions().
+	 */
+	if (strlen(dev->blk_symlink_name) && module_is_live(THIS_MODULE))
+		sysfs_remove_link(ibnbd_devices_kobject, dev->blk_symlink_name);
+}
+
+static struct kobj_type ibnbd_dev_ktype = {
+	.sysfs_ops      = &kobj_sysfs_ops,
+	.default_attrs  = ibnbd_dev_attrs,
+};
+
+static int ibnbd_clt_add_dev_kobj(struct ibnbd_clt_dev *dev)
+{
+	int ret;
+	struct kobject *gd_kobj = &disk_to_dev(dev->gd)->kobj;
+
+	ret = kobject_init_and_add(&dev->kobj, &ibnbd_dev_ktype, gd_kobj, "%s",
+				   "ibnbd");
+	if (ret)
+		ibnbd_err(dev, "Failed to create device sysfs dir, err: %d\n",
+			  ret);
+
+	return ret;
+}
+
+static ssize_t ibnbd_clt_map_device_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo \""
+			 "sessname=<name of the ibtrs session>"
+			 " path=<[srcaddr,]dstaddr>"
+			 " [path=<[srcaddr,]dstaddr>]"
+			 " device_path=<full path on remote side>"
+			 " [access_mode=<ro|rw|migration>]"
+			 " [input_mode=<mq|rq>]"
+			 " [io_mode=<fileio|blockio>]\" > %s\n\n"
+			 "addr ::= [ ip:<ipv4> | ip:<ipv6> | gid:<gid> ]\n",
+			 attr->attr.name);
+}
+
+static int ibnbd_clt_get_path_name(struct ibnbd_clt_dev *dev, char *buf,
+				   size_t len)
+{
+	int ret;
+	char pathname[NAME_MAX], *s;
+
+	strlcpy(pathname, dev->pathname, sizeof(pathname));
+	while ((s = strchr(pathname, '/')))
+		s[0] = '!';
+
+	ret = snprintf(buf, len, "%s", pathname);
+	if (ret >= len)
+		return -ENAMETOOLONG;
+
+	return 0;
+}
+
+static int ibnbd_clt_add_dev_symlink(struct ibnbd_clt_dev *dev)
+{
+	struct kobject *gd_kobj = &disk_to_dev(dev->gd)->kobj;
+	int ret;
+
+	ret = ibnbd_clt_get_path_name(dev, dev->blk_symlink_name,
+				      sizeof(dev->blk_symlink_name));
+	if (ret) {
+		ibnbd_err(dev, "Failed to get /sys/block symlink path, err: %d\n",
+			  ret);
+		goto out_err;
+	}
+
+	ret = sysfs_create_link(ibnbd_devices_kobject, gd_kobj,
+				dev->blk_symlink_name);
+	if (ret) {
+		ibnbd_err(dev, "Creating /sys/block symlink failed, err: %d\n",
+			  ret);
+		goto out_err;
+	}
+
+	return 0;
+
+out_err:
+	dev->blk_symlink_name[0] = '\0';
+	return ret;
+}
+
+static ssize_t ibnbd_clt_map_device_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct ibnbd_clt_dev *dev;
+	int ret;
+	char pathname[NAME_MAX];
+	char sessname[NAME_MAX];
+	enum ibnbd_access_mode access_mode = IBNBD_ACCESS_RW;
+	enum ibnbd_queue_mode queue_mode = BLK_MQ;
+	enum ibnbd_io_mode io_mode = IBNBD_AUTOIO;
+
+	size_t path_cnt;
+	struct ibtrs_addr paths[3];
+	struct sockaddr_storage saddr[ARRAY_SIZE(paths)];
+	struct sockaddr_storage daddr[ARRAY_SIZE(paths)];
+
+	for (path_cnt = 0; path_cnt < ARRAY_SIZE(paths); path_cnt++) {
+		paths[path_cnt].src = (struct sockaddr *)&saddr[path_cnt];
+		paths[path_cnt].dst = (struct sockaddr *)&daddr[path_cnt];
+	}
+
+	ret = ibnbd_clt_parse_map_options(buf, sessname, paths,
+					  &path_cnt, ARRAY_SIZE(paths),
+					  pathname, &access_mode,
+					  &queue_mode, &io_mode);
+	if (ret)
+		return ret;
+
+	pr_info("Mapping device %s on session %s,"
+		" (access_mode: %s, input_mode: %s, io_mode: %s)\n",
+		pathname, sessname, ibnbd_access_mode_str(access_mode),
+		ibnbd_queue_mode_str(queue_mode), ibnbd_io_mode_str(io_mode));
+
+	dev = ibnbd_clt_map_device(sessname, paths, path_cnt, pathname,
+				   access_mode, queue_mode, io_mode);
+	if (unlikely(IS_ERR(dev)))
+		return PTR_ERR(dev);
+
+	ret = ibnbd_clt_add_dev_kobj(dev);
+	if (unlikely(ret))
+		goto unmap_dev;
+
+	ret = ibnbd_clt_add_dev_symlink(dev);
+	if (ret)
+		goto unmap_dev;
+
+	return count;
+
+unmap_dev:
+	ibnbd_clt_unmap_device(dev, true, NULL);
+
+	return ret;
+}
+
+static struct kobj_attribute ibnbd_clt_map_device_attr =
+	__ATTR(map_device, 0644,
+	       ibnbd_clt_map_device_show, ibnbd_clt_map_device_store);
+
+static struct attribute *default_attrs[] = {
+	&ibnbd_clt_map_device_attr.attr,
+	NULL,
+};
+
+static struct attribute_group default_attr_group = {
+	.attrs = default_attrs,
+};
+
+int ibnbd_clt_create_sysfs_files(void)
+{
+	int err = 0;
+
+	ibnbd_kobject = kobject_create_and_add(KBUILD_MODNAME, kernel_kobj);
+	if (!ibnbd_kobject) {
+		err = -ENOMEM;
+		goto err1;
+	}
+
+	ibnbd_devices_kobject = kobject_create_and_add("devices",
+						       ibnbd_kobject);
+	if (!ibnbd_devices_kobject) {
+		err = -ENOMEM;
+		goto err2;
+	}
+
+	err = sysfs_create_group(ibnbd_kobject, &default_attr_group);
+	if (err)
+		goto err3;
+
+	return 0;
+
+err3:
+	kobject_put(ibnbd_devices_kobject);
+err2:
+	kobject_put(ibnbd_kobject);
+err1:
+	return err;
+}
+
+void ibnbd_clt_destroy_default_group(void)
+{
+	sysfs_remove_group(ibnbd_kobject, &default_attr_group);
+}
+
+void ibnbd_clt_destroy_sysfs_files(void)
+{
+	kobject_del(ibnbd_devices_kobject);
+	kobject_put(ibnbd_devices_kobject);
+	kobject_del(ibnbd_kobject);
+	kobject_put(ibnbd_kobject);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 18/24] ibnbd: server: private header with server structs and functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (16 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 17/24] ibnbd: client: sysfs interface functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:08 ` [PATCH 19/24] ibnbd: server: main functionality Roman Pen
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This header describes main structs and functions used by ibnbd-server
module, namely structs for managing sessions from different clients
and mapped (opened) devices.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv.h | 100 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv.h b/drivers/block/ibnbd/ibnbd-srv.h
new file mode 100644
index 000000000000..191a1650bc1d
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.h
@@ -0,0 +1,100 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_SRV_H
+#define IBNBD_SRV_H
+
+#include <linux/types.h>
+#include <linux/idr.h>
+#include <linux/kref.h>
+
+#include "ibtrs.h"
+#include "ibnbd-proto.h"
+#include "ibnbd-log.h"
+
+struct ibnbd_srv_session {
+	/* Entry inside global sess_list */
+	struct list_head        list;
+	struct ibtrs_srv	*ibtrs;
+	char			sessname[NAME_MAX];
+	int			queue_depth;
+	struct bio_set		*sess_bio_set;
+
+	rwlock_t                index_lock ____cacheline_aligned;
+	struct idr              index_idr;
+	/* List of struct ibnbd_srv_sess_dev */
+	struct list_head        sess_dev_list;
+	struct mutex		lock;
+	u8			ver;
+};
+
+struct ibnbd_srv_dev {
+	/* Entry inside global dev_list */
+	struct list_head                list;
+	struct kobject                  dev_kobj;
+	struct kobject                  dev_sessions_kobj;
+	struct kref                     kref;
+	char				id[NAME_MAX];
+	/* List of ibnbd_srv_sess_dev structs */
+	struct list_head		sess_dev_list;
+	struct mutex			lock;
+	int				open_write_cnt;
+	enum ibnbd_io_mode		mode;
+};
+
+/* Structure which binds N devices and N sessions */
+struct ibnbd_srv_sess_dev {
+	/* Entry inside ibnbd_srv_dev struct */
+	struct list_head		dev_list;
+	/* Entry inside ibnbd_srv_session struct */
+	struct list_head		sess_list;
+	struct ibnbd_dev		*ibnbd_dev;
+	struct ibnbd_srv_session        *sess;
+	struct ibnbd_srv_dev		*dev;
+	struct kobject                  kobj;
+	struct completion		*sysfs_release_compl;
+	u32                             device_id;
+	fmode_t                         open_flags;
+	struct kref			kref;
+	struct completion               *destroy_comp;
+	char				pathname[NAME_MAX];
+};
+
+/* ibnbd-srv-sysfs.c */
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+			       struct block_device *bdev,
+			       const char *dir_name);
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev);
+int ibnbd_srv_create_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+void ibnbd_srv_destroy_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+int ibnbd_srv_create_sysfs_files(void);
+void ibnbd_srv_destroy_sysfs_files(void);
+
+#endif /* IBNBD_SRV_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 19/24] ibnbd: server: main functionality
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (17 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 18/24] ibnbd: server: private header with server structs and functions Roman Pen
@ 2018-02-02 14:08 ` Roman Pen
  2018-02-02 14:09 ` [PATCH 20/24] ibnbd: server: functionality for IO submission to file or block dev Roman Pen
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:08 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is main functionality of ibnbd-server module, which handles IBTRS
events and IBNBD protocol requests, like map (open) or unmap (close)
device.  Also server side is responsible for processing incoming IBTRS
IO requests and forward them to local mapped devices.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv.c | 901 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 901 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv.c b/drivers/block/ibnbd/ibnbd-srv.c
new file mode 100644
index 000000000000..a32d22ab67a3
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.c
@@ -0,0 +1,901 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/blkdev.h>
+
+#include "ibnbd-srv.h"
+#include "ibnbd-srv-dev.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_VERSION(IBNBD_VER_STRING);
+MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_DEV_SEARCH_PATH "/"
+
+static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
+
+static int dev_search_path_set(const char *val, const struct kernel_param *kp)
+{
+	char *dup;
+
+	if (strlen(val) >= sizeof(dev_search_path))
+		return -EINVAL;
+
+	dup = kstrdup(val, GFP_KERNEL);
+
+	if (dup[strlen(dup) - 1] == '\n')
+		dup[strlen(dup) - 1] = '\0';
+
+	strlcpy(dev_search_path, dup, sizeof(dev_search_path));
+
+	kfree(dup);
+	pr_info("dev_search_path changed to '%s'\n", dev_search_path);
+
+	return 0;
+}
+
+static struct kparam_string dev_search_path_kparam_str = {
+	.maxlen	= sizeof(dev_search_path),
+	.string	= dev_search_path
+};
+
+static const struct kernel_param_ops dev_search_path_ops = {
+	.set	= dev_search_path_set,
+	.get	= param_get_string,
+};
+
+module_param_cb(dev_search_path, &dev_search_path_ops,
+		&dev_search_path_kparam_str, 0444);
+MODULE_PARM_DESC(dev_search_path, "Sets the device_search_path."
+		 " When a device is mapped this path is prepended to the"
+		 " device_path from the map_device operation."
+		 " (default: " DEFAULT_DEV_SEARCH_PATH ")");
+
+static int def_io_mode = IBNBD_BLOCKIO;
+module_param(def_io_mode, int, 0444);
+MODULE_PARM_DESC(def_io_mode, "By default, export devices in"
+		 " blockio(" __stringify(_IBNBD_BLOCKIO) ") or"
+		 " fileio(" __stringify(_IBNBD_FILEIO) ") mode."
+		 " (default: " __stringify(_IBNBD_BLOCKIO) " (blockio))");
+
+static DEFINE_MUTEX(sess_lock);
+static DEFINE_SPINLOCK(dev_lock);
+
+static LIST_HEAD(sess_list);
+static LIST_HEAD(dev_list);
+
+struct ibnbd_io_private {
+	struct ibtrs_srv_op		*id;
+	struct ibnbd_srv_sess_dev	*sess_dev;
+};
+
+static void ibnbd_sess_dev_release(struct kref *kref)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kref, struct ibnbd_srv_sess_dev, kref);
+	complete(sess_dev->destroy_comp);
+}
+
+static inline void ibnbd_put_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	kref_put(&sess_dev->kref, ibnbd_sess_dev_release);
+}
+
+static void ibnbd_endio(void *priv, int error)
+{
+	struct ibnbd_io_private *ibnbd_priv = priv;
+	struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
+
+	ibnbd_put_sess_dev(sess_dev);
+
+	ibtrs_srv_resp_rdma(ibnbd_priv->id, error);
+
+	kfree(priv);
+}
+
+static struct ibnbd_srv_sess_dev *
+ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+	int ret = 0;
+
+	read_lock(&srv_sess->index_lock);
+	sess_dev = idr_find(&srv_sess->index_idr, dev_id);
+	if (likely(sess_dev))
+		ret = kref_get_unless_zero(&sess_dev->kref);
+	read_unlock(&srv_sess->index_lock);
+
+	if (unlikely(!sess_dev || !ret))
+		return ERR_PTR(-ENXIO);
+
+	return sess_dev;
+}
+
+static int process_rdma(struct ibtrs_srv *sess,
+			struct ibnbd_srv_session *srv_sess,
+			struct ibtrs_srv_op *id, void *data, u32 datalen,
+			const void *usr, size_t usrlen)
+{
+	const struct ibnbd_msg_io *msg = usr;
+	struct ibnbd_io_private *priv;
+	struct ibnbd_srv_sess_dev *sess_dev;
+	u32 dev_id;
+	int err;
+
+	priv = kmalloc(sizeof(*priv), GFP_KERNEL);
+	if (unlikely(!priv))
+		return -ENOMEM;
+
+	dev_id = le32_to_cpu(msg->device_id);
+
+	sess_dev = ibnbd_get_sess_dev(dev_id, srv_sess);
+	if (unlikely(IS_ERR(sess_dev))) {
+		pr_err_ratelimited("Got I/O request on session %s for "
+				   "unknown device id %d\n",
+				   srv_sess->sessname, dev_id);
+		err = -ENOTCONN;
+		goto err;
+	}
+
+	priv->sess_dev = sess_dev;
+	priv->id = id;
+
+	err = ibnbd_dev_submit_io(sess_dev->ibnbd_dev, le64_to_cpu(msg->sector),
+				  data, datalen, le32_to_cpu(msg->bi_size),
+				  le32_to_cpu(msg->rw), priv);
+	if (unlikely(err)) {
+		ibnbd_err(sess_dev,
+			  "Submitting I/O to device failed, err: %d\n", err);
+		goto sess_dev_put;
+	}
+
+	return 0;
+
+sess_dev_put:
+	ibnbd_put_sess_dev(sess_dev);
+err:
+	kfree(priv);
+	return err;
+}
+
+static void destroy_device(struct ibnbd_srv_dev *dev)
+{
+	WARN(!list_empty(&dev->sess_dev_list),
+	     "Device %s is being destroyed but still in use!\n",
+	     dev->id);
+
+	spin_lock(&dev_lock);
+	list_del(&dev->list);
+	spin_unlock(&dev_lock);
+
+	if (dev->dev_kobj.state_in_sysfs)
+		/*
+		 * Destroy kobj only if it was really created.
+		 * The following call should be sync, because
+		 *  we free the memory afterwards.
+		 */
+		ibnbd_srv_destroy_dev_sysfs(dev);
+
+	kfree(dev);
+}
+
+static void destroy_device_cb(struct kref *kref)
+{
+	struct ibnbd_srv_dev *dev;
+
+	dev = container_of(kref, struct ibnbd_srv_dev, kref);
+
+	destroy_device(dev);
+}
+
+static void ibnbd_put_srv_dev(struct ibnbd_srv_dev *dev)
+{
+	kref_put(&dev->kref, destroy_device_cb);
+}
+
+static void ibnbd_destroy_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	DECLARE_COMPLETION_ONSTACK(dc);
+
+	write_lock(&sess_dev->sess->index_lock);
+	idr_remove(&sess_dev->sess->index_idr, sess_dev->device_id);
+	write_unlock(&sess_dev->sess->index_lock);
+
+	sess_dev->destroy_comp = &dc;
+	ibnbd_put_sess_dev(sess_dev);
+	wait_for_completion(&dc);
+
+	ibnbd_dev_close(sess_dev->ibnbd_dev);
+	list_del(&sess_dev->sess_list);
+	mutex_lock(&sess_dev->dev->lock);
+	list_del(&sess_dev->dev_list);
+	if (sess_dev->open_flags & FMODE_WRITE)
+		sess_dev->dev->open_write_cnt--;
+	mutex_unlock(&sess_dev->dev->lock);
+
+	ibnbd_put_srv_dev(sess_dev->dev);
+
+	ibnbd_info(sess_dev, "Device closed\n");
+	kfree(sess_dev);
+}
+
+static void destroy_sess(struct ibnbd_srv_session *srv_sess)
+{
+	struct ibnbd_srv_sess_dev *sess_dev, *tmp;
+
+	if (list_empty(&srv_sess->sess_dev_list))
+		goto out;
+
+	mutex_lock(&srv_sess->lock);
+	list_for_each_entry_safe(sess_dev, tmp, &srv_sess->sess_dev_list,
+				 sess_list) {
+		ibnbd_srv_destroy_dev_session_sysfs(sess_dev);
+		ibnbd_destroy_sess_dev(sess_dev);
+	}
+	mutex_unlock(&srv_sess->lock);
+
+out:
+	idr_destroy(&srv_sess->index_idr);
+	bioset_free(srv_sess->sess_bio_set);
+
+	pr_info("IBTRS Session %s disconnected\n", srv_sess->sessname);
+
+	mutex_lock(&sess_lock);
+	list_del(&srv_sess->list);
+	mutex_unlock(&sess_lock);
+
+	kfree(srv_sess);
+}
+
+static int create_sess(struct ibtrs_srv *ibtrs)
+{
+	struct ibnbd_srv_session *srv_sess;
+	char sessname[NAME_MAX];
+	int err;
+
+	err = ibtrs_srv_get_sess_name(ibtrs, sessname, sizeof(sessname));
+	if (unlikely(err)) {
+		pr_err("ibtrs_srv_get_sess_name(%s): %d\n", sessname, err);
+
+		return err;
+	}
+	srv_sess = kzalloc(sizeof(*srv_sess), GFP_KERNEL);
+	if (!srv_sess)
+		return -ENOMEM;
+	srv_sess->queue_depth = ibtrs_srv_get_queue_depth(ibtrs);
+	srv_sess->sess_bio_set = bioset_create(srv_sess->queue_depth, 0,
+					       BIOSET_NEED_BVECS);
+	if (!srv_sess->sess_bio_set) {
+		pr_err("Allocating srv_session for session %s failed\n",
+		       sessname);
+		kfree(srv_sess);
+		return -ENOMEM;
+	}
+
+	idr_init(&srv_sess->index_idr);
+	rwlock_init(&srv_sess->index_lock);
+	INIT_LIST_HEAD(&srv_sess->sess_dev_list);
+	mutex_init(&srv_sess->lock);
+	mutex_lock(&sess_lock);
+	list_add(&srv_sess->list, &sess_list);
+	mutex_unlock(&sess_lock);
+
+	srv_sess->ibtrs = ibtrs;
+	srv_sess->queue_depth = ibtrs_srv_get_queue_depth(ibtrs);
+	strlcpy(srv_sess->sessname, sessname, sizeof(srv_sess->sessname));
+
+	ibtrs_srv_set_sess_priv(ibtrs, srv_sess);
+
+	return 0;
+}
+
+static int ibnbd_srv_link_ev(struct ibtrs_srv *ibtrs,
+			     enum ibtrs_srv_link_ev ev, void *priv)
+{
+	struct ibnbd_srv_session *srv_sess = priv;
+
+	switch (ev) {
+	case IBTRS_SRV_LINK_EV_CONNECTED:
+		return create_sess(ibtrs);
+
+	case IBTRS_SRV_LINK_EV_DISCONNECTED:
+		if (WARN_ON(!srv_sess))
+			return -EINVAL;
+
+		destroy_sess(srv_sess);
+		return 0;
+
+	default:
+		pr_warn("Received unknown IBTRS session event %d from session"
+			" %s\n", ev, srv_sess->sessname);
+		return -EINVAL;
+	}
+}
+
+static int process_msg_close(struct ibtrs_srv *ibtrs,
+			     struct ibnbd_srv_session *srv_sess,
+			     void *data, size_t datalen, const void *usr,
+			     size_t usrlen)
+{
+	const struct ibnbd_msg_close *close_msg = usr;
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = ibnbd_get_sess_dev(close_msg->device_id, srv_sess);
+	if (unlikely(IS_ERR(sess_dev)))
+		return 0;
+
+	ibnbd_srv_destroy_dev_session_sysfs(sess_dev);
+	ibnbd_put_sess_dev(sess_dev);
+	mutex_lock(&srv_sess->lock);
+	ibnbd_destroy_sess_dev(sess_dev);
+	mutex_unlock(&srv_sess->lock);
+	return 0;
+}
+
+static int process_msg_open(struct ibtrs_srv *ibtrs,
+			    struct ibnbd_srv_session *srv_sess,
+			    const void *msg, size_t len,
+			    void *data, size_t datalen);
+
+static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
+				 struct ibnbd_srv_session *srv_sess,
+				 const void *msg, size_t len,
+				 void *data, size_t datalen);
+
+static int ibnbd_srv_rdma_ev(struct ibtrs_srv *ibtrs, void *priv,
+			     struct ibtrs_srv_op *id, int dir,
+			     void *data, size_t datalen, const void *usr,
+			     size_t usrlen)
+{
+	struct ibnbd_srv_session *srv_sess = priv;
+	const struct ibnbd_msg_hdr *hdr = usr;
+	int ret = 0;
+	u16 type;
+
+	if (unlikely(WARN_ON(!srv_sess)))
+		return -ENODEV;
+
+	type = le16_to_cpu(hdr->type);
+
+	switch (type) {
+	case IBNBD_MSG_IO:
+		return process_rdma(ibtrs, srv_sess, id, data, datalen, usr,
+				    usrlen);
+	case IBNBD_MSG_CLOSE:
+		ret = process_msg_close(ibtrs, srv_sess, data, datalen,
+					usr, usrlen);
+		break;
+	case IBNBD_MSG_OPEN:
+		ret = process_msg_open(ibtrs, srv_sess, usr, usrlen,
+				       data, datalen);
+		break;
+	case IBNBD_MSG_SESS_INFO:
+		ret = process_msg_sess_info(ibtrs, srv_sess, usr, usrlen,
+					    data, datalen);
+		break;
+	default:
+		pr_warn("Received unexpected message type %d with dir %d from"
+			" session %s\n", type, dir, srv_sess->sessname);
+		return -EINVAL;
+	}
+
+	ibtrs_srv_resp_rdma(id, ret);
+	return 0;
+}
+
+static struct ibnbd_srv_sess_dev
+*ibnbd_sess_dev_alloc(struct ibnbd_srv_session *srv_sess)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+	int error;
+
+	sess_dev = kzalloc(sizeof(*sess_dev), GFP_KERNEL);
+	if (!sess_dev)
+		return ERR_PTR(-ENOMEM);
+
+	idr_preload(GFP_KERNEL);
+	write_lock(&srv_sess->index_lock);
+
+	error = idr_alloc(&srv_sess->index_idr, sess_dev, 0, -1, GFP_NOWAIT);
+	if (error < 0) {
+		pr_warn("Allocating idr failed, err: %d\n", error);
+		goto out_unlock;
+	}
+
+	sess_dev->device_id = error;
+	error = 0;
+
+out_unlock:
+	write_unlock(&srv_sess->index_lock);
+	idr_preload_end();
+	if (error) {
+		kfree(sess_dev);
+		return ERR_PTR(error);
+	}
+
+	return sess_dev;
+}
+
+static struct ibnbd_srv_dev *ibnbd_srv_init_srv_dev(const char *id,
+						    enum ibnbd_io_mode mode)
+{
+	struct ibnbd_srv_dev *dev;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	strlcpy(dev->id, id, sizeof(dev->id));
+	dev->mode = mode;
+	kref_init(&dev->kref);
+	INIT_LIST_HEAD(&dev->sess_dev_list);
+	mutex_init(&dev->lock);
+
+	return dev;
+}
+
+static struct ibnbd_srv_dev *
+ibnbd_srv_find_or_add_srv_dev(struct ibnbd_srv_dev *new_dev)
+{
+	struct ibnbd_srv_dev *dev;
+
+	spin_lock(&dev_lock);
+	list_for_each_entry(dev, &dev_list, list) {
+		if (!strncmp(dev->id, new_dev->id, sizeof(dev->id))) {
+			if (!kref_get_unless_zero(&dev->kref))
+				/*
+				 * We lost the race, device is almost dead.
+				 *  Continue traversing to find a valid one.
+				 */
+				continue;
+			spin_unlock(&dev_lock);
+			return dev;
+		}
+	}
+	list_add(&new_dev->list, &dev_list);
+	spin_unlock(&dev_lock);
+
+	return new_dev;
+}
+
+static int ibnbd_srv_check_update_open_perm(struct ibnbd_srv_dev *srv_dev,
+					    struct ibnbd_srv_session *srv_sess,
+					    enum ibnbd_io_mode io_mode,
+					    enum ibnbd_access_mode access_mode)
+{
+	int ret = -EPERM;
+
+	mutex_lock(&srv_dev->lock);
+
+	if (srv_dev->mode != io_mode) {
+		pr_err("Mapping device '%s' for session %s in %s mode forbidden,"
+		       " device is already mapped from other client(s) in"
+		       " %s mode\n", srv_dev->id, srv_sess->sessname,
+		       ibnbd_io_mode_str(io_mode),
+		       ibnbd_io_mode_str(srv_dev->mode));
+		goto out;
+	}
+
+	switch (access_mode) {
+	case IBNBD_ACCESS_RO:
+		ret = 0;
+		break;
+	case IBNBD_ACCESS_RW:
+		if (srv_dev->open_write_cnt == 0)  {
+			srv_dev->open_write_cnt++;
+			ret = 0;
+		} else {
+			pr_err("Mapping device '%s' for session %s with"
+			       " RW permissions failed. Device already opened"
+			       " as 'RW' by %d client(s) in %s mode.\n",
+			       srv_dev->id, srv_sess->sessname,
+			       srv_dev->open_write_cnt,
+			       ibnbd_io_mode_str(srv_dev->mode));
+		}
+		break;
+	case IBNBD_ACCESS_MIGRATION:
+		if (srv_dev->open_write_cnt < 2) {
+			srv_dev->open_write_cnt++;
+			ret = 0;
+		} else {
+			pr_err("Mapping device '%s' for session %s with"
+			       " migration permissions failed. Device already"
+			       " opened as 'RW' by %d client(s) in %s mode.\n",
+			       srv_dev->id, srv_sess->sessname,
+			       srv_dev->open_write_cnt,
+			       ibnbd_io_mode_str(srv_dev->mode));
+		}
+		break;
+	default:
+		pr_err("Received mapping request for device '%s' on session %s"
+		       " with invalid access mode: %d\n", srv_dev->id,
+		       srv_sess->sessname, access_mode);
+		ret = -EINVAL;
+	}
+
+out:
+	mutex_unlock(&srv_dev->lock);
+
+	return ret;
+}
+
+static struct ibnbd_srv_dev *
+ibnbd_srv_get_or_create_srv_dev(struct ibnbd_dev *ibnbd_dev,
+				struct ibnbd_srv_session *srv_sess,
+				enum ibnbd_io_mode io_mode,
+				enum ibnbd_access_mode access_mode)
+{
+	int ret;
+	struct ibnbd_srv_dev *new_dev, *dev;
+
+	new_dev = ibnbd_srv_init_srv_dev(ibnbd_dev->name, io_mode);
+	if (IS_ERR(new_dev))
+		return new_dev;
+
+	dev = ibnbd_srv_find_or_add_srv_dev(new_dev);
+	if (dev != new_dev)
+		kfree(new_dev);
+
+	ret = ibnbd_srv_check_update_open_perm(dev, srv_sess, io_mode,
+					       access_mode);
+	if (ret) {
+		ibnbd_put_srv_dev(dev);
+		return ERR_PTR(ret);
+	}
+
+	return dev;
+}
+
+static void ibnbd_srv_fill_msg_open_rsp(struct ibnbd_msg_open_rsp *rsp,
+					struct ibnbd_srv_sess_dev *sess_dev)
+{
+	struct ibnbd_dev *ibnbd_dev = sess_dev->ibnbd_dev;
+
+	rsp->hdr.type = cpu_to_le16(IBNBD_MSG_OPEN_RSP);
+	rsp->device_id =
+		cpu_to_le32(sess_dev->device_id);
+	rsp->nsectors =
+		cpu_to_le64(get_capacity(ibnbd_dev->bdev->bd_disk));
+	rsp->logical_block_size	=
+		cpu_to_le16(ibnbd_dev_get_logical_bsize(ibnbd_dev));
+	rsp->physical_block_size =
+		cpu_to_le16(ibnbd_dev_get_phys_bsize(ibnbd_dev));
+	rsp->max_segments =
+		cpu_to_le16(ibnbd_dev_get_max_segs(ibnbd_dev));
+	rsp->max_hw_sectors =
+		cpu_to_le32(ibnbd_dev_get_max_hw_sects(ibnbd_dev));
+	rsp->max_write_same_sectors =
+		cpu_to_le32(ibnbd_dev_get_max_write_same_sects(ibnbd_dev));
+	rsp->max_discard_sectors =
+		cpu_to_le32(ibnbd_dev_get_max_discard_sects(ibnbd_dev));
+	rsp->discard_granularity =
+		cpu_to_le32(ibnbd_dev_get_discard_granularity(ibnbd_dev));
+	rsp->discard_alignment =
+		cpu_to_le32(ibnbd_dev_get_discard_alignment(ibnbd_dev));
+	rsp->secure_discard =
+		cpu_to_le16(ibnbd_dev_get_secure_discard(ibnbd_dev));
+	rsp->rotational =
+		!blk_queue_nonrot(bdev_get_queue(ibnbd_dev->bdev));
+	rsp->io_mode =
+		ibnbd_dev->mode;
+}
+
+static struct ibnbd_srv_sess_dev *
+ibnbd_srv_create_set_sess_dev(struct ibnbd_srv_session *srv_sess,
+			      const struct ibnbd_msg_open *open_msg,
+			      struct ibnbd_dev *ibnbd_dev, fmode_t open_flags,
+			      struct ibnbd_srv_dev *srv_dev)
+{
+	struct ibnbd_srv_sess_dev *sdev = ibnbd_sess_dev_alloc(srv_sess);
+
+	if (IS_ERR(sdev))
+		return sdev;
+
+	kref_init(&sdev->kref);
+
+	strlcpy(sdev->pathname, open_msg->dev_name, sizeof(sdev->pathname));
+
+	sdev->ibnbd_dev		= ibnbd_dev;
+	sdev->sess		= srv_sess;
+	sdev->dev		= srv_dev;
+	sdev->open_flags	= open_flags;
+
+	return sdev;
+}
+
+static char *ibnbd_srv_get_full_path(const char *dev_name)
+{
+	char *full_path;
+	char *a, *b;
+
+	full_path = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!full_path)
+		return ERR_PTR(-ENOMEM);
+
+	snprintf(full_path, PATH_MAX, "%s/%s", dev_search_path, dev_name);
+
+	/* eliminitate duplicated slashes */
+	a = strchr(full_path, '/');
+	b = a;
+	while (*b != '\0') {
+		if (*b == '/' && *a == '/') {
+			b++;
+		} else {
+			a++;
+			*a = *b;
+			b++;
+		}
+	}
+	a++;
+	*a = '\0';
+
+	return full_path;
+}
+
+static int process_msg_sess_info(struct ibtrs_srv *ibtrs,
+				 struct ibnbd_srv_session *srv_sess,
+				 const void *msg, size_t len,
+				 void *data, size_t datalen)
+{
+	const struct ibnbd_msg_sess_info *sess_info_msg = msg;
+	struct ibnbd_msg_sess_info_rsp *rsp = data;
+
+	srv_sess->ver = min_t(u8, sess_info_msg->ver, IBNBD_VER_MAJOR);
+	pr_debug("Session %s using protocol version %d (client version: %d,"
+		 " server version: %d)\n", srv_sess->sessname,
+		 srv_sess->ver, sess_info_msg->ver, IBNBD_VER_MAJOR);
+
+	rsp->hdr.type = cpu_to_le16(IBNBD_MSG_SESS_INFO_RSP);
+	rsp->ver = srv_sess->ver;
+
+	return 0;
+}
+
+/**
+ * find_srv_sess_dev() - a dev is already opened by this name
+ *
+ * Return struct ibnbd_srv_sess_dev if srv_sess already opened the dev_name
+ * NULL if the session didn't open the device yet.
+ */
+static struct ibnbd_srv_sess_dev *
+find_srv_sess_dev(struct ibnbd_srv_session *srv_sess, const char *dev_name)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	if (list_empty(&srv_sess->sess_dev_list))
+		return NULL;
+
+	list_for_each_entry(sess_dev, &srv_sess->sess_dev_list, sess_list)
+		if (!strcmp(sess_dev->pathname, dev_name))
+			return sess_dev;
+
+	return NULL;
+}
+
+static int process_msg_open(struct ibtrs_srv *ibtrs,
+			    struct ibnbd_srv_session *srv_sess,
+			    const void *msg, size_t len,
+			    void *data, size_t datalen)
+{
+	int ret;
+	struct ibnbd_srv_dev *srv_dev;
+	struct ibnbd_srv_sess_dev *srv_sess_dev;
+	const struct ibnbd_msg_open *open_msg = msg;
+	fmode_t open_flags;
+	char *full_path;
+	struct ibnbd_dev *ibnbd_dev;
+	enum ibnbd_io_mode io_mode;
+	struct ibnbd_msg_open_rsp *rsp = data;
+
+	pr_debug("Open message received: session='%s' path='%s' access_mode=%d"
+		 " io_mode=%d\n", srv_sess->sessname, open_msg->dev_name,
+		 open_msg->access_mode, open_msg->io_mode);
+	open_flags = FMODE_READ;
+	if (open_msg->access_mode != IBNBD_ACCESS_RO)
+		open_flags |= FMODE_WRITE;
+
+	mutex_lock(&srv_sess->lock);
+
+	srv_sess_dev = find_srv_sess_dev(srv_sess, open_msg->dev_name);
+	if (srv_sess_dev)
+		goto fill_response;
+
+	if ((strlen(dev_search_path) + strlen(open_msg->dev_name))
+	    >= PATH_MAX) {
+		pr_err("Opening device for session %s failed, device path too"
+		       " long. '%s/%s' is longer than PATH_MAX (%d)\n",
+		       srv_sess->sessname, dev_search_path, open_msg->dev_name,
+		       PATH_MAX);
+		ret = -EINVAL;
+		goto reject;
+	}
+	full_path = ibnbd_srv_get_full_path(open_msg->dev_name);
+	if (IS_ERR(full_path)) {
+		ret = PTR_ERR(full_path);
+		pr_err("Opening device '%s' for client %s failed,"
+		       " failed to get device full path, err: %d\n",
+		       open_msg->dev_name, srv_sess->sessname, ret);
+		goto reject;
+	}
+
+	if (open_msg->io_mode == IBNBD_BLOCKIO)
+		io_mode = IBNBD_BLOCKIO;
+	else if (open_msg->io_mode == IBNBD_FILEIO)
+		io_mode = IBNBD_FILEIO;
+	else
+		io_mode = def_io_mode;
+
+	ibnbd_dev = ibnbd_dev_open(full_path, open_flags, io_mode,
+				   srv_sess->sess_bio_set, ibnbd_endio);
+	if (IS_ERR(ibnbd_dev)) {
+		pr_err("Opening device '%s' on session %s failed,"
+		       " failed to open the block device, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(ibnbd_dev));
+		ret = PTR_ERR(ibnbd_dev);
+		goto free_path;
+	}
+
+	srv_dev = ibnbd_srv_get_or_create_srv_dev(ibnbd_dev, srv_sess, io_mode,
+						  open_msg->access_mode);
+	if (IS_ERR(srv_dev)) {
+		pr_err("Opening device '%s' on session %s failed,"
+		       " creating srv_dev failed, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(srv_dev));
+		ret = PTR_ERR(srv_dev);
+		goto ibnbd_dev_close;
+	}
+
+	srv_sess_dev = ibnbd_srv_create_set_sess_dev(srv_sess, open_msg,
+						     ibnbd_dev, open_flags,
+						     srv_dev);
+	if (IS_ERR(srv_sess_dev)) {
+		pr_err("Opening device '%s' on session %s failed,"
+		       " creating sess_dev failed, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(srv_sess_dev));
+		ret = PTR_ERR(srv_sess_dev);
+		goto srv_dev_put;
+	}
+
+	/* Create the srv_dev sysfs files if they haven't been created yet. The
+	 * reason to delay the creation is not to create the sysfs files before
+	 * we are sure the device can be opened.
+	 */
+	mutex_lock(&srv_dev->lock);
+	if (!srv_dev->dev_kobj.state_in_sysfs) {
+		ret = ibnbd_srv_create_dev_sysfs(srv_dev, ibnbd_dev->bdev,
+						 ibnbd_dev->name);
+		if (ret) {
+			mutex_unlock(&srv_dev->lock);
+			ibnbd_err(srv_sess_dev, "Opening device failed, failed to"
+				  " create device sysfs files, err: %d\n",
+				  ret);
+			goto free_srv_sess_dev;
+		}
+	}
+
+	ret = ibnbd_srv_create_dev_session_sysfs(srv_sess_dev);
+	if (ret) {
+		mutex_unlock(&srv_dev->lock);
+		ibnbd_err(srv_sess_dev, "Opening device failed, failed to create"
+			  " dev client sysfs files, err: %d\n", ret);
+		goto free_srv_sess_dev;
+	}
+
+	list_add(&srv_sess_dev->dev_list, &srv_dev->sess_dev_list);
+	mutex_unlock(&srv_dev->lock);
+
+	list_add(&srv_sess_dev->sess_list, &srv_sess->sess_dev_list);
+
+	ibnbd_info(srv_sess_dev, "Opened device '%s' in %s mode\n",
+		   srv_dev->id, ibnbd_io_mode_str(io_mode));
+
+	kfree(full_path);
+
+fill_response:
+	ibnbd_srv_fill_msg_open_rsp(rsp, srv_sess_dev);
+	mutex_unlock(&srv_sess->lock);
+	return 0;
+
+free_srv_sess_dev:
+	write_lock(&srv_sess->index_lock);
+	idr_remove(&srv_sess->index_idr, srv_sess_dev->device_id);
+	write_unlock(&srv_sess->index_lock);
+	kfree(srv_sess_dev);
+srv_dev_put:
+	if (open_msg->access_mode != IBNBD_ACCESS_RO) {
+		mutex_lock(&srv_dev->lock);
+		srv_dev->open_write_cnt--;
+		mutex_unlock(&srv_dev->lock);
+	}
+	ibnbd_put_srv_dev(srv_dev);
+ibnbd_dev_close:
+	ibnbd_dev_close(ibnbd_dev);
+free_path:
+	kfree(full_path);
+reject:
+	mutex_unlock(&srv_sess->lock);
+	return ret;
+}
+
+static struct ibtrs_srv_ctx *ibtrs_ctx;
+
+static int __init ibnbd_srv_init_module(void)
+{
+	int err;
+
+	pr_info("Loading module %s, version %s\n",
+		KBUILD_MODNAME, IBNBD_VER_STRING);
+
+	ibtrs_ctx = ibtrs_srv_open(ibnbd_srv_rdma_ev, ibnbd_srv_link_ev,
+				   IBTRS_PORT);
+	if (unlikely(IS_ERR(ibtrs_ctx))) {
+		err = PTR_ERR(ibtrs_ctx);
+		pr_err("ibtrs_srv_open(), err: %d\n", err);
+		goto out;
+	}
+	err = ibnbd_dev_init();
+	if (err) {
+		pr_err("ibnbd_dev_init(), err: %d\n", err);
+		goto srv_close;
+	}
+
+	err = ibnbd_srv_create_sysfs_files();
+	if (err) {
+		pr_err("ibnbd_srv_create_sysfs_files(), err: %d\n", err);
+		goto dev_destroy;
+	}
+
+	return 0;
+
+dev_destroy:
+	ibnbd_dev_destroy();
+srv_close:
+	ibtrs_srv_close(ibtrs_ctx);
+out:
+
+	return err;
+}
+
+static void __exit ibnbd_srv_cleanup_module(void)
+{
+	ibtrs_srv_close(ibtrs_ctx);
+	WARN_ON(!list_empty(&sess_list));
+	ibnbd_srv_destroy_sysfs_files();
+	ibnbd_dev_destroy();
+	pr_info("Module unloaded\n");
+}
+
+module_init(ibnbd_srv_init_module);
+module_exit(ibnbd_srv_cleanup_module);
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 20/24] ibnbd: server: functionality for IO submission to file or block dev
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (18 preceding siblings ...)
  2018-02-02 14:08 ` [PATCH 19/24] ibnbd: server: main functionality Roman Pen
@ 2018-02-02 14:09 ` Roman Pen
  2018-02-02 14:09 ` [PATCH 21/24] ibnbd: server: sysfs interface functions Roman Pen
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:09 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This provides helper functions for IO submission to file or block dev.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv-dev.c | 410 ++++++++++++++++++++++++++++++++++++
 drivers/block/ibnbd/ibnbd-srv-dev.h | 149 +++++++++++++
 2 files changed, 559 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv-dev.c b/drivers/block/ibnbd/ibnbd-srv-dev.c
new file mode 100644
index 000000000000..a5894849b9d5
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-dev.c
@@ -0,0 +1,410 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibnbd-srv-dev.h"
+#include "ibnbd-log.h"
+
+#define IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS 0
+
+struct ibnbd_dev_file_io_work {
+	struct ibnbd_dev	*dev;
+	void			*priv;
+
+	sector_t		sector;
+	void			*data;
+	size_t			len;
+	size_t			bi_size;
+	enum ibnbd_io_flags	flags;
+
+	struct work_struct	work;
+};
+
+struct ibnbd_dev_blk_io {
+	struct ibnbd_dev *dev;
+	void		 *priv;
+};
+
+static struct workqueue_struct *fileio_wq;
+
+int ibnbd_dev_init(void)
+{
+	fileio_wq = alloc_workqueue("%s", WQ_UNBOUND,
+				    IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS,
+				    "ibnbd_server_fileio_wq");
+	if (!fileio_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void ibnbd_dev_destroy(void)
+{
+	destroy_workqueue(fileio_wq);
+}
+
+static inline struct block_device *ibnbd_dev_open_bdev(const char *path,
+						       fmode_t flags)
+{
+	return blkdev_get_by_path(path, flags, THIS_MODULE);
+}
+
+static int ibnbd_dev_blk_open(struct ibnbd_dev *dev, const char *path,
+			      fmode_t flags)
+{
+	dev->bdev = ibnbd_dev_open_bdev(path, flags);
+	return PTR_ERR_OR_ZERO(dev->bdev);
+}
+
+static int ibnbd_dev_vfs_open(struct ibnbd_dev *dev, const char *path,
+			      fmode_t flags)
+{
+	int oflags = O_DSYNC; /* enable write-through */
+
+	if (flags & FMODE_WRITE)
+		oflags |= O_RDWR;
+	else if (flags & FMODE_READ)
+		oflags |= O_RDONLY;
+	else
+		return -EINVAL;
+
+	dev->file = filp_open(path, oflags, 0);
+	return PTR_ERR_OR_ZERO(dev->file);
+}
+
+struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
+				 enum ibnbd_io_mode mode, struct bio_set *bs,
+				 ibnbd_dev_io_fn io_cb)
+{
+	struct ibnbd_dev *dev;
+	int ret;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	if (mode == IBNBD_BLOCKIO) {
+		dev->blk_open_flags = flags;
+		ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+		if (ret)
+			goto err;
+	} else if (mode == IBNBD_FILEIO) {
+		dev->blk_open_flags = FMODE_READ;
+		ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+		if (ret)
+			goto err;
+
+		ret = ibnbd_dev_vfs_open(dev, path, flags);
+		if (ret)
+			goto blk_put;
+	}
+
+	dev->blk_open_flags	= flags;
+	dev->mode		= mode;
+	dev->io_cb		= io_cb;
+	bdevname(dev->bdev, dev->name);
+	dev->ibd_bio_set	= bs;
+
+	return dev;
+
+blk_put:
+	blkdev_put(dev->bdev, dev->blk_open_flags);
+err:
+	kfree(dev);
+	return ERR_PTR(ret);
+}
+
+void ibnbd_dev_close(struct ibnbd_dev *dev)
+{
+	flush_workqueue(fileio_wq);
+	blkdev_put(dev->bdev, dev->blk_open_flags);
+	if (dev->mode == IBNBD_FILEIO)
+		filp_close(dev->file, dev->file);
+	kfree(dev);
+}
+
+static void ibnbd_dev_bi_end_io(struct bio *bio)
+{
+	struct ibnbd_dev_blk_io *io = bio->bi_private;
+
+	io->dev->io_cb(io->priv, blk_status_to_errno(bio->bi_status));
+	bio_put(bio);
+	kfree(io);
+}
+
+static void bio_map_kern_endio(struct bio *bio)
+{
+	bio_put(bio);
+}
+
+/**
+ *	ibnbd_bio_map_kern	-	map kernel address into bio
+ *	@q: the struct request_queue for the bio
+ *	@data: pointer to buffer to map
+ *	@bs: bio_set to use.
+ *	@len: length in bytes
+ *	@gfp_mask: allocation flags for bio allocation
+ *
+ *	Map the kernel address into a bio suitable for io to a block
+ *	device. Returns an error pointer in case of error.
+ */
+static struct bio *ibnbd_bio_map_kern(struct request_queue *q, void *data,
+				      struct bio_set *bs,
+				      unsigned int len, gfp_t gfp_mask)
+{
+	unsigned long kaddr = (unsigned long)data;
+	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	unsigned long start = kaddr >> PAGE_SHIFT;
+	const int nr_pages = end - start;
+	int offset, i;
+	struct bio *bio;
+
+	bio = bio_alloc_bioset(gfp_mask, nr_pages, bs);
+	if (!bio)
+		return ERR_PTR(-ENOMEM);
+
+	offset = offset_in_page(kaddr);
+	for (i = 0; i < nr_pages; i++) {
+		unsigned int bytes = PAGE_SIZE - offset;
+
+		if (len <= 0)
+			break;
+
+		if (bytes > len)
+			bytes = len;
+
+		if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
+				    offset) < bytes) {
+			/* we don't support partial mappings */
+			bio_put(bio);
+			return ERR_PTR(-EINVAL);
+		}
+
+		data += bytes;
+		len -= bytes;
+		offset = 0;
+	}
+
+	bio->bi_end_io = bio_map_kern_endio;
+	return bio;
+}
+
+static int ibnbd_dev_blk_submit_io(struct ibnbd_dev *dev, sector_t sector,
+				   void *data, size_t len, u32 bi_size,
+				   enum ibnbd_io_flags flags, void *priv)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+	struct ibnbd_dev_blk_io *io;
+	struct bio *bio;
+
+	/* check if the buffer is suitable for bdev */
+	if (unlikely(WARN_ON(!blk_rq_aligned(q, (unsigned long)data, len))))
+		return -EINVAL;
+
+	/* Generate bio with pages pointing to the rdma buffer */
+	bio = ibnbd_bio_map_kern(q, data, dev->ibd_bio_set, len, GFP_KERNEL);
+	if (unlikely(IS_ERR(bio)))
+		return PTR_ERR(bio);
+
+	io = kmalloc(sizeof(*io), GFP_KERNEL);
+	if (unlikely(!io)) {
+		bio_put(bio);
+		return -ENOMEM;
+	}
+
+	io->dev		= dev;
+	io->priv	= priv;
+
+	bio->bi_end_io		= ibnbd_dev_bi_end_io;
+	bio->bi_private		= io;
+	bio->bi_opf		= ibnbd_to_bio_flags(flags);
+	bio->bi_iter.bi_sector	= sector;
+	bio->bi_iter.bi_size	= bi_size;
+	bio_set_dev(bio, dev->bdev);
+
+	submit_bio(bio);
+
+	return 0;
+}
+
+static int ibnbd_dev_file_handle_flush(struct ibnbd_dev_file_io_work *w,
+				       loff_t start)
+{
+	int ret;
+	loff_t end;
+	int len = w->bi_size;
+
+	if (len)
+		end = start + len - 1;
+	else
+		end = LLONG_MAX;
+
+	ret = vfs_fsync_range(w->dev->file, start, end, 1);
+	if (unlikely(ret))
+		pr_info_ratelimited("I/O FLUSH failed on %s, vfs_sync err: %d\n",
+				    w->dev->name, ret);
+	return ret;
+}
+
+static int ibnbd_dev_file_handle_fua(struct ibnbd_dev_file_io_work *w,
+				     loff_t start)
+{
+	int ret;
+	loff_t end;
+	int len = w->bi_size;
+
+	if (len)
+		end = start + len - 1;
+	else
+		end = LLONG_MAX;
+
+	ret = vfs_fsync_range(w->dev->file, start, end, 1);
+	if (unlikely(ret))
+		pr_info_ratelimited("I/O FUA failed on %s, vfs_sync err: %d\n",
+				    w->dev->name, ret);
+	return ret;
+}
+
+static int ibnbd_dev_file_handle_write_same(struct ibnbd_dev_file_io_work *w)
+{
+	int i;
+
+	if (unlikely(WARN_ON(w->bi_size % w->len)))
+		return -EINVAL;
+
+	for (i = 1; i < w->bi_size / w->len; i++)
+		memcpy(w->data + i * w->len, w->data, w->len);
+
+	return 0;
+}
+
+static void ibnbd_dev_file_submit_io_worker(struct work_struct *w)
+{
+	struct ibnbd_dev_file_io_work *dev_work;
+	struct file *f;
+	int ret, len;
+	loff_t off;
+
+	dev_work = container_of(w, struct ibnbd_dev_file_io_work, work);
+	off = dev_work->sector * ibnbd_dev_get_logical_bsize(dev_work->dev);
+	f = dev_work->dev->file;
+	len = dev_work->bi_size;
+
+	if (ibnbd_op(dev_work->flags) == IBNBD_OP_FLUSH) {
+		ret = ibnbd_dev_file_handle_flush(dev_work, off);
+		if (unlikely(ret))
+			goto out;
+	}
+
+	if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE_SAME) {
+		ret = ibnbd_dev_file_handle_write_same(dev_work);
+		if (unlikely(ret))
+			goto out;
+	}
+
+	/* TODO Implement support for DIRECT */
+	if (dev_work->bi_size) {
+		loff_t off_tmp = off;
+
+		if (ibnbd_op(dev_work->flags) == IBNBD_OP_WRITE)
+			ret = kernel_write(f, dev_work->data, dev_work->bi_size,
+					   &off_tmp);
+		else
+			ret = kernel_read(f, dev_work->data, dev_work->bi_size,
+					  &off_tmp);
+
+		if (unlikely(ret < 0)) {
+			goto out;
+		} else if (unlikely(ret != dev_work->bi_size)) {
+			/* TODO implement support for partial completions */
+			ret = -EIO;
+			goto out;
+		} else {
+			ret = 0;
+		}
+	}
+
+	if (dev_work->flags & IBNBD_F_FUA)
+		ret = ibnbd_dev_file_handle_fua(dev_work, off);
+out:
+	dev_work->dev->io_cb(dev_work->priv, ret);
+	kfree(dev_work);
+}
+
+static int ibnbd_dev_file_submit_io(struct ibnbd_dev *dev, sector_t sector,
+				    void *data, size_t len, size_t bi_size,
+				    enum ibnbd_io_flags flags, void *priv)
+{
+	struct ibnbd_dev_file_io_work *w;
+
+	if (!ibnbd_flags_supported(flags)) {
+		pr_info_ratelimited("Unsupported I/O flags: 0x%x on device "
+				    "%s\n", flags, dev->name);
+		return -ENOTSUPP;
+	}
+
+	w = kmalloc(sizeof(*w), GFP_KERNEL);
+	if (!w)
+		return -ENOMEM;
+
+	w->dev		= dev;
+	w->priv		= priv;
+	w->sector	= sector;
+	w->data		= data;
+	w->len		= len;
+	w->bi_size	= bi_size;
+	w->flags	= flags;
+	INIT_WORK(&w->work, ibnbd_dev_file_submit_io_worker);
+
+	if (unlikely(!queue_work(fileio_wq, &w->work))) {
+		kfree(w);
+		return -EEXIST;
+	}
+
+	return 0;
+}
+
+int ibnbd_dev_submit_io(struct ibnbd_dev *dev, sector_t sector, void *data,
+			size_t len, u32 bi_size, enum ibnbd_io_flags flags,
+			void *priv)
+{
+	if (dev->mode == IBNBD_FILEIO)
+		return ibnbd_dev_file_submit_io(dev, sector, data, len, bi_size,
+						flags, priv);
+	else if (dev->mode == IBNBD_BLOCKIO)
+		return ibnbd_dev_blk_submit_io(dev, sector, data, len, bi_size,
+					       flags, priv);
+
+	pr_warn("Submitting I/O to %s failed, dev->mode contains invalid "
+		"value: '%d', memory corrupted?", dev->name, dev->mode);
+
+	return -EINVAL;
+}
diff --git a/drivers/block/ibnbd/ibnbd-srv-dev.h b/drivers/block/ibnbd/ibnbd-srv-dev.h
new file mode 100644
index 000000000000..2c02038d1f36
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-dev.h
@@ -0,0 +1,149 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef IBNBD_SRV_DEV_H
+#define IBNBD_SRV_DEV_H
+
+#include <linux/fs.h>
+#include "ibnbd-proto.h"
+
+typedef void ibnbd_dev_io_fn(void *priv, int error);
+
+struct ibnbd_dev {
+	struct block_device	*bdev;
+	struct bio_set		*ibd_bio_set;
+	struct file		*file;
+	fmode_t			blk_open_flags;
+	enum ibnbd_io_mode	mode;
+	char			name[BDEVNAME_SIZE];
+	ibnbd_dev_io_fn		*io_cb;
+};
+
+/** ibnbd_dev_init() - Initialize ibnbd_dev
+ *
+ * This functions initialized the ibnbd-dev component.
+ * It has to be called 1x time before ibnbd_dev_open() is used
+ */
+int ibnbd_dev_init(void);
+
+/** ibnbd_dev_destroy() - Destroy ibnbd_dev
+ *
+ * This functions destroys the ibnbd-dev component.
+ * It has to be called after the last device was closed.
+ */
+void ibnbd_dev_destroy(void);
+
+/**
+ * ibnbd_dev_open() - Open a device
+ * @flags:	open flags
+ * @mode:	open via VFS or block layer
+ * @bs:		bio_set to use during block io,
+ * @io_cb:	is called when I/O finished
+ */
+struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
+				 enum ibnbd_io_mode mode, struct bio_set *bs,
+				 ibnbd_dev_io_fn io_cb);
+
+/**
+ * ibnbd_dev_close() - Close a device
+ */
+void ibnbd_dev_close(struct ibnbd_dev *dev);
+
+static inline int ibnbd_dev_get_logical_bsize(const struct ibnbd_dev *dev)
+{
+	return bdev_logical_block_size(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_phys_bsize(const struct ibnbd_dev *dev)
+{
+	return bdev_physical_block_size(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_max_segs(const struct ibnbd_dev *dev)
+{
+	return queue_max_segments(bdev_get_queue(dev->bdev));
+}
+
+static inline int ibnbd_dev_get_max_hw_sects(const struct ibnbd_dev *dev)
+{
+	return queue_max_hw_sectors(bdev_get_queue(dev->bdev));
+}
+
+static inline int
+ibnbd_dev_get_max_write_same_sects(const struct ibnbd_dev *dev)
+{
+	return bdev_write_same(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_secure_discard(const struct ibnbd_dev *dev)
+{
+	if (dev->mode == IBNBD_BLOCKIO)
+		return blk_queue_secure_erase(bdev_get_queue(dev->bdev));
+	return 0;
+}
+
+static inline int ibnbd_dev_get_max_discard_sects(const struct ibnbd_dev *dev)
+{
+	if (!blk_queue_discard(bdev_get_queue(dev->bdev)))
+		return 0;
+
+	if (dev->mode == IBNBD_BLOCKIO)
+		return blk_queue_get_max_sectors(bdev_get_queue(dev->bdev),
+						 REQ_OP_DISCARD);
+	return 0;
+}
+
+static inline int ibnbd_dev_get_discard_granularity(const struct ibnbd_dev *dev)
+{
+	if (dev->mode == IBNBD_BLOCKIO)
+		return bdev_get_queue(dev->bdev)->limits.discard_granularity;
+	return 0;
+}
+
+static inline int ibnbd_dev_get_discard_alignment(const struct ibnbd_dev *dev)
+{
+	if (dev->mode == IBNBD_BLOCKIO)
+		return bdev_get_queue(dev->bdev)->limits.discard_alignment;
+	return 0;
+}
+
+/**
+ * ibnbd_dev_submit_io() - Submit an I/O to the disk
+ * @dev:	device to that the I/O is submitted
+ * @sector:	address to read/write data to
+ * @data:	I/O data to write or buffer to read I/O date into
+ * @len:	length of @data
+ * @bi_size:	Amount of data that will be read/written
+ * @priv:	private data passed to @io_fn
+ */
+int ibnbd_dev_submit_io(struct ibnbd_dev *dev, sector_t sector, void *data,
+			size_t len, u32 bi_size, enum ibnbd_io_flags flags,
+			void *priv);
+
+#endif /* IBNBD_SRV_DEV_H */
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 21/24] ibnbd: server: sysfs interface functions
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (19 preceding siblings ...)
  2018-02-02 14:09 ` [PATCH 20/24] ibnbd: server: functionality for IO submission to file or block dev Roman Pen
@ 2018-02-02 14:09 ` Roman Pen
  2018-02-02 14:09 ` [PATCH 22/24] ibnbd: include client and server modules into kernel compilation Roman Pen
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:09 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

This is the sysfs interface to IBNBD mapped devices on server side:

  /sys/kernel/ibnbd_server/devices/<device_name>/
    |- block_dev
    |  *** link pointing to the corresponding block device sysfs entry
    |
    |- sessions/<session-name>/
    |  *** sessions directory
       |
       |- read_only
       |  *** is devices mapped as read only
       |
       |- mapping_path
          *** relative device path provided by the client during mapping

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/ibnbd-srv-sysfs.c | 264 ++++++++++++++++++++++++++++++++++
 1 file changed, 264 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv-sysfs.c b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
new file mode 100644
index 000000000000..a0efd6a2accb
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
@@ -0,0 +1,264 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler <mail@fholler.de>
+ *          Jack Wang <jinpu.wang@profitbricks.com>
+ *          Kleber Souza <kleber.souza@profitbricks.com>
+ *          Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *          Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
+ *          Roman Penyaev <roman.penyaev@profitbricks.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <uapi/linux/limits.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/stat.h>
+#include <linux/genhd.h>
+#include <linux/list.h>
+#include <linux/moduleparam.h>
+
+#include "ibnbd-srv.h"
+
+static struct kobject *ibnbd_srv_kobj;
+static struct kobject *ibnbd_srv_devices_kobj;
+
+static struct attribute *ibnbd_srv_default_dev_attrs[] = {
+	NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_attr_group = {
+	.attrs = ibnbd_srv_default_dev_attrs,
+};
+
+static ssize_t ibnbd_srv_attr_show(struct kobject *kobj, struct attribute *attr,
+				   char *page)
+{
+	struct kobj_attribute *kattr;
+	int ret = -EIO;
+
+	kattr = container_of(attr, struct kobj_attribute, attr);
+	if (kattr->show)
+		ret = kattr->show(kobj, kattr, page);
+	return ret;
+}
+
+static ssize_t ibnbd_srv_attr_store(struct kobject *kobj,
+				    struct attribute *attr,
+				    const char *page, size_t length)
+{
+	struct kobj_attribute *kattr;
+	int ret = -EIO;
+
+	kattr = container_of(attr, struct kobj_attribute, attr);
+	if (kattr->store)
+		ret = kattr->store(kobj, kattr, page, length);
+	return ret;
+}
+
+static const struct sysfs_ops ibnbd_srv_sysfs_ops = {
+	.show	= ibnbd_srv_attr_show,
+	.store	= ibnbd_srv_attr_store,
+};
+
+static struct kobj_type ibnbd_srv_dev_ktype = {
+	.sysfs_ops	= &ibnbd_srv_sysfs_ops,
+};
+
+static struct kobj_type ibnbd_srv_dev_sessions_ktype = {
+	.sysfs_ops	= &ibnbd_srv_sysfs_ops,
+};
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+			       struct block_device *bdev,
+			       const char *dir_name)
+{
+	struct kobject *bdev_kobj;
+	int ret;
+
+	ret = kobject_init_and_add(&dev->dev_kobj, &ibnbd_srv_dev_ktype,
+				   ibnbd_srv_devices_kobj, dir_name);
+	if (ret)
+		return ret;
+
+	ret = kobject_init_and_add(&dev->dev_sessions_kobj,
+				   &ibnbd_srv_dev_sessions_ktype,
+				   &dev->dev_kobj, "sessions");
+	if (ret)
+		goto err;
+
+	ret = sysfs_create_group(&dev->dev_kobj,
+				 &ibnbd_srv_default_dev_attr_group);
+	if (ret)
+		goto err2;
+
+	bdev_kobj = &disk_to_dev(bdev->bd_disk)->kobj;
+	ret = sysfs_create_link(&dev->dev_kobj, bdev_kobj, "block_dev");
+	if (ret)
+		goto err3;
+
+	return 0;
+
+err3:
+	sysfs_remove_group(&dev->dev_kobj,
+			   &ibnbd_srv_default_dev_attr_group);
+err2:
+	kobject_del(&dev->dev_sessions_kobj);
+	kobject_put(&dev->dev_sessions_kobj);
+err:
+	kobject_del(&dev->dev_kobj);
+	kobject_put(&dev->dev_kobj);
+	return ret;
+}
+
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev)
+{
+	sysfs_remove_link(&dev->dev_kobj, "block_dev");
+	sysfs_remove_group(&dev->dev_kobj, &ibnbd_srv_default_dev_attr_group);
+	kobject_del(&dev->dev_sessions_kobj);
+	kobject_put(&dev->dev_sessions_kobj);
+	kobject_del(&dev->dev_kobj);
+	kobject_put(&dev->dev_kobj);
+}
+
+static ssize_t ibnbd_srv_dev_session_ro_show(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     char *page)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 (sess_dev->open_flags & FMODE_WRITE) ? "0" : "1");
+}
+
+static struct kobj_attribute ibnbd_srv_dev_session_ro_attr =
+	__ATTR(read_only, 0444,
+	       ibnbd_srv_dev_session_ro_show,
+	       NULL);
+
+static ssize_t
+ibnbd_srv_dev_session_mapping_path_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", sess_dev->pathname);
+}
+
+static struct kobj_attribute ibnbd_srv_dev_session_mapping_path_attr =
+	__ATTR(mapping_path, 0444,
+	       ibnbd_srv_dev_session_mapping_path_show,
+	       NULL);
+
+static struct attribute *ibnbd_srv_default_dev_sessions_attrs[] = {
+	&ibnbd_srv_dev_session_ro_attr.attr,
+	&ibnbd_srv_dev_session_mapping_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_session_attr_group = {
+	.attrs = ibnbd_srv_default_dev_sessions_attrs,
+};
+
+void ibnbd_srv_destroy_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	DECLARE_COMPLETION_ONSTACK(sysfs_compl);
+
+	sysfs_remove_group(&sess_dev->kobj,
+			   &ibnbd_srv_default_dev_session_attr_group);
+
+	sess_dev->sysfs_release_compl = &sysfs_compl;
+	kobject_del(&sess_dev->kobj);
+	kobject_put(&sess_dev->kobj);
+	wait_for_completion(&sysfs_compl);
+}
+
+static void ibnbd_srv_sess_dev_release(struct kobject *kobj)
+{
+	struct ibnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+	if (sess_dev->sysfs_release_compl)
+		complete_all(sess_dev->sysfs_release_compl);
+}
+
+static struct kobj_type ibnbd_srv_sess_dev_ktype = {
+	.sysfs_ops	= &ibnbd_srv_sysfs_ops,
+	.release	= ibnbd_srv_sess_dev_release,
+};
+
+int ibnbd_srv_create_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev)
+{
+	int ret;
+
+	ret = kobject_init_and_add(&sess_dev->kobj, &ibnbd_srv_sess_dev_ktype,
+				   &sess_dev->dev->dev_sessions_kobj, "%s",
+				   sess_dev->sess->sessname);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_group(&sess_dev->kobj,
+				 &ibnbd_srv_default_dev_session_attr_group);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	kobject_del(&sess_dev->kobj);
+	kobject_put(&sess_dev->kobj);
+
+	return ret;
+}
+
+int ibnbd_srv_create_sysfs_files(void)
+{
+	int err;
+
+	ibnbd_srv_kobj = kobject_create_and_add(KBUILD_MODNAME, kernel_kobj);
+	if (!ibnbd_srv_kobj)
+		return -ENOMEM;
+
+	ibnbd_srv_devices_kobj = kobject_create_and_add("devices",
+							ibnbd_srv_kobj);
+	if (!ibnbd_srv_devices_kobj) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_put(ibnbd_srv_kobj);
+	return err;
+}
+
+void ibnbd_srv_destroy_sysfs_files(void)
+{
+	kobject_put(ibnbd_srv_devices_kobj);
+	kobject_put(ibnbd_srv_kobj);
+}
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 22/24] ibnbd: include client and server modules into kernel compilation
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (20 preceding siblings ...)
  2018-02-02 14:09 ` [PATCH 21/24] ibnbd: server: sysfs interface functions Roman Pen
@ 2018-02-02 14:09 ` Roman Pen
  2018-02-02 14:09 ` [PATCH 23/24] ibnbd: a bit of documentation Roman Pen
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:09 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

Add IBNBD Makefile, Kconfig and also corresponding lines into upper
block layer files.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/Kconfig        |  2 ++
 drivers/block/Makefile       |  1 +
 drivers/block/ibnbd/Kconfig  | 22 ++++++++++++++++++++++
 drivers/block/ibnbd/Makefile | 13 +++++++++++++
 4 files changed, 38 insertions(+)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 40579d0cb3d1..483aae5d391e 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -477,4 +477,6 @@ config BLK_DEV_RSXX
 	  To compile this driver as a module, choose M here: the
 	  module will be called rsxx.
 
+source "drivers/block/ibnbd/Kconfig"
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dc061158b403..65346a1d0b1a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk.o
 obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_BLK_DEV_IBNBD)	+= ibnbd/
 
 skd-y		:= skd_main.o
 swim_mod-y	:= swim.o swim_asm.o
diff --git a/drivers/block/ibnbd/Kconfig b/drivers/block/ibnbd/Kconfig
new file mode 100644
index 000000000000..c5cc7d111c7a
--- /dev/null
+++ b/drivers/block/ibnbd/Kconfig
@@ -0,0 +1,22 @@
+config BLK_DEV_IBNBD
+	boolean
+
+config BLK_DEV_IBNBD_CLIENT
+	tristate "Network block device driver on top of IBTRS transport"
+	depends on INFINIBAND_IBTRS_CLIENT
+	select BLK_DEV_IBNBD
+	help
+	  IBNBD client allows for mapping of a remote block devices over
+	  IBTRS protocol from a target system where IBNBD server is running.
+
+	  If unsure, say N.
+
+config BLK_DEV_IBNBD_SERVER
+	tristate "Network block device over RDMA Infiniband server support"
+	depends on INFINIBAND_IBTRS_SERVER
+	select BLK_DEV_IBNBD
+	help
+	  IBNBD server allows for exporting local block devices to a remote client
+	  over IBTRS protocol.
+
+	  If unsure, say N.
diff --git a/drivers/block/ibnbd/Makefile b/drivers/block/ibnbd/Makefile
new file mode 100644
index 000000000000..5f20e72e0633
--- /dev/null
+++ b/drivers/block/ibnbd/Makefile
@@ -0,0 +1,13 @@
+ccflags-y := -Idrivers/infiniband/ulp/ibtrs
+
+ibnbd-client-y := ibnbd-clt.o \
+		  ibnbd-clt-sysfs.o
+
+ibnbd-server-y := ibnbd-srv.o \
+		  ibnbd-srv-dev.o \
+		  ibnbd-srv-sysfs.o
+
+obj-$(CONFIG_BLK_DEV_IBNBD_CLIENT) += ibnbd-client.o
+obj-$(CONFIG_BLK_DEV_IBNBD_SERVER) += ibnbd-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 23/24] ibnbd: a bit of documentation
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (21 preceding siblings ...)
  2018-02-02 14:09 ` [PATCH 22/24] ibnbd: include client and server modules into kernel compilation Roman Pen
@ 2018-02-02 14:09 ` Roman Pen
  2018-02-02 15:55   ` Bart Van Assche
  2018-02-02 14:09 ` [PATCH 24/24] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Roman Pen
                   ` (3 subsequent siblings)
  26 siblings, 1 reply; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:09 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

README with description of major sysfs entries.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/block/ibnbd/README | 272 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 272 insertions(+)

diff --git a/drivers/block/ibnbd/README b/drivers/block/ibnbd/README
new file mode 100644
index 000000000000..e0feb39fad14
--- /dev/null
+++ b/drivers/block/ibnbd/README
@@ -0,0 +1,272 @@
+***************************************
+Infiniband Network Block Device (IBNBD)
+***************************************
+
+Introduction
+------------
+
+IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
+(client and server) that allow for remote access of a block device on
+the server over IBTRS protocol using the RDMA (InfiniBand, RoCE, iWarp)
+transport. After being mapped, the remote block devices can be accessed
+on the client side as local block devices.
+
+I/O is transfered between client and server by the IBTRS transport
+modules. The administration of IBNBD and IBTRS modules is done via
+sysfs entries.
+
+Requirements
+------------
+
+  IBTRS kernel modules
+
+Quick Start
+-----------
+
+Server side:
+  # modprobe ibnbd_server
+
+Client side:
+  # modprobe ibnbd_client
+  # echo "sessname=blya path=ip:10.50.100.66 device_path=/dev/ram0" > \
+            /sys/kernel/ibnbd_client/map_device
+
+  Where "sessname=" is a session name, a string to identify the session
+  on client and on server sides; "path=" is a destination IP address or
+  a pair of a source and a destination IPs, separated by comma.  Multiple
+  "path=" options can be specified in order to use multipath  (see IBTRS
+  description for details); "device_path=" is the block device to be
+  mapped from the server side. After the session to the server machine is
+  established, the mapped device will appear on the client side under
+  /dev/ibnbd<N>.
+
+
+======================
+Client Sysfs Interface
+======================
+
+All sysfs files that are not read-only provide the usage information on read:
+
+Example:
+  # cat /sys/kernel/ibnbd_client/map_device
+
+  > Usage: echo "sessname=<name of the ibtrs session> path=<[srcaddr,]dstaddr>
+  > [path=<[srcaddr,]dstaddr>] device_path=<full path on remote side>
+  > [access_mode=<ro|rw|migration>] [input_mode=<mq|rq>]
+  > [io_mode=<fileio|blockio>]" > map_device
+  >
+  > addr ::= [ ip:<ipv4> | ip:<ipv6> | gid:<gid> ]
+
+Entries under /sys/kernel/ibnbd_client/
+=======================================
+
+map_device (RW)
+---------------
+
+Expected format is the following:
+
+    sessname=<name of the ibtrs session>
+    path=<[srcaddr,]dstaddr> [path=<[srcaddr,]dstaddr> ...]
+    device_path=<full path on remote side>
+    [access_mode=<ro|rw|migration>]
+    [input_mode=<mq|rq>]
+    [io_mode=<fileio|blockio>]
+
+Where:
+
+sessname: accepts a string not bigger than 256 chars, which identifies
+          a given session on the client and on the server.
+	  I.e. "clt_hostname-srv_hostname" could be a natural choice.
+
+path:     describes a connection between the client and the server by
+	  specifying destination and, when required, the source address.
+	  The addresses are to be provided in the following format:
+
+            ip:<IPv6>
+            ip:<IPv4>
+            gid:<GID>
+
+          for example:
+
+          path=ip:10.0.0.66
+                         The single addr is treated as the destination.
+                         The connection will be established to this
+                         server from any client IP address.
+
+          path=ip:10.0.0.66,ip:10.0.1.66
+                         First addr is the source address and the second
+                         is the destination.
+
+          If multiple "path=" options are specified multiple connection
+          will be established and data will be sent according to
+          the selected multipath policy (see IBTRS mp_policy sysfs entry
+          description).
+
+device_path: Path to the block device on the server side. Path is specified
+	     relative to the directory on server side configured in the
+             'dev_search_path' module parameter of the ibnbd_server.
+             The ibnbd_server prepends the <device_path> received from client
+	     with <dev_search_path> and tries to open the
+	     <dev_search_path>/<device_path> block device.  On success,
+	     a /dev/ibnbd<N> device file, a /sys/block/ibnbd_client/ibnbd<N>/
+	     directory and an entry in /sys/kernel/ibnbd_client/devices will be
+             created.
+
+access_mode: the access_mode parameter specifies if the device is to be
+             mapped as "ro" read-only or "rw" read-write. The server allows
+	     a device to be exported in rw mode only once. The "migration"
+             access mode has to be specified if a second mapping in read-write
+	     mode is desired.
+
+             By default "rw" is used.
+
+input_mode: the input_mode parameter specifies the internal I/O
+            processing mode of the block device on the client.  Accepts
+            "mq" and "rq".
+
+            By default "mq" mode is used.
+
+io_mode:  the io_mode parameter specifies if the device on the server
+          will be opened as block device "blockio" or as file "fileio".
+          When the device is opened as file, the VFS page cache is used
+          for read I/O operations, write I/O operations bypass the page
+          cache and go directly to disk (except meta updates, like file
+          access time).
+
+          By default "blockio" mode is used.
+
+Exit Codes:
+
+If the device is already mapped it will fail with EEXIST. If the input
+has an invalid format it will return EINVAL. If the device path cannot
+be found on the server, it will fail with ENOENT.
+
+Finding device file after mapping
+---------------------------------
+
+After mapping, the device file can be found by:
+ o  The symlink /sys/kernel/ibnbd_client/devices/<device_id> points to
+    /sys/block/<dev-name>. The last part of the symlink destination is
+    the same as the device name.  By extracting the last part of the
+    path the path to the device /dev/<dev-name> can be build.
+
+ o /dev/block/$(cat /sys/kernel/ibnbd_client/devices/<device_id>/dev)
+
+How to find the <device_id> of the device is described on the next
+section.
+
+Entries under /sys/kernel/ibnbd_client/devices/
+===============================================
+
+For each device mapped on the client a new symbolic link is created as
+/sys/kernel/ibnbd_client/devices/<device_id>, which points to the block
+device created by ibnbd (/sys/block/ibnbd<N>/). The <device_id> of each
+device is created as follows:
+
+- If the 'device_path' provided during mapping contains slashes ("/"),
+  they are replaced by exclamation mark ("!") and used as as the
+  <device_id>. Otherwise, the <device_id> will be the same as the
+  "device_path" provided.
+
+Entries under /sys/block/ibnbd<N>/ibnbd_client/
+===============================================
+
+unmap_device (RW)
+-----------------
+
+To unmap a volume, "normal" or "force" has to be written to:
+  /sys/block/ibnbd<N>/ibnbd_client/unmap_device
+
+When "normal" is used, the operation will fail with EBUSY if any process
+is using the device.  When "force" is used, the device is also unmapped
+when device is in use.  All I/Os that are in progress will fail.
+
+Example:
+
+   # echo "normal" > /sys/block/ibnbd0/ibnbd/unmap_device
+
+state (RO)
+----------
+
+The file contains the current state of the block device. The state file
+returns "open" when the device is successfully mapped from the server
+and accepting I/O requests. When the connection to the server gets
+disconnected in case of an error (e.g. link failure), the state file
+returns "closed" and all I/O requests submitted to it will fail with -EIO.
+
+session (RO)
+------------
+
+IBNBD uses IBTRS session to transport the data between client and
+server.  The entry "session" contains the name of the session, that
+was used to establish the IBTRS session.  It's the same name that
+was passed as server parameter to the map_device entry.
+
+mapping_path (RO)
+-----------------
+
+Contains the path that was passed as "device_path" to the map_device
+operation.
+
+======================
+Server Sysfs Interface
+======================
+
+Entries under /sys/kernel/ibnbd_server/
+=======================================
+
+When a client maps a device, a directory entry with the name of the
+block device is created under /sys/kernel/ibnbd_server/devices/.
+
+Entries under /sys/kernel/ibnbd_server/devices/<device_name>/
+=============================================================
+
+block_dev (link)
+---------------
+
+Is a symlink to the sysfs entry of the exported device.
+
+Example:
+
+  block_dev -> ../../../../devices/virtual/block/ram0
+
+Entries under /sys/kernel/ibnbd_server/devices/<device_name>/sessions/
+======================================================================
+
+For each client a particular device is exported to, following directory will be
+created:
+
+/sys/kernel/ibnbd_server/devices/<device_name>/sessions/<session-name>/
+
+When the device is unmapped by that client, the directory will be removed.
+
+Entries under /sys/kernel/ibnbd_server/devices/<device_name>/sessions/<session-name>
+====================================================================================
+
+read_only (RO)
+--------------
+
+Contains '1' if device is mapped read-only, otherwise '0'.
+
+mapping_path (RO)
+-----------------
+
+Contains the relative device path provided by the user during mapping.
+
+==============================
+IBNBD-Server Module Parameters
+==============================
+
+dev_search_path
+---------------
+
+When a device is mapped from the client, the server generates the path
+to the block device on the server side by concatenating dev_search_path
+and the "device_path" that was specified in the map_device operation.
+
+The default dev_search_path is: "/".
+
+Contact
+-------
+
+Mailing list: "IBNBD/IBTRS Storage Team" <ibnbd@profitbricks.com>
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 24/24] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (22 preceding siblings ...)
  2018-02-02 14:09 ` [PATCH 23/24] ibnbd: a bit of documentation Roman Pen
@ 2018-02-02 14:09 ` Roman Pen
  2018-02-02 16:07 ` [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Bart Van Assche
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 79+ messages in thread
From: Roman Pen @ 2018-02-02 14:09 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Or Gerlitz, Roman Pen, Danil Kipnis, Jack Wang

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 MAINTAINERS | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 18994806e441..fad9c2529f8a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6714,6 +6714,20 @@ IBM ServeRAID RAID DRIVER
 S:	Orphan
 F:	drivers/scsi/ips.*
 
+IBNBD BLOCK DRIVERS
+M:	IBNBD/IBTRS Storage Team <ibnbd@profitbricks.com>
+L:	linux-block@vger.kernel.org
+S:	Maintained
+T:	git git://github.com/profitbricks/ibnbd.git
+F:	drivers/block/ibnbd/
+
+IBTRS TRANSPORT DRIVERS
+M:	IBNBD/IBTRS Storage Team <ibnbd@profitbricks.com>
+L:	linux-rdma@vger.kernel.org
+S:	Maintained
+T:	git git://github.com/profitbricks/ibnbd.git
+F:	drivers/infiniband/ulp/ibtrs/
+
 ICH LPC AND GPIO DRIVER
 M:	Peter Tyser <ptyser@xes-inc.com>
 S:	Maintained
-- 
2.13.1

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/24] ibnbd: client: main functionality
  2018-02-02 14:08 ` [PATCH 16/24] ibnbd: client: main functionality Roman Pen
@ 2018-02-02 15:11   ` Jens Axboe
  2018-02-05 12:54     ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Jens Axboe @ 2018-02-02 15:11 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Christoph Hellwig, Sagi Grimberg, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

On 2/2/18 7:08 AM, Roman Pen wrote:
> This is main functionality of ibnbd-client module, which provides
> interface to map remote device as local block device /dev/ibnbd<N>
> and feeds IBTRS with IO requests.

Kill the legacy IO path for this, the driver should only support
blk-mq. Hence kill off your BLK_RQ part, that will eliminate
the dual path you have too.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/24] ibnbd: a bit of documentation
  2018-02-02 14:09 ` [PATCH 23/24] ibnbd: a bit of documentation Roman Pen
@ 2018-02-02 15:55   ` Bart Van Assche
  2018-02-05 13:03     ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-02 15:55 UTC (permalink / raw)
  To: roman.penyaev, linux-block, linux-rdma
  Cc: danil.kipnis, hch, ogerlitz, jinpu.wang, axboe, sagi

T24gRnJpLCAyMDE4LTAyLTAyIGF0IDE1OjA5ICswMTAwLCBSb21hbiBQZW4gd3JvdGU6DQo+ICtF
bnRyaWVzIHVuZGVyIC9zeXMva2VybmVsL2libmJkX2NsaWVudC8NCj4gKz09PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PQ0KPiBbIC4uLiBdDQoNCllvdSB3aWxsIG5lZWQgR3Jl
ZyBLSCdzIHBlcm1pc3Npb24gdG8gYWRkIG5ldyBlbnRyaWVzIGRpcmVjdGx5IHVuZGVyIC9zeXMv
a2VybmVsLg0KU2luY2UgSSB0aGluayB0aGF0IGl0IGlzIHVubGlrZWx5IHRoYXQgaGUgd2lsbCBn
aXZlIHRoYXQgcGVybWlzc2lvbjogaGF2ZSB5b3UNCmNvbnNpZGVyZWQgdG8gYWRkIHRoZSBuZXcg
Y2xpZW50IGVudHJpZXMgdW5kZXIgL3N5cy9jbGFzcy9ibG9jayBmb3IgdGhlIGNsaWVudCBhbmQN
Ci9zeXMva2VybmVsL2NvbmZpZ2ZzL2libmJkIGZvciB0aGUgdGFyZ2V0LCBzaW1pbGFyIHRvIHdo
YXQgdGhlIE5WTWVPRiBkcml2ZXJzIGRvDQp0b2RheT8NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (23 preceding siblings ...)
  2018-02-02 14:09 ` [PATCH 24/24] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Roman Pen
@ 2018-02-02 16:07 ` Bart Van Assche
  2018-02-02 16:40   ` Doug Ledford
  2018-02-02 17:05 ` Bart Van Assche
  2018-02-05 12:16 ` Sagi Grimberg
  26 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-02 16:07 UTC (permalink / raw)
  To: roman.penyaev, linux-block, linux-rdma
  Cc: danil.kipnis, hch, ogerlitz, jinpu.wang, axboe, sagi

T24gRnJpLCAyMDE4LTAyLTAyIGF0IDE1OjA4ICswMTAwLCBSb21hbiBQZW4gd3JvdGU6DQo+IFNp
bmNlIHRoZSBmaXJzdCB2ZXJzaW9uIHRoZSBmb2xsb3dpbmcgd2FzIGNoYW5nZWQ6DQo+IA0KPiAg
ICAtIExvYWQtYmFsYW5jaW5nIGFuZCBJTyBmYWlsLW92ZXIgdXNpbmcgbXVsdGlwYXRoIGZlYXR1
cmVzIHdlcmUgYWRkZWQuDQo+ICAgIC0gTWFqb3IgcGFydHMgb2YgdGhlIGNvZGUgd2VyZSByZXdy
aXR0ZW4sIHNpbXBsaWZpZWQgYW5kIG92ZXJhbGwgY29kZQ0KPiAgICAgIHNpemUgd2FzIHJlZHVj
ZWQgYnkgYSBxdWFydGVyLg0KDQpUaGF0IGlzIGludGVyZXN0aW5nIHRvIGtub3csIGJ1dCB3aGF0
IGhhcHBlbmVkIHRvIHRoZSBmZWVkYmFjayB0aGF0IFNhZ2kgYW5kDQpJIHByb3ZpZGVkIG9uIHYx
PyBIYXMgdGhhdCBmZWVkYmFjayBiZWVuIGFkZHJlc3NlZD8gU2VlIGFsc28NCmh0dHBzOi8vd3d3
LnNwaW5pY3MubmV0L2xpc3RzL2xpbnV4LXJkbWEvbXNnNDc4MTkuaHRtbCBhbmQNCmh0dHBzOi8v
d3d3LnNwaW5pY3MubmV0L2xpc3RzL2xpbnV4LXJkbWEvbXNnNDc4NzkuaHRtbC4NCg0KUmVnYXJk
aW5nIG11bHRpcGF0aCBzdXBwb3J0OiB0aGVyZSBhcmUgYWxyZWFkeSB0d28gbXVsdGlwYXRoIGlt
cGxlbWVudGF0aW9ucw0KdXBzdHJlYW0gKGRtLW1wYXRoIGFuZCB0aGUgbXVsdGlwYXRoIGltcGxl
bWVudGF0aW9uIGluIHRoZSBOVk1lIGluaXRpYXRvcikuDQpJJ20gbm90IHN1cmUgd2Ugd2FudCBh
IHRoaXJkIG11bHRpcGF0aCBpbXBsZW1lbnRhdGlvbiBpbiB0aGUgTGludXgga2VybmVsLg0KDQpU
aGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-02 16:07 ` [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Bart Van Assche
@ 2018-02-02 16:40   ` Doug Ledford
  2018-02-05  8:45     ` Jinpu Wang
  2018-06-04 12:14     ` Danil Kipnis
  0 siblings, 2 replies; 79+ messages in thread
From: Doug Ledford @ 2018-02-02 16:40 UTC (permalink / raw)
  To: Bart Van Assche, roman.penyaev, linux-block, linux-rdma
  Cc: danil.kipnis, hch, ogerlitz, jinpu.wang, axboe, sagi

[-- Attachment #1: Type: text/plain, Size: 2226 bytes --]

On Fri, 2018-02-02 at 16:07 +0000, Bart Van Assche wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> > Since the first version the following was changed:
> > 
> >    - Load-balancing and IO fail-over using multipath features were added.
> >    - Major parts of the code were rewritten, simplified and overall code
> >      size was reduced by a quarter.
> 
> That is interesting to know, but what happened to the feedback that Sagi and
> I provided on v1? Has that feedback been addressed? See also
> https://www.spinics.net/lists/linux-rdma/msg47819.html and
> https://www.spinics.net/lists/linux-rdma/msg47879.html.
> 
> Regarding multipath support: there are already two multipath implementations
> upstream (dm-mpath and the multipath implementation in the NVMe initiator).
> I'm not sure we want a third multipath implementation in the Linux kernel.

There's more than that.  There was also md-multipath, and smc-r includes
another version of multipath, plus I assume we support mptcp as well.

But, to be fair, the different multipaths in this list serve different
purposes and I'm not sure they could all be generalized out and served
by a single multipath code.  Although, fortunately, md-multipath is
deprecated, so no need to worry about it, and it is only dm-multipath
and nvme multipath that deal directly with block devices and assume
block semantics.  If I read the cover letter right (and I haven't dug
into the code to confirm this), the ibtrs multipath has much more in
common with smc-r multipath, where it doesn't really assume a block
layer device sits on top of it, it's more of a pure network multipath,
which the implementation of smc-r is and mptcp would be too.  I would
like to see a core RDMA multipath implementation soon that would
abstract out some of these multipath tasks, at least across RDMA links,
and that didn't have the current limitations (smc-r only supports RoCE
links, and it sounds like ibtrs only supports IB like links, but maybe
I'm wrong there, I haven't looked at the patches yet).

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/24] ibtrs: client: main functionality
  2018-02-02 14:08 ` [PATCH 05/24] ibtrs: client: main functionality Roman Pen
@ 2018-02-02 16:54   ` Bart Van Assche
  2018-02-05 13:27     ` Roman Penyaev
  2018-02-05 11:19   ` Sagi Grimberg
  1 sibling, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-02 16:54 UTC (permalink / raw)
  To: roman.penyaev, linux-block, linux-rdma
  Cc: danil.kipnis, hch, ogerlitz, jinpu.wang, axboe, sagi

T24gRnJpLCAyMDE4LTAyLTAyIGF0IDE1OjA4ICswMTAwLCBSb21hbiBQZW4gd3JvdGU6DQo+ICtz
dGF0aWMgaW5saW5lIHN0cnVjdCBpYnRyc190YWcgKg0KPiArX19pYnRyc19nZXRfdGFnKHN0cnVj
dCBpYnRyc19jbHQgKmNsdCwgZW51bSBpYnRyc19jbHRfY29uX3R5cGUgY29uX3R5cGUpDQo+ICt7
DQo+ICsJc2l6ZV90IG1heF9kZXB0aCA9IGNsdC0+cXVldWVfZGVwdGg7DQo+ICsJc3RydWN0IGli
dHJzX3RhZyAqdGFnOw0KPiArCWludCBjcHUsIGJpdDsNCj4gKw0KPiArCWNwdSA9IGdldF9jcHUo
KTsNCj4gKwlkbyB7DQo+ICsJCWJpdCA9IGZpbmRfZmlyc3RfemVyb19iaXQoY2x0LT50YWdzX21h
cCwgbWF4X2RlcHRoKTsNCj4gKwkJaWYgKHVubGlrZWx5KGJpdCA+PSBtYXhfZGVwdGgpKSB7DQo+
ICsJCQlwdXRfY3B1KCk7DQo+ICsJCQlyZXR1cm4gTlVMTDsNCj4gKwkJfQ0KPiArDQo+ICsJfSB3
aGlsZSAodW5saWtlbHkodGVzdF9hbmRfc2V0X2JpdF9sb2NrKGJpdCwgY2x0LT50YWdzX21hcCkp
KTsNCj4gKwlwdXRfY3B1KCk7DQo+ICsNCj4gKwl0YWcgPSBHRVRfVEFHKGNsdCwgYml0KTsNCj4g
KwlXQVJOX09OKHRhZy0+bWVtX2lkICE9IGJpdCk7DQo+ICsJdGFnLT5jcHVfaWQgPSBjcHU7DQo+
ICsJdGFnLT5jb25fdHlwZSA9IGNvbl90eXBlOw0KPiArDQo+ICsJcmV0dXJuIHRhZzsNCj4gK30N
Cj4gKw0KPiArc3RhdGljIGlubGluZSB2b2lkIF9faWJ0cnNfcHV0X3RhZyhzdHJ1Y3QgaWJ0cnNf
Y2x0ICpjbHQsDQo+ICsJCQkJICAgc3RydWN0IGlidHJzX3RhZyAqdGFnKQ0KPiArew0KPiArCWNs
ZWFyX2JpdF91bmxvY2sodGFnLT5tZW1faWQsIGNsdC0+dGFnc19tYXApOw0KPiArfQ0KPiArDQo+
ICtzdHJ1Y3QgaWJ0cnNfdGFnICppYnRyc19jbHRfZ2V0X3RhZyhzdHJ1Y3QgaWJ0cnNfY2x0ICpj
bHQsDQo+ICsJCQkJICAgIGVudW0gaWJ0cnNfY2x0X2Nvbl90eXBlIGNvbl90eXBlLA0KPiArCQkJ
CSAgICBpbnQgY2FuX3dhaXQpDQo+ICt7DQo+ICsJc3RydWN0IGlidHJzX3RhZyAqdGFnOw0KPiAr
CURFRklORV9XQUlUKHdhaXQpOw0KPiArDQo+ICsJdGFnID0gX19pYnRyc19nZXRfdGFnKGNsdCwg
Y29uX3R5cGUpOw0KPiArCWlmIChsaWtlbHkodGFnKSB8fCAhY2FuX3dhaXQpDQo+ICsJCXJldHVy
biB0YWc7DQo+ICsNCj4gKwlkbyB7DQo+ICsJCXByZXBhcmVfdG9fd2FpdCgmY2x0LT50YWdzX3dh
aXQsICZ3YWl0LCBUQVNLX1VOSU5URVJSVVBUSUJMRSk7DQo+ICsJCXRhZyA9IF9faWJ0cnNfZ2V0
X3RhZyhjbHQsIGNvbl90eXBlKTsNCj4gKwkJaWYgKGxpa2VseSh0YWcpKQ0KPiArCQkJYnJlYWs7
DQo+ICsNCj4gKwkJaW9fc2NoZWR1bGUoKTsNCj4gKwl9IHdoaWxlICgxKTsNCj4gKw0KPiArCWZp
bmlzaF93YWl0KCZjbHQtPnRhZ3Nfd2FpdCwgJndhaXQpOw0KPiArDQo+ICsJcmV0dXJuIHRhZzsN
Cj4gK30NCj4gK0VYUE9SVF9TWU1CT0woaWJ0cnNfY2x0X2dldF90YWcpOw0KPiArDQo+ICt2b2lk
IGlidHJzX2NsdF9wdXRfdGFnKHN0cnVjdCBpYnRyc19jbHQgKmNsdCwgc3RydWN0IGlidHJzX3Rh
ZyAqdGFnKQ0KPiArew0KPiArCWlmIChXQVJOX09OKCF0ZXN0X2JpdCh0YWctPm1lbV9pZCwgY2x0
LT50YWdzX21hcCkpKQ0KPiArCQlyZXR1cm47DQo+ICsNCj4gKwlfX2lidHJzX3B1dF90YWcoY2x0
LCB0YWcpOw0KPiArDQo+ICsJLyoNCj4gKwkgKiBQdXR0aW5nIGEgdGFnIGlzIGEgYmFycmllciwg
c28gd2Ugd2lsbCBvYnNlcnZlDQo+ICsJICogbmV3IGVudHJ5IGluIHRoZSB3YWl0IGxpc3QsIG5v
IHdvcnJpZXMuDQo+ICsJICovDQo+ICsJaWYgKHdhaXRxdWV1ZV9hY3RpdmUoJmNsdC0+dGFnc193
YWl0KSkNCj4gKwkJd2FrZV91cCgmY2x0LT50YWdzX3dhaXQpOw0KPiArfQ0KPiArRVhQT1JUX1NZ
TUJPTChpYnRyc19jbHRfcHV0X3RhZyk7DQoNCkRvIHRoZXNlIGZ1bmN0aW9ucyBoYXZlIGFueSBh
ZHZhbnRhZ2Ugb3ZlciB0aGUgY29kZSBpbiBsaWIvc2JpdG1hcC5jPyBJZiBub3QsDQpwbGVhc2Ug
Y2FsbCB0aGUgc2JpdG1hcCBmdW5jdGlvbnMgaW5zdGVhZCBvZiBhZGRpbmcgYW4gYWRkaXRpb25h
bCB0YWcgYWxsb2NhdG9yLg0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (24 preceding siblings ...)
  2018-02-02 16:07 ` [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Bart Van Assche
@ 2018-02-02 17:05 ` Bart Van Assche
  2018-02-05  8:56   ` Jinpu Wang
  2018-02-05 12:16 ` Sagi Grimberg
  26 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-02 17:05 UTC (permalink / raw)
  To: roman.penyaev, linux-block, linux-rdma
  Cc: danil.kipnis, hch, ogerlitz, jinpu.wang, axboe, sagi

T24gRnJpLCAyMDE4LTAyLTAyIGF0IDE1OjA4ICswMTAwLCBSb21hbiBQZW4gd3JvdGU6DQo+IG8g
U2ltcGxlIGNvbmZpZ3VyYXRpb24gb2YgSUJOQkQ6DQo+ICAgIC0gU2VydmVyIHNpZGUgaXMgY29t
cGxldGVseSBwYXNzaXZlOiB2b2x1bWVzIGRvIG5vdCBuZWVkIHRvIGJlDQo+ICAgICAgZXhwbGlj
aXRseSBleHBvcnRlZC4NCg0KVGhhdCBzb3VuZHMgbGlrZSBhIHNlY3VyaXR5IGhvbGU/IEkgdGhp
bmsgdGhlIGFiaWxpdHkgdG8gY29uZmlndXJlIHdoZXRoZXIgb3INCm5vdCBhbiBpbml0aWF0b3Ig
aXMgYWxsb3dlZCB0byBsb2cgaW4gaXMgZXNzZW50aWFsIGFuZCBhbHNvIHdoaWNoIHZvbHVtZXMg
YW4NCmluaXRpYXRvciBoYXMgYWNjZXNzIHRvLg0KDQo+ICAgIC0gT25seSBJQiBwb3J0IEdJRCBh
bmQgZGV2aWNlIHBhdGggbmVlZGVkIG9uIGNsaWVudCBzaWRlIHRvIG1hcA0KPiAgICAgIGEgYmxv
Y2sgZGV2aWNlLg0KDQpJIHRoaW5rIElQIGFkZHJlc3NpbmcgaXMgcHJlZmVycmVkIG92ZXIgR0lE
IGFkZHJlc3NpbmcgaW4gUm9DRSBuZXR3b3Jrcy4NCkFkZGl0aW9uYWxseSwgaGF2ZSB5b3Ugbm90
aWNlZCB0aGF0IEdVSUQgY29uZmlndXJhdGlvbiBzdXBwb3J0IGhhcyBiZWVuIGFkZGVkDQp0byB0
aGUgdXBzdHJlYW0gaWJfc3JwdCBkcml2ZXI/IFVzaW5nIEdJRHMgaGFzIGEgdmVyeSBpbXBvcnRh
bnQgZGlzYWR2YW50YWdlLA0KbmFtZWx5IHRoYXQgYXQgbGVhc3QgaW4gSUIgbmV0d29ya3MgdGhl
IHByZWZpeCB3aWxsIGNoYW5nZSBpZiB0aGUgc3VibmV0DQptYW5hZ2VyIGlzIHJlY29uZmlndXJl
ZC4gQWRkaXRpb25hbGx5LCBpbiBJQiBuZXR3b3JrcyBpdCBtYXkgaGFwcGVuIHRoYXQgdGhlDQp0
YXJnZXQgZHJpdmVyIGlzIGxvYWRlZCBhbmQgY29uZmlndXJlZCBiZWZvcmUgdGhlIEdJRCBoYXMg
YmVlbiBhc3NpZ25lZCB0bw0KYWxsIFJETUEgcG9ydHMuDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-02 16:40   ` Doug Ledford
@ 2018-02-05  8:45     ` Jinpu Wang
  2018-06-04 12:14     ` Danil Kipnis
  1 sibling, 0 replies; 79+ messages in thread
From: Jinpu Wang @ 2018-02-05  8:45 UTC (permalink / raw)
  To: Doug Ledford, Bart Van Assche
  Cc: roman.penyaev, linux-block, linux-rdma, danil.kipnis, hch,
	ogerlitz, axboe, sagi

On Fri, Feb 2, 2018 at 5:40 PM, Doug Ledford <dledford@redhat.com> wrote:
> On Fri, 2018-02-02 at 16:07 +0000, Bart Van Assche wrote:
>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > Since the first version the following was changed:
>> >
>> >    - Load-balancing and IO fail-over using multipath features were added.
>> >    - Major parts of the code were rewritten, simplified and overall code
>> >      size was reduced by a quarter.
>>
>> That is interesting to know, but what happened to the feedback that Sagi and
>> I provided on v1? Has that feedback been addressed? See also
>> https://www.spinics.net/lists/linux-rdma/msg47819.html and
>> https://www.spinics.net/lists/linux-rdma/msg47879.html.
>>
>> Regarding multipath support: there are already two multipath implementations
>> upstream (dm-mpath and the multipath implementation in the NVMe initiator).
>> I'm not sure we want a third multipath implementation in the Linux kernel.
>
> There's more than that.  There was also md-multipath, and smc-r includes
> another version of multipath, plus I assume we support mptcp as well.
>
> But, to be fair, the different multipaths in this list serve different
> purposes and I'm not sure they could all be generalized out and served
> by a single multipath code.  Although, fortunately, md-multipath is
> deprecated, so no need to worry about it, and it is only dm-multipath
> and nvme multipath that deal directly with block devices and assume
> block semantics.  If I read the cover letter right (and I haven't dug
> into the code to confirm this), the ibtrs multipath has much more in
> common with smc-r multipath, where it doesn't really assume a block
> layer device sits on top of it, it's more of a pure network multipath,
> which the implementation of smc-r is and mptcp would be too.  I would
> like to see a core RDMA multipath implementation soon that would
> abstract out some of these multipath tasks, at least across RDMA links,
> and that didn't have the current limitations (smc-r only supports RoCE
> links, and it sounds like ibtrs only supports IB like links, but maybe
> I'm wrong there, I haven't looked at the patches yet).
Hi Doug, hi Bart,

Thanks for your valuable input, here is my 2 cents:

IBTRS multipath is indeed a network multipath, with sysfs interface to
remove/add path dynamically.
IBTRS is built on rdma-cm, so expect to support RoCE and iWARP, but we
mainly tested in IB environment,
slightly tested on RXE.


Regards,
-- 
Jack Wang
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-02 17:05 ` Bart Van Assche
@ 2018-02-05  8:56   ` Jinpu Wang
  2018-02-05 11:36     ` Sagi Grimberg
  2018-02-05 16:16     ` Bart Van Assche
  0 siblings, 2 replies; 79+ messages in thread
From: Jinpu Wang @ 2018-02-05  8:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: roman.penyaev, linux-block, linux-rdma, danil.kipnis, hch,
	ogerlitz, axboe, sagi

Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> o Simple configuration of IBNBD:
>>    - Server side is completely passive: volumes do not need to be
>>      explicitly exported.
>
> That sounds like a security hole? I think the ability to configure whether or
> not an initiator is allowed to log in is essential and also which volumes an
> initiator has access to.
Our design target for well controlled production environment, so
security is handle in other layer.
On server side, admin can set the dev_search_path in module parameter
to set parent directory,
this will concatenate with the path client send in open message to
open  a block device.


>
>>    - Only IB port GID and device path needed on client side to map
>>      a block device.
>
> I think IP addressing is preferred over GID addressing in RoCE networks.
> Additionally, have you noticed that GUID configuration support has been added
> to the upstream ib_srpt driver? Using GIDs has a very important disadvantage,
> namely that at least in IB networks the prefix will change if the subnet
> manager is reconfigured. Additionally, in IB networks it may happen that the
> target driver is loaded and configured before the GID has been assigned to
> all RDMA ports.
>
> Thanks,
>
> Bart.

Sorry, the above description is not accurate, IBNBD/IBTRS support
GID/IPv4/IPv6 addressing.
We will adjust in next post.

Thanks,
-- 
Jack Wang
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules
  2018-02-02 14:08 ` [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules Roman Pen
@ 2018-02-05 10:52   ` Sagi Grimberg
  2018-02-06 12:01     ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 10:52 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

Hi Roman,

Here are some comments below.

> +int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
> +{
> +	struct ib_recv_wr wr, *bad_wr;
> +
> +	wr.next    = NULL;
> +	wr.wr_cqe  = cqe;
> +	wr.sg_list = NULL;
> +	wr.num_sge = 0;
> +
> +	return ib_post_recv(con->qp, &wr, &bad_wr);
> +}
> +EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);

What is this designed to do?

> +int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
> +				 struct ib_sge *sge, unsigned int num_sge,
> +				 u32 rkey, u64 rdma_addr, u32 imm_data,
> +				 enum ib_send_flags flags)
> +{
> +	struct ib_send_wr *bad_wr;
> +	struct ib_rdma_wr wr;
> +	int i;
> +
> +	wr.wr.next	  = NULL;
> +	wr.wr.wr_cqe	  = &iu->cqe;
> +	wr.wr.sg_list	  = sge;
> +	wr.wr.num_sge	  = num_sge;
> +	wr.rkey		  = rkey;
> +	wr.remote_addr	  = rdma_addr;
> +	wr.wr.opcode	  = IB_WR_RDMA_WRITE_WITH_IMM;
> +	wr.wr.ex.imm_data = cpu_to_be32(imm_data);
> +	wr.wr.send_flags  = flags;
> +
> +	/*
> +	 * If one of the sges has 0 size, the operation will fail with an
> +	 * length error
> +	 */
> +	for (i = 0; i < num_sge; i++)
> +		if (WARN_ON(sge[i].length == 0))
> +			return -EINVAL;
> +
> +	return ib_post_send(con->qp, &wr.wr, &bad_wr);
> +}
> +EXPORT_SYMBOL_GPL(ibtrs_iu_post_rdma_write_imm);
> +
> +int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
> +				    u32 imm_data, enum ib_send_flags flags)
> +{
> +	struct ib_send_wr wr, *bad_wr;
> +
> +	memset(&wr, 0, sizeof(wr));
> +	wr.wr_cqe	= cqe;
> +	wr.send_flags	= flags;
> +	wr.opcode	= IB_WR_RDMA_WRITE_WITH_IMM;
> +	wr.ex.imm_data	= cpu_to_be32(imm_data);
> +
> +	return ib_post_send(con->qp, &wr, &bad_wr);
> +}
> +EXPORT_SYMBOL_GPL(ibtrs_post_rdma_write_imm_empty);

Christoph did a great job adding a generic rdma rw API, please
reuse it, if you rely on needed functionality that does not exist
there, please enhance it instead of open-coding a new rdma engine
library.

> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device *dev)
> +{
> +	int err;
> +
> +	d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
> +	if (IS_ERR(d->pd))
> +		return PTR_ERR(d->pd);
> +	d->dev = dev;
> +	d->lkey = d->pd->local_dma_lkey;
> +	d->rkey = d->pd->unsafe_global_rkey;
> +
> +	err = ibtrs_query_device(d);
> +	if (unlikely(err))
> +		ib_dealloc_pd(d->pd);
> +
> +	return err;
> +}

I must say that this makes me frustrated.. We stopped doing these
sort of things long time ago. No way we can even consider accepting
the unsafe use of the global rkey exposing the entire memory space for
remote access permissions.

Sorry for being blunt, but this protocol design which makes a concious
decision to expose unconditionally is broken by definition.

> +struct ibtrs_ib_dev *ibtrs_ib_dev_find_get(struct rdma_cm_id *cm_id)
> +{
> +	struct ibtrs_ib_dev *dev;
> +	int err;
> +
> +	mutex_lock(&device_list_mutex);
> +	list_for_each_entry(dev, &device_list, entry) {
> +		if (dev->dev->node_guid == cm_id->device->node_guid &&
> +		    kref_get_unless_zero(&dev->ref))
> +			goto out_unlock;
> +	}
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (unlikely(!dev))
> +		goto out_err;
> +
> +	kref_init(&dev->ref);
> +	err = ibtrs_ib_dev_init(dev, cm_id->device);
> +	if (unlikely(err))
> +		goto out_free;
> +	list_add(&dev->entry, &device_list);
> +out_unlock:
> +	mutex_unlock(&device_list_mutex);
> +
> +	return dev;
> +
> +out_free:
> +	kfree(dev);
> +out_err:
> +	mutex_unlock(&device_list_mutex);
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(ibtrs_ib_dev_find_get);

Is it time to make this a common helper in rdma_cm?

...

> +static void schedule_hb(struct ibtrs_sess *sess)
> +{
> +	queue_delayed_work(sess->hb_wq, &sess->hb_dwork,
> +			   msecs_to_jiffies(sess->hb_interval_ms));
> +}

What does hb stand for?

> +void ibtrs_send_hb_ack(struct ibtrs_sess *sess)
> +{
> +	struct ibtrs_con *usr_con = sess->con[0];
> +	u32 imm;
> +	int err;
> +
> +	imm = ibtrs_to_imm(IBTRS_HB_ACK_IMM, 0);
> +	err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe,
> +					      imm, IB_SEND_SIGNALED);
> +	if (unlikely(err)) {
> +		sess->hb_err_handler(usr_con, err);
> +		return;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(ibtrs_send_hb_ack);

What is this?

What is all this hb stuff?

> +
> +static int ibtrs_str_ipv4_to_sockaddr(const char *addr, size_t len,
> +				      short port, struct sockaddr *dst)
> +{
> +	struct sockaddr_in *dst_sin = (struct sockaddr_in *)dst;
> +	int ret;
> +
> +	ret = in4_pton(addr, len, (u8 *)&dst_sin->sin_addr.s_addr,
> +		       '\0', NULL);
> +	if (ret == 0)
> +		return -EINVAL;
> +
> +	dst_sin->sin_family = AF_INET;
> +	dst_sin->sin_port = htons(port);
> +
> +	return 0;
> +}
> +
> +static int ibtrs_str_ipv6_to_sockaddr(const char *addr, size_t len,
> +				      short port, struct sockaddr *dst)
> +{
> +	struct sockaddr_in6 *dst_sin6 = (struct sockaddr_in6 *)dst;
> +	int ret;
> +
> +	ret = in6_pton(addr, len, dst_sin6->sin6_addr.s6_addr,
> +		       '\0', NULL);
> +	if (ret != 1)
> +		return -EINVAL;
> +
> +	dst_sin6->sin6_family = AF_INET6;
> +	dst_sin6->sin6_port = htons(port);
> +
> +	return 0;
> +}

We already added helpers for this in net utils, you don't need to
code it again.

> +
> +static int ibtrs_str_gid_to_sockaddr(const char *addr, size_t len,
> +				     short port, struct sockaddr *dst)
> +{
> +	struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
> +	int ret;
> +
> +	/* We can use some of the I6 functions since GID is a valid
> +	 * IPv6 address format
> +	 */
> +	ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
> +	if (ret == 0)
> +		return -EINVAL;
> +
> +	dst_ib->sib_family = AF_IB;
> +	/*
> +	 * Use the same TCP server port number as the IB service ID
> +	 * on the IB port space range
> +	 */
> +	dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
> +	dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
> +	dst_ib->sib_pkey = cpu_to_be16(0xffff);
> +
> +	return 0;
> +}

Would be a nice addition to net utils.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04/24] ibtrs: client: private header with client structs and functions
  2018-02-02 14:08 ` [PATCH 04/24] ibtrs: client: private header with client structs and functions Roman Pen
@ 2018-02-05 10:59   ` Sagi Grimberg
  2018-02-06 12:23     ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 10:59 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

Hi Roman,


> +struct ibtrs_clt_io_req {
> +	struct list_head        list;
> +	struct ibtrs_iu		*iu;
> +	struct scatterlist	*sglist; /* list holding user data */
> +	unsigned int		sg_cnt;
> +	unsigned int		sg_size;
> +	unsigned int		data_len;
> +	unsigned int		usr_len;
> +	void			*priv;
> +	bool			in_use;
> +	struct ibtrs_clt_con	*con;
> +	union {
> +		struct ib_pool_fmr	**fmr_list;
> +		struct ibtrs_fr_desc	**fr_list;
> +	};

We are pretty much stuck with fmrs for legacy devices, it has
no future support plans, please don't add new dependencies
on it. Its already hard enough to get rid of it.

> +	void			*map_page;
> +	struct ibtrs_tag	*tag;

Can I ask why do you need another tag that is not the request
tag?

> +	u16			nmdesc;
> +	enum dma_data_direction dir;
> +	ibtrs_conf_fn		*conf;
> +	unsigned long		start_time;
> +};
> +

> +static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
> +{
> +	if (unlikely(!c))
> +		return NULL;
> +
> +	return container_of(c, struct ibtrs_clt_con, c);
> +}
> +
> +static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
> +{
> +	if (unlikely(!s))
> +		return NULL;
> +
> +	return container_of(s, struct ibtrs_clt_sess, s);
> +}

Seems a bit awkward that container_of wrappers check pointer validity...

> +/**
> + * list_next_or_null_rr - get next list element in round-robin fashion.
> + * @pos:     entry, starting cursor.
> + * @head:    head of the list to examine. This list must have at least one
> + *           element, namely @pos.
> + * @member:  name of the list_head structure within typeof(*pos).
> + *
> + * Important to understand that @pos is a list entry, which can be already
> + * removed using list_del_rcu(), so if @head has become empty NULL will be
> + * returned. Otherwise next element is returned in round-robin fashion.
> + */
> +#define list_next_or_null_rcu_rr(pos, head, member) ({			\
> +	typeof(pos) ________next = NULL;				\
> +									\
> +	if (!list_empty(head))						\
> +		________next = (pos)->member.next != (head) ?		\
> +			list_entry_rcu((pos)->member.next,		\
> +				       typeof(*pos), member) :		\
> +			list_entry_rcu((pos)->member.next->next,	\
> +				       typeof(*pos), member);		\
> +	________next;							\
> +})

Why is this local to your driver?

> +
> +/* See ibtrs-log.h */
> +#define TYPES_TO_SESSNAME(obj)						\
> +	LIST(CASE(obj, struct ibtrs_clt_sess *, s.sessname),		\
> +	     CASE(obj, struct ibtrs_clt *, sessname))
> +
> +#define TAG_SIZE(clt) (sizeof(struct ibtrs_tag) + (clt)->pdu_sz)
> +#define GET_TAG(clt, idx) ((clt)->tags + TAG_SIZE(clt) * idx)

Still don't understand why this is even needed..

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/24] ibtrs: client: main functionality
  2018-02-02 14:08 ` [PATCH 05/24] ibtrs: client: main functionality Roman Pen
  2018-02-02 16:54   ` Bart Van Assche
@ 2018-02-05 11:19   ` Sagi Grimberg
  2018-02-05 14:19     ` Roman Penyaev
  1 sibling, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 11:19 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

Hi Roman,

> +static inline void ibtrs_clt_state_lock(void)
> +{
> +	rcu_read_lock();
> +}
> +
> +static inline void ibtrs_clt_state_unlock(void)
> +{
> +	rcu_read_unlock();
> +}

This looks rather pointless...

> +
> +#define cmpxchg_min(var, new) ({					\
> +	typeof(var) old;						\
> +									\
> +	do {								\
> +		old = var;						\
> +		new = (!old ? new : min_t(typeof(var), old, new));	\
> +	} while (cmpxchg(&var, old, new) != old);			\
> +})

Why is this sort of thing local to your driver?

> +/**
> + * struct ibtrs_fr_pool - pool of fast registration descriptors
> + *
> + * An entry is available for allocation if and only if it occurs in @free_list.
> + *
> + * @size:      Number of descriptors in this pool.
> + * @max_page_list_len: Maximum fast registration work request page list length.
> + * @lock:      Protects free_list.
> + * @free_list: List of free descriptors.
> + * @desc:      Fast registration descriptor pool.
> + */
> +struct ibtrs_fr_pool {
> +	int			size;
> +	int			max_page_list_len;
> +	spinlock_t		lock; /* protects free_list */
> +	struct list_head	free_list;
> +	struct ibtrs_fr_desc	desc[0];
> +};

We already have a per-qp fr list implementation, any specific reason to
implement it again?

> +static inline struct ibtrs_tag *
> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
> +{
> +	size_t max_depth = clt->queue_depth;
> +	struct ibtrs_tag *tag;
> +	int cpu, bit;
> +
> +	cpu = get_cpu();
> +	do {
> +		bit = find_first_zero_bit(clt->tags_map, max_depth);
> +		if (unlikely(bit >= max_depth)) {
> +			put_cpu();
> +			return NULL;
> +		}
> +
> +	} while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
> +	put_cpu();
> +
> +	tag = GET_TAG(clt, bit);
> +	WARN_ON(tag->mem_id != bit);
> +	tag->cpu_id = cpu;
> +	tag->con_type = con_type;
> +
> +	return tag;
> +}
> +
> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
> +				   struct ibtrs_tag *tag)
> +{
> +	clear_bit_unlock(tag->mem_id, clt->tags_map);
> +}
> +
> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
> +				    enum ibtrs_clt_con_type con_type,
> +				    int can_wait)
> +{
> +	struct ibtrs_tag *tag;
> +	DEFINE_WAIT(wait);
> +
> +	tag = __ibtrs_get_tag(clt, con_type);
> +	if (likely(tag) || !can_wait)
> +		return tag;
> +
> +	do {
> +		prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
> +		tag = __ibtrs_get_tag(clt, con_type);
> +		if (likely(tag))
> +			break;
> +
> +		io_schedule();
> +	} while (1);
> +
> +	finish_wait(&clt->tags_wait, &wait);
> +
> +	return tag;
> +}
> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
> +
> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
> +{
> +	if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
> +		return;
> +
> +	__ibtrs_put_tag(clt, tag);
> +
> +	/*
> +	 * Putting a tag is a barrier, so we will observe
> +	 * new entry in the wait list, no worries.
> +	 */
> +	if (waitqueue_active(&clt->tags_wait))
> +		wake_up(&clt->tags_wait);
> +}
> +EXPORT_SYMBOL(ibtrs_clt_put_tag);

Again, the tags are not clear why they are needed...

> +/**
> + * ibtrs_destroy_fr_pool() - free the resources owned by a pool
> + * @pool: Fast registration pool to be destroyed.
> + */
> +static void ibtrs_destroy_fr_pool(struct ibtrs_fr_pool *pool)
> +{
> +	struct ibtrs_fr_desc *d;
> +	int i, err;
> +
> +	if (!pool)
> +		return;
> +
> +	for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> +		if (d->mr) {
> +			err = ib_dereg_mr(d->mr);
> +			if (err)
> +				pr_err("Failed to deregister memory region,"
> +				       " err: %d\n", err);
> +		}
> +	}
> +	kfree(pool);
> +}
> +
> +/**
> + * ibtrs_create_fr_pool() - allocate and initialize a pool for fast registration
> + * @device:            IB device to allocate fast registration descriptors for.
> + * @pd:                Protection domain associated with the FR descriptors.
> + * @pool_size:         Number of descriptors to allocate.
> + * @max_page_list_len: Maximum fast registration work request page list length.
> + */
> +static struct ibtrs_fr_pool *ibtrs_create_fr_pool(struct ib_device *device,
> +						  struct ib_pd *pd,
> +						  int pool_size,
> +						  int max_page_list_len)
> +{
> +	struct ibtrs_fr_pool *pool;
> +	struct ibtrs_fr_desc *d;
> +	struct ib_mr *mr;
> +	int i, ret;
> +
> +	if (pool_size <= 0) {
> +		pr_warn("Creating fr pool failed, invalid pool size %d\n",
> +			pool_size);
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	pool = kzalloc(sizeof(*pool) + pool_size * sizeof(*d), GFP_KERNEL);
> +	if (!pool) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	pool->size = pool_size;
> +	pool->max_page_list_len = max_page_list_len;
> +	spin_lock_init(&pool->lock);
> +	INIT_LIST_HEAD(&pool->free_list);
> +
> +	for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
> +		mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, max_page_list_len);
> +		if (IS_ERR(mr)) {
> +			pr_warn("Failed to allocate fast region memory\n");
> +			ret = PTR_ERR(mr);
> +			goto destroy_pool;
> +		}
> +		d->mr = mr;
> +		list_add_tail(&d->entry, &pool->free_list);
> +	}
> +
> +	return pool;
> +
> +destroy_pool:
> +	ibtrs_destroy_fr_pool(pool);
> +err:
> +	return ERR_PTR(ret);
> +}
> +
> +/**
> + * ibtrs_fr_pool_get() - obtain a descriptor suitable for fast registration
> + * @pool: Pool to obtain descriptor from.
> + */
> +static struct ibtrs_fr_desc *ibtrs_fr_pool_get(struct ibtrs_fr_pool *pool)
> +{
> +	struct ibtrs_fr_desc *d = NULL;
> +
> +	spin_lock_bh(&pool->lock);
> +	if (!list_empty(&pool->free_list)) {
> +		d = list_first_entry(&pool->free_list, typeof(*d), entry);
> +		list_del(&d->entry);
> +	}
> +	spin_unlock_bh(&pool->lock);
> +
> +	return d;
> +}
> +
> +/**
> + * ibtrs_fr_pool_put() - put an FR descriptor back in the free list
> + * @pool: Pool the descriptor was allocated from.
> + * @desc: Pointer to an array of fast registration descriptor pointers.
> + * @n:    Number of descriptors to put back.
> + *
> + * Note: The caller must already have queued an invalidation request for
> + * desc->mr->rkey before calling this function.
> + */
> +static void ibtrs_fr_pool_put(struct ibtrs_fr_pool *pool,
> +			      struct ibtrs_fr_desc **desc, int n)
> +{
> +	int i;
> +
> +	spin_lock_bh(&pool->lock);
> +	for (i = 0; i < n; i++)
> +		list_add(&desc[i]->entry, &pool->free_list);
> +	spin_unlock_bh(&pool->lock);
> +}
> +
> +static void ibtrs_map_desc(struct ibtrs_map_state *state, dma_addr_t dma_addr,
> +			   u32 dma_len, u32 rkey, u32 max_desc)
> +{
> +	struct ibtrs_sg_desc *desc = state->desc;
> +
> +	pr_debug("dma_addr %llu, key %u, dma_len %u\n",
> +		 dma_addr, rkey, dma_len);
> +	desc->addr = cpu_to_le64(dma_addr);
> +	desc->key  = cpu_to_le32(rkey);
> +	desc->len  = cpu_to_le32(dma_len);
> +
> +	state->total_len += dma_len;
> +	if (state->ndesc < max_desc) {
> +		state->desc++;
> +		state->ndesc++;
> +	} else {
> +		state->ndesc = INT_MIN;
> +		pr_err("Could not fit S/G list into buffer descriptor %d.\n",
> +		       max_desc);
> +	}
> +}
> +
> +static int ibtrs_map_finish_fmr(struct ibtrs_map_state *state,
> +				struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ib_pool_fmr *fmr;
> +	dma_addr_t dma_addr;
> +	u64 io_addr = 0;
> +
> +	fmr = ib_fmr_pool_map_phys(sess->fmr_pool, state->pages,
> +				   state->npages, io_addr);
> +	if (IS_ERR(fmr)) {
> +		ibtrs_wrn_rl(sess, "Failed to map FMR from FMR pool, "
> +			     "err: %ld\n", PTR_ERR(fmr));
> +		return PTR_ERR(fmr);
> +	}
> +
> +	*state->next_fmr++ = fmr;
> +	state->nmdesc++;
> +	dma_addr = state->base_dma_addr & ~sess->mr_page_mask;
> +	pr_debug("ndesc = %d, nmdesc = %d, npages = %d\n",
> +		 state->ndesc, state->nmdesc, state->npages);
> +	if (state->dir == DMA_TO_DEVICE)
> +		ibtrs_map_desc(state, dma_addr, state->dma_len, fmr->fmr->lkey,
> +			       sess->max_desc);
> +	else
> +		ibtrs_map_desc(state, dma_addr, state->dma_len, fmr->fmr->rkey,
> +			       sess->max_desc);
> +
> +	return 0;
> +}
> +
> +static void ibtrs_clt_fast_reg_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ibtrs_clt_con *con = cq->cq_context;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
> +		ibtrs_err(sess, "Failed IB_WR_REG_MR: %s\n",
> +			  ib_wc_status_msg(wc->status));
> +		ibtrs_rdma_error_recovery(con);
> +	}
> +}
> +
> +static struct ib_cqe fast_reg_cqe = {
> +	.done = ibtrs_clt_fast_reg_done
> +};
> +
> +/* TODO */
> +static int ibtrs_map_finish_fr(struct ibtrs_map_state *state,
> +			       struct ibtrs_clt_con *con, int sg_cnt,
> +			       unsigned int *sg_offset_p)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_fr_desc *desc;
> +	struct ib_send_wr *bad_wr;
> +	struct ib_reg_wr wr;
> +	struct ib_pd *pd;
> +	u32 rkey;
> +	int n;
> +
> +	pd = sess->s.ib_dev->pd;
> +	if (sg_cnt == 1 && (pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY)) {
> +		unsigned int sg_offset = sg_offset_p ? *sg_offset_p : 0;
> +
> +		ibtrs_map_desc(state, sg_dma_address(state->sg) + sg_offset,
> +			       sg_dma_len(state->sg) - sg_offset,
> +			       pd->unsafe_global_rkey, sess->max_desc);
> +		if (sg_offset_p)
> +			*sg_offset_p = 0;
> +		return 1;
> +	}
> +
> +	desc = ibtrs_fr_pool_get(con->fr_pool);
> +	if (!desc) {
> +		ibtrs_wrn_rl(sess, "Failed to get descriptor from FR pool\n");
> +		return -ENOMEM;
> +	}
> +
> +	rkey = ib_inc_rkey(desc->mr->rkey);
> +	ib_update_fast_reg_key(desc->mr, rkey);
> +
> +	memset(&wr, 0, sizeof(wr));
> +	n = ib_map_mr_sg(desc->mr, state->sg, sg_cnt, sg_offset_p,
> +			 sess->mr_page_size);
> +	if (unlikely(n < 0)) {
> +		ibtrs_fr_pool_put(con->fr_pool, &desc, 1);
> +		return n;
> +	}
> +
> +	wr.wr.next = NULL;
> +	wr.wr.opcode = IB_WR_REG_MR;
> +	wr.wr.wr_cqe = &fast_reg_cqe;
> +	wr.wr.num_sge = 0;
> +	wr.wr.send_flags = 0;
> +	wr.mr = desc->mr;
> +	wr.key = desc->mr->rkey;
> +	wr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE);

Do you actually ever have remote write access in your protocol?

> +static void ibtrs_clt_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ibtrs_clt_con *con = cq->cq_context;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
> +		ibtrs_err(sess, "Failed IB_WR_LOCAL_INV: %s\n",
> +			  ib_wc_status_msg(wc->status));
> +		ibtrs_rdma_error_recovery(con);
> +	}
> +}
> +
> +static struct ib_cqe local_inv_cqe = {
> +	.done = ibtrs_clt_inv_rkey_done
> +};
> +
> +static int ibtrs_inv_rkey(struct ibtrs_clt_con *con, u32 rkey)
> +{
> +	struct ib_send_wr *bad_wr;
> +	struct ib_send_wr wr = {
> +		.opcode		    = IB_WR_LOCAL_INV,
> +		.wr_cqe		    = &local_inv_cqe,
> +		.next		    = NULL,
> +		.num_sge	    = 0,
> +		.send_flags	    = 0,
> +		.ex.invalidate_rkey = rkey,
> +	};
> +
> +	return ib_post_send(con->c.qp, &wr, &bad_wr);
> +}

Is not signalling the local invalidate safe? A recent report
suggested that this is not safe in the presence of ack drops.

> +static int ibtrs_post_send_rdma(struct ibtrs_clt_con *con,
> +				struct ibtrs_clt_io_req *req,
> +				u64 addr, u32 off, u32 imm)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	enum ib_send_flags flags;
> +	struct ib_sge list[1];
> +
> +	if (unlikely(!req->sg_size)) {
> +		ibtrs_wrn(sess, "Doing RDMA Write failed, no data supplied\n");
> +		return -EINVAL;
> +	}
> +
> +	/* user data and user message in the first list element */
> +	list[0].addr   = req->iu->dma_addr;
> +	list[0].length = req->sg_size;
> +	list[0].lkey   = sess->s.ib_dev->lkey;
> +
> +	/*
> +	 * From time to time we have to post signalled sends,
> +	 * or send queue will fill up and only QP reset can help.
> +	 */
> +	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
> +			0 : IB_SEND_SIGNALED;
> +	return ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list, 1,
> +					    sess->srv_rdma_buf_rkey,
> +					    addr + off, imm, flags);
> +}
> +
> +static void ibtrs_set_sge_with_desc(struct ib_sge *list,
> +				    struct ibtrs_sg_desc *desc)
> +{
> +	list->addr   = le64_to_cpu(desc->addr);
> +	list->length = le32_to_cpu(desc->len);
> +	list->lkey   = le32_to_cpu(desc->key);
> +	pr_debug("dma_addr %llu, key %u, dma_len %u\n",
> +		 list->addr, list->lkey, list->length);
> +}
> +
> +static void ibtrs_set_rdma_desc_last(struct ibtrs_clt_con *con,
> +				     struct ib_sge *list,
> +				     struct ibtrs_clt_io_req *req,
> +				     struct ib_rdma_wr *wr, int offset,
> +				     struct ibtrs_sg_desc *desc, int m,
> +				     int n, u64 addr, u32 size, u32 imm)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	enum ib_send_flags flags;
> +	int i;
> +
> +	for (i = m; i < n; i++, desc++)
> +		ibtrs_set_sge_with_desc(&list[i], desc);
> +
> +	list[i].addr   = req->iu->dma_addr;
> +	list[i].length = size;
> +	list[i].lkey   = sess->s.ib_dev->lkey;
> +
> +	wr->wr.wr_cqe = &req->iu->cqe;
> +	wr->wr.sg_list = &list[m];
> +	wr->wr.num_sge = n - m + 1;
> +	wr->remote_addr	= addr + offset;
> +	wr->rkey = sess->srv_rdma_buf_rkey;
> +
> +	/*
> +	 * From time to time we have to post signalled sends,
> +	 * or send queue will fill up and only QP reset can help.
> +	 */
> +	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
> +			0 : IB_SEND_SIGNALED;
> +
> +	wr->wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
> +	wr->wr.send_flags  = flags;
> +	wr->wr.ex.imm_data = cpu_to_be32(imm);
> +}
> +
> +static int ibtrs_post_send_rdma_desc_more(struct ibtrs_clt_con *con,
> +					  struct ib_sge *list,
> +					  struct ibtrs_clt_io_req *req,
> +					  struct ibtrs_sg_desc *desc, int n,
> +					  u64 addr, u32 size, u32 imm)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	size_t max_sge, num_sge, num_wr;
> +	struct ib_send_wr *bad_wr;
> +	struct ib_rdma_wr *wrs, *wr;
> +	int j = 0, k, offset = 0, len = 0;
> +	int m = 0;
> +	int ret;
> +
> +	max_sge = sess->max_sge;
> +	num_sge = 1 + n;
> +	num_wr = DIV_ROUND_UP(num_sge, max_sge);
> +
> +	wrs = kcalloc(num_wr, sizeof(*wrs), GFP_ATOMIC);
> +	if (!wrs)
> +		return -ENOMEM;
> +
> +	if (num_wr == 1)
> +		goto last_one;
> +
> +	for (; j < num_wr; j++) {
> +		wr = &wrs[j];
> +		for (k = 0; k < max_sge; k++, desc++) {
> +			m = k + j * max_sge;
> +			ibtrs_set_sge_with_desc(&list[m], desc);
> +			len += le32_to_cpu(desc->len);
> +		}
> +		wr->wr.wr_cqe = &req->iu->cqe;
> +		wr->wr.sg_list = &list[m];
> +		wr->wr.num_sge = max_sge;
> +		wr->remote_addr	= addr + offset;
> +		wr->rkey = sess->srv_rdma_buf_rkey;
> +
> +		offset += len;
> +		wr->wr.next = &wrs[j + 1].wr;
> +		wr->wr.opcode = IB_WR_RDMA_WRITE;
> +	}
> +
> +last_one:
> +	wr = &wrs[j];
> +
> +	ibtrs_set_rdma_desc_last(con, list, req, wr, offset,
> +				 desc, m, n, addr, size, imm);
> +
> +	ret = ib_post_send(con->c.qp, &wrs[0].wr, &bad_wr);
> +	if (unlikely(ret))
> +		ibtrs_err(sess, "Posting write request to QP failed,"
> +			  " err: %d\n", ret);
> +	kfree(wrs);
> +	return ret;
> +}
> +
> +static int ibtrs_post_send_rdma_desc(struct ibtrs_clt_con *con,
> +				     struct ibtrs_clt_io_req *req,
> +				     struct ibtrs_sg_desc *desc, int n,
> +				     u64 addr, u32 size, u32 imm)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	enum ib_send_flags flags;
> +	struct ib_sge *list;
> +	size_t num_sge;
> +	int ret, i;
> +
> +	num_sge = 1 + n;
> +	list = kmalloc_array(num_sge, sizeof(*list), GFP_ATOMIC);
> +	if (!list)
> +		return -ENOMEM;
> +
> +	if (num_sge < sess->max_sge) {
> +		for (i = 0; i < n; i++, desc++)
> +			ibtrs_set_sge_with_desc(&list[i], desc);
> +		list[i].addr   = req->iu->dma_addr;
> +		list[i].length = size;
> +		list[i].lkey   = sess->s.ib_dev->lkey;
> +
> +		/*
> +		 * From time to time we have to post signalled sends,
> +		 * or send queue will fill up and only QP reset can help.
> +		 */
> +		flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
> +				0 : IB_SEND_SIGNALED;
> +		ret = ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list,
> +						   num_sge,
> +						   sess->srv_rdma_buf_rkey,
> +						   addr, imm, flags);
> +	} else {
> +		ret = ibtrs_post_send_rdma_desc_more(con, list, req, desc, n,
> +						     addr, size, imm);
> +	}
> +
> +	kfree(list);
> +	return ret;
> +}
> +
> +static int ibtrs_post_send_rdma_more(struct ibtrs_clt_con *con,
> +				     struct ibtrs_clt_io_req *req,
> +				     u64 addr, u32 size, u32 imm)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ib_device *ibdev = sess->s.ib_dev->dev;
> +	enum ib_send_flags flags;
> +	struct scatterlist *sg;
> +	struct ib_sge *list;
> +	size_t num_sge;
> +	int i, ret;
> +
> +	num_sge = 1 + req->sg_cnt;
> +	list = kmalloc_array(num_sge, sizeof(*list), GFP_ATOMIC);
> +	if (!list)
> +		return -ENOMEM;
> +
> +	for_each_sg(req->sglist, sg, req->sg_cnt, i) {
> +		list[i].addr   = ib_sg_dma_address(ibdev, sg);
> +		list[i].length = ib_sg_dma_len(ibdev, sg);
> +		list[i].lkey   = sess->s.ib_dev->lkey;
> +	}
> +	list[i].addr   = req->iu->dma_addr;
> +	list[i].length = size;
> +	list[i].lkey   = sess->s.ib_dev->lkey;
> +
> +	/*
> +	 * From time to time we have to post signalled sends,
> +	 * or send queue will fill up and only QP reset can help.
> +	 */
> +	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
> +			0 : IB_SEND_SIGNALED;
> +	ret = ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list, num_sge,
> +					   sess->srv_rdma_buf_rkey,
> +					   addr, imm, flags);
> +	kfree(list);
> +
> +	return ret;
> +}

All these rdma halpers looks like that can be reused from the rdma rw
API if it was enhanced with immediate capabilities.

> +static inline unsigned long ibtrs_clt_get_raw_ms(void)
> +{
> +	struct timespec ts;
> +
> +	getrawmonotonic(&ts);
> +
> +	return timespec_to_ns(&ts) / NSEC_PER_MSEC;
> +}

Why is this local to your driver?

> +
> +static void complete_rdma_req(struct ibtrs_clt_io_req *req,
> +			      int errno, bool notify)
> +{
> +	struct ibtrs_clt_con *con = req->con;
> +	struct ibtrs_clt_sess *sess;
> +	enum dma_data_direction dir;
> +	struct ibtrs_clt *clt;
> +	void *priv;
> +
> +	if (WARN_ON(!req->in_use))
> +		return;
> +	if (WARN_ON(!req->con))
> +		return;
> +	sess = to_clt_sess(con->c.sess);
> +	clt = sess->clt;
> +
> +	if (req->sg_cnt > fmr_sg_cnt)
> +		ibtrs_unmap_fast_reg_data(req->con, req);
> +	if (req->sg_cnt)
> +		ib_dma_unmap_sg(sess->s.ib_dev->dev, req->sglist,
> +				req->sg_cnt, req->dir);
> +	if (sess->stats.enable_rdma_lat)
> +		ibtrs_clt_update_rdma_lat(&sess->stats,
> +					  req->dir == DMA_FROM_DEVICE,
> +					  ibtrs_clt_get_raw_ms() -
> +					  req->start_time);
> +	ibtrs_clt_decrease_inflight(&sess->stats);
> +
> +	req->in_use = false;
> +	req->con = NULL;
> +	priv = req->priv;
> +	dir = req->dir;
> +
> +	if (notify)
> +		req->conf(priv, errno);
> +}



> +
> +static void process_io_rsp(struct ibtrs_clt_sess *sess, u32 msg_id, s16 errno)
> +{
> +	if (WARN_ON(msg_id >= sess->queue_depth))
> +		return;
> +
> +	complete_rdma_req(&sess->reqs[msg_id], errno, true);
> +}
> +
> +static struct ib_cqe io_comp_cqe = {
> +	.done = ibtrs_clt_rdma_done
> +};
> +
> +static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ibtrs_clt_con *con = cq->cq_context;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	u32 imm_type, imm_payload;
> +	int err;
> +
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
> +		if (wc->status != IB_WC_WR_FLUSH_ERR) {
> +			ibtrs_err(sess, "RDMA failed: %s\n",
> +				  ib_wc_status_msg(wc->status));
> +			ibtrs_rdma_error_recovery(con);
> +		}
> +		return;
> +	}
> +	ibtrs_clt_update_wc_stats(con);
> +
> +	switch (wc->opcode) {
> +	case IB_WC_RDMA_WRITE:
> +		/*
> +		 * post_send() RDMA write completions of IO reqs (read/write)
> +		 * and hb
> +		 */
> +		break;
> +	case IB_WC_RECV_RDMA_WITH_IMM:
> +		/*
> +		 * post_recv() RDMA write completions of IO reqs (read/write)
> +		 * and hb
> +		 */
> +		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
> +			return;
> +		err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
> +		if (unlikely(err)) {
> +			ibtrs_err(sess, "ibtrs_post_recv_empty(): %d\n", err);
> +			ibtrs_rdma_error_recovery(con);
> +			break;
> +		}
> +		ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
> +			       &imm_type, &imm_payload);
> +		if (likely(imm_type == IBTRS_IO_RSP_IMM)) {
> +			u32 msg_id;
> +
> +			ibtrs_from_io_rsp_imm(imm_payload, &msg_id, &err);
> +			process_io_rsp(sess, msg_id, err);
> +		} else if (imm_type == IBTRS_HB_MSG_IMM) {
> +			WARN_ON(con->c.cid);
> +			ibtrs_send_hb_ack(&sess->s);
> +		} else if (imm_type == IBTRS_HB_ACK_IMM) {
> +			WARN_ON(con->c.cid);
> +			sess->s.hb_missed_cnt = 0;
> +		} else {
> +			ibtrs_wrn(sess, "Unknown IMM type %u\n", imm_type);
> +		}
> +		break;
> +	default:
> +		ibtrs_wrn(sess, "Unexpected WC type: %s\n",
> +			  ib_wc_opcode_str(wc->opcode));
> +		return;
> +	}

Is there a spec somewhere with the protocol information that explains
how this all works?

> +struct path_it {
> +	int i;
> +	struct list_head skip_list;
> +	struct ibtrs_clt *clt;
> +	struct ibtrs_clt_sess *(*next_path)(struct path_it *);
> +};
> +
> +#define do_each_path(path, clt, it) {					\
> +	path_it_init(it, clt);						\
> +	ibtrs_clt_state_lock();						\
> +	for ((it)->i = 0; ((path) = ((it)->next_path)(it)) &&		\
> +			  (it)->i < (it)->clt->paths_num;		\
> +	     (it)->i++)
> +
> +#define while_each_path(it)						\
> +	path_it_deinit(it);						\
> +	ibtrs_clt_state_unlock();					\
> +	}
> +
> +/**
> + * get_next_path_rr() - Returns path in round-robin fashion.
> + *
> + * Related to @MP_POLICY_RR
> + *
> + * Locks:
> + *    ibtrs_clt_state_lock() must be hold.
> + */
> +static struct ibtrs_clt_sess *get_next_path_rr(struct path_it *it)
> +{
> +	struct ibtrs_clt_sess __percpu * __rcu *ppcpu_path, *path;
> +	struct ibtrs_clt *clt = it->clt;
> +
> +	ppcpu_path = this_cpu_ptr(clt->pcpu_path);
> +	path = rcu_dereference(*ppcpu_path);
> +	if (unlikely(!path))
> +		path = list_first_or_null_rcu(&clt->paths_list,
> +					      typeof(*path), s.entry);
> +	else
> +		path = list_next_or_null_rcu_rr(path, &clt->paths_list,
> +						s.entry);
> +	rcu_assign_pointer(*ppcpu_path, path);
> +
> +	return path;
> +}
> +
> +/**
> + * get_next_path_min_inflight() - Returns path with minimal inflight count.
> + *
> + * Related to @MP_POLICY_MIN_INFLIGHT
> + *
> + * Locks:
> + *    ibtrs_clt_state_lock() must be hold.
> + */
> +static struct ibtrs_clt_sess *get_next_path_min_inflight(struct path_it *it)
> +{
> +	struct ibtrs_clt_sess *min_path = NULL;
> +	struct ibtrs_clt *clt = it->clt;
> +	struct ibtrs_clt_sess *sess;
> +	int min_inflight = INT_MAX;
> +	int inflight;
> +
> +	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry) {
> +		if (unlikely(!list_empty(raw_cpu_ptr(sess->mp_skip_entry))))
> +			continue;
> +
> +		inflight = atomic_read(&sess->stats.inflight);
> +
> +		if (inflight < min_inflight) {
> +			min_inflight = inflight;
> +			min_path = sess;
> +		}
> +	}
> +
> +	/*
> +	 * add the path to the skip list, so that next time we can get
> +	 * a different one
> +	 */
> +	if (min_path)
> +		list_add(raw_cpu_ptr(min_path->mp_skip_entry), &it->skip_list);
> +
> +	return min_path;
> +}
> +
> +static inline void path_it_init(struct path_it *it, struct ibtrs_clt *clt)
> +{
> +	INIT_LIST_HEAD(&it->skip_list);
> +	it->clt = clt;
> +	it->i = 0;
> +
> +	if (clt->mp_policy == MP_POLICY_RR)
> +		it->next_path = get_next_path_rr;
> +	else
> +		it->next_path = get_next_path_min_inflight;
> +}
> +
> +static inline void path_it_deinit(struct path_it *it)
> +{
> +	struct list_head *skip, *tmp;
> +	/*
> +	 * The skip_list is used only for the MIN_INFLIGHT policy.
> +	 * We need to remove paths from it, so that next IO can insert
> +	 * paths (->mp_skip_entry) into a skip_list again.
> +	 */
> +	list_for_each_safe(skip, tmp, &it->skip_list)
> +		list_del_init(skip);
> +}
> +
> +static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
> +				      struct ibtrs_clt_sess *sess,
> +				      ibtrs_conf_fn *conf,
> +				      struct ibtrs_tag *tag, void *priv,
> +				      const struct kvec *vec, size_t usr_len,
> +				      struct scatterlist *sg, size_t sg_cnt,
> +				      size_t data_len, int dir)
> +{
> +	req->tag = tag;
> +	req->in_use = true;
> +	req->usr_len = usr_len;
> +	req->data_len = data_len;
> +	req->sglist = sg;
> +	req->sg_cnt = sg_cnt;
> +	req->priv = priv;
> +	req->dir = dir;
> +	req->con = ibtrs_tag_to_clt_con(sess, tag);
> +	req->conf = conf;
> +	copy_from_kvec(req->iu->buf, vec, usr_len);
> +	if (sess->stats.enable_rdma_lat)
> +		req->start_time = ibtrs_clt_get_raw_ms();
> +}
> +
> +static inline struct ibtrs_clt_io_req *
> +ibtrs_clt_get_req(struct ibtrs_clt_sess *sess, ibtrs_conf_fn *conf,
> +		  struct ibtrs_tag *tag, void *priv,
> +		  const struct kvec *vec, size_t usr_len,
> +		  struct scatterlist *sg, size_t sg_cnt,
> +		  size_t data_len, int dir)
> +{
> +	struct ibtrs_clt_io_req *req;
> +
> +	req = &sess->reqs[tag->mem_id];
> +	ibtrs_clt_init_req(req, sess, conf, tag, priv, vec, usr_len,
> +			   sg, sg_cnt, data_len, dir);
> +	return req;
> +}
> +
> +static inline struct ibtrs_clt_io_req *
> +ibtrs_clt_get_copy_req(struct ibtrs_clt_sess *alive_sess,
> +		       struct ibtrs_clt_io_req *fail_req)
> +{
> +	struct ibtrs_clt_io_req *req;
> +	struct kvec vec = {
> +		.iov_base = fail_req->iu->buf,
> +		.iov_len  = fail_req->usr_len
> +	};
> +
> +	req = &alive_sess->reqs[fail_req->tag->mem_id];
> +	ibtrs_clt_init_req(req, alive_sess, fail_req->conf, fail_req->tag,
> +			   fail_req->priv, &vec, fail_req->usr_len,
> +			   fail_req->sglist, fail_req->sg_cnt,
> +			   fail_req->data_len, fail_req->dir);
> +	return req;
> +}
> +
> +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
> +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);
> +
> +static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
> +				  struct ibtrs_clt_io_req *fail_req)
> +{
> +	struct ibtrs_clt_sess *alive_sess;
> +	struct ibtrs_clt_io_req *req;
> +	int err = -ECONNABORTED;
> +	struct path_it it;
> +
> +	do_each_path(alive_sess, clt, &it) {
> +		if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
> +			continue;
> +		req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
> +		if (req->dir == DMA_TO_DEVICE)
> +			err = ibtrs_clt_write_req(req);
> +		else
> +			err = ibtrs_clt_read_req(req);
> +		if (unlikely(err)) {
> +			req->in_use = false;
> +			continue;
> +		}
> +		/* Success path */
> +		ibtrs_clt_inc_failover_cnt(&alive_sess->stats);
> +		break;
> +	} while_each_path(&it);
> +
> +	return err;
> +}
> +
> +static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess,
> +				      bool failover)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +	struct ibtrs_clt_io_req *req;
> +	int i;
> +
> +	if (!sess->reqs)
> +		return;
> +	for (i = 0; i < sess->queue_depth; ++i) {
> +		bool notify;
> +		int err = 0;
> +
> +		req = &sess->reqs[i];
> +		if (!req->in_use)
> +			continue;
> +
> +		if (failover)
> +			err = ibtrs_clt_failover_req(clt, req);
> +
> +		notify = (!failover || err);
> +		complete_rdma_req(req, -ECONNABORTED, notify);
> +	}
> +}
> +
> +static void free_sess_reqs(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_clt_io_req *req;
> +	int i;
> +
> +	if (!sess->reqs)
> +		return;
> +	for (i = 0; i < sess->queue_depth; ++i) {
> +		req = &sess->reqs[i];
> +		if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR)
> +			kfree(req->fr_list);
> +		else if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR)
> +			kfree(req->fmr_list);
> +		kfree(req->map_page);
> +		ibtrs_iu_free(req->iu, DMA_TO_DEVICE,
> +			      sess->s.ib_dev->dev);
> +	}
> +	kfree(sess->reqs);
> +	sess->reqs = NULL;
> +}
> +
> +static int alloc_sess_reqs(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_clt_io_req *req;
> +	void *mr_list;
> +	int i;
> +
> +	sess->reqs = kcalloc(sess->queue_depth, sizeof(*sess->reqs),
> +			     GFP_KERNEL);
> +	if (unlikely(!sess->reqs))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < sess->queue_depth; ++i) {
> +		req = &sess->reqs[i];
> +		req->iu = ibtrs_iu_alloc(i, sess->max_req_size, GFP_KERNEL,
> +					 sess->s.ib_dev->dev, DMA_TO_DEVICE,
> +					 ibtrs_clt_rdma_done);
> +		if (unlikely(!req->iu))
> +			goto out;
> +		mr_list = kmalloc_array(sess->max_pages_per_mr,
> +					sizeof(void *), GFP_KERNEL);
> +		if (unlikely(!mr_list))
> +			goto out;
> +		if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR)
> +			req->fr_list = mr_list;
> +		else if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR)
> +			req->fmr_list = mr_list;
> +
> +		req->map_page = kmalloc_array(sess->max_pages_per_mr,
> +					      sizeof(void *), GFP_KERNEL);
> +		if (unlikely(!req->map_page))
> +			goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	free_sess_reqs(sess);
> +
> +	return -ENOMEM;
> +}
> +
> +static int alloc_tags(struct ibtrs_clt *clt)
> +{
> +	unsigned int chunk_bits;
> +	int err, i;
> +
> +	clt->tags_map = kcalloc(BITS_TO_LONGS(clt->queue_depth), sizeof(long),
> +				GFP_KERNEL);
> +	if (unlikely(!clt->tags_map)) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +	clt->tags = kcalloc(clt->queue_depth, TAG_SIZE(clt), GFP_KERNEL);
> +	if (unlikely(!clt->tags)) {
> +		err = -ENOMEM;
> +		goto err_map;
> +	}
> +	chunk_bits = ilog2(clt->queue_depth - 1) + 1;
> +	for (i = 0; i < clt->queue_depth; i++) {
> +		struct ibtrs_tag *tag;
> +
> +		tag = GET_TAG(clt, i);
> +		tag->mem_id = i;
> +		tag->mem_off = i << (MAX_IMM_PAYL_BITS - chunk_bits);
> +	}
> +
> +	return 0;
> +
> +err_map:
> +	kfree(clt->tags_map);
> +	clt->tags_map = NULL;
> +out_err:
> +	return err;
> +}
> +
> +static void free_tags(struct ibtrs_clt *clt)
> +{
> +	kfree(clt->tags_map);
> +	clt->tags_map = NULL;
> +	kfree(clt->tags);
> +	clt->tags = NULL;
> +}
> +
> +static void query_fast_reg_mode(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_ib_dev *ib_dev;
> +	u64 max_pages_per_mr;
> +	int mr_page_shift;
> +
> +	ib_dev = sess->s.ib_dev;
> +	if (ib_dev->dev->alloc_fmr && ib_dev->dev->dealloc_fmr &&
> +	    ib_dev->dev->map_phys_fmr && ib_dev->dev->unmap_fmr) {
> +		sess->fast_reg_mode = IBTRS_FAST_MEM_FMR;
> +		ibtrs_info(sess, "Device %s supports FMR\n", ib_dev->dev->name);
> +	}
> +	if (ib_dev->attrs.device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS &&
> +	    use_fr) {
> +		sess->fast_reg_mode = IBTRS_FAST_MEM_FR;
> +		ibtrs_info(sess, "Device %s supports FR\n", ib_dev->dev->name);
> +	}
> +
> +	/*
> +	 * Use the smallest page size supported by the HCA, down to a
> +	 * minimum of 4096 bytes. We're unlikely to build large sglists
> +	 * out of smaller entries.
> +	 */
> +	mr_page_shift      = max(12, ffs(ib_dev->attrs.page_size_cap) - 1);
> +	sess->mr_page_size = 1 << mr_page_shift;
> +	sess->max_sge      = ib_dev->attrs.max_sge;
> +	sess->mr_page_mask = ~((u64)sess->mr_page_size - 1);
> +	max_pages_per_mr   = ib_dev->attrs.max_mr_size;
> +	do_div(max_pages_per_mr, sess->mr_page_size);
> +	sess->max_pages_per_mr = min_t(u64, sess->max_pages_per_mr,
> +				       max_pages_per_mr);
> +	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
> +		sess->max_pages_per_mr =
> +			min_t(u32, sess->max_pages_per_mr,
> +			      ib_dev->attrs.max_fast_reg_page_list_len);
> +	}
> +	sess->mr_max_size = sess->mr_page_size * sess->max_pages_per_mr;
> +}
> +
> +static int alloc_con_fast_pool(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_fr_pool *fr_pool;
> +	int err = 0;
> +
> +	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
> +		fr_pool = ibtrs_create_fr_pool(sess->s.ib_dev->dev,
> +					       sess->s.ib_dev->pd,
> +					       sess->queue_depth,
> +					       sess->max_pages_per_mr);
> +		if (unlikely(IS_ERR(fr_pool))) {
> +			err = PTR_ERR(fr_pool);
> +			ibtrs_err(sess, "FR pool allocation failed, err: %d\n",
> +				  err);
> +			return err;
> +		}
> +		con->fr_pool = fr_pool;
> +	}
> +
> +	return err;
> +}
> +
> +static void free_con_fast_pool(struct ibtrs_clt_con *con)
> +{
> +	if (con->fr_pool) {
> +		ibtrs_destroy_fr_pool(con->fr_pool);
> +		con->fr_pool = NULL;
> +	}
> +}
> +
> +static int alloc_sess_fast_pool(struct ibtrs_clt_sess *sess)
> +{
> +	struct ib_fmr_pool_param fmr_param;
> +	struct ib_fmr_pool *fmr_pool;
> +	int err = 0;
> +
> +	if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR) {
> +		memset(&fmr_param, 0, sizeof(fmr_param));
> +		fmr_param.pool_size	    = sess->queue_depth *
> +					      sess->max_pages_per_mr;
> +		fmr_param.dirty_watermark   = fmr_param.pool_size / 4;
> +		fmr_param.cache		    = 0;
> +		fmr_param.max_pages_per_fmr = sess->max_pages_per_mr;
> +		fmr_param.page_shift	    = ilog2(sess->mr_page_size);
> +		fmr_param.access	    = (IB_ACCESS_LOCAL_WRITE |
> +					       IB_ACCESS_REMOTE_WRITE);
> +
> +		fmr_pool = ib_create_fmr_pool(sess->s.ib_dev->pd, &fmr_param);
> +		if (unlikely(IS_ERR(fmr_pool))) {
> +			err = PTR_ERR(fmr_pool);
> +			ibtrs_err(sess, "FMR pool allocation failed, err: %d\n",
> +				  err);
> +			return err;
> +		}
> +		sess->fmr_pool = fmr_pool;
> +	}
> +
> +	return err;
> +}
> +
> +static void free_sess_fast_pool(struct ibtrs_clt_sess *sess)
> +{
> +	if (sess->fmr_pool) {
> +		ib_destroy_fmr_pool(sess->fmr_pool);
> +		sess->fmr_pool = NULL;
> +	}
> +}
> +
> +static int alloc_sess_io_bufs(struct ibtrs_clt_sess *sess)
> +{
> +	int ret;
> +
> +	ret = alloc_sess_reqs(sess);
> +	if (unlikely(ret)) {
> +		ibtrs_err(sess, "alloc_sess_reqs(), err: %d\n", ret);
> +		return ret;
> +	}
> +	ret = alloc_sess_fast_pool(sess);
> +	if (unlikely(ret)) {
> +		ibtrs_err(sess, "alloc_sess_fast_pool(), err: %d\n", ret);
> +		goto free_reqs;
> +	}
> +
> +	return 0;
> +
> +free_reqs:
> +	free_sess_reqs(sess);
> +
> +	return ret;
> +}
> +
> +static void free_sess_io_bufs(struct ibtrs_clt_sess *sess)
> +{
> +	free_sess_reqs(sess);
> +	free_sess_fast_pool(sess);
> +}
> +
> +static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
> +				     enum ibtrs_clt_state new_state)
> +{
> +	enum ibtrs_clt_state old_state;
> +	bool changed = false;
> +
> +	old_state = sess->state;
> +	switch (new_state) {
> +	case IBTRS_CLT_CONNECTING:
> +		switch (old_state) {
> +		case IBTRS_CLT_RECONNECTING:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
> +	case IBTRS_CLT_RECONNECTING:
> +		switch (old_state) {
> +		case IBTRS_CLT_CONNECTED:
> +		case IBTRS_CLT_CONNECTING_ERR:
> +		case IBTRS_CLT_CLOSED:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
> +	case IBTRS_CLT_CONNECTED:
> +		switch (old_state) {
> +		case IBTRS_CLT_CONNECTING:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
> +	case IBTRS_CLT_CONNECTING_ERR:
> +		switch (old_state) {
> +		case IBTRS_CLT_CONNECTING:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
> +	case IBTRS_CLT_CLOSING:
> +		switch (old_state) {
> +		case IBTRS_CLT_CONNECTING:
> +		case IBTRS_CLT_CONNECTING_ERR:
> +		case IBTRS_CLT_RECONNECTING:
> +		case IBTRS_CLT_CONNECTED:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
> +	case IBTRS_CLT_CLOSED:
> +		switch (old_state) {
> +		case IBTRS_CLT_CLOSING:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
> +	case IBTRS_CLT_DEAD:
> +		switch (old_state) {
> +		case IBTRS_CLT_CLOSED:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
> +	default:
> +		break;
> +	}
> +	if (changed) {
> +		sess->state = new_state;
> +		wake_up_locked(&sess->state_wq);
> +	}
> +
> +	return changed;
> +}
> +
> +static bool ibtrs_clt_change_state_from_to(struct ibtrs_clt_sess *sess,
> +					   enum ibtrs_clt_state old_state,
> +					   enum ibtrs_clt_state new_state)
> +{
> +	bool changed = false;
> +
> +	spin_lock_irq(&sess->state_wq.lock);
> +	if (sess->state == old_state)
> +		changed = __ibtrs_clt_change_state(sess, new_state);
> +	spin_unlock_irq(&sess->state_wq.lock);
> +
> +	return changed;
> +}
> +
> +static bool ibtrs_clt_change_state_get_old(struct ibtrs_clt_sess *sess,
> +					   enum ibtrs_clt_state new_state,
> +					   enum ibtrs_clt_state *old_state)
> +{
> +	bool changed;
> +
> +	spin_lock_irq(&sess->state_wq.lock);
> +	*old_state = sess->state;
> +	changed = __ibtrs_clt_change_state(sess, new_state);
> +	spin_unlock_irq(&sess->state_wq.lock);
> +
> +	return changed;
> +}
> +
> +static bool ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
> +				   enum ibtrs_clt_state new_state)
> +{
> +	enum ibtrs_clt_state old_state;
> +
> +	return ibtrs_clt_change_state_get_old(sess, new_state, &old_state);
> +}
> +
> +static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess)
> +{
> +	enum ibtrs_clt_state state;
> +
> +	spin_lock_irq(&sess->state_wq.lock);
> +	state = sess->state;
> +	spin_unlock_irq(&sess->state_wq.lock);
> +
> +	return state;
> +}
> +
> +static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err)
> +{
> +	struct ibtrs_clt_con *con;
> +
> +	(void)err;
> +	con = container_of(c, typeof(*con), c);
> +	ibtrs_rdma_error_recovery(con);
> +}
> +
> +static void ibtrs_clt_init_hb(struct ibtrs_clt_sess *sess)
> +{
> +	ibtrs_init_hb(&sess->s, &io_comp_cqe,
> +		      IBTRS_HB_INTERVAL_MS,
> +		      IBTRS_HB_MISSED_MAX,
> +		      ibtrs_clt_hb_err_handler,
> +		      ibtrs_wq);
> +}
> +
> +static void ibtrs_clt_start_hb(struct ibtrs_clt_sess *sess)
> +{
> +	ibtrs_start_hb(&sess->s);
> +}
> +
> +static void ibtrs_clt_stop_hb(struct ibtrs_clt_sess *sess)
> +{
> +	ibtrs_stop_hb(&sess->s);
> +}
> +
> +static void ibtrs_clt_reconnect_work(struct work_struct *work);
> +static void ibtrs_clt_close_work(struct work_struct *work);
> +
> +static struct ibtrs_clt_sess *alloc_sess(struct ibtrs_clt *clt,
> +					 const struct ibtrs_addr *path,
> +					 size_t con_num, u16 max_segments)
> +{
> +	struct ibtrs_clt_sess *sess;
> +	int err = -ENOMEM;
> +	int cpu;
> +
> +	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
> +	if (unlikely(!sess))
> +		goto err;
> +
> +	/* Extra connection for user messages */
> +	con_num += 1;
> +
> +	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
> +	if (unlikely(!sess->s.con))
> +		goto err_free_sess;
> +
> +	mutex_init(&sess->init_mutex);
> +	uuid_gen(&sess->s.uuid);
> +	memcpy(&sess->s.dst_addr, path->dst,
> +	       rdma_addr_size((struct sockaddr *)path->dst));
> +
> +	/*
> +	 * rdma_resolve_addr() passes src_addr to cma_bind_addr, which
> +	 * checks the sa_family to be non-zero. If user passed src_addr=NULL
> +	 * the sess->src_addr will contain only zeros, which is then fine.
> +	 */
> +	if (path->src)
> +		memcpy(&sess->s.src_addr, path->src,
> +		       rdma_addr_size((struct sockaddr *)path->src));
> +	strlcpy(sess->s.sessname, clt->sessname, sizeof(sess->s.sessname));
> +	sess->s.con_num = con_num;
> +	sess->clt = clt;
> +	sess->max_pages_per_mr = max_segments;
> +	init_waitqueue_head(&sess->state_wq);
> +	sess->state = IBTRS_CLT_CONNECTING;
> +	atomic_set(&sess->connected_cnt, 0);
> +	INIT_WORK(&sess->close_work, ibtrs_clt_close_work);
> +	INIT_DELAYED_WORK(&sess->reconnect_dwork, ibtrs_clt_reconnect_work);
> +	ibtrs_clt_init_hb(sess);
> +
> +	sess->mp_skip_entry = alloc_percpu(typeof(*sess->mp_skip_entry));
> +	if (unlikely(!sess->mp_skip_entry))
> +		goto err_free_con;
> +
> +	for_each_possible_cpu(cpu)
> +		INIT_LIST_HEAD(per_cpu_ptr(sess->mp_skip_entry, cpu));
> +
> +	err = ibtrs_clt_init_stats(&sess->stats);
> +	if (unlikely(err))
> +		goto err_free_percpu;
> +
> +	return sess;
> +
> +err_free_percpu:
> +	free_percpu(sess->mp_skip_entry);
> +err_free_con:
> +	kfree(sess->s.con);
> +err_free_sess:
> +	kfree(sess);
> +err:
> +	return ERR_PTR(err);
> +}
> +
> +static void free_sess(struct ibtrs_clt_sess *sess)
> +{
> +	ibtrs_clt_free_stats(&sess->stats);
> +	free_percpu(sess->mp_skip_entry);
> +	kfree(sess->s.con);
> +	kfree(sess->srv_rdma_addr);
> +	kfree(sess);
> +}
> +
> +static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid)
> +{
> +	struct ibtrs_clt_con *con;
> +
> +	con = kzalloc(sizeof(*con), GFP_KERNEL);
> +	if (unlikely(!con))
> +		return -ENOMEM;
> +
> +	/* Map first two connections to the first CPU */
> +	con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
> +	con->c.cid = cid;
> +	con->c.sess = &sess->s;
> +	atomic_set(&con->io_cnt, 0);
> +
> +	sess->s.con[cid] = &con->c;
> +
> +	return 0;
> +}
> +
> +static void destroy_con(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +
> +	sess->s.con[con->c.cid] = NULL;
> +	kfree(con);
> +}
> +
> +static int create_con_cq_qp(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	u16 cq_size, wr_queue_size;
> +	int err, cq_vector;
> +
> +	/*
> +	 * This function can fail, but still destroy_con_cq_qp() should
> +	 * be called, this is because create_con_cq_qp() is called on cm
> +	 * event path, thus caller/waiter never knows: have we failed before
> +	 * create_con_cq_qp() or after.  To solve this dilemma without
> +	 * creating any additional flags just allow destroy_con_cq_qp() be
> +	 * called many times.
> +	 */
> +
> +	if (con->c.cid == 0) {
> +		cq_size = SERVICE_CON_QUEUE_DEPTH;
> +		/* + 2 for drain and heartbeat */
> +		wr_queue_size = SERVICE_CON_QUEUE_DEPTH + 2;
> +		/* We must be the first here */
> +		if (WARN_ON(sess->s.ib_dev))
> +			return -EINVAL;
> +
> +		/*
> +		 * The whole session uses device from user connection.
> +		 * Be careful not to close user connection before ib dev
> +		 * is gracefully put.
> +		 */
> +		sess->s.ib_dev = ibtrs_ib_dev_find_get(con->c.cm_id);
> +		if (unlikely(!sess->s.ib_dev)) {
> +			ibtrs_wrn(sess, "ibtrs_ib_dev_find_get(): no memory\n");
> +			return -ENOMEM;
> +		}
> +		sess->s.ib_dev_ref = 1;
> +		query_fast_reg_mode(sess);
> +	} else {
> +		int num_wr;
> +
> +		/*
> +		 * Here we assume that session members are correctly set.
> +		 * This is always true if user connection (cid == 0) is
> +		 * established first.
> +		 */
> +		if (WARN_ON(!sess->s.ib_dev))
> +			return -EINVAL;
> +		if (WARN_ON(!sess->queue_depth))
> +			return -EINVAL;
> +
> +		/* Shared between connections */
> +		sess->s.ib_dev_ref++;
> +		cq_size = sess->queue_depth;
> +		num_wr = DIV_ROUND_UP(sess->max_pages_per_mr, sess->max_sge);
> +		wr_queue_size = sess->s.ib_dev->attrs.max_qp_wr;
> +		wr_queue_size = min_t(int, wr_queue_size,
> +				      sess->queue_depth * num_wr *
> +				      (use_fr ? 3 : 2) + 1);
> +	}
> +	cq_vector = con->cpu % sess->s.ib_dev->dev->num_comp_vectors;
> +	err = ibtrs_cq_qp_create(&sess->s, &con->c, sess->max_sge,
> +				 cq_vector, cq_size, wr_queue_size,
> +				 IB_POLL_SOFTIRQ);
> +	/*
> +	 * In case of error we do not bother to clean previous allocations,
> +	 * since destroy_con_cq_qp() must be called.
> +	 */
> +
> +	if (unlikely(err))
> +		return err;
> +
> +	if (con->c.cid) {
> +		err = alloc_con_fast_pool(con);
> +		if (unlikely(err))
> +			ibtrs_cq_qp_destroy(&con->c);
> +	}
> +
> +	return err;
> +}
> +
> +static void destroy_con_cq_qp(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +
> +	/*
> +	 * Be careful here: destroy_con_cq_qp() can be called even
> +	 * create_con_cq_qp() failed, see comments there.
> +	 */
> +
> +	ibtrs_cq_qp_destroy(&con->c);
> +	if (con->c.cid != 0)
> +		free_con_fast_pool(con);
> +	if (sess->s.ib_dev_ref && !--sess->s.ib_dev_ref) {
> +		ibtrs_ib_dev_put(sess->s.ib_dev);
> +		sess->s.ib_dev = NULL;
> +	}
> +}
> +
> +static void stop_cm(struct ibtrs_clt_con *con)
> +{
> +	rdma_disconnect(con->c.cm_id);
> +	if (con->c.qp)
> +		ib_drain_qp(con->c.qp);
> +}
> +
> +static void destroy_cm(struct ibtrs_clt_con *con)
> +{
> +	rdma_destroy_id(con->c.cm_id);
> +	con->c.cm_id = NULL;
> +}
> +
> +static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
> +				     struct rdma_cm_event *ev);
> +
> +static int create_cm(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct rdma_cm_id *cm_id;
> +	int err;
> +
> +	cm_id = rdma_create_id(&init_net, ibtrs_clt_rdma_cm_handler, con,
> +			       sess->s.dst_addr.ss_family == AF_IB ?
> +			       RDMA_PS_IB : RDMA_PS_TCP, IB_QPT_RC);
> +	if (unlikely(IS_ERR(cm_id))) {
> +		err = PTR_ERR(cm_id);
> +		ibtrs_err(sess, "Failed to create CM ID, err: %d\n", err);
> +
> +		return err;
> +	}
> +	con->c.cm_id = cm_id;
> +	con->cm_err = 0;
> +	/* allow the port to be reused */
> +	err = rdma_set_reuseaddr(cm_id, 1);
> +	if (err != 0) {
> +		ibtrs_err(sess, "Set address reuse failed, err: %d\n", err);
> +		goto destroy_cm;
> +	}
> +	err = rdma_resolve_addr(cm_id, (struct sockaddr *)&sess->s.src_addr,
> +				(struct sockaddr *)&sess->s.dst_addr,
> +				IBTRS_CONNECT_TIMEOUT_MS);
> +	if (unlikely(err)) {
> +		ibtrs_err(sess, "Failed to resolve address, err: %d\n", err);
> +		goto destroy_cm;
> +	}
> +	/*
> +	 * Combine connection status and session events. This is needed
> +	 * for waiting two possible cases: cm_err has something meaningful
> +	 * or session state was really changed to error by device removal.
> +	 */
> +	err = wait_event_interruptible_timeout(sess->state_wq,
> +			con->cm_err || sess->state != IBTRS_CLT_CONNECTING,
> +			msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
> +	if (unlikely(err == 0 || err == -ERESTARTSYS)) {
> +		if (err == 0)
> +			err = -ETIMEDOUT;
> +		/* Timedout or interrupted */
> +		goto errr;
> +	}
> +	if (unlikely(con->cm_err < 0)) {
> +		err = con->cm_err;
> +		goto errr;
> +	}
> +	if (unlikely(sess->state != IBTRS_CLT_CONNECTING)) {
> +		/* Device removal */
> +		err = -ECONNABORTED;
> +		goto errr;
> +	}
> +
> +	return 0;
> +
> +errr:
> +	stop_cm(con);
> +	/* Is safe to call destroy if cq_qp is not inited */
> +	destroy_con_cq_qp(con);
> +destroy_cm:
> +	destroy_cm(con);
> +
> +	return err;
> +}
> +
> +static void ibtrs_clt_sess_up(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +	int up;
> +
> +	/*
> +	 * We can fire RECONNECTED event only when all paths were
> +	 * connected on ibtrs_clt_open(), then each was disconnected
> +	 * and the first one connected again.  That's why this nasty
> +	 * game with counter value.
> +	 */
> +
> +	mutex_lock(&clt->paths_ev_mutex);
> +	up = ++clt->paths_up;
> +	/*
> +	 * Here it is safe to access paths num directly since up counter
> +	 * is greater than MAX_PATHS_NUM only while ibtrs_clt_open() is
> +	 * in progress, thus paths removals are impossible.
> +	 */
> +	if (up > MAX_PATHS_NUM && up == MAX_PATHS_NUM + clt->paths_num)
> +		clt->paths_up = clt->paths_num;
> +	else if (up == 1)
> +		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_RECONNECTED);
> +	mutex_unlock(&clt->paths_ev_mutex);
> +
> +	/* Mark session as established */
> +	sess->established = true;
> +	sess->reconnect_attempts = 0;
> +	sess->stats.reconnects.successful_cnt++;
> +}
> +
> +static void ibtrs_clt_sess_down(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +
> +	if (!sess->established)
> +		return;
> +
> +	sess->established = false;
> +	mutex_lock(&clt->paths_ev_mutex);
> +	WARN_ON(!clt->paths_up);
> +	if (--clt->paths_up == 0)
> +		clt->link_ev(clt->priv, IBTRS_CLT_LINK_EV_DISCONNECTED);
> +	mutex_unlock(&clt->paths_ev_mutex);
> +}
> +
> +static void ibtrs_clt_stop_and_destroy_conns(struct ibtrs_clt_sess *sess,
> +					     bool failover)
> +{
> +	struct ibtrs_clt_con *con;
> +	unsigned int cid;
> +
> +	WARN_ON(sess->state == IBTRS_CLT_CONNECTED);
> +
> +	/*
> +	 * Possible race with ibtrs_clt_open(), when DEVICE_REMOVAL comes
> +	 * exactly in between.  Start destroying after it finishes.
> +	 */
> +	mutex_lock(&sess->init_mutex);
> +	mutex_unlock(&sess->init_mutex);
> +
> +	/*
> +	 * All IO paths must observe !CONNECTED state before we
> +	 * free everything.
> +	 */
> +	synchronize_rcu();
> +
> +	ibtrs_clt_stop_hb(sess);
> +
> +	/*
> +	 * The order it utterly crucial: firstly disconnect and complete all
> +	 * rdma requests with error (thus set in_use=false for requests),
> +	 * then fail outstanding requests checking in_use for each, and
> +	 * eventually notify upper layer about session disconnection.
> +	 */
> +
> +	for (cid = 0; cid < sess->s.con_num; cid++) {
> +		con = to_clt_con(sess->s.con[cid]);
> +		if (!con)
> +			break;
> +
> +		stop_cm(con);
> +	}
> +	fail_all_outstanding_reqs(sess, failover);
> +	free_sess_io_bufs(sess);
> +	ibtrs_clt_sess_down(sess);
> +
> +	/*
> +	 * Wait for graceful shutdown, namely when peer side invokes
> +	 * rdma_disconnect(). 'connected_cnt' is decremented only on
> +	 * CM events, thus if other side had crashed and hb has detected
> +	 * something is wrong, here we will stuck for exactly timeout ms,
> +	 * since CM does not fire anything.  That is fine, we are not in
> +	 * hurry.
> +	 */
> +	wait_event_timeout(sess->state_wq, !atomic_read(&sess->connected_cnt),
> +			   msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
> +
> +	for (cid = 0; cid < sess->s.con_num; cid++) {
> +		con = to_clt_con(sess->s.con[cid]);
> +		if (!con)
> +			break;
> +
> +		destroy_con_cq_qp(con);
> +		destroy_cm(con);
> +		destroy_con(con);
> +	}
> +}
> +
> +static void ibtrs_clt_remove_path_from_arr(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +	struct ibtrs_clt_sess *next;
> +	int cpu;
> +
> +	mutex_lock(&clt->paths_mutex);
> +	list_del_rcu(&sess->s.entry);
> +
> +	/* Make sure everybody observes path removal. */
> +	synchronize_rcu();
> +
> +	/*
> +	 * Decrement paths number only after grace period, because
> +	 * caller of do_each_path() must firstly observe list without
> +	 * path and only then decremented paths number.
> +	 *
> +	 * Otherwise there can be the following situation:
> +	 *    o Two paths exist and IO is coming.
> +	 *    o One path is removed:
> +	 *      CPU#0                          CPU#1
> +	 *      do_each_path():                ibtrs_clt_remove_path_from_arr():
> +	 *          path = get_next_path()
> +	 *          ^^^                            list_del_rcu(path)
> +	 *          [!CONNECTED path]              clt->paths_num--
> +	 *                                              ^^^^^^^^^
> +	 *          load clt->paths_num                 from 2 to 1
> +	 *                    ^^^^^^^^^
> +	 *                    sees 1
> +	 *
> +	 *      path is observed as !CONNECTED, but do_each_path() loop
> +	 *      ends, because expression i < clt->paths_num is false.
> +	 */
> +	clt->paths_num--;
> +
> +	next = list_next_or_null_rcu_rr(sess, &clt->paths_list, s.entry);
> +
> +	/*
> +	 * Pcpu paths can still point to the path which is going to be
> +	 * removed, so change the pointer manually.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		struct ibtrs_clt_sess **ppcpu_path;
> +
> +		ppcpu_path = per_cpu_ptr(clt->pcpu_path, cpu);
> +		if (*ppcpu_path != sess)
> +			/*
> +			 * synchronize_rcu() was called just after deleting
> +			 * entry from the list, thus IO code path cannot
> +			 * change pointer back to the pointer which is going
> +			 * to be removed, we are safe here.
> +			 */
> +			continue;
> +
> +		/*
> +		 * We race with IO code path, which also changes pointer,
> +		 * thus we have to be careful not to override it.
> +		 */
> +		cmpxchg(ppcpu_path, sess, next);
> +	}
> +	mutex_unlock(&clt->paths_mutex);
> +}
> +
> +static inline bool __ibtrs_clt_path_exists(struct ibtrs_clt *clt,
> +					   struct ibtrs_addr *addr)
> +{
> +	struct ibtrs_clt_sess *sess;
> +
> +	list_for_each_entry(sess, &clt->paths_list, s.entry)
> +		if (!sockaddr_cmp((struct sockaddr *)&sess->s.dst_addr,
> +				  addr->dst))
> +			return true;
> +
> +	return false;
> +}
> +
> +static bool ibtrs_clt_path_exists(struct ibtrs_clt *clt,
> +				  struct ibtrs_addr *addr)
> +{
> +	bool res;
> +
> +	mutex_lock(&clt->paths_mutex);
> +	res = __ibtrs_clt_path_exists(clt, addr);
> +	mutex_unlock(&clt->paths_mutex);
> +
> +	return res;
> +}
> +
> +static int ibtrs_clt_add_path_to_arr(struct ibtrs_clt_sess *sess,
> +				     struct ibtrs_addr *addr)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +	int err = 0;
> +
> +	mutex_lock(&clt->paths_mutex);
> +	if (!__ibtrs_clt_path_exists(clt, addr)) {
> +		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
> +		clt->paths_num++;
> +	} else
> +		err = -EEXIST;
> +	mutex_unlock(&clt->paths_mutex);
> +
> +	return err;
> +}
> +
> +static void ibtrs_clt_close_work(struct work_struct *work)
> +{
> +	struct ibtrs_clt_sess *sess;
> +	/*
> +	 * Always try to do a failover, if only single path remains,
> +	 * all requests will be completed with error.
> +	 */
> +	bool failover = true;
> +
> +	sess = container_of(work, struct ibtrs_clt_sess, close_work);
> +
> +	cancel_delayed_work_sync(&sess->reconnect_dwork);
> +	ibtrs_clt_stop_and_destroy_conns(sess, failover);
> +	/*
> +	 * Sounds stupid, huh?  No, it is not.  Consider this sequence:
> +	 *
> +	 *   #CPU0                              #CPU1
> +	 *   1.  CONNECTED->RECONNECTING
> +	 *   2.                                 RECONNECTING->CLOSING
> +	 *   3.  queue_work(&reconnect_dwork)
> +	 *   4.                                 queue_work(&close_work);
> +	 *   5.  reconnect_work();              close_work();
> +	 *
> +	 * To avoid that case do cancel twice: before and after.
> +	 */
> +	cancel_delayed_work_sync(&sess->reconnect_dwork);
> +	ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSED);
> +}
> +
> +static void ibtrs_clt_close_conns(struct ibtrs_clt_sess *sess, bool wait)
> +{
> +	if (ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSING))
> +		queue_work(ibtrs_wq, &sess->close_work);
> +	if (wait)
> +		flush_work(&sess->close_work);
> +}
> +
> +static int init_conns(struct ibtrs_clt_sess *sess)
> +{
> +	unsigned int cid;
> +	int err;
> +
> +	/*
> +	 * On every new session connections increase reconnect counter
> +	 * to avoid clashes with previous sessions not yet closed
> +	 * sessions on a server side.
> +	 */
> +	sess->s.recon_cnt++;
> +
> +	/* Establish all RDMA connections  */
> +	for (cid = 0; cid < sess->s.con_num; cid++) {
> +		err = create_con(sess, cid);
> +		if (unlikely(err))
> +			goto destroy;
> +
> +		err = create_cm(to_clt_con(sess->s.con[cid]));
> +		if (unlikely(err)) {
> +			destroy_con(to_clt_con(sess->s.con[cid]));
> +			goto destroy;
> +		}
> +	}
> +	/* Allocate all session related buffers */
> +	err = alloc_sess_io_bufs(sess);
> +	if (unlikely(err))
> +		goto destroy;
> +
> +	ibtrs_clt_start_hb(sess);
> +
> +	return 0;
> +
> +destroy:
> +	while (cid--) {
> +		struct ibtrs_clt_con *con = to_clt_con(sess->s.con[cid]);
> +
> +		stop_cm(con);
> +		destroy_con_cq_qp(con);
> +		destroy_cm(con);
> +		destroy_con(con);
> +	}
> +	/*
> +	 * If we've never taken async path and got an error, say,
> +	 * doing rdma_resolve_addr(), switch to CONNECTION_ERR state
> +	 * manually to keep reconnecting.
> +	 */
> +	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
> +
> +	return err;
> +}
> +
> +static int ibtrs_rdma_addr_resolved(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	int err;
> +
> +	err = create_con_cq_qp(con);
> +	if (unlikely(err)) {
> +		ibtrs_err(sess, "create_con_cq_qp(), err: %d\n", err);
> +		return err;
> +	}
> +	err = rdma_resolve_route(con->c.cm_id, IBTRS_CONNECT_TIMEOUT_MS);
> +	if (unlikely(err)) {
> +		ibtrs_err(sess, "Resolving route failed, err: %d\n", err);
> +		destroy_con_cq_qp(con);
> +	}
> +
> +	return err;
> +}
> +
> +static int ibtrs_rdma_route_resolved(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_clt *clt = sess->clt;
> +	struct ibtrs_msg_conn_req msg;
> +	struct rdma_conn_param param;
> +
> +	int err;
> +
> +	memset(&param, 0, sizeof(param));
> +	param.retry_count = retry_count;
> +	param.rnr_retry_count = 7;
> +	param.private_data = &msg;
> +	param.private_data_len = sizeof(msg);
> +
> +	/*
> +	 * Those two are the part of struct cma_hdr which is shared
> +	 * with private_data in case of AF_IB, so put zeroes to avoid
> +	 * wrong validation inside cma.c on receiver side.
> +	 */
> +	msg.__cma_version = 0;
> +	msg.__ip_version = 0;
> +	msg.magic = cpu_to_le16(IBTRS_MAGIC);
> +	msg.version = cpu_to_le16(IBTRS_VERSION);
> +	msg.cid = cpu_to_le16(con->c.cid);
> +	msg.cid_num = cpu_to_le16(sess->s.con_num);
> +	msg.recon_cnt = cpu_to_le16(sess->s.recon_cnt);
> +	uuid_copy(&msg.sess_uuid, &sess->s.uuid);
> +	uuid_copy(&msg.paths_uuid, &clt->paths_uuid);
> +
> +	err = rdma_connect(con->c.cm_id, &param);
> +	if (err)
> +		ibtrs_err(sess, "rdma_connect(): %d\n", err);
> +
> +	return err;
> +}
> +
> +static int ibtrs_rdma_conn_established(struct ibtrs_clt_con *con,
> +				       struct rdma_cm_event *ev)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	const struct ibtrs_msg_conn_rsp *msg;
> +	u16 version, queue_depth;
> +	int errno;
> +	u8 len;
> +
> +	msg = ev->param.conn.private_data;
> +	len = ev->param.conn.private_data_len;
> +	if (unlikely(len < sizeof(*msg))) {
> +		ibtrs_err(sess, "Invalid IBTRS connection response");
> +		return -ECONNRESET;
> +	}
> +	if (unlikely(le16_to_cpu(msg->magic) != IBTRS_MAGIC)) {
> +		ibtrs_err(sess, "Invalid IBTRS magic");
> +		return -ECONNRESET;
> +	}
> +	version = le16_to_cpu(msg->version);
> +	if (unlikely(version >> 8 != IBTRS_VER_MAJOR)) {
> +		ibtrs_err(sess, "Unsupported major IBTRS version: %d",
> +			  version);
> +		return -ECONNRESET;
> +	}
> +	errno = le16_to_cpu(msg->errno);
> +	if (unlikely(errno)) {
> +		ibtrs_err(sess, "Invalid IBTRS message: errno %d",
> +			  errno);
> +		return -ECONNRESET;
> +	}
> +	if (con->c.cid == 0) {
> +		queue_depth = le16_to_cpu(msg->queue_depth);
> +
> +		if (queue_depth > MAX_SESS_QUEUE_DEPTH) {
> +			ibtrs_err(sess, "Invalid IBTRS message: queue=%d\n",
> +				  queue_depth);
> +			return -ECONNRESET;
> +		}
> +		if (!sess->srv_rdma_addr || sess->queue_depth < queue_depth) {
> +			kfree(sess->srv_rdma_addr);
> +			sess->srv_rdma_addr =
> +				kcalloc(queue_depth,
> +					sizeof(*sess->srv_rdma_addr),
> +					GFP_KERNEL);
> +			if (unlikely(!sess->srv_rdma_addr)) {
> +				ibtrs_err(sess, "Failed to allocate "
> +					  "queue_depth=%d\n", queue_depth);
> +				return -ENOMEM;
> +			}
> +		}
> +		sess->queue_depth = queue_depth;
> +		sess->srv_rdma_buf_rkey = le32_to_cpu(msg->rkey);
> +		sess->max_req_size = le32_to_cpu(msg->max_req_size);
> +		sess->max_io_size = le32_to_cpu(msg->max_io_size);
> +		sess->chunk_size = sess->max_io_size + sess->max_req_size;
> +		sess->max_desc  = sess->max_req_size;
> +		sess->max_desc -= sizeof(u32) + sizeof(u32) + IO_MSG_SIZE;
> +		sess->max_desc /= sizeof(struct ibtrs_sg_desc);
> +
> +		/*
> +		 * Global queue depth and is always a minimum.  If while a
> +		 * reconnection server sends us a value a bit higher -
> +		 * client does not care and uses cached minimum.
> +		 */
> +		ibtrs_clt_set_min_queue_depth(sess->clt, sess->queue_depth);
> +		ibtrs_clt_set_min_io_size(sess->clt, sess->max_io_size);
> +	}
> +
> +	return 0;
> +}
> +
> +static int ibtrs_rdma_conn_rejected(struct ibtrs_clt_con *con,
> +				    struct rdma_cm_event *ev)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	const struct ibtrs_msg_conn_rsp *msg;
> +	const char *rej_msg;
> +	int status, errno;
> +	u8 data_len;
> +
> +	status = ev->status;
> +	rej_msg = rdma_reject_msg(con->c.cm_id, status);
> +	msg = rdma_consumer_reject_data(con->c.cm_id, ev, &data_len);
> +
> +	if (msg && data_len >= sizeof(*msg)) {
> +		errno = (int16_t)le16_to_cpu(msg->errno);
> +		if (errno == -EBUSY)
> +			ibtrs_err(sess,
> +				  "Previous session is still exists on the "
> +				  "server, please reconnect later\n");
> +		else
> +			ibtrs_err(sess,
> +				  "Connect rejected: status %d (%s), ibtrs "
> +				  "errno %d\n", status, rej_msg, errno);
> +	} else {
> +		ibtrs_err(sess,
> +			  "Connect rejected but with malformed message: "
> +			  "status %d (%s)\n", status, rej_msg);
> +	}
> +
> +	return -ECONNRESET;
> +}
> +
> +static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +
> +	if (ibtrs_clt_change_state_from_to(sess,
> +					   IBTRS_CLT_CONNECTED,
> +					   IBTRS_CLT_RECONNECTING)) {
> +		/*
> +		 * Normal scenario, reconnect if we were successfully connected
> +		 */
> +		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
> +	} else {
> +		/*
> +		 * Error can happen just on establishing new connection,
> +		 * so notify waiter with error state, waiter is responsible
> +		 * for cleaning the rest and reconnect if needed.
> +		 */
> +		ibtrs_clt_change_state_from_to(sess,
> +					       IBTRS_CLT_CONNECTING,
> +					       IBTRS_CLT_CONNECTING_ERR);
> +	}
> +}
> +
> +static inline void flag_success_on_conn(struct ibtrs_clt_con *con)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +
> +	atomic_inc(&sess->connected_cnt);
> +	con->cm_err = 1;
> +}
> +
> +static inline void flag_error_on_conn(struct ibtrs_clt_con *con, int cm_err)
> +{
> +	if (con->cm_err == 1) {
> +		struct ibtrs_clt_sess *sess;
> +
> +		sess = to_clt_sess(con->c.sess);
> +		if (atomic_dec_and_test(&sess->connected_cnt))
> +			wake_up(&sess->state_wq);
> +	}
> +	con->cm_err = cm_err;
> +}
> +
> +static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
> +				     struct rdma_cm_event *ev)
> +{
> +	struct ibtrs_clt_con *con = cm_id->context;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	int cm_err = 0;
> +
> +	switch (ev->event) {
> +	case RDMA_CM_EVENT_ADDR_RESOLVED:
> +		cm_err = ibtrs_rdma_addr_resolved(con);
> +		break;
> +	case RDMA_CM_EVENT_ROUTE_RESOLVED:
> +		cm_err = ibtrs_rdma_route_resolved(con);
> +		break;
> +	case RDMA_CM_EVENT_ESTABLISHED:
> +		con->cm_err = ibtrs_rdma_conn_established(con, ev);
> +		if (likely(!con->cm_err)) {
> +			/*
> +			 * Report success and wake up. Here we abuse state_wq,
> +			 * i.e. wake up without state change, but we set cm_err.
> +			 */
> +			flag_success_on_conn(con);
> +			wake_up(&sess->state_wq);
> +			return 0;
> +		}
> +		break;
> +	case RDMA_CM_EVENT_REJECTED:
> +		cm_err = ibtrs_rdma_conn_rejected(con, ev);
> +		break;
> +	case RDMA_CM_EVENT_CONNECT_ERROR:
> +	case RDMA_CM_EVENT_UNREACHABLE:
> +		ibtrs_wrn(sess, "CM error event %d\n", ev->event);
> +		cm_err = -ECONNRESET;
> +		break;
> +	case RDMA_CM_EVENT_ADDR_ERROR:
> +	case RDMA_CM_EVENT_ROUTE_ERROR:
> +		cm_err = -EHOSTUNREACH;
> +		break;
> +	case RDMA_CM_EVENT_DISCONNECTED:
> +	case RDMA_CM_EVENT_ADDR_CHANGE:
> +	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> +		cm_err = -ECONNRESET;
> +		break;
> +	case RDMA_CM_EVENT_DEVICE_REMOVAL:
> +		/*
> +		 * Device removal is a special case.  Queue close and return 0.
> +		 */
> +		ibtrs_clt_close_conns(sess, false);
> +		return 0;
> +	default:
> +		ibtrs_err(sess, "Unexpected RDMA CM event (%d)\n", ev->event);
> +		cm_err = -ECONNRESET;
> +		break;
> +	}
> +
> +	if (cm_err) {
> +		/*
> +		 * cm error makes sense only on connection establishing,
> +		 * in other cases we rely on normal procedure of reconnecting.
> +		 */
> +		flag_error_on_conn(con, cm_err);
> +		ibtrs_rdma_error_recovery(con);
> +	}
> +
> +	return 0;
> +}
> +
> +static void ibtrs_clt_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ibtrs_clt_con *con = cq->cq_context;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_iu *iu;
> +
> +	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
> +	ibtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.ib_dev->dev);
> +
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
> +		ibtrs_err(sess, "Sess info request send failed: %s\n",
> +			  ib_wc_status_msg(wc->status));
> +		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
> +		return;
> +	}
> +
> +	ibtrs_clt_update_wc_stats(con);
> +}
> +
> +static int process_info_rsp(struct ibtrs_clt_sess *sess,
> +			    const struct ibtrs_msg_info_rsp *msg)
> +{
> +	unsigned int addr_num;
> +	int i;
> +
> +	addr_num = le16_to_cpu(msg->addr_num);
> +	/*
> +	 * Check if IB immediate data size is enough to hold the mem_id and
> +	 * the offset inside the memory chunk.
> +	 */
> +	if (unlikely(ilog2(addr_num - 1) + ilog2(sess->chunk_size - 1) >
> +		     MAX_IMM_PAYL_BITS)) {
> +		ibtrs_err(sess, "RDMA immediate size (%db) not enough to "
> +			  "encode %d buffers of size %dB\n",  MAX_IMM_PAYL_BITS,
> +			  addr_num, sess->chunk_size);
> +		return -EINVAL;
> +	}
> +	if (unlikely(addr_num > sess->queue_depth)) {
> +		ibtrs_err(sess, "Incorrect addr_num=%d\n", addr_num);
> +		return -EINVAL;
> +	}
> +	for (i = 0; i < msg->addr_num; i++)
> +		sess->srv_rdma_addr[i] = le64_to_cpu(msg->addr[i]);
> +
> +	return 0;
> +}
> +
> +static void ibtrs_clt_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
> +{
> +	struct ibtrs_clt_con *con = cq->cq_context;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_msg_info_rsp *msg;
> +	enum ibtrs_clt_state state;
> +	struct ibtrs_iu *iu;
> +	size_t rx_sz;
> +	int err;
> +
> +	state = IBTRS_CLT_CONNECTING_ERR;
> +
> +	WARN_ON(con->c.cid);
> +	iu = container_of(wc->wr_cqe, struct ibtrs_iu, cqe);
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
> +		ibtrs_err(sess, "Sess info response recv failed: %s\n",
> +			  ib_wc_status_msg(wc->status));
> +		goto out;
> +	}
> +	WARN_ON(wc->opcode != IB_WC_RECV);
> +
> +	if (unlikely(wc->byte_len < sizeof(*msg))) {
> +		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
> +			  wc->byte_len);
> +		goto out;
> +	}
> +	msg = iu->buf;
> +	if (unlikely(le16_to_cpu(msg->type) != IBTRS_MSG_INFO_RSP)) {
> +		ibtrs_err(sess, "Sess info response is malformed: type %d\n",
> +			  le32_to_cpu(msg->type));
> +		goto out;
> +	}
> +	rx_sz  = sizeof(*msg);
> +	rx_sz += sizeof(msg->addr[0]) * le16_to_cpu(msg->addr_num);
> +	if (unlikely(wc->byte_len < rx_sz)) {
> +		ibtrs_err(sess, "Sess info response is malformed: size %d\n",
> +			  wc->byte_len);
> +		goto out;
> +	}
> +	err = process_info_rsp(sess, msg);
> +	if (unlikely(err))
> +		goto out;
> +
> +	err = post_recv_sess(sess);
> +	if (unlikely(err))
> +		goto out;
> +
> +	state = IBTRS_CLT_CONNECTED;
> +
> +out:
> +	ibtrs_clt_update_wc_stats(con);
> +	ibtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.ib_dev->dev);
> +	ibtrs_clt_change_state(sess, state);
> +}
> +
> +static int ibtrs_send_sess_info(struct ibtrs_clt_sess *sess)
> +{
> +	struct ibtrs_clt_con *usr_con = to_clt_con(sess->s.con[0]);
> +	struct ibtrs_msg_info_req *msg;
> +	struct ibtrs_iu *tx_iu, *rx_iu;
> +	size_t rx_sz;
> +	int err;
> +
> +	rx_sz  = sizeof(struct ibtrs_msg_info_rsp);
> +	rx_sz += sizeof(u64) * MAX_SESS_QUEUE_DEPTH;
> +
> +	tx_iu = ibtrs_iu_alloc(0, sizeof(struct ibtrs_msg_info_req), GFP_KERNEL,
> +			       sess->s.ib_dev->dev, DMA_TO_DEVICE,
> +			       ibtrs_clt_info_req_done);
> +	rx_iu = ibtrs_iu_alloc(0, rx_sz, GFP_KERNEL, sess->s.ib_dev->dev,
> +			       DMA_FROM_DEVICE, ibtrs_clt_info_rsp_done);
> +	if (unlikely(!tx_iu || !rx_iu)) {
> +		ibtrs_err(sess, "ibtrs_iu_alloc(): no memory\n");
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +	/* Prepare for getting info response */
> +	err = ibtrs_iu_post_recv(&usr_con->c, rx_iu);
> +	if (unlikely(err)) {
> +		ibtrs_err(sess, "ibtrs_iu_post_recv(), err: %d\n", err);
> +		goto out;
> +	}
> +	rx_iu = NULL;
> +
> +	msg = tx_iu->buf;
> +	msg->type = cpu_to_le16(IBTRS_MSG_INFO_REQ);
> +	memcpy(msg->sessname, sess->s.sessname, sizeof(msg->sessname));
> +
> +	/* Send info request */
> +	err = ibtrs_iu_post_send(&usr_con->c, tx_iu, sizeof(*msg));
> +	if (unlikely(err)) {
> +		ibtrs_err(sess, "ibtrs_iu_post_send(), err: %d\n", err);
> +		goto out;
> +	}
> +	tx_iu = NULL;
> +
> +	/* Wait for state change */
> +	wait_event_interruptible_timeout(sess->state_wq,
> +				sess->state != IBTRS_CLT_CONNECTING,
> +				msecs_to_jiffies(IBTRS_CONNECT_TIMEOUT_MS));
> +	if (unlikely(sess->state != IBTRS_CLT_CONNECTED)) {
> +		if (sess->state == IBTRS_CLT_CONNECTING_ERR)
> +			err = -ECONNRESET;
> +		else
> +			err = -ETIMEDOUT;
> +		goto out;
> +	}
> +
> +out:
> +	if (tx_iu)
> +		ibtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.ib_dev->dev);
> +	if (rx_iu)
> +		ibtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.ib_dev->dev);
> +	if (unlikely(err))
> +		/* If we've never taken async path because of malloc problems */
> +		ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING_ERR);
> +
> +	return err;
> +}
> +
> +/**
> + * init_sess() - establishes all session connections and does handshake
> + *
> + * In case of error full close or reconnect procedure should be taken,
> + * because reconnect or close async works can be started.
> + */
> +static int init_sess(struct ibtrs_clt_sess *sess)
> +{
> +	int err;
> +
> +	mutex_lock(&sess->init_mutex);
> +	err = init_conns(sess);
> +	if (unlikely(err)) {
> +		ibtrs_err(sess, "init_conns(), err: %d\n", err);
> +		goto out;
> +	}
> +	err = ibtrs_send_sess_info(sess);
> +	if (unlikely(err)) {
> +		ibtrs_err(sess, "ibtrs_send_sess_info(), err: %d\n", err);
> +		goto out;
> +	}
> +	ibtrs_clt_sess_up(sess);
> +out:
> +	mutex_unlock(&sess->init_mutex);
> +
> +	return err;
> +}
> +
> +static void ibtrs_clt_reconnect_work(struct work_struct *work)
> +{
> +	struct ibtrs_clt_sess *sess;
> +	struct ibtrs_clt *clt;
> +	unsigned int delay_ms;
> +	int err;
> +
> +	sess = container_of(to_delayed_work(work), struct ibtrs_clt_sess,
> +			    reconnect_dwork);
> +	clt = sess->clt;
> +
> +	if (ibtrs_clt_state(sess) == IBTRS_CLT_CLOSING)
> +		/* User requested closing */
> +		return;
> +
> +	if (sess->reconnect_attempts >= clt->max_reconnect_attempts) {
> +		/* Close a session completely if max attempts is reached */
> +		ibtrs_clt_close_conns(sess, false);
> +		return;
> +	}
> +	sess->reconnect_attempts++;
> +
> +	/* Stop everything */
> +	ibtrs_clt_stop_and_destroy_conns(sess, true);
> +	ibtrs_clt_change_state(sess, IBTRS_CLT_CONNECTING);
> +
> +	err = init_sess(sess);
> +	if (unlikely(err))
> +		goto reconnect_again;
> +
> +	return;
> +
> +reconnect_again:
> +	if (ibtrs_clt_change_state(sess, IBTRS_CLT_RECONNECTING)) {
> +		sess->stats.reconnects.fail_cnt++;
> +		delay_ms = clt->reconnect_delay_sec * 1000;
> +		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork,
> +				   msecs_to_jiffies(delay_ms));
> +	}
> +}
> +
> +static struct ibtrs_clt *alloc_clt(const char *sessname, size_t paths_num,
> +				   short port, size_t pdu_sz,
> +				   void *priv, link_clt_ev_fn *link_ev,
> +				   unsigned int max_segments,
> +				   unsigned int reconnect_delay_sec,
> +				   unsigned int max_reconnect_attempts)
> +{
> +	struct ibtrs_clt *clt;
> +	int err;
> +
> +	if (unlikely(!paths_num || paths_num > MAX_PATHS_NUM))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (unlikely(strlen(sessname) >= sizeof(clt->sessname)))
> +		return ERR_PTR(-EINVAL);
> +
> +	clt = kzalloc(sizeof(*clt), GFP_KERNEL);
> +	if (unlikely(!clt))
> +		return ERR_PTR(-ENOMEM);
> +
> +	clt->pcpu_path = alloc_percpu(typeof(*clt->pcpu_path));
> +	if (unlikely(!clt->pcpu_path)) {
> +		kfree(clt);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	uuid_gen(&clt->paths_uuid);
> +	INIT_LIST_HEAD_RCU(&clt->paths_list);
> +	clt->paths_num = paths_num;
> +	clt->paths_up = MAX_PATHS_NUM;
> +	clt->port = port;
> +	clt->pdu_sz = pdu_sz;
> +	clt->max_segments = max_segments;
> +	clt->reconnect_delay_sec = reconnect_delay_sec;
> +	clt->max_reconnect_attempts = max_reconnect_attempts;
> +	clt->priv = priv;
> +	clt->link_ev = link_ev;
> +	clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
> +	strlcpy(clt->sessname, sessname, sizeof(clt->sessname));
> +	init_waitqueue_head(&clt->tags_wait);
> +	mutex_init(&clt->paths_ev_mutex);
> +	mutex_init(&clt->paths_mutex);
> +
> +	err = ibtrs_clt_create_sysfs_root_folders(clt);
> +	if (unlikely(err)) {
> +		free_percpu(clt->pcpu_path);
> +		kfree(clt);
> +		return ERR_PTR(err);
> +	}
> +
> +	return clt;
> +}
> +
> +static void wait_for_inflight_tags(struct ibtrs_clt *clt)
> +{
> +	if (clt->tags_map) {
> +		size_t sz = clt->queue_depth;
> +
> +		wait_event(clt->tags_wait,
> +			   find_first_bit(clt->tags_map, sz) >= sz);
> +	}
> +}
> +
> +static void free_clt(struct ibtrs_clt *clt)
> +{
> +	ibtrs_clt_destroy_sysfs_root_folders(clt);
> +	wait_for_inflight_tags(clt);
> +	free_tags(clt);
> +	free_percpu(clt->pcpu_path);
> +	kfree(clt);
> +}
> +
> +struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
> +				 const char *sessname,
> +				 const struct ibtrs_addr *paths,
> +				 size_t paths_num,
> +				 short port,
> +				 size_t pdu_sz, u8 reconnect_delay_sec,
> +				 u16 max_segments,
> +				 s16 max_reconnect_attempts)
> +{
> +	struct ibtrs_clt_sess *sess, *tmp;
> +	struct ibtrs_clt *clt;
> +	int err, i;
> +
> +	clt = alloc_clt(sessname, paths_num, port, pdu_sz, priv, link_ev,
> +			max_segments, reconnect_delay_sec,
> +			max_reconnect_attempts);
> +	if (unlikely(IS_ERR(clt))) {
> +		err = PTR_ERR(clt);
> +		goto out;
> +	}
> +	for (i = 0; i < paths_num; i++) {
> +		struct ibtrs_clt_sess *sess;
> +
> +		sess = alloc_sess(clt, &paths[i], nr_cons_per_session,
> +				  max_segments);
> +		if (unlikely(IS_ERR(sess))) {
> +			err = PTR_ERR(sess);
> +			ibtrs_err(clt, "alloc_sess(), err: %d\n", err);
> +			goto close_all_sess;
> +		}
> +		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
> +
> +		err = init_sess(sess);
> +		if (unlikely(err))
> +			goto close_all_sess;
> +
> +		err = ibtrs_clt_create_sess_files(sess);
> +		if (unlikely(err))
> +			goto close_all_sess;
> +	}
> +	err = alloc_tags(clt);
> +	if (unlikely(err)) {
> +		ibtrs_err(clt, "alloc_tags(), err: %d\n", err);
> +		goto close_all_sess;
> +	}
> +	err = ibtrs_clt_create_sysfs_root_files(clt);
> +	if (unlikely(err))
> +		goto close_all_sess;
> +
> +	/*
> +	 * There is a race if someone decides to completely remove just
> +	 * newly created path using sysfs entry.  To avoid the race we
> +	 * use simple 'opened' flag, see ibtrs_clt_remove_path_from_sysfs().
> +	 */
> +	clt->opened = true;
> +
> +	/* Do not let module be unloaded if client is alive */
> +	__module_get(THIS_MODULE);
> +
> +	return clt;
> +
> +close_all_sess:
> +	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
> +		ibtrs_clt_destroy_sess_files(sess, NULL);
> +		ibtrs_clt_close_conns(sess, true);
> +		free_sess(sess);
> +	}
> +	free_clt(clt);
> +
> +out:
> +	return ERR_PTR(err);
> +}
> +EXPORT_SYMBOL(ibtrs_clt_open);
> +
> +void ibtrs_clt_close(struct ibtrs_clt *clt)
> +{
> +	struct ibtrs_clt_sess *sess, *tmp;
> +
> +	/* Firstly forbid sysfs access */
> +	ibtrs_clt_destroy_sysfs_root_files(clt);
> +	ibtrs_clt_destroy_sysfs_root_folders(clt);
> +
> +	/* Now it is safe to iterate over all paths without locks */
> +	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
> +		ibtrs_clt_destroy_sess_files(sess, NULL);
> +		ibtrs_clt_close_conns(sess, true);
> +		free_sess(sess);
> +	}
> +	free_clt(clt);
> +	module_put(THIS_MODULE);
> +}
> +EXPORT_SYMBOL(ibtrs_clt_close);
> +
> +int ibtrs_clt_reconnect_from_sysfs(struct ibtrs_clt_sess *sess)
> +{
> +	enum ibtrs_clt_state old_state;
> +	int err = -EBUSY;
> +	bool changed;
> +
> +	changed = ibtrs_clt_change_state_get_old(sess, IBTRS_CLT_RECONNECTING,
> +						 &old_state);
> +	if (changed) {
> +		sess->reconnect_attempts = 0;
> +		queue_delayed_work(ibtrs_wq, &sess->reconnect_dwork, 0);
> +	}
> +	if (changed || old_state == IBTRS_CLT_RECONNECTING) {
> +		/*
> +		 * flush_delayed_work() queues pending work for immediate
> +		 * execution, so do the flush if we have queued something
> +		 * right now or work is pending.
> +		 */
> +		flush_delayed_work(&sess->reconnect_dwork);
> +		err = ibtrs_clt_sess_is_connected(sess) ? 0 : -ENOTCONN;
> +	}
> +
> +	return err;
> +}
> +
> +int ibtrs_clt_disconnect_from_sysfs(struct ibtrs_clt_sess *sess)
> +{
> +	ibtrs_clt_close_conns(sess, true);
> +
> +	return 0;
> +}
> +
> +int ibtrs_clt_remove_path_from_sysfs(struct ibtrs_clt_sess *sess,
> +				     const struct attribute *sysfs_self)
> +{
> +	struct ibtrs_clt *clt = sess->clt;
> +	enum ibtrs_clt_state old_state;
> +	bool changed;
> +
> +	/*
> +	 * That can happen only when userspace tries to remove path
> +	 * very early, when ibtrs_clt_open() is not yet finished.
> +	 */
> +	if (unlikely(!clt->opened))
> +		return -EBUSY;
> +
> +	/*
> +	 * Continue stopping path till state was changed to DEAD or
> +	 * state was observed as DEAD:
> +	 * 1. State was changed to DEAD - we were fast and nobody
> +	 *    invoked ibtrs_clt_reconnect(), which can again start
> +	 *    reconnecting.
> +	 * 2. State was observed as DEAD - we have someone in parallel
> +	 *    removing the path.
> +	 */
> +	do {
> +		ibtrs_clt_close_conns(sess, true);
> +	} while (!(changed = ibtrs_clt_change_state_get_old(sess,
> +							    IBTRS_CLT_DEAD,
> +							    &old_state)) &&
> +		   old_state != IBTRS_CLT_DEAD);
> +
> +	/*
> +	 * If state was successfully changed to DEAD, commit suicide.
> +	 */
> +	if (likely(changed)) {
> +		ibtrs_clt_destroy_sess_files(sess, sysfs_self);
> +		ibtrs_clt_remove_path_from_arr(sess);
> +		free_sess(sess);
> +	}
> +
> +	return 0;
> +}
> +
> +void ibtrs_clt_set_max_reconnect_attempts(struct ibtrs_clt *clt, int value)
> +{
> +	clt->max_reconnect_attempts = (unsigned int)value;
> +}
> +
> +int ibtrs_clt_get_max_reconnect_attempts(const struct ibtrs_clt *clt)
> +{
> +	return (int)clt->max_reconnect_attempts;
> +}
> +
> +static int ibtrs_clt_rdma_write_desc(struct ibtrs_clt_con *con,
> +				     struct ibtrs_clt_io_req *req, u64 buf,
> +				     size_t u_msg_len, u32 imm,
> +				     struct ibtrs_msg_rdma_write *msg)
> +{
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_sg_desc *desc;
> +	int ret;
> +
> +	desc = kmalloc_array(sess->max_pages_per_mr, sizeof(*desc), GFP_ATOMIC);
> +	if (unlikely(!desc))
> +		return -ENOMEM;
> +
> +	ret = ibtrs_fast_reg_map_data(con, desc, req);
> +	if (unlikely(ret < 0)) {
> +		ibtrs_err_rl(sess,
> +			     "Write request failed, fast reg. data mapping"
> +			     " failed, err: %d\n", ret);
> +		kfree(desc);
> +		return ret;
> +	}
> +	ret = ibtrs_post_send_rdma_desc(con, req, desc, ret, buf,
> +					u_msg_len + sizeof(*msg), imm);
> +	if (unlikely(ret)) {
> +		ibtrs_err(sess, "Write request failed, posting work"
> +			  " request failed, err: %d\n", ret);
> +		ibtrs_unmap_fast_reg_data(con, req);
> +	}
> +	kfree(desc);
> +	return ret;
> +}
> +
> +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req)
> +{
> +	struct ibtrs_clt_con *con = req->con;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_msg_rdma_write *msg;
> +
> +	int ret, count = 0;
> +	u32 imm, buf_id;
> +	u64 buf;
> +
> +	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
> +
> +	if (unlikely(tsize > sess->chunk_size)) {
> +		ibtrs_wrn(sess, "Write request failed, size too big %zu > %d\n",
> +			  tsize, sess->chunk_size);
> +		return -EMSGSIZE;
> +	}
> +	if (req->sg_cnt) {
> +		count = ib_dma_map_sg(sess->s.ib_dev->dev, req->sglist,
> +				      req->sg_cnt, req->dir);
> +		if (unlikely(!count)) {
> +			ibtrs_wrn(sess, "Write request failed, map failed\n");
> +			return -EINVAL;
> +		}
> +	}
> +	/* put ibtrs msg after sg and user message */
> +	msg = req->iu->buf + req->usr_len;
> +	msg->type = cpu_to_le16(IBTRS_MSG_WRITE);
> +	msg->usr_len = cpu_to_le16(req->usr_len);
> +
> +	/* ibtrs message on server side will be after user data and message */
> +	imm = req->tag->mem_off + req->data_len + req->usr_len;
> +	imm = ibtrs_to_io_req_imm(imm);
> +	buf_id = req->tag->mem_id;
> +	req->sg_size = tsize;
> +	buf = sess->srv_rdma_addr[buf_id];
> +
> +	/*
> +	 * Update stats now, after request is successfully sent it is not
> +	 * safe anymore to touch it.
> +	 */
> +	ibtrs_clt_update_all_stats(req, WRITE);
> +
> +	if (count > fmr_sg_cnt)
> +		ret = ibtrs_clt_rdma_write_desc(req->con, req, buf,
> +						req->usr_len, imm, msg);
> +	else
> +		ret = ibtrs_post_send_rdma_more(req->con, req, buf,
> +						req->usr_len + sizeof(*msg),
> +						imm);
> +	if (unlikely(ret)) {
> +		ibtrs_err(sess, "Write request failed: %d\n", ret);
> +		ibtrs_clt_decrease_inflight(&sess->stats);
> +		if (req->sg_cnt)
> +			ib_dma_unmap_sg(sess->s.ib_dev->dev, req->sglist,
> +					req->sg_cnt, req->dir);
> +	}
> +
> +	return ret;
> +}
> +
> +int ibtrs_clt_write(struct ibtrs_clt *clt, ibtrs_conf_fn *conf,
> +		    struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
> +		    size_t nr, size_t data_len, struct scatterlist *sg,
> +		    unsigned int sg_cnt)
> +{
> +	struct ibtrs_clt_io_req *req;
> +	struct ibtrs_clt_sess *sess;
> +
> +	int err = -ECONNABORTED;
> +	struct path_it it;
> +	size_t usr_len;
> +
> +	usr_len = kvec_length(vec, nr);
> +	do_each_path(sess, clt, &it) {
> +		if (unlikely(sess->state != IBTRS_CLT_CONNECTED))
> +			continue;
> +
> +		if (unlikely(usr_len > IO_MSG_SIZE)) {
> +			ibtrs_wrn_rl(sess, "Write request failed, user message"
> +				     " size is %zu B big, max size is %d B\n",
> +				     usr_len, IO_MSG_SIZE);
> +			err = -EMSGSIZE;
> +			break;
> +		}
> +		req = ibtrs_clt_get_req(sess, conf, tag, priv, vec, usr_len,
> +					sg, sg_cnt, data_len, DMA_TO_DEVICE);
> +		err = ibtrs_clt_write_req(req);
> +		if (unlikely(err)) {
> +			req->in_use = false;
> +			continue;
> +		}
> +		/* Success path */
> +		break;
> +	} while_each_path(&it);
> +
> +	return err;
> +}
> +
> +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req)
> +{
> +	struct ibtrs_clt_con *con = req->con;
> +	struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
> +	struct ibtrs_msg_rdma_read *msg;
> +	struct ibtrs_ib_dev *ibdev;
> +	struct scatterlist *sg;
> +
> +	int i, ret, count = 0;
> +	u32 imm, buf_id;
> +
> +	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
> +
> +	ibdev = sess->s.ib_dev;
> +
> +	if (unlikely(tsize > sess->chunk_size)) {
> +		ibtrs_wrn(sess, "Read request failed, message size is"
> +			  " %zu, bigger than CHUNK_SIZE %d\n", tsize,
> +			  sess->chunk_size);
> +		return -EMSGSIZE;
> +	}
> +
> +	if (req->sg_cnt) {
> +		count = ib_dma_map_sg(ibdev->dev, req->sglist, req->sg_cnt,
> +				      req->dir);
> +		if (unlikely(!count)) {
> +			ibtrs_wrn(sess, "Read request failed, "
> +				  "dma map failed\n");
> +			return -EINVAL;
> +		}
> +	}
> +	/* put our message into req->buf after user message*/
> +	msg = req->iu->buf + req->usr_len;
> +	msg->type = cpu_to_le16(IBTRS_MSG_READ);
> +	msg->sg_cnt = cpu_to_le32(count);
> +	msg->usr_len = cpu_to_le16(req->usr_len);
> +
> +	if (count > fmr_sg_cnt) {
> +		ret = ibtrs_fast_reg_map_data(req->con, msg->desc, req);
> +		if (ret < 0) {
> +			ibtrs_err_rl(sess,
> +				     "Read request failed, failed to map "
> +				     " fast reg. data, err: %d\n", ret);
> +			ib_dma_unmap_sg(ibdev->dev, req->sglist, req->sg_cnt,
> +					req->dir);
> +			return ret;
> +		}
> +		msg->sg_cnt = cpu_to_le32(ret);
> +	} else {
> +		for_each_sg(req->sglist, sg, req->sg_cnt, i) {
> +			msg->desc[i].addr =
> +				cpu_to_le64(ib_sg_dma_address(ibdev->dev, sg));
> +			msg->desc[i].key =
> +				cpu_to_le32(ibdev->rkey);
> +			msg->desc[i].len =
> +				cpu_to_le32(ib_sg_dma_len(ibdev->dev, sg));
> +		}
> +		req->nmdesc = 0;
> +	}
> +	/*
> +	 * ibtrs message will be after the space reserved for disk data and
> +	 * user message
> +	 */
> +	imm = req->tag->mem_off + req->data_len + req->usr_len;
> +	imm = ibtrs_to_io_req_imm(imm);
> +	buf_id = req->tag->mem_id;
> +
> +	req->sg_size  = sizeof(*msg);
> +	req->sg_size += le32_to_cpu(msg->sg_cnt) * sizeof(struct ibtrs_sg_desc);
> +	req->sg_size += req->usr_len;
> +
> +	/*
> +	 * Update stats now, after request is successfully sent it is not
> +	 * safe anymore to touch it.
> +	 */
> +	ibtrs_clt_update_all_stats(req, READ);
> +
> +	ret = ibtrs_post_send_rdma(req->con, req, sess->srv_rdma_addr[buf_id],
> +				   req->data_len, imm);
> +	if (unlikely(ret)) {
> +		ibtrs_err(sess, "Read request failed: %d\n", ret);
> +		ibtrs_clt_decrease_inflight(&sess->stats);
> +		if (unlikely(count > fmr_sg_cnt))
> +			ibtrs_unmap_fast_reg_data(req->con, req);
> +		if (req->sg_cnt)
> +			ib_dma_unmap_sg(ibdev->dev, req->sglist,
> +					req->sg_cnt, req->dir);
> +	}
> +
> +	return ret;
> +}
> +
> +int ibtrs_clt_read(struct ibtrs_clt *clt, ibtrs_conf_fn *conf,
> +		   struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
> +		   size_t nr, size_t data_len, struct scatterlist *sg,
> +		   unsigned int sg_cnt)
> +{
> +	struct ibtrs_clt_io_req *req;
> +	struct ibtrs_clt_sess *sess;
> +
> +	int err = -ECONNABORTED;
> +	struct path_it it;
> +	size_t usr_len;
> +
> +	usr_len = kvec_length(vec, nr);
> +	do_each_path(sess, clt, &it) {
> +		if (unlikely(sess->state != IBTRS_CLT_CONNECTED))
> +			continue;
> +
> +		if (unlikely(usr_len > IO_MSG_SIZE ||
> +			     sizeof(struct ibtrs_msg_rdma_read) +
> +			     sg_cnt * sizeof(struct ibtrs_sg_desc) >
> +			     sess->max_req_size)) {
> +			ibtrs_wrn_rl(sess, "Read request failed, user message"
> +				     " size is %zu B big, max size is %d B\n",
> +				     usr_len, IO_MSG_SIZE);
> +			err = -EMSGSIZE;
> +			break;
> +		}
> +		req = ibtrs_clt_get_req(sess, conf, tag, priv, vec, usr_len,
> +					sg, sg_cnt, data_len, DMA_FROM_DEVICE);
> +		err = ibtrs_clt_read_req(req);
> +		if (unlikely(err)) {
> +			req->in_use = false;
> +			continue;
> +		}
> +		/* Success path */
> +		break;
> +	} while_each_path(&it);
> +
> +	return err;
> +}
> +
> +int ibtrs_clt_request(int dir, ibtrs_conf_fn *conf, struct ibtrs_clt *clt,
> +		      struct ibtrs_tag *tag, void *priv, const struct kvec *vec,
> +		      size_t nr, size_t len, struct scatterlist *sg,
> +		      unsigned int sg_len)
> +{
> +	if (dir == READ)
> +		return ibtrs_clt_read(clt, conf, tag, priv, vec, nr, len, sg,
> +				      sg_len);
> +	else
> +		return ibtrs_clt_write(clt, conf, tag, priv, vec, nr, len, sg,
> +				       sg_len);
> +}
> +EXPORT_SYMBOL(ibtrs_clt_request);
> +
> +int ibtrs_clt_query(struct ibtrs_clt *clt, struct ibtrs_attrs *attr)
> +{
> +	if (unlikely(!ibtrs_clt_is_connected(clt)))
> +		return -ECOMM;
> +
> +	attr->queue_depth      = clt->queue_depth;
> +	attr->max_io_size      = clt->max_io_size;
> +	strlcpy(attr->sessname, clt->sessname, sizeof(attr->sessname));
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ibtrs_clt_query);
> +
> +int ibtrs_clt_create_path_from_sysfs(struct ibtrs_clt *clt,
> +				     struct ibtrs_addr *addr)
> +{
> +	struct ibtrs_clt_sess *sess;
> +	int err;
> +
> +	if (ibtrs_clt_path_exists(clt, addr))
> +		return -EEXIST;
> +
> +	sess = alloc_sess(clt, addr, nr_cons_per_session, clt->max_segments);
> +	if (unlikely(IS_ERR(sess)))
> +		return PTR_ERR(sess);
> +
> +	/*
> +	 * It is totally safe to add path in CONNECTING state: coming
> +	 * IO will never grab it.  Also it is very important to add
> +	 * path before init, since init fires LINK_CONNECTED event.
> +	 */
> +	err = ibtrs_clt_add_path_to_arr(sess, addr);
> +	if (unlikely(err))
> +		goto free_sess;
> +
> +	err = init_sess(sess);
> +	if (unlikely(err))
> +		goto close_sess;
> +
> +	err = ibtrs_clt_create_sess_files(sess);
> +	if (unlikely(err))
> +		goto close_sess;
> +
> +	return 0;
> +
> +close_sess:
> +	ibtrs_clt_remove_path_from_arr(sess);
> +	ibtrs_clt_close_conns(sess, true);
> +free_sess:
> +	free_sess(sess);
> +
> +	return err;
> +}
> +
> +static int check_module_params(void)
> +{
> +	if (fmr_sg_cnt > MAX_SEGMENTS || fmr_sg_cnt < 0) {
> +		pr_err("invalid fmr_sg_cnt values\n");
> +		return -EINVAL;
> +	}
> +	if (nr_cons_per_session == 0)
> +		nr_cons_per_session = min_t(unsigned int, nr_cpu_ids, U16_MAX);
> +
> +	return 0;
> +}
> +
> +static int __init ibtrs_client_init(void)
> +{
> +	int err;
> +
> +	pr_info("Loading module %s, version: %s "
> +		"(use_fr: %d, retry_count: %d, "
> +		"fmr_sg_cnt: %d)\n",
> +		KBUILD_MODNAME, IBTRS_VER_STRING,
> +		use_fr,	retry_count, fmr_sg_cnt);
> +	err = check_module_params();
> +	if (err) {
> +		pr_err("Failed to load module, invalid module parameters,"
> +		       " err: %d\n", err);
> +		return err;
> +	}
> +	ibtrs_wq = alloc_workqueue("ibtrs_client_wq", WQ_MEM_RECLAIM, 0);
> +	if (!ibtrs_wq) {
> +		pr_err("Failed to load module, alloc ibtrs_client_wq failed\n");
> +		return -ENOMEM;
> +	}
> +	err = ibtrs_clt_create_sysfs_module_files();
> +	if (err) {
> +		pr_err("Failed to load module, can't create sysfs files,"
> +		       " err: %d\n", err);
> +		goto out_ibtrs_wq;
> +	}
> +
> +	return 0;
> +
> +out_ibtrs_wq:
> +	destroy_workqueue(ibtrs_wq);
> +
> +	return err;
> +}
> +
> +static void __exit ibtrs_client_exit(void)
> +{
> +	ibtrs_clt_destroy_sysfs_module_files();
> +	destroy_workqueue(ibtrs_wq);
> +}
> +
> +module_init(ibtrs_client_init);
> +module_exit(ibtrs_client_exit);
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/24] ibtrs: client: sysfs interface functions
  2018-02-02 14:08 ` [PATCH 07/24] ibtrs: client: sysfs interface functions Roman Pen
@ 2018-02-05 11:20   ` Sagi Grimberg
  2018-02-06 12:28     ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 11:20 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

Hi Roman,

> This is the sysfs interface to IBTRS sessions on client side:
> 
>    /sys/kernel/ibtrs_client/<SESS-NAME>/
>      *** IBTRS session created by ibtrs_clt_open() API call
>      |
>      |- max_reconnect_attempts
>      |  *** number of reconnect attempts for session
>      |
>      |- add_path
>      |  *** adds another connection path into IBTRS session
>      |
>      |- paths/<DEST-IP>/
>         *** established paths to server in a session
>         |
>         |- disconnect
>         |  *** disconnect path
>         |
>         |- reconnect
>         |  *** reconnect path
>         |
>         |- remove_path
>         |  *** remove current path
>         |
>         |- state
>         |  *** retrieve current path state
>         |
>         |- stats/
>            *** current path statistics
>            |
> 	  |- cpu_migration
> 	  |- rdma
> 	  |- rdma_lat
> 	  |- reconnects
> 	  |- reset_all
> 	  |- sg_entries
> 	  |- wc_completions
> 
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
> Cc: Jack Wang <jinpu.wang@profitbricks.com>

I think stats usually belong in debugfs.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/24] ibtrs: server: main functionality
  2018-02-02 14:08 ` [PATCH 09/24] ibtrs: server: main functionality Roman Pen
@ 2018-02-05 11:29   ` Sagi Grimberg
  2018-02-06 12:46     ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 11:29 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

Hi Roman,

Some comments below.

On 02/02/2018 04:08 PM, Roman Pen wrote:
> This is main functionality of ibtrs-server module, which accepts
> set of RDMA connections (so called IBTRS session), creates/destroys
> sysfs entries associated with IBTRS session and notifies upper layer
> (user of IBTRS API) about RDMA requests or link events.
> 
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
> Cc: Jack Wang <jinpu.wang@profitbricks.com>
> ---
>   drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1811 ++++++++++++++++++++++++++++++
>   1 file changed, 1811 insertions(+)
> 
> diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
> new file mode 100644
> index 000000000000..0d1fc08bd821
> --- /dev/null
> +++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
> @@ -0,0 +1,1811 @@
> +/*
> + * InfiniBand Transport Layer
> + *
> + * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
> + * Authors: Fabian Holler <mail@fholler.de>
> + *          Jack Wang <jinpu.wang@profitbricks.com>
> + *          Kleber Souza <kleber.souza@profitbricks.com>
> + *          Danil Kipnis <danil.kipnis@profitbricks.com>
> + *          Roman Penyaev <roman.penyaev@profitbricks.com>
> + *          Milind Dumbare <Milind.dumbare@gmail.com>
> + *
> + * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
> + * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
> + *          Roman Penyaev <roman.penyaev@profitbricks.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
> +
> +#include <linux/module.h>
> +#include <linux/mempool.h>
> +
> +#include "ibtrs-srv.h"
> +#include "ibtrs-log.h"
> +
> +MODULE_AUTHOR("ibnbd@profitbricks.com");
> +MODULE_DESCRIPTION("IBTRS Server");
> +MODULE_VERSION(IBTRS_VER_STRING);
> +MODULE_LICENSE("GPL");
> +
> +#define DEFAULT_MAX_IO_SIZE_KB 128
> +#define DEFAULT_MAX_IO_SIZE (DEFAULT_MAX_IO_SIZE_KB * 1024)
> +#define MAX_REQ_SIZE PAGE_SIZE
> +#define MAX_SG_COUNT ((MAX_REQ_SIZE - sizeof(struct ibtrs_msg_rdma_read)) \
> +		      / sizeof(struct ibtrs_sg_desc))
> +
> +static int max_io_size = DEFAULT_MAX_IO_SIZE;
> +static int rcv_buf_size = DEFAULT_MAX_IO_SIZE + MAX_REQ_SIZE;
> +
> +static int max_io_size_set(const char *val, const struct kernel_param *kp)
> +{
> +	int err, ival;
> +
> +	err = kstrtoint(val, 0, &ival);
> +	if (err)
> +		return err;
> +
> +	if (ival < 4096 || ival + MAX_REQ_SIZE > (4096 * 1024) ||
> +	    (ival + MAX_REQ_SIZE) % 512 != 0) {
> +		pr_err("Invalid max io size value %d, has to be"
> +		       " > %d, < %d\n", ival, 4096, 4194304);
> +		return -EINVAL;
> +	}
> +
> +	max_io_size = ival;
> +	rcv_buf_size = max_io_size + MAX_REQ_SIZE;
> +	pr_info("max io size changed to %d\n", ival);
> +
> +	return 0;
> +}
> +
> +static const struct kernel_param_ops max_io_size_ops = {
> +	.set		= max_io_size_set,
> +	.get		= param_get_int,
> +};
> +module_param_cb(max_io_size, &max_io_size_ops, &max_io_size, 0444);
> +MODULE_PARM_DESC(max_io_size,
> +		 "Max size for each IO request, when change the unit is in byte"
> +		 " (default: " __stringify(DEFAULT_MAX_IO_SIZE_KB) "KB)");
> +
> +#define DEFAULT_SESS_QUEUE_DEPTH 512
> +static int sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
> +module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
> +MODULE_PARM_DESC(sess_queue_depth,
> +		 "Number of buffers for pending I/O requests to allocate"
> +		 " per session. Maximum: " __stringify(MAX_SESS_QUEUE_DEPTH)
> +		 " (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
> +
> +/* We guarantee to serve 10 paths at least */
> +#define CHUNK_POOL_SIZE (DEFAULT_SESS_QUEUE_DEPTH * 10)
> +static mempool_t *chunk_pool;
> +
> +static int retry_count = 7;
> +
> +static int retry_count_set(const char *val, const struct kernel_param *kp)
> +{
> +	int err, ival;
> +
> +	err = kstrtoint(val, 0, &ival);
> +	if (err)
> +		return err;
> +
> +	if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT) {
> +		pr_err("Invalid retry count value %d, has to be"
> +		       " > %d, < %d\n", ival, MIN_RTR_CNT, MAX_RTR_CNT);
> +		return -EINVAL;
> +	}
> +
> +	retry_count = ival;
> +	pr_info("QP retry count changed to %d\n", ival);
> +
> +	return 0;
> +}
> +
> +static const struct kernel_param_ops retry_count_ops = {
> +	.set		= retry_count_set,
> +	.get		= param_get_int,
> +};
> +module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
> +
> +MODULE_PARM_DESC(retry_count, "Number of times to send the message if the"
> +		 " remote side didn't respond with Ack or Nack (default: 3,"
> +		 " min: " __stringify(MIN_RTR_CNT) ", max: "
> +		 __stringify(MAX_RTR_CNT) ")");
> +
> +static char cq_affinity_list[256] = "";
> +static cpumask_t cq_affinity_mask = { CPU_BITS_ALL };
> +
> +static void init_cq_affinity(void)
> +{
> +	sprintf(cq_affinity_list, "0-%d", nr_cpu_ids - 1);
> +}
> +
> +static int cq_affinity_list_set(const char *val, const struct kernel_param *kp)
> +{
> +	int ret = 0, len = strlen(val);
> +	cpumask_var_t new_value;
> +
> +	if (!strlen(cq_affinity_list))
> +		init_cq_affinity();
> +
> +	if (len >= sizeof(cq_affinity_list))
> +		return -EINVAL;
> +	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	ret = cpulist_parse(val, new_value);
> +	if (ret) {
> +		pr_err("Can't set cq_affinity_list \"%s\": %d\n", val,
> +		       ret);
> +		goto free_cpumask;
> +	}
> +
> +	strlcpy(cq_affinity_list, val, sizeof(cq_affinity_list));
> +	*strchrnul(cq_affinity_list, '\n') = '\0';
> +	cpumask_copy(&cq_affinity_mask, new_value);
> +
> +	pr_info("cq_affinity_list changed to %*pbl\n",
> +		cpumask_pr_args(&cq_affinity_mask));
> +free_cpumask:
> +	free_cpumask_var(new_value);
> +	return ret;
> +}
> +
> +static struct kparam_string cq_affinity_list_kparam_str = {
> +	.maxlen	= sizeof(cq_affinity_list),
> +	.string	= cq_affinity_list
> +};
> +
> +static const struct kernel_param_ops cq_affinity_list_ops = {
> +	.set	= cq_affinity_list_set,
> +	.get	= param_get_string,
> +};
> +
> +module_param_cb(cq_affinity_list, &cq_affinity_list_ops,
> +		&cq_affinity_list_kparam_str, 0644);
> +MODULE_PARM_DESC(cq_affinity_list, "Sets the list of cpus to use as cq vectors."
> +		 "(default: use all possible CPUs)");
> +

Can you explain why not using configfs?

> +static void ibtrs_srv_close_work(struct work_struct *work)
> +{
> +	struct ibtrs_srv_sess *sess;
> +	struct ibtrs_srv_ctx *ctx;
> +	struct ibtrs_srv_con *con;
> +	int i;
> +
> +	sess = container_of(work, typeof(*sess), close_work);
> +	ctx = sess->srv->ctx;
> +
> +	ibtrs_srv_destroy_sess_files(sess);
> +	ibtrs_srv_stop_hb(sess);
> +
> +	for (i = 0; i < sess->s.con_num; i++) {
> +		con = to_srv_con(sess->s.con[i]);
> +		if (!con)
> +			continue;
> +
> +		rdma_disconnect(con->c.cm_id);
> +		ib_drain_qp(con->c.qp);
> +	}
> +	/* Wait for all inflights */
> +	ibtrs_srv_wait_ops_ids(sess);
> +
> +	/* Notify upper layer if we are the last path */
> +	ibtrs_srv_sess_down(sess);
> +
> +	unmap_cont_bufs(sess);
> +	ibtrs_srv_free_ops_ids(sess);
> +
> +	for (i = 0; i < sess->s.con_num; i++) {
> +		con = to_srv_con(sess->s.con[i]);
> +		if (!con)
> +			continue;
> +
> +		ibtrs_cq_qp_destroy(&con->c);
> +		rdma_destroy_id(con->c.cm_id);
> +		kfree(con);
> +	}
> +	ibtrs_ib_dev_put(sess->s.ib_dev);
> +
> +	del_path_from_srv(sess);
> +	put_srv(sess->srv);
> +	sess->srv = NULL;
> +	ibtrs_srv_change_state(sess, IBTRS_SRV_CLOSED);
> +
> +	kfree(sess->rdma_addr);
> +	kfree(sess->s.con);
> +	kfree(sess);
> +}
> +
> +static int ibtrs_rdma_do_accept(struct ibtrs_srv_sess *sess,
> +				struct rdma_cm_id *cm_id)
> +{
> +	struct ibtrs_srv *srv = sess->srv;
> +	struct ibtrs_msg_conn_rsp msg;
> +	struct rdma_conn_param param;
> +	int err;
> +
> +	memset(&param, 0, sizeof(param));
> +	param.retry_count = retry_count;
> +	param.rnr_retry_count = 7;
> +	param.private_data = &msg;
> +	param.private_data_len = sizeof(msg);
> +
> +	memset(&msg, 0, sizeof(msg));
> +	msg.magic = cpu_to_le16(IBTRS_MAGIC);
> +	msg.version = cpu_to_le16(IBTRS_VERSION);
> +	msg.errno = 0;
> +	msg.queue_depth = cpu_to_le16(srv->queue_depth);
> +	msg.rkey = cpu_to_le32(sess->s.ib_dev->rkey);

As said, this cannot happen anymore...

> +static struct rdma_cm_id *ibtrs_srv_cm_init(struct ibtrs_srv_ctx *ctx,
> +					    struct sockaddr *addr,
> +					    enum rdma_port_space ps)
> +{
> +	struct rdma_cm_id *cm_id;
> +	int ret;
> +
> +	cm_id = rdma_create_id(&init_net, ibtrs_srv_rdma_cm_handler,
> +			       ctx, ps, IB_QPT_RC);
> +	if (IS_ERR(cm_id)) {
> +		ret = PTR_ERR(cm_id);
> +		pr_err("Creating id for RDMA connection failed, err: %d\n",
> +		       ret);
> +		goto err_out;
> +	}
> +	ret = rdma_bind_addr(cm_id, addr);
> +	if (ret) {
> +		pr_err("Binding RDMA address failed, err: %d\n", ret);
> +		goto err_cm;
> +	}
> +	ret = rdma_listen(cm_id, 64);
> +	if (ret) {
> +		pr_err("Listening on RDMA connection failed, err: %d\n",
> +		       ret);
> +		goto err_cm;
> +	}
> +
> +	switch (addr->sa_family) {
> +	case AF_INET:
> +		pr_debug("listening on port %u\n",
> +			 ntohs(((struct sockaddr_in *)addr)->sin_port));
> +		break;
> +	case AF_INET6:
> +		pr_debug("listening on port %u\n",
> +			 ntohs(((struct sockaddr_in6 *)addr)->sin6_port));
> +		break;
> +	case AF_IB:
> +		pr_debug("listening on service id 0x%016llx\n",
> +			 be64_to_cpu(rdma_get_service_id(cm_id, addr)));
> +		break;
> +	default:
> +		pr_debug("listening on address family %u\n", addr->sa_family);
> +	}

We already have printk that accepts address format...

> +
> +	return cm_id;
> +
> +err_cm:
> +	rdma_destroy_id(cm_id);
> +err_out:
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static int ibtrs_srv_rdma_init(struct ibtrs_srv_ctx *ctx, unsigned int port)
> +{
> +	struct sockaddr_in6 sin = {
> +		.sin6_family	= AF_INET6,
> +		.sin6_addr	= IN6ADDR_ANY_INIT,
> +		.sin6_port	= htons(port),
> +	};
> +	struct sockaddr_ib sib = {
> +		.sib_family			= AF_IB,
> +		.sib_addr.sib_subnet_prefix	= 0ULL,
> +		.sib_addr.sib_interface_id	= 0ULL,
> +		.sib_sid	= cpu_to_be64(RDMA_IB_IP_PS_IB | port),
> +		.sib_sid_mask	= cpu_to_be64(0xffffffffffffffffULL),
> +		.sib_pkey	= cpu_to_be16(0xffff),
> +	};

ipv4?

> +	struct rdma_cm_id *cm_ip, *cm_ib;
> +	int ret;
> +
> +	/*
> +	 * We accept both IPoIB and IB connections, so we need to keep
> +	 * two cm id's, one for each socket type and port space.
> +	 * If the cm initialization of one of the id's fails, we abort
> +	 * everything.
> +	 */
> +	cm_ip = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sin, RDMA_PS_TCP);
> +	if (unlikely(IS_ERR(cm_ip)))
> +		return PTR_ERR(cm_ip);
> +
> +	cm_ib = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sib, RDMA_PS_IB);
> +	if (unlikely(IS_ERR(cm_ib))) {
> +		ret = PTR_ERR(cm_ib);
> +		goto free_cm_ip;
> +	}
> +
> +	ctx->cm_id_ip = cm_ip;
> +	ctx->cm_id_ib = cm_ib;
> +
> +	return 0;
> +
> +free_cm_ip:
> +	rdma_destroy_id(cm_ip);
> +
> +	return ret;
> +}
> +
> +static struct ibtrs_srv_ctx *alloc_srv_ctx(rdma_ev_fn *rdma_ev,
> +					   link_ev_fn *link_ev)
> +{
> +	struct ibtrs_srv_ctx *ctx;
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return NULL;
> +
> +	ctx->rdma_ev = rdma_ev;
> +	ctx->link_ev = link_ev;
> +	mutex_init(&ctx->srv_mutex);
> +	INIT_LIST_HEAD(&ctx->srv_list);
> +
> +	return ctx;
> +}
> +
> +static void free_srv_ctx(struct ibtrs_srv_ctx *ctx)
> +{
> +	WARN_ON(!list_empty(&ctx->srv_list));
> +	kfree(ctx);
> +}
> +
> +struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
> +				     unsigned int port)
> +{
> +	struct ibtrs_srv_ctx *ctx;
> +	int err;
> +
> +	ctx = alloc_srv_ctx(rdma_ev, link_ev);
> +	if (unlikely(!ctx))
> +		return ERR_PTR(-ENOMEM);
> +
> +	err = ibtrs_srv_rdma_init(ctx, port);
> +	if (unlikely(err)) {
> +		free_srv_ctx(ctx);
> +		return ERR_PTR(err);
> +	}
> +	/* Do not let module be unloaded if server context is alive */
> +	__module_get(THIS_MODULE);
> +
> +	return ctx;
> +}
> +EXPORT_SYMBOL(ibtrs_srv_open);
> +
> +void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess)
> +{
> +	close_sess(sess);
> +}
> +
> +static void close_sess(struct ibtrs_srv_sess *sess)
> +{
> +	enum ibtrs_srv_state old_state;
> +
> +	if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING,
> +					   &old_state))
> +		queue_work(ibtrs_wq, &sess->close_work);
> +	WARN_ON(sess->state != IBTRS_SRV_CLOSING);
> +}
> +
> +static void close_sessions(struct ibtrs_srv *srv)
> +{
> +	struct ibtrs_srv_sess *sess;
> +
> +	mutex_lock(&srv->paths_mutex);
> +	list_for_each_entry(sess, &srv->paths_list, s.entry)
> +		close_sess(sess);
> +	mutex_unlock(&srv->paths_mutex);
> +}
> +
> +static void close_ctx(struct ibtrs_srv_ctx *ctx)
> +{
> +	struct ibtrs_srv *srv;
> +
> +	mutex_lock(&ctx->srv_mutex);
> +	list_for_each_entry(srv, &ctx->srv_list, ctx_list)
> +		close_sessions(srv);
> +	mutex_unlock(&ctx->srv_mutex);
> +	flush_workqueue(ibtrs_wq);
> +}
> +
> +void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx)
> +{
> +	rdma_destroy_id(ctx->cm_id_ip);
> +	rdma_destroy_id(ctx->cm_id_ib);
> +	close_ctx(ctx);
> +	free_srv_ctx(ctx);
> +	module_put(THIS_MODULE);
> +}
> +EXPORT_SYMBOL(ibtrs_srv_close);
> +
> +static int check_module_params(void)
> +{
> +	if (sess_queue_depth < 1 || sess_queue_depth > MAX_SESS_QUEUE_DEPTH) {
> +		pr_err("Invalid sess_queue_depth parameter value\n");
> +		return -EINVAL;
> +	}
> +
> +	/* check if IB immediate data size is enough to hold the mem_id and the
> +	 * offset inside the memory chunk
> +	 */
> +	if (ilog2(sess_queue_depth - 1) + ilog2(rcv_buf_size - 1) >
> +	    MAX_IMM_PAYL_BITS) {
> +		pr_err("RDMA immediate size (%db) not enough to encode "
> +		       "%d buffers of size %dB. Reduce 'sess_queue_depth' "
> +		       "or 'max_io_size' parameters.\n", MAX_IMM_PAYL_BITS,
> +		       sess_queue_depth, rcv_buf_size);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int __init ibtrs_server_init(void)
> +{
> +	int err;
> +
> +	if (!strlen(cq_affinity_list))
> +		init_cq_affinity();
> +
> +	pr_info("Loading module %s, version: %s "
> +		"(retry_count: %d, cq_affinity_list: %s, "
> +		"max_io_size: %d, sess_queue_depth: %d)\n",
> +		KBUILD_MODNAME, IBTRS_VER_STRING, retry_count,
> +		cq_affinity_list, max_io_size, sess_queue_depth);
> +
> +	err = check_module_params();
> +	if (err) {
> +		pr_err("Failed to load module, invalid module parameters,"
> +		       " err: %d\n", err);
> +		return err;
> +	}
> +	chunk_pool = mempool_create_page_pool(CHUNK_POOL_SIZE,
> +					      get_order(rcv_buf_size));
> +	if (unlikely(!chunk_pool)) {
> +		pr_err("Failed preallocate pool of chunks\n");
> +		return -ENOMEM;
> +	}
> +	ibtrs_wq = alloc_workqueue("ibtrs_server_wq", WQ_MEM_RECLAIM, 0);
> +	if (!ibtrs_wq) {
> +		pr_err("Failed to load module, alloc ibtrs_server_wq failed\n");
> +		goto out_chunk_pool;
> +	}
> +	err = ibtrs_srv_create_sysfs_module_files();
> +	if (err) {
> +		pr_err("Failed to load module, can't create sysfs files,"
> +		       " err: %d\n", err);
> +		goto out_ibtrs_wq;
> +	}
> +
> +	return 0;
> +
> +out_ibtrs_wq:
> +	destroy_workqueue(ibtrs_wq);
> +out_chunk_pool:
> +	mempool_destroy(chunk_pool);
> +
> +	return err;
> +}
> +
> +static void __exit ibtrs_server_exit(void)
> +{
> +	ibtrs_srv_destroy_sysfs_module_files();
> +	destroy_workqueue(ibtrs_wq);
> +	mempool_destroy(chunk_pool);
> +}
> +
> +module_init(ibtrs_server_init);
> +module_exit(ibtrs_server_exit);
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05  8:56   ` Jinpu Wang
@ 2018-02-05 11:36     ` Sagi Grimberg
  2018-02-05 13:38       ` Danil Kipnis
  2018-02-05 16:16     ` Bart Van Assche
  1 sibling, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 11:36 UTC (permalink / raw)
  To: Jinpu Wang, Bart Van Assche
  Cc: roman.penyaev, linux-block, linux-rdma, danil.kipnis, hch,
	ogerlitz, axboe


> Hi Bart,
> 
> My another 2 cents:)
> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>>> o Simple configuration of IBNBD:
>>>     - Server side is completely passive: volumes do not need to be
>>>       explicitly exported.
>>
>> That sounds like a security hole? I think the ability to configure whether or
>> not an initiator is allowed to log in is essential and also which volumes an
>> initiator has access to.
> Our design target for well controlled production environment, so
> security is handle in other layer.

What will happen to a new adopter of the code you are contributing?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
                   ` (25 preceding siblings ...)
  2018-02-02 17:05 ` Bart Van Assche
@ 2018-02-05 12:16 ` Sagi Grimberg
  2018-02-05 12:30   ` Sagi Grimberg
                     ` (2 more replies)
  26 siblings, 3 replies; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 12:16 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

Hi Roman and the team,

On 02/02/2018 04:08 PM, Roman Pen wrote:
> This series introduces IBNBD/IBTRS modules.
> 
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA.

So its not strictly infiniband correct?

  It is optimized to transfer (read/write) IO blocks
> in the sense that it follows the BIO semantics of providing the
> possibility to either write data from a scatter-gather list to the
> remote side or to request ("read") data transfer from the remote side
> into a given set of buffers.
> 
> IBTRS is multipath capable and provides I/O fail-over and load-balancing
> functionality.

Couple of questions on your multipath implementation?
1. What was your main objective over dm-multipath?
2. What was the consideration of this implementation over
creating a stand-alone bio based device node to reinject the
bio to the original block device?

> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
> 
> Why?
> 
>     - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>       thus internal protocol is simple and consists of several request
> 	 types only without awareness of underlaying hardware devices.

Can you explain how the protocol is developed for thin-p? What are the
essence of how its suited for it?

>     - IBTRS was developed as an independent RDMA transport library, which
>       supports fail-over and load-balancing policies using multipath, thus
> 	 it can be used for any other IO needs rather than only for block
> 	 device.

What do you mean by "any other IO"?

>     - IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
>       https://www.spinics.net/lists/linux-rdma/msg48799.html
>       (I retested on latest 4.14 kernel - there is no any significant
> 	  difference, thus I post the old link).

That is interesting to learn.

Reading your reference brings a couple of questions though,
- Its unclear to me how ibnbd performs reads without performing memory
   registration. Is it using the global dma rkey?

- Its unclear to me how there is a difference in noreg in writes,
   because for small writes nvme-rdma never register memory (it uses
   inline data).

- Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that
   seems considerably low against other reports. Can you try and explain
   what was the bottleneck? This can be a potential bug and I (and the
   rest of the community is interesting in knowing more details).

- srp/scst comparison is really not fair having it in legacy request
   mode. Can you please repeat it and report a bug to either linux-rdma
   or to the scst mailing list?

- Your latency measurements are surprisingly high for a null target
   device (even for low end nvme device actually) regardless of the
   transport implementation.

For example:
- QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
   fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
   and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
   latency I got ~14 us. So something does not add up here. If this is
   not some configuration issue, then we have serious bugs to handle..

- QD=16 the read latencies are > 10ms for null devices?! I'm having
   troubles understanding how you were able to get such high latencies
   (> 100 ms for QD>=100)

Can you share more information about your setup? It would really help
us understand more.

>     - Major parts of the code were rewritten, simplified and overall code
>       size was reduced by a quarter.

That is good to know.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 12:16 ` Sagi Grimberg
@ 2018-02-05 12:30   ` Sagi Grimberg
  2018-02-07 13:06     ` Roman Penyaev
  2018-02-05 16:58   ` Bart Van Assche
  2018-02-06 13:12   ` Roman Penyaev
  2 siblings, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 12:30 UTC (permalink / raw)
  To: Roman Pen, linux-block, linux-rdma
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, Or Gerlitz,
	Danil Kipnis, Jack Wang

Hi Roman and the team (again), replying to my own email :)

I forgot to mention that first of all thank you for upstreaming
your work! I fully support your goal to have your production driver
upstream to minimize your maintenance efforts. I hope that my
feedback didn't came across with a different impression, that was
certainly not my intent.

It would be great if you can address and/or reply to my feedback
(as well as others) and re-spin it again.

Cheers,
Sagi.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/24] ibnbd: client: main functionality
  2018-02-02 15:11   ` Jens Axboe
@ 2018-02-05 12:54     ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-05 12:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-rdma, Christoph Hellwig, Sagi Grimberg,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

On Fri, Feb 2, 2018 at 4:11 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2/2/18 7:08 AM, Roman Pen wrote:
>> This is main functionality of ibnbd-client module, which provides
>> interface to map remote device as local block device /dev/ibnbd<N>
>> and feeds IBTRS with IO requests.
>
> Kill the legacy IO path for this, the driver should only support
> blk-mq. Hence kill off your BLK_RQ part, that will eliminate
> the dual path you have too.

ok, sounds good, frankly, we did not use rq very long time.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/24] ibnbd: a bit of documentation
  2018-02-02 15:55   ` Bart Van Assche
@ 2018-02-05 13:03     ` Roman Penyaev
  2018-02-05 14:16       ` Sagi Grimberg
  0 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-05 13:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-rdma, danil.kipnis, hch, ogerlitz, jinpu.wang,
	axboe, sagi

On Fri, Feb 2, 2018 at 4:55 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Fri, 2018-02-02 at 15:09 +0100, Roman Pen wrote:
>> +Entries under /sys/kernel/ibnbd_client/
>> +=======================================
>> [ ... ]
>
> You will need Greg KH's permission to add new entries directly under /sys/kernel.
> Since I think that it is unlikely that he will give that permission: have you
> considered to add the new client entries under /sys/class/block for the client and
> /sys/kernel/configfs/ibnbd for the target, similar to what the NVMeOF drivers do
> today?

/sys/kernel was chosen ages ago and I completely forgot to move it to configfs.
IBTRS is not a block device, so for some read-only entries (statistics
or states)
something else should be probably used, not configfs.  Or it is fine
to read state
of the connection from configfs?  For me sounds a bit weird.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/24] ibtrs: client: main functionality
  2018-02-02 16:54   ` Bart Van Assche
@ 2018-02-05 13:27     ` Roman Penyaev
  2018-02-05 14:14       ` Sagi Grimberg
  0 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-05 13:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-rdma, danil.kipnis, hch, ogerlitz, jinpu.wang,
	axboe, sagi

On Fri, Feb 2, 2018 at 5:54 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> +static inline struct ibtrs_tag *
>> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
>> +{
>> +     size_t max_depth = clt->queue_depth;
>> +     struct ibtrs_tag *tag;
>> +     int cpu, bit;
>> +
>> +     cpu = get_cpu();
>> +     do {
>> +             bit = find_first_zero_bit(clt->tags_map, max_depth);
>> +             if (unlikely(bit >= max_depth)) {
>> +                     put_cpu();
>> +                     return NULL;
>> +             }
>> +
>> +     } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
>> +     put_cpu();
>> +
>> +     tag = GET_TAG(clt, bit);
>> +     WARN_ON(tag->mem_id != bit);
>> +     tag->cpu_id = cpu;
>> +     tag->con_type = con_type;
>> +
>> +     return tag;
>> +}
>> +
>> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
>> +                                struct ibtrs_tag *tag)
>> +{
>> +     clear_bit_unlock(tag->mem_id, clt->tags_map);
>> +}
>> +
>> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
>> +                                 enum ibtrs_clt_con_type con_type,
>> +                                 int can_wait)
>> +{
>> +     struct ibtrs_tag *tag;
>> +     DEFINE_WAIT(wait);
>> +
>> +     tag = __ibtrs_get_tag(clt, con_type);
>> +     if (likely(tag) || !can_wait)
>> +             return tag;
>> +
>> +     do {
>> +             prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
>> +             tag = __ibtrs_get_tag(clt, con_type);
>> +             if (likely(tag))
>> +                     break;
>> +
>> +             io_schedule();
>> +     } while (1);
>> +
>> +     finish_wait(&clt->tags_wait, &wait);
>> +
>> +     return tag;
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
>> +
>> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
>> +{
>> +     if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
>> +             return;
>> +
>> +     __ibtrs_put_tag(clt, tag);
>> +
>> +     /*
>> +      * Putting a tag is a barrier, so we will observe
>> +      * new entry in the wait list, no worries.
>> +      */
>> +     if (waitqueue_active(&clt->tags_wait))
>> +             wake_up(&clt->tags_wait);
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_put_tag);
>
> Do these functions have any advantage over the code in lib/sbitmap.c? If not,
> please call the sbitmap functions instead of adding an additional tag allocator.

Indeed, seems sbitmap can be reused.

But tags is a part of IBTRS, and is not related to block device at all.  One
IBTRS connection (session) handles many block devices (or any IO producers).
With a tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.  Also tag
helps a lot on IO fail-over to another connection (multipath implementation,
which is also a part of the transport library, not a block device), where you
simply reuse the same buffer slot (with a tag in your hands) forwarding IO to
another RDMA connection.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 11:36     ` Sagi Grimberg
@ 2018-02-05 13:38       ` Danil Kipnis
  2018-02-05 14:17         ` Sagi Grimberg
  0 siblings, 1 reply; 79+ messages in thread
From: Danil Kipnis @ 2018-02-05 13:38 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jinpu Wang, Bart Van Assche, roman.penyaev, linux-block,
	linux-rdma, hch, ogerlitz, axboe

>
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche <Bart.VanAssche@wdc.com>
>> wrote:
>>>
>>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>>>>
>>>> o Simple configuration of IBNBD:
>>>>     - Server side is completely passive: volumes do not need to be
>>>>       explicitly exported.
>>>
>>>
>>> That sounds like a security hole? I think the ability to configure
>>> whether or
>>> not an initiator is allowed to log in is essential and also which volumes
>>> an
>>> initiator has access to.
>>
>> Our design target for well controlled production environment, so
>> security is handle in other layer.
>
>
> What will happen to a new adopter of the code you are contributing?

Hi Sagi, Hi Bart,
thanks for your feedback.
We considered the "storage cluster" setup, where each ibnbd client has
access to each ibnbd server. Each ibnbd server manages devices under
his "dev_search_path" and can provide access to them to any ibnbd
client in the network. On top of that Ibnbd server has an additional
"artificial" restriction, that a device can be mapped in writable-mode
by only one client at once.

-- 
Danil

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/24] ibtrs: client: main functionality
  2018-02-05 13:27     ` Roman Penyaev
@ 2018-02-05 14:14       ` Sagi Grimberg
  2018-02-05 17:05         ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 14:14 UTC (permalink / raw)
  To: Roman Penyaev, Bart Van Assche
  Cc: linux-block, linux-rdma, danil.kipnis, hch, ogerlitz, jinpu.wang, axboe


> Indeed, seems sbitmap can be reused.
> 
> But tags is a part of IBTRS, and is not related to block device at all.  One
> IBTRS connection (session) handles many block devices

we use host shared tag sets for the case of multiple block devices.

> (or any IO producers).

Lets wait until we actually have this theoretical non-block IO
producers..

> With a tag you get a free slot of a buffer where you can read/write, so once
> you've allocated a tag you won't sleep on IO path inside a library.

Same for block tags (given that you don't set the request queue
otherwise)

> Also tag
> helps a lot on IO fail-over to another connection (multipath implementation,
> which is also a part of the transport library, not a block device), where you
> simply reuse the same buffer slot (with a tag in your hands) forwarding IO to
> another RDMA connection.

What is the benefit of this detached architecture? IMO, one reason why
you ended up not reusing a lot of the infrastructure is yielded from the
attempt to support a theoretical different consumer that is not ibnbd.
Did you actually had plans for any other consumers?

Personally, I think you will be much better off with a unified approach
for your block device implementation.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 23/24] ibnbd: a bit of documentation
  2018-02-05 13:03     ` Roman Penyaev
@ 2018-02-05 14:16       ` Sagi Grimberg
  0 siblings, 0 replies; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 14:16 UTC (permalink / raw)
  To: Roman Penyaev, Bart Van Assche
  Cc: linux-block, linux-rdma, danil.kipnis, hch, ogerlitz, jinpu.wang, axboe


> /sys/kernel was chosen ages ago and I completely forgot to move it to configfs.
> IBTRS is not a block device, so for some read-only entries (statistics
> or states)
> something else should be probably used, not configfs.  Or it is fine
> to read state
> of the connection from configfs?  For me sounds a bit weird.

Configs go via configfs, states and alike go to sysfs (but you need your
own class device).

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 13:38       ` Danil Kipnis
@ 2018-02-05 14:17         ` Sagi Grimberg
  2018-02-05 16:40           ` Danil Kipnis
  0 siblings, 1 reply; 79+ messages in thread
From: Sagi Grimberg @ 2018-02-05 14:17 UTC (permalink / raw)
  To: Danil Kipnis
  Cc: Jinpu Wang, Bart Van Assche, roman.penyaev, linux-block,
	linux-rdma, hch, ogerlitz, axboe


>>> Hi Bart,
>>>
>>> My another 2 cents:)
>>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche <Bart.VanAssche@wdc.com>
>>> wrote:
>>>>
>>>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>>>>>
>>>>> o Simple configuration of IBNBD:
>>>>>      - Server side is completely passive: volumes do not need to be
>>>>>        explicitly exported.
>>>>
>>>>
>>>> That sounds like a security hole? I think the ability to configure
>>>> whether or
>>>> not an initiator is allowed to log in is essential and also which volumes
>>>> an
>>>> initiator has access to.
>>>
>>> Our design target for well controlled production environment, so
>>> security is handle in other layer.
>>
>>
>> What will happen to a new adopter of the code you are contributing?
> 
> Hi Sagi, Hi Bart,
> thanks for your feedback.
> We considered the "storage cluster" setup, where each ibnbd client has
> access to each ibnbd server. Each ibnbd server manages devices under
> his "dev_search_path" and can provide access to them to any ibnbd
> client in the network.

I don't understand how that helps?

> On top of that Ibnbd server has an additional
> "artificial" restriction, that a device can be mapped in writable-mode
> by only one client at once.

I think one would still need the option to disallow readable export as
well.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/24] ibtrs: client: main functionality
  2018-02-05 11:19   ` Sagi Grimberg
@ 2018-02-05 14:19     ` Roman Penyaev
  2018-02-05 16:24       ` Bart Van Assche
  0 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-05 14:19 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

Hi Sagi,

On Mon, Feb 5, 2018 at 12:19 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Roman,
>
>> +static inline void ibtrs_clt_state_lock(void)
>> +{
>> +       rcu_read_lock();
>> +}
>> +
>> +static inline void ibtrs_clt_state_unlock(void)
>> +{
>> +       rcu_read_unlock();
>> +}
>
>
> This looks rather pointless...

Yeah, old scraps.  Some time later those were not only wrappers
around rcu.  Now rcu can be called explicitly, that is true.
Thanks.

>
>> +
>> +#define cmpxchg_min(var, new) ({                                       \
>> +       typeof(var) old;                                                \
>> +                                                                       \
>> +       do {                                                            \
>> +               old = var;                                              \
>> +               new = (!old ? new : min_t(typeof(var), old, new));      \
>> +       } while (cmpxchg(&var, old, new) != old);                       \
>> +})
>
>
> Why is this sort of thing local to your driver?

Good question :)  Most likely because personally I do not know
what is the good generic place for this kind of stuff.

Probably I share the same feeling with the author of these lines
in nvme/host/rdma.c: put_unaligned_le24() :)

>> +/**
>> + * struct ibtrs_fr_pool - pool of fast registration descriptors
>> + *
>> + * An entry is available for allocation if and only if it occurs in
>> @free_list.
>> + *
>> + * @size:      Number of descriptors in this pool.
>> + * @max_page_list_len: Maximum fast registration work request page list
>> length.
>> + * @lock:      Protects free_list.
>> + * @free_list: List of free descriptors.
>> + * @desc:      Fast registration descriptor pool.
>> + */
>> +struct ibtrs_fr_pool {
>> +       int                     size;
>> +       int                     max_page_list_len;
>> +       spinlock_t              lock; /* protects free_list */
>> +       struct list_head        free_list;
>> +       struct ibtrs_fr_desc    desc[0];
>> +};
>
>
> We already have a per-qp fr list implementation, any specific reason to
> implement it again?

No, fr is a part of the code which we are not using, fmr is faster
in our setup.  So we will need to reiterate on fr mode, thanks.


>> +static inline struct ibtrs_tag *
>> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
>> +{
>> +       size_t max_depth = clt->queue_depth;
>> +       struct ibtrs_tag *tag;
>> +       int cpu, bit;
>> +
>> +       cpu = get_cpu();
>> +       do {
>> +               bit = find_first_zero_bit(clt->tags_map, max_depth);
>> +               if (unlikely(bit >= max_depth)) {
>> +                       put_cpu();
>> +                       return NULL;
>> +               }
>> +
>> +       } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
>> +       put_cpu();
>> +
>> +       tag = GET_TAG(clt, bit);
>> +       WARN_ON(tag->mem_id != bit);
>> +       tag->cpu_id = cpu;
>> +       tag->con_type = con_type;
>> +
>> +       return tag;
>> +}
>> +
>> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
>> +                                  struct ibtrs_tag *tag)
>> +{
>> +       clear_bit_unlock(tag->mem_id, clt->tags_map);
>> +}
>> +
>> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
>> +                                   enum ibtrs_clt_con_type con_type,
>> +                                   int can_wait)
>> +{
>> +       struct ibtrs_tag *tag;
>> +       DEFINE_WAIT(wait);
>> +
>> +       tag = __ibtrs_get_tag(clt, con_type);
>> +       if (likely(tag) || !can_wait)
>> +               return tag;
>> +
>> +       do {
>> +               prepare_to_wait(&clt->tags_wait, &wait,
>> TASK_UNINTERRUPTIBLE);
>> +               tag = __ibtrs_get_tag(clt, con_type);
>> +               if (likely(tag))
>> +                       break;
>> +
>> +               io_schedule();
>> +       } while (1);
>> +
>> +       finish_wait(&clt->tags_wait, &wait);
>> +
>> +       return tag;
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
>> +
>> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
>> +{
>> +       if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
>> +               return;
>> +
>> +       __ibtrs_put_tag(clt, tag);
>> +
>> +       /*
>> +        * Putting a tag is a barrier, so we will observe
>> +        * new entry in the wait list, no worries.
>> +        */
>> +       if (waitqueue_active(&clt->tags_wait))
>> +               wake_up(&clt->tags_wait);
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_put_tag);
>
>
> Again, the tags are not clear why they are needed...

We have two separate instances: block device (IBNBD) and a transport
library (IBTRS).  Many block devices share the same IBTRS session
with fixed size queue depth.  Tags is a part of IBTRS, so with allocated
tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.
Also tag helps a lot on IO fail-over to another connection (multipath
implementation, which is also a part of the transport library, not a
block device), where you simply reuse the same buffer slot (with a tag
in your hands) forwarding IO to another RDMA connection.

>> +/**
>> + * ibtrs_destroy_fr_pool() - free the resources owned by a pool
>> + * @pool: Fast registration pool to be destroyed.
>> + */
>> +static void ibtrs_destroy_fr_pool(struct ibtrs_fr_pool *pool)
>> +{
>> +       struct ibtrs_fr_desc *d;
>> +       int i, err;
>> +
>> +       if (!pool)
>> +               return;
>> +
>> +       for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>> +               if (d->mr) {
>> +                       err = ib_dereg_mr(d->mr);
>> +                       if (err)
>> +                               pr_err("Failed to deregister memory
>> region,"
>> +                                      " err: %d\n", err);
>> +               }
>> +       }
>> +       kfree(pool);
>> +}
>> +
>> +/**
>> + * ibtrs_create_fr_pool() - allocate and initialize a pool for fast
>> registration
>> + * @device:            IB device to allocate fast registration
>> descriptors for.
>> + * @pd:                Protection domain associated with the FR
>> descriptors.
>> + * @pool_size:         Number of descriptors to allocate.
>> + * @max_page_list_len: Maximum fast registration work request page list
>> length.
>> + */
>> +static struct ibtrs_fr_pool *ibtrs_create_fr_pool(struct ib_device
>> *device,
>> +                                                 struct ib_pd *pd,
>> +                                                 int pool_size,
>> +                                                 int max_page_list_len)
>> +{
>> +       struct ibtrs_fr_pool *pool;
>> +       struct ibtrs_fr_desc *d;
>> +       struct ib_mr *mr;
>> +       int i, ret;
>> +
>> +       if (pool_size <= 0) {
>> +               pr_warn("Creating fr pool failed, invalid pool size %d\n",
>> +                       pool_size);
>> +               ret = -EINVAL;
>> +               goto err;
>> +       }
>> +
>> +       pool = kzalloc(sizeof(*pool) + pool_size * sizeof(*d),
>> GFP_KERNEL);
>> +       if (!pool) {
>> +               ret = -ENOMEM;
>> +               goto err;
>> +       }
>> +
>> +       pool->size = pool_size;
>> +       pool->max_page_list_len = max_page_list_len;
>> +       spin_lock_init(&pool->lock);
>> +       INIT_LIST_HEAD(&pool->free_list);
>> +
>> +       for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
>> +               mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
>> max_page_list_len);
>> +               if (IS_ERR(mr)) {
>> +                       pr_warn("Failed to allocate fast region
>> memory\n");
>> +                       ret = PTR_ERR(mr);
>> +                       goto destroy_pool;
>> +               }
>> +               d->mr = mr;
>> +               list_add_tail(&d->entry, &pool->free_list);
>> +       }
>> +
>> +       return pool;
>> +
>> +destroy_pool:
>> +       ibtrs_destroy_fr_pool(pool);
>> +err:
>> +       return ERR_PTR(ret);
>> +}
>> +
>> +/**
>> + * ibtrs_fr_pool_get() - obtain a descriptor suitable for fast
>> registration
>> + * @pool: Pool to obtain descriptor from.
>> + */
>> +static struct ibtrs_fr_desc *ibtrs_fr_pool_get(struct ibtrs_fr_pool
>> *pool)
>> +{
>> +       struct ibtrs_fr_desc *d = NULL;
>> +
>> +       spin_lock_bh(&pool->lock);
>> +       if (!list_empty(&pool->free_list)) {
>> +               d = list_first_entry(&pool->free_list, typeof(*d), entry);
>> +               list_del(&d->entry);
>> +       }
>> +       spin_unlock_bh(&pool->lock);
>> +
>> +       return d;
>> +}
>> +
>> +/**
>> + * ibtrs_fr_pool_put() - put an FR descriptor back in the free list
>> + * @pool: Pool the descriptor was allocated from.
>> + * @desc: Pointer to an array of fast registration descriptor pointers.
>> + * @n:    Number of descriptors to put back.
>> + *
>> + * Note: The caller must already have queued an invalidation request for
>> + * desc->mr->rkey before calling this function.
>> + */
>> +static void ibtrs_fr_pool_put(struct ibtrs_fr_pool *pool,
>> +                             struct ibtrs_fr_desc **desc, int n)
>> +{
>> +       int i;
>> +
>> +       spin_lock_bh(&pool->lock);
>> +       for (i = 0; i < n; i++)
>> +               list_add(&desc[i]->entry, &pool->free_list);
>> +       spin_unlock_bh(&pool->lock);
>> +}
>> +
>> +static void ibtrs_map_desc(struct ibtrs_map_state *state, dma_addr_t
>> dma_addr,
>> +                          u32 dma_len, u32 rkey, u32 max_desc)
>> +{
>> +       struct ibtrs_sg_desc *desc = state->desc;
>> +
>> +       pr_debug("dma_addr %llu, key %u, dma_len %u\n",
>> +                dma_addr, rkey, dma_len);
>> +       desc->addr = cpu_to_le64(dma_addr);
>> +       desc->key  = cpu_to_le32(rkey);
>> +       desc->len  = cpu_to_le32(dma_len);
>> +
>> +       state->total_len += dma_len;
>> +       if (state->ndesc < max_desc) {
>> +               state->desc++;
>> +               state->ndesc++;
>> +       } else {
>> +               state->ndesc = INT_MIN;
>> +               pr_err("Could not fit S/G list into buffer descriptor
>> %d.\n",
>> +                      max_desc);
>> +       }
>> +}
>> +
>> +static int ibtrs_map_finish_fmr(struct ibtrs_map_state *state,
>> +                               struct ibtrs_clt_con *con)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       struct ib_pool_fmr *fmr;
>> +       dma_addr_t dma_addr;
>> +       u64 io_addr = 0;
>> +
>> +       fmr = ib_fmr_pool_map_phys(sess->fmr_pool, state->pages,
>> +                                  state->npages, io_addr);
>> +       if (IS_ERR(fmr)) {
>> +               ibtrs_wrn_rl(sess, "Failed to map FMR from FMR pool, "
>> +                            "err: %ld\n", PTR_ERR(fmr));
>> +               return PTR_ERR(fmr);
>> +       }
>> +
>> +       *state->next_fmr++ = fmr;
>> +       state->nmdesc++;
>> +       dma_addr = state->base_dma_addr & ~sess->mr_page_mask;
>> +       pr_debug("ndesc = %d, nmdesc = %d, npages = %d\n",
>> +                state->ndesc, state->nmdesc, state->npages);
>> +       if (state->dir == DMA_TO_DEVICE)
>> +               ibtrs_map_desc(state, dma_addr, state->dma_len,
>> fmr->fmr->lkey,
>> +                              sess->max_desc);
>> +       else
>> +               ibtrs_map_desc(state, dma_addr, state->dma_len,
>> fmr->fmr->rkey,
>> +                              sess->max_desc);
>> +
>> +       return 0;
>> +}
>> +
>> +static void ibtrs_clt_fast_reg_done(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +       struct ibtrs_clt_con *con = cq->cq_context;
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +
>> +       if (unlikely(wc->status != IB_WC_SUCCESS)) {
>> +               ibtrs_err(sess, "Failed IB_WR_REG_MR: %s\n",
>> +                         ib_wc_status_msg(wc->status));
>> +               ibtrs_rdma_error_recovery(con);
>> +       }
>> +}
>> +
>> +static struct ib_cqe fast_reg_cqe = {
>> +       .done = ibtrs_clt_fast_reg_done
>> +};
>> +
>> +/* TODO */
>> +static int ibtrs_map_finish_fr(struct ibtrs_map_state *state,
>> +                              struct ibtrs_clt_con *con, int sg_cnt,
>> +                              unsigned int *sg_offset_p)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       struct ibtrs_fr_desc *desc;
>> +       struct ib_send_wr *bad_wr;
>> +       struct ib_reg_wr wr;
>> +       struct ib_pd *pd;
>> +       u32 rkey;
>> +       int n;
>> +
>> +       pd = sess->s.ib_dev->pd;
>> +       if (sg_cnt == 1 && (pd->flags & IB_PD_UNSAFE_GLOBAL_RKEY)) {
>> +               unsigned int sg_offset = sg_offset_p ? *sg_offset_p : 0;
>> +
>> +               ibtrs_map_desc(state, sg_dma_address(state->sg) +
>> sg_offset,
>> +                              sg_dma_len(state->sg) - sg_offset,
>> +                              pd->unsafe_global_rkey, sess->max_desc);
>> +               if (sg_offset_p)
>> +                       *sg_offset_p = 0;
>> +               return 1;
>> +       }
>> +
>> +       desc = ibtrs_fr_pool_get(con->fr_pool);
>> +       if (!desc) {
>> +               ibtrs_wrn_rl(sess, "Failed to get descriptor from FR
>> pool\n");
>> +               return -ENOMEM;
>> +       }
>> +
>> +       rkey = ib_inc_rkey(desc->mr->rkey);
>> +       ib_update_fast_reg_key(desc->mr, rkey);
>> +
>> +       memset(&wr, 0, sizeof(wr));
>> +       n = ib_map_mr_sg(desc->mr, state->sg, sg_cnt, sg_offset_p,
>> +                        sess->mr_page_size);
>> +       if (unlikely(n < 0)) {
>> +               ibtrs_fr_pool_put(con->fr_pool, &desc, 1);
>> +               return n;
>> +       }
>> +
>> +       wr.wr.next = NULL;
>> +       wr.wr.opcode = IB_WR_REG_MR;
>> +       wr.wr.wr_cqe = &fast_reg_cqe;
>> +       wr.wr.num_sge = 0;
>> +       wr.wr.send_flags = 0;
>> +       wr.mr = desc->mr;
>> +       wr.key = desc->mr->rkey;
>> +       wr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE);
>
>
> Do you actually ever have remote write access in your protocol?

We do not have reads, instead client writes on write and server writes
on read. (write only storage solution :)

>
>> +static void ibtrs_clt_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +       struct ibtrs_clt_con *con = cq->cq_context;
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +
>> +       if (unlikely(wc->status != IB_WC_SUCCESS)) {
>> +               ibtrs_err(sess, "Failed IB_WR_LOCAL_INV: %s\n",
>> +                         ib_wc_status_msg(wc->status));
>> +               ibtrs_rdma_error_recovery(con);
>> +       }
>> +}
>> +
>> +static struct ib_cqe local_inv_cqe = {
>> +       .done = ibtrs_clt_inv_rkey_done
>> +};
>> +
>> +static int ibtrs_inv_rkey(struct ibtrs_clt_con *con, u32 rkey)
>> +{
>> +       struct ib_send_wr *bad_wr;
>> +       struct ib_send_wr wr = {
>> +               .opcode             = IB_WR_LOCAL_INV,
>> +               .wr_cqe             = &local_inv_cqe,
>> +               .next               = NULL,
>> +               .num_sge            = 0,
>> +               .send_flags         = 0,
>> +               .ex.invalidate_rkey = rkey,
>> +       };
>> +
>> +       return ib_post_send(con->c.qp, &wr, &bad_wr);
>> +}
>
>
> Is not signalling the local invalidate safe? A recent report
> suggested that this is not safe in the presence of ack drops.

For our setup we use fmr, so frankly I do not follow any fr discussions.
Could you please provide the link?

>> +static int ibtrs_post_send_rdma(struct ibtrs_clt_con *con,
>> +                               struct ibtrs_clt_io_req *req,
>> +                               u64 addr, u32 off, u32 imm)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       enum ib_send_flags flags;
>> +       struct ib_sge list[1];
>> +
>> +       if (unlikely(!req->sg_size)) {
>> +               ibtrs_wrn(sess, "Doing RDMA Write failed, no data
>> supplied\n");
>> +               return -EINVAL;
>> +       }
>> +
>> +       /* user data and user message in the first list element */
>> +       list[0].addr   = req->iu->dma_addr;
>> +       list[0].length = req->sg_size;
>> +       list[0].lkey   = sess->s.ib_dev->lkey;
>> +
>> +       /*
>> +        * From time to time we have to post signalled sends,
>> +        * or send queue will fill up and only QP reset can help.
>> +        */
>> +       flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
>> +                       0 : IB_SEND_SIGNALED;
>> +       return ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list, 1,
>> +                                           sess->srv_rdma_buf_rkey,
>> +                                           addr + off, imm, flags);
>> +}
>> +
>> +static void ibtrs_set_sge_with_desc(struct ib_sge *list,
>> +                                   struct ibtrs_sg_desc *desc)
>> +{
>> +       list->addr   = le64_to_cpu(desc->addr);
>> +       list->length = le32_to_cpu(desc->len);
>> +       list->lkey   = le32_to_cpu(desc->key);
>> +       pr_debug("dma_addr %llu, key %u, dma_len %u\n",
>> +                list->addr, list->lkey, list->length);
>> +}
>> +
>> +static void ibtrs_set_rdma_desc_last(struct ibtrs_clt_con *con,
>> +                                    struct ib_sge *list,
>> +                                    struct ibtrs_clt_io_req *req,
>> +                                    struct ib_rdma_wr *wr, int offset,
>> +                                    struct ibtrs_sg_desc *desc, int m,
>> +                                    int n, u64 addr, u32 size, u32 imm)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       enum ib_send_flags flags;
>> +       int i;
>> +
>> +       for (i = m; i < n; i++, desc++)
>> +               ibtrs_set_sge_with_desc(&list[i], desc);
>> +
>> +       list[i].addr   = req->iu->dma_addr;
>> +       list[i].length = size;
>> +       list[i].lkey   = sess->s.ib_dev->lkey;
>> +
>> +       wr->wr.wr_cqe = &req->iu->cqe;
>> +       wr->wr.sg_list = &list[m];
>> +       wr->wr.num_sge = n - m + 1;
>> +       wr->remote_addr = addr + offset;
>> +       wr->rkey = sess->srv_rdma_buf_rkey;
>> +
>> +       /*
>> +        * From time to time we have to post signalled sends,
>> +        * or send queue will fill up and only QP reset can help.
>> +        */
>> +       flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
>> +                       0 : IB_SEND_SIGNALED;
>> +
>> +       wr->wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
>> +       wr->wr.send_flags  = flags;
>> +       wr->wr.ex.imm_data = cpu_to_be32(imm);
>> +}
>> +
>> +static int ibtrs_post_send_rdma_desc_more(struct ibtrs_clt_con *con,
>> +                                         struct ib_sge *list,
>> +                                         struct ibtrs_clt_io_req *req,
>> +                                         struct ibtrs_sg_desc *desc, int
>> n,
>> +                                         u64 addr, u32 size, u32 imm)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       size_t max_sge, num_sge, num_wr;
>> +       struct ib_send_wr *bad_wr;
>> +       struct ib_rdma_wr *wrs, *wr;
>> +       int j = 0, k, offset = 0, len = 0;
>> +       int m = 0;
>> +       int ret;
>> +
>> +       max_sge = sess->max_sge;
>> +       num_sge = 1 + n;
>> +       num_wr = DIV_ROUND_UP(num_sge, max_sge);
>> +
>> +       wrs = kcalloc(num_wr, sizeof(*wrs), GFP_ATOMIC);
>> +       if (!wrs)
>> +               return -ENOMEM;
>> +
>> +       if (num_wr == 1)
>> +               goto last_one;
>> +
>> +       for (; j < num_wr; j++) {
>> +               wr = &wrs[j];
>> +               for (k = 0; k < max_sge; k++, desc++) {
>> +                       m = k + j * max_sge;
>> +                       ibtrs_set_sge_with_desc(&list[m], desc);
>> +                       len += le32_to_cpu(desc->len);
>> +               }
>> +               wr->wr.wr_cqe = &req->iu->cqe;
>> +               wr->wr.sg_list = &list[m];
>> +               wr->wr.num_sge = max_sge;
>> +               wr->remote_addr = addr + offset;
>> +               wr->rkey = sess->srv_rdma_buf_rkey;
>> +
>> +               offset += len;
>> +               wr->wr.next = &wrs[j + 1].wr;
>> +               wr->wr.opcode = IB_WR_RDMA_WRITE;
>> +       }
>> +
>> +last_one:
>> +       wr = &wrs[j];
>> +
>> +       ibtrs_set_rdma_desc_last(con, list, req, wr, offset,
>> +                                desc, m, n, addr, size, imm);
>> +
>> +       ret = ib_post_send(con->c.qp, &wrs[0].wr, &bad_wr);
>> +       if (unlikely(ret))
>> +               ibtrs_err(sess, "Posting write request to QP failed,"
>> +                         " err: %d\n", ret);
>> +       kfree(wrs);
>> +       return ret;
>> +}
>> +
>> +static int ibtrs_post_send_rdma_desc(struct ibtrs_clt_con *con,
>> +                                    struct ibtrs_clt_io_req *req,
>> +                                    struct ibtrs_sg_desc *desc, int n,
>> +                                    u64 addr, u32 size, u32 imm)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       enum ib_send_flags flags;
>> +       struct ib_sge *list;
>> +       size_t num_sge;
>> +       int ret, i;
>> +
>> +       num_sge = 1 + n;
>> +       list = kmalloc_array(num_sge, sizeof(*list), GFP_ATOMIC);
>> +       if (!list)
>> +               return -ENOMEM;
>> +
>> +       if (num_sge < sess->max_sge) {
>> +               for (i = 0; i < n; i++, desc++)
>> +                       ibtrs_set_sge_with_desc(&list[i], desc);
>> +               list[i].addr   = req->iu->dma_addr;
>> +               list[i].length = size;
>> +               list[i].lkey   = sess->s.ib_dev->lkey;
>> +
>> +               /*
>> +                * From time to time we have to post signalled sends,
>> +                * or send queue will fill up and only QP reset can help.
>> +                */
>> +               flags = atomic_inc_return(&con->io_cnt) %
>> sess->queue_depth ?
>> +                               0 : IB_SEND_SIGNALED;
>> +               ret = ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list,
>> +                                                  num_sge,
>> +
>> sess->srv_rdma_buf_rkey,
>> +                                                  addr, imm, flags);
>> +       } else {
>> +               ret = ibtrs_post_send_rdma_desc_more(con, list, req, desc,
>> n,
>> +                                                    addr, size, imm);
>> +       }
>> +
>> +       kfree(list);
>> +       return ret;
>> +}
>> +
>> +static int ibtrs_post_send_rdma_more(struct ibtrs_clt_con *con,
>> +                                    struct ibtrs_clt_io_req *req,
>> +                                    u64 addr, u32 size, u32 imm)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       struct ib_device *ibdev = sess->s.ib_dev->dev;
>> +       enum ib_send_flags flags;
>> +       struct scatterlist *sg;
>> +       struct ib_sge *list;
>> +       size_t num_sge;
>> +       int i, ret;
>> +
>> +       num_sge = 1 + req->sg_cnt;
>> +       list = kmalloc_array(num_sge, sizeof(*list), GFP_ATOMIC);
>> +       if (!list)
>> +               return -ENOMEM;
>> +
>> +       for_each_sg(req->sglist, sg, req->sg_cnt, i) {
>> +               list[i].addr   = ib_sg_dma_address(ibdev, sg);
>> +               list[i].length = ib_sg_dma_len(ibdev, sg);
>> +               list[i].lkey   = sess->s.ib_dev->lkey;
>> +       }
>> +       list[i].addr   = req->iu->dma_addr;
>> +       list[i].length = size;
>> +       list[i].lkey   = sess->s.ib_dev->lkey;
>> +
>> +       /*
>> +        * From time to time we have to post signalled sends,
>> +        * or send queue will fill up and only QP reset can help.
>> +        */
>> +       flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
>> +                       0 : IB_SEND_SIGNALED;
>> +       ret = ibtrs_iu_post_rdma_write_imm(&con->c, req->iu, list,
>> num_sge,
>> +                                          sess->srv_rdma_buf_rkey,
>> +                                          addr, imm, flags);
>> +       kfree(list);
>> +
>> +       return ret;
>> +}
>
>
> All these rdma halpers looks like that can be reused from the rdma rw
> API if it was enhanced with immediate capabilities.

True.

>> +static inline unsigned long ibtrs_clt_get_raw_ms(void)
>> +{
>> +       struct timespec ts;
>> +
>> +       getrawmonotonic(&ts);
>> +
>> +       return timespec_to_ns(&ts) / NSEC_PER_MSEC;
>> +}
>
>
> Why is this local to your driver?
>
>> +
>> +static void complete_rdma_req(struct ibtrs_clt_io_req *req,
>> +                             int errno, bool notify)
>> +{
>> +       struct ibtrs_clt_con *con = req->con;
>> +       struct ibtrs_clt_sess *sess;
>> +       enum dma_data_direction dir;
>> +       struct ibtrs_clt *clt;
>> +       void *priv;
>> +
>> +       if (WARN_ON(!req->in_use))
>> +               return;
>> +       if (WARN_ON(!req->con))
>> +               return;
>> +       sess = to_clt_sess(con->c.sess);
>> +       clt = sess->clt;
>> +
>> +       if (req->sg_cnt > fmr_sg_cnt)
>> +               ibtrs_unmap_fast_reg_data(req->con, req);
>> +       if (req->sg_cnt)
>> +               ib_dma_unmap_sg(sess->s.ib_dev->dev, req->sglist,
>> +                               req->sg_cnt, req->dir);
>> +       if (sess->stats.enable_rdma_lat)
>> +               ibtrs_clt_update_rdma_lat(&sess->stats,
>> +                                         req->dir == DMA_FROM_DEVICE,
>> +                                         ibtrs_clt_get_raw_ms() -
>> +                                         req->start_time);
>> +       ibtrs_clt_decrease_inflight(&sess->stats);
>> +
>> +       req->in_use = false;
>> +       req->con = NULL;
>> +       priv = req->priv;
>> +       dir = req->dir;
>> +
>> +       if (notify)
>> +               req->conf(priv, errno);
>> +}
>
>
>
>
>> +
>> +static void process_io_rsp(struct ibtrs_clt_sess *sess, u32 msg_id, s16
>> errno)
>> +{
>> +       if (WARN_ON(msg_id >= sess->queue_depth))
>> +               return;
>> +
>> +       complete_rdma_req(&sess->reqs[msg_id], errno, true);
>> +}
>> +
>> +static struct ib_cqe io_comp_cqe = {
>> +       .done = ibtrs_clt_rdma_done
>> +};
>> +
>> +static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
>> +{
>> +       struct ibtrs_clt_con *con = cq->cq_context;
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       u32 imm_type, imm_payload;
>> +       int err;
>> +
>> +       if (unlikely(wc->status != IB_WC_SUCCESS)) {
>> +               if (wc->status != IB_WC_WR_FLUSH_ERR) {
>> +                       ibtrs_err(sess, "RDMA failed: %s\n",
>> +                                 ib_wc_status_msg(wc->status));
>> +                       ibtrs_rdma_error_recovery(con);
>> +               }
>> +               return;
>> +       }
>> +       ibtrs_clt_update_wc_stats(con);
>> +
>> +       switch (wc->opcode) {
>> +       case IB_WC_RDMA_WRITE:
>> +               /*
>> +                * post_send() RDMA write completions of IO reqs
>> (read/write)
>> +                * and hb
>> +                */
>> +               break;
>> +       case IB_WC_RECV_RDMA_WITH_IMM:
>> +               /*
>> +                * post_recv() RDMA write completions of IO reqs
>> (read/write)
>> +                * and hb
>> +                */
>> +               if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
>> +                       return;
>> +               err = ibtrs_post_recv_empty(&con->c, &io_comp_cqe);
>> +               if (unlikely(err)) {
>> +                       ibtrs_err(sess, "ibtrs_post_recv_empty(): %d\n",
>> err);
>> +                       ibtrs_rdma_error_recovery(con);
>> +                       break;
>> +               }
>> +               ibtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
>> +                              &imm_type, &imm_payload);
>> +               if (likely(imm_type == IBTRS_IO_RSP_IMM)) {
>> +                       u32 msg_id;
>> +
>> +                       ibtrs_from_io_rsp_imm(imm_payload, &msg_id, &err);
>> +                       process_io_rsp(sess, msg_id, err);
>> +               } else if (imm_type == IBTRS_HB_MSG_IMM) {
>> +                       WARN_ON(con->c.cid);
>> +                       ibtrs_send_hb_ack(&sess->s);
>> +               } else if (imm_type == IBTRS_HB_ACK_IMM) {
>> +                       WARN_ON(con->c.cid);
>> +                       sess->s.hb_missed_cnt = 0;
>> +               } else {
>> +                       ibtrs_wrn(sess, "Unknown IMM type %u\n",
>> imm_type);
>> +               }
>> +               break;
>> +       default:
>> +               ibtrs_wrn(sess, "Unexpected WC type: %s\n",
>> +                         ib_wc_opcode_str(wc->opcode));
>> +               return;
>> +       }
>
>
> Is there a spec somewhere with the protocol information that explains
> how this all works?

Not yet.  The transfer procedure is described in vault presentation.
Is README is a good place for such stuff?  I mean some low-level
protocol spec.

>> +struct path_it {
>> +       int i;
>> +       struct list_head skip_list;
>> +       struct ibtrs_clt *clt;
>> +       struct ibtrs_clt_sess *(*next_path)(struct path_it *);
>> +};
>> +
>> +#define do_each_path(path, clt, it) {                                  \
>> +       path_it_init(it, clt);                                          \
>> +       ibtrs_clt_state_lock();                                         \
>> +       for ((it)->i = 0; ((path) = ((it)->next_path)(it)) &&           \
>> +                         (it)->i < (it)->clt->paths_num;               \
>> +            (it)->i++)
>> +
>> +#define while_each_path(it)                                            \
>> +       path_it_deinit(it);                                             \
>> +       ibtrs_clt_state_unlock();                                       \
>> +       }
>> +
>> +/**
>> + * get_next_path_rr() - Returns path in round-robin fashion.
>> + *
>> + * Related to @MP_POLICY_RR
>> + *
>> + * Locks:
>> + *    ibtrs_clt_state_lock() must be hold.
>> + */
>> +static struct ibtrs_clt_sess *get_next_path_rr(struct path_it *it)
>> +{
>> +       struct ibtrs_clt_sess __percpu * __rcu *ppcpu_path, *path;
>> +       struct ibtrs_clt *clt = it->clt;
>> +
>> +       ppcpu_path = this_cpu_ptr(clt->pcpu_path);
>> +       path = rcu_dereference(*ppcpu_path);
>> +       if (unlikely(!path))
>> +               path = list_first_or_null_rcu(&clt->paths_list,
>> +                                             typeof(*path), s.entry);
>> +       else
>> +               path = list_next_or_null_rcu_rr(path, &clt->paths_list,
>> +                                               s.entry);
>> +       rcu_assign_pointer(*ppcpu_path, path);
>> +
>> +       return path;
>> +}
>> +
>> +/**
>> + * get_next_path_min_inflight() - Returns path with minimal inflight
>> count.
>> + *
>> + * Related to @MP_POLICY_MIN_INFLIGHT
>> + *
>> + * Locks:
>> + *    ibtrs_clt_state_lock() must be hold.
>> + */
>> +static struct ibtrs_clt_sess *get_next_path_min_inflight(struct path_it
>> *it)
>> +{
>> +       struct ibtrs_clt_sess *min_path = NULL;
>> +       struct ibtrs_clt *clt = it->clt;
>> +       struct ibtrs_clt_sess *sess;
>> +       int min_inflight = INT_MAX;
>> +       int inflight;
>> +
>> +       list_for_each_entry_rcu(sess, &clt->paths_list, s.entry) {
>> +               if
>> (unlikely(!list_empty(raw_cpu_ptr(sess->mp_skip_entry))))
>> +                       continue;
>> +
>> +               inflight = atomic_read(&sess->stats.inflight);
>> +
>> +               if (inflight < min_inflight) {
>> +                       min_inflight = inflight;
>> +                       min_path = sess;
>> +               }
>> +       }
>> +
>> +       /*
>> +        * add the path to the skip list, so that next time we can get
>> +        * a different one
>> +        */
>> +       if (min_path)
>> +               list_add(raw_cpu_ptr(min_path->mp_skip_entry),
>> &it->skip_list);
>> +
>> +       return min_path;
>> +}
>> +
>> +static inline void path_it_init(struct path_it *it, struct ibtrs_clt
>> *clt)
>> +{
>> +       INIT_LIST_HEAD(&it->skip_list);
>> +       it->clt = clt;
>> +       it->i = 0;
>> +
>> +       if (clt->mp_policy == MP_POLICY_RR)
>> +               it->next_path = get_next_path_rr;
>> +       else
>> +               it->next_path = get_next_path_min_inflight;
>> +}
>> +
>> +static inline void path_it_deinit(struct path_it *it)
>> +{
>> +       struct list_head *skip, *tmp;
>> +       /*
>> +        * The skip_list is used only for the MIN_INFLIGHT policy.
>> +        * We need to remove paths from it, so that next IO can insert
>> +        * paths (->mp_skip_entry) into a skip_list again.
>> +        */
>> +       list_for_each_safe(skip, tmp, &it->skip_list)
>> +               list_del_init(skip);
>> +}
>> +
>> +static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req,
>> +                                     struct ibtrs_clt_sess *sess,
>> +                                     ibtrs_conf_fn *conf,
>> +                                     struct ibtrs_tag *tag, void *priv,
>> +                                     const struct kvec *vec, size_t
>> usr_len,
>> +                                     struct scatterlist *sg, size_t
>> sg_cnt,
>> +                                     size_t data_len, int dir)
>> +{
>> +       req->tag = tag;
>> +       req->in_use = true;
>> +       req->usr_len = usr_len;
>> +       req->data_len = data_len;
>> +       req->sglist = sg;
>> +       req->sg_cnt = sg_cnt;
>> +       req->priv = priv;
>> +       req->dir = dir;
>> +       req->con = ibtrs_tag_to_clt_con(sess, tag);
>> +       req->conf = conf;
>> +       copy_from_kvec(req->iu->buf, vec, usr_len);
>> +       if (sess->stats.enable_rdma_lat)
>> +               req->start_time = ibtrs_clt_get_raw_ms();
>> +}
>> +
>> +static inline struct ibtrs_clt_io_req *
>> +ibtrs_clt_get_req(struct ibtrs_clt_sess *sess, ibtrs_conf_fn *conf,
>> +                 struct ibtrs_tag *tag, void *priv,
>> +                 const struct kvec *vec, size_t usr_len,
>> +                 struct scatterlist *sg, size_t sg_cnt,
>> +                 size_t data_len, int dir)
>> +{
>> +       struct ibtrs_clt_io_req *req;
>> +
>> +       req = &sess->reqs[tag->mem_id];
>> +       ibtrs_clt_init_req(req, sess, conf, tag, priv, vec, usr_len,
>> +                          sg, sg_cnt, data_len, dir);
>> +       return req;
>> +}
>> +
>> +static inline struct ibtrs_clt_io_req *
>> +ibtrs_clt_get_copy_req(struct ibtrs_clt_sess *alive_sess,
>> +                      struct ibtrs_clt_io_req *fail_req)
>> +{
>> +       struct ibtrs_clt_io_req *req;
>> +       struct kvec vec = {
>> +               .iov_base = fail_req->iu->buf,
>> +               .iov_len  = fail_req->usr_len
>> +       };
>> +
>> +       req = &alive_sess->reqs[fail_req->tag->mem_id];
>> +       ibtrs_clt_init_req(req, alive_sess, fail_req->conf, fail_req->tag,
>> +                          fail_req->priv, &vec, fail_req->usr_len,
>> +                          fail_req->sglist, fail_req->sg_cnt,
>> +                          fail_req->data_len, fail_req->dir);
>> +       return req;
>> +}
>> +
>> +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req);
>> +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req);
>> +
>> +static int ibtrs_clt_failover_req(struct ibtrs_clt *clt,
>> +                                 struct ibtrs_clt_io_req *fail_req)
>> +{
>> +       struct ibtrs_clt_sess *alive_sess;
>> +       struct ibtrs_clt_io_req *req;
>> +       int err = -ECONNABORTED;
>> +       struct path_it it;
>> +
>> +       do_each_path(alive_sess, clt, &it) {
>> +               if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED))
>> +                       continue;
>> +               req = ibtrs_clt_get_copy_req(alive_sess, fail_req);
>> +               if (req->dir == DMA_TO_DEVICE)
>> +                       err = ibtrs_clt_write_req(req);
>> +               else
>> +                       err = ibtrs_clt_read_req(req);
>> +               if (unlikely(err)) {
>> +                       req->in_use = false;
>> +                       continue;
>> +               }
>> +               /* Success path */
>> +               ibtrs_clt_inc_failover_cnt(&alive_sess->stats);
>> +               break;
>> +       } while_each_path(&it);
>> +
>> +       return err;
>> +}
>> +
>> +static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess,
>> +                                     bool failover)
>> +{
>> +       struct ibtrs_clt *clt = sess->clt;
>> +       struct ibtrs_clt_io_req *req;
>> +       int i;
>> +
>> +       if (!sess->reqs)
>> +               return;
>> +       for (i = 0; i < sess->queue_depth; ++i) {
>> +               bool notify;
>> +               int err = 0;
>> +
>> +               req = &sess->reqs[i];
>> +               if (!req->in_use)
>> +                       continue;
>> +
>> +               if (failover)
>> +                       err = ibtrs_clt_failover_req(clt, req);
>> +
>> +               notify = (!failover || err);
>> +               complete_rdma_req(req, -ECONNABORTED, notify);
>> +       }
>> +}
>> +
>> +static void free_sess_reqs(struct ibtrs_clt_sess *sess)
>> +{
>> +       struct ibtrs_clt_io_req *req;
>> +       int i;
>> +
>> +       if (!sess->reqs)
>> +               return;
>> +       for (i = 0; i < sess->queue_depth; ++i) {
>> +               req = &sess->reqs[i];
>> +               if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR)
>> +                       kfree(req->fr_list);
>> +               else if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR)
>> +                       kfree(req->fmr_list);
>> +               kfree(req->map_page);
>> +               ibtrs_iu_free(req->iu, DMA_TO_DEVICE,
>> +                             sess->s.ib_dev->dev);
>> +       }
>> +       kfree(sess->reqs);
>> +       sess->reqs = NULL;
>> +}
>> +
>> +static int alloc_sess_reqs(struct ibtrs_clt_sess *sess)
>> +{
>> +       struct ibtrs_clt_io_req *req;
>> +       void *mr_list;
>> +       int i;
>> +
>> +       sess->reqs = kcalloc(sess->queue_depth, sizeof(*sess->reqs),
>> +                            GFP_KERNEL);
>> +       if (unlikely(!sess->reqs))
>> +               return -ENOMEM;
>> +
>> +       for (i = 0; i < sess->queue_depth; ++i) {
>> +               req = &sess->reqs[i];
>> +               req->iu = ibtrs_iu_alloc(i, sess->max_req_size,
>> GFP_KERNEL,
>> +                                        sess->s.ib_dev->dev,
>> DMA_TO_DEVICE,
>> +                                        ibtrs_clt_rdma_done);
>> +               if (unlikely(!req->iu))
>> +                       goto out;
>> +               mr_list = kmalloc_array(sess->max_pages_per_mr,
>> +                                       sizeof(void *), GFP_KERNEL);
>> +               if (unlikely(!mr_list))
>> +                       goto out;
>> +               if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR)
>> +                       req->fr_list = mr_list;
>> +               else if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR)
>> +                       req->fmr_list = mr_list;
>> +
>> +               req->map_page = kmalloc_array(sess->max_pages_per_mr,
>> +                                             sizeof(void *), GFP_KERNEL);
>> +               if (unlikely(!req->map_page))
>> +                       goto out;
>> +       }
>> +
>> +       return 0;
>> +
>> +out:
>> +       free_sess_reqs(sess);
>> +
>> +       return -ENOMEM;
>> +}
>> +
>> +static int alloc_tags(struct ibtrs_clt *clt)
>> +{
>> +       unsigned int chunk_bits;
>> +       int err, i;
>> +
>> +       clt->tags_map = kcalloc(BITS_TO_LONGS(clt->queue_depth),
>> sizeof(long),
>> +                               GFP_KERNEL);
>> +       if (unlikely(!clt->tags_map)) {
>> +               err = -ENOMEM;
>> +               goto out_err;
>> +       }
>> +       clt->tags = kcalloc(clt->queue_depth, TAG_SIZE(clt), GFP_KERNEL);
>> +       if (unlikely(!clt->tags)) {
>> +               err = -ENOMEM;
>> +               goto err_map;
>> +       }
>> +       chunk_bits = ilog2(clt->queue_depth - 1) + 1;
>> +       for (i = 0; i < clt->queue_depth; i++) {
>> +               struct ibtrs_tag *tag;
>> +
>> +               tag = GET_TAG(clt, i);
>> +               tag->mem_id = i;
>> +               tag->mem_off = i << (MAX_IMM_PAYL_BITS - chunk_bits);
>> +       }
>> +
>> +       return 0;
>> +
>> +err_map:
>> +       kfree(clt->tags_map);
>> +       clt->tags_map = NULL;
>> +out_err:
>> +       return err;
>> +}
>> +
>> +static void free_tags(struct ibtrs_clt *clt)
>> +{
>> +       kfree(clt->tags_map);
>> +       clt->tags_map = NULL;
>> +       kfree(clt->tags);
>> +       clt->tags = NULL;
>> +}
>> +
>> +static void query_fast_reg_mode(struct ibtrs_clt_sess *sess)
>> +{
>> +       struct ibtrs_ib_dev *ib_dev;
>> +       u64 max_pages_per_mr;
>> +       int mr_page_shift;
>> +
>> +       ib_dev = sess->s.ib_dev;
>> +       if (ib_dev->dev->alloc_fmr && ib_dev->dev->dealloc_fmr &&
>> +           ib_dev->dev->map_phys_fmr && ib_dev->dev->unmap_fmr) {
>> +               sess->fast_reg_mode = IBTRS_FAST_MEM_FMR;
>> +               ibtrs_info(sess, "Device %s supports FMR\n",
>> ib_dev->dev->name);
>> +       }
>> +       if (ib_dev->attrs.device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS
>> &&
>> +           use_fr) {
>> +               sess->fast_reg_mode = IBTRS_FAST_MEM_FR;
>> +               ibtrs_info(sess, "Device %s supports FR\n",
>> ib_dev->dev->name);
>> +       }
>> +
>> +       /*
>> +        * Use the smallest page size supported by the HCA, down to a
>> +        * minimum of 4096 bytes. We're unlikely to build large sglists
>> +        * out of smaller entries.
>> +        */
>> +       mr_page_shift      = max(12, ffs(ib_dev->attrs.page_size_cap) -
>> 1);
>> +       sess->mr_page_size = 1 << mr_page_shift;
>> +       sess->max_sge      = ib_dev->attrs.max_sge;
>> +       sess->mr_page_mask = ~((u64)sess->mr_page_size - 1);
>> +       max_pages_per_mr   = ib_dev->attrs.max_mr_size;
>> +       do_div(max_pages_per_mr, sess->mr_page_size);
>> +       sess->max_pages_per_mr = min_t(u64, sess->max_pages_per_mr,
>> +                                      max_pages_per_mr);
>> +       if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
>> +               sess->max_pages_per_mr =
>> +                       min_t(u32, sess->max_pages_per_mr,
>> +                             ib_dev->attrs.max_fast_reg_page_list_len);
>> +       }
>> +       sess->mr_max_size = sess->mr_page_size * sess->max_pages_per_mr;
>> +}
>> +
>> +static int alloc_con_fast_pool(struct ibtrs_clt_con *con)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       struct ibtrs_fr_pool *fr_pool;
>> +       int err = 0;
>> +
>> +       if (sess->fast_reg_mode == IBTRS_FAST_MEM_FR) {
>> +               fr_pool = ibtrs_create_fr_pool(sess->s.ib_dev->dev,
>> +                                              sess->s.ib_dev->pd,
>> +                                              sess->queue_depth,
>> +                                              sess->max_pages_per_mr);
>> +               if (unlikely(IS_ERR(fr_pool))) {
>> +                       err = PTR_ERR(fr_pool);
>> +                       ibtrs_err(sess, "FR pool allocation failed, err:
>> %d\n",
>> +                                 err);
>> +                       return err;
>> +               }
>> +               con->fr_pool = fr_pool;
>> +       }
>> +
>> +       return err;
>> +}
>> +
>> +static void free_con_fast_pool(struct ibtrs_clt_con *con)
>> +{
>> +       if (con->fr_pool) {
>> +               ibtrs_destroy_fr_pool(con->fr_pool);
>> +               con->fr_pool = NULL;
>> +       }
>> +}
>> +
>> +static int alloc_sess_fast_pool(struct ibtrs_clt_sess *sess)
>> +{
>> +       struct ib_fmr_pool_param fmr_param;
>> +       struct ib_fmr_pool *fmr_pool;
>> +       int err = 0;
>> +
>> +       if (sess->fast_reg_mode == IBTRS_FAST_MEM_FMR) {
>> +               memset(&fmr_param, 0, sizeof(fmr_param));
>> +               fmr_param.pool_size         = sess->queue_depth *
>> +                                             sess->max_pages_per_mr;
>> +               fmr_param.dirty_watermark   = fmr_param.pool_size / 4;
>> +               fmr_param.cache             = 0;
>> +               fmr_param.max_pages_per_fmr = sess->max_pages_per_mr;
>> +               fmr_param.page_shift        = ilog2(sess->mr_page_size);
>> +               fmr_param.access            = (IB_ACCESS_LOCAL_WRITE |
>> +                                              IB_ACCESS_REMOTE_WRITE);
>> +
>> +               fmr_pool = ib_create_fmr_pool(sess->s.ib_dev->pd,
>> &fmr_param);
>> +               if (unlikely(IS_ERR(fmr_pool))) {
>> +                       err = PTR_ERR(fmr_pool);
>> +                       ibtrs_err(sess, "FMR pool allocation failed, err:
>> %d\n",
>> +                                 err);
>> +                       return err;
>> +               }
>> +               sess->fmr_pool = fmr_pool;
>> +       }
>> +
>> +       return err;
>> +}
>> +
>> +static void free_sess_fast_pool(struct ibtrs_clt_sess *sess)
>> +{
>> +       if (sess->fmr_pool) {
>> +               ib_destroy_fmr_pool(sess->fmr_pool);
>> +               sess->fmr_pool = NULL;
>> +       }
>> +}
>> +
>> +static int alloc_sess_io_bufs(struct ibtrs_clt_sess *sess)
>> +{
>> +       int ret;
>> +
>> +       ret = alloc_sess_reqs(sess);
>> +       if (unlikely(ret)) {
>> +               ibtrs_err(sess, "alloc_sess_reqs(), err: %d\n", ret);
>> +               return ret;
>> +       }
>> +       ret = alloc_sess_fast_pool(sess);
>> +       if (unlikely(ret)) {
>> +               ibtrs_err(sess, "alloc_sess_fast_pool(), err: %d\n", ret);
>> +               goto free_reqs;
>> +       }
>> +
>> +       return 0;
>> +
>> +free_reqs:
>> +       free_sess_reqs(sess);
>> +
>> +       return ret;
>> +}
>> +
>> +static void free_sess_io_bufs(struct ibtrs_clt_sess *sess)
>> +{
>> +       free_sess_reqs(sess);
>> +       free_sess_fast_pool(sess);
>> +}
>> +
>> +static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
>> +                                    enum ibtrs_clt_state new_state)
>> +{
>> +       enum ibtrs_clt_state old_state;
>> +       bool changed = false;
>> +
>> +       old_state = sess->state;
>> +       switch (new_state) {
>> +       case IBTRS_CLT_CONNECTING:
>> +               switch (old_state) {
>> +               case IBTRS_CLT_RECONNECTING:
>> +                       changed = true;
>> +                       /* FALLTHRU */
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       case IBTRS_CLT_RECONNECTING:
>> +               switch (old_state) {
>> +               case IBTRS_CLT_CONNECTED:
>> +               case IBTRS_CLT_CONNECTING_ERR:
>> +               case IBTRS_CLT_CLOSED:
>> +                       changed = true;
>> +                       /* FALLTHRU */
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       case IBTRS_CLT_CONNECTED:
>> +               switch (old_state) {
>> +               case IBTRS_CLT_CONNECTING:
>> +                       changed = true;
>> +                       /* FALLTHRU */
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       case IBTRS_CLT_CONNECTING_ERR:
>> +               switch (old_state) {
>> +               case IBTRS_CLT_CONNECTING:
>> +                       changed = true;
>> +                       /* FALLTHRU */
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       case IBTRS_CLT_CLOSING:
>> +               switch (old_state) {
>> +               case IBTRS_CLT_CONNECTING:
>> +               case IBTRS_CLT_CONNECTING_ERR:
>> +               case IBTRS_CLT_RECONNECTING:
>> +               case IBTRS_CLT_CONNECTED:
>> +                       changed = true;
>> +                       /* FALLTHRU */
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       case IBTRS_CLT_CLOSED:
>> +               switch (old_state) {
>> +               case IBTRS_CLT_CLOSING:
>> +                       changed = true;
>> +                       /* FALLTHRU */
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       case IBTRS_CLT_DEAD:
>> +               switch (old_state) {
>> +               case IBTRS_CLT_CLOSED:
>> +                       changed = true;
>> +                       /* FALLTHRU */
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +       if (changed) {
>> +               sess->state = new_state;
>> +               wake_up_locked(&sess->state_wq);
>> +       }
>> +
>> +       return changed;
>> +}
>> +
>> +static bool ibtrs_clt_change_state_from_to(struct ibtrs_clt_sess *sess,
>> +                                          enum ibtrs_clt_state old_state,
>> +                                          enum ibtrs_clt_state new_state)
>> +{
>> +       bool changed = false;
>> +
>> +       spin_lock_irq(&sess->state_wq.lock);
>> +       if (sess->state == old_state)
>> +               changed = __ibtrs_clt_change_state(sess, new_state);
>> +       spin_unlock_irq(&sess->state_wq.lock);
>> +
>> +       return changed;
>> +}
>> +
>> +static bool ibtrs_clt_change_state_get_old(struct ibtrs_clt_sess *sess,
>> +                                          enum ibtrs_clt_state new_state,
>> +                                          enum ibtrs_clt_state
>> *old_state)
>> +{
>> +       bool changed;
>> +
>> +       spin_lock_irq(&sess->state_wq.lock);
>> +       *old_state = sess->state;
>> +       changed = __ibtrs_clt_change_state(sess, new_state);
>> +       spin_unlock_irq(&sess->state_wq.lock);
>> +
>> +       return changed;
>> +}
>> +
>> +static bool ibtrs_clt_change_state(struct ibtrs_clt_sess *sess,
>> +                                  enum ibtrs_clt_state new_state)
>> +{
>> +       enum ibtrs_clt_state old_state;
>> +
>> +       return ibtrs_clt_change_state_get_old(sess, new_state,
>> &old_state);
>> +}
>> +
>> +static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess)
>> +{
>> +       enum ibtrs_clt_state state;
>> +
>> +       spin_lock_irq(&sess->state_wq.lock);
>> +       state = sess->state;
>> +       spin_unlock_irq(&sess->state_wq.lock);
>> +
>> +       return state;
>> +}
>> +
>> +static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err)
>> +{
>> +       struct ibtrs_clt_con *con;
>> +
>> +       (void)err;
>> +       con = container_of(c, typeof(*con), c);
>> +       ibtrs_rdma_error_recovery(con);
>> +}
>> +
>> +static void ibtrs_clt_init_hb(struct ibtrs_clt_sess *sess)
>> +{
>> +       ibtrs_init_hb(&sess->s, &io_comp_cqe,
>> +                     IBTRS_HB_INTERVAL_MS,
>> +                     IBTRS_HB_MISSED_MAX,
>> +                     ibtrs_clt_hb_err_handler,
>> +                     ibtrs_wq);
>> +}
>> +
>> +static void ibtrs_clt_start_hb(struct ibtrs_clt_sess *sess)
>> +{
>> +       ibtrs_start_hb(&sess->s);
>> +}
>> +
>> +static void ibtrs_clt_stop_hb(struct ibtrs_clt_sess *sess)
>> +{
>> +       ibtrs_stop_hb(&sess->s);
>> +}
>> +
>> +static void ibtrs_clt_reconnect_work(struct work_struct *work);
>> +static void ibtrs_clt_close_work(struct work_struct *work);
>> +
>> +static struct ibtrs_clt_sess *alloc_sess(struct ibtrs_clt *clt,
>> +                                        const struct ibtrs_addr *path,
>> +                                        size_t con_num, u16 max_segments)
>> +{
>> +       struct ibtrs_clt_sess *sess;
>> +       int err = -ENOMEM;
>> +       int cpu;
>> +
>> +       sess = kzalloc(sizeof(*sess), GFP_KERNEL);
>> +       if (unlikely(!sess))
>> +               goto err;
>> +
>> +       /* Extra connection for user messages */
>> +       con_num += 1;
>> +
>> +       sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
>> +       if (unlikely(!sess->s.con))
>> +               goto err_free_sess;
>> +
>> +       mutex_init(&sess->init_mutex);
>> +       uuid_gen(&sess->s.uuid);
>> +       memcpy(&sess->s.dst_addr, path->dst,
>> +              rdma_addr_size((struct sockaddr *)path->dst));
>> +
>> +       /*
>> +        * rdma_resolve_addr() passes src_addr to cma_bind_addr, which
>> +        * checks the sa_family to be non-zero. If user passed
>> src_addr=NULL
>> +        * the sess->src_addr will contain only zeros, which is then fine.
>> +        */
>> +       if (path->src)
>> +               memcpy(&sess->s.src_addr, path->src,
>> +                      rdma_addr_size((struct sockaddr *)path->src));
>> +       strlcpy(sess->s.sessname, clt->sessname,
>> sizeof(sess->s.sessname));
>> +       sess->s.con_num = con_num;
>> +       sess->clt = clt;
>> +       sess->max_pages_per_mr = max_segments;
>> +       init_waitqueue_head(&sess->state_wq);
>> +       sess->state = IBTRS_CLT_CONNECTING;
>> +       atomic_set(&sess->connected_cnt, 0);
>> +       INIT_WORK(&sess->close_work, ibtrs_clt_close_work);
>> +       INIT_DELAYED_WORK(&sess->reconnect_dwork,
>> ibtrs_clt_reconnect_work);
>> +       ibtrs_clt_init_hb(sess);
>> +
>> +       sess->mp_skip_entry = alloc_percpu(typeof(*sess->mp_skip_entry));
>> +       if (unlikely(!sess->mp_skip_entry))
>> +               goto err_free_con;
>> +
>> +       for_each_possible_cpu(cpu)
>> +               INIT_LIST_HEAD(per_cpu_ptr(sess->mp_skip_entry, cpu));
>> +
>> +       err = ibtrs_clt_init_stats(&sess->stats);
>> +       if (unlikely(err))
>> +               goto err_free_percpu;
>> +
>> +       return sess;
>> +
>> +err_free_percpu:
>> +       free_percpu(sess->mp_skip_entry);
>> +err_free_con:
>> +       kfree(sess->s.con);
>> +err_free_sess:
>> +       kfree(sess);
>> +err:
>> +       return ERR_PTR(err);
>> +}
>> +
>> +static void free_sess(struct ibtrs_clt_sess *sess)
>> +{
>> +       ibtrs_clt_free_stats(&sess->stats);
>> +       free_percpu(sess->mp_skip_entry);
>> +       kfree(sess->s.con);
>> +       kfree(sess->srv_rdma_addr);
>> +       kfree(sess);
>> +}
>> +
>> +static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid)
>> +{
>> +       struct ibtrs_clt_con *con;
>> +
>> +       con = kzalloc(sizeof(*con), GFP_KERNEL);
>> +       if (unlikely(!con))
>> +               return -ENOMEM;
>> +
>> +       /* Map first two connections to the first CPU */
>> +       con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
>> +       con->c.cid = cid;
>> +       con->c.sess = &sess->s;
>> +       atomic_set(&con->io_cnt, 0);
>> +
>> +       sess->s.con[cid] = &con->c;
>> +
>> +       return 0;
>> +}
>> +
>> +static void destroy_con(struct ibtrs_clt_con *con)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +
>> +       sess->s.con[con->c.cid] = NULL;
>> +       kfree(con);
>> +}
>> +
>> +static int create_con_cq_qp(struct ibtrs_clt_con *con)
>> +{
>> +       struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
>> +       u16 cq_size, wr_queue_size;
>> +       int err, cq_vector;
>> +
>> +       /*
>> +        * This function can fail, but still destroy_con_cq_qp() should
>> +        * be called, this is because create_con_cq_qp() is called on cm
>> +        * event path, thus caller/waiter never knows: have we failed
>> before
>> +        * create_con_cq_qp() or after.  To solve this dilemma without
>> +        * creating any additional flags just allow destroy_con_cq_qp() be
>> +        * called many times.
>> +        */
>> +
>> +       if (con->c.cid == 0) {
>> +               cq_size = SERVICE_CON_QUEUE_DEPTH;
>> +               /* + 2 for drain and heartbeat */
>> +               wr_queue_size = SERVICE_CON_QUEUE_DEPTH + 2;
>> +               /* We must be the first here */
>> +               if (WARN_ON(sess->s.ib_dev))
>> +                       return -EINVAL;
>> +
>> +               /*
>> +                * The whole session uses device from user connection.
>> +                * Be careful not to close user connection before ib dev
>> +                * is gracefully put.
>> +                */
>> +               sess->s.ib_dev = ibtrs_ib_dev_find_get(con->c.cm_id);
>> +               if (unlikely(!sess->s.ib_dev)) {
>> +                       ibtrs_wrn(sess, "ibtrs_ib_dev_find_get(): no
>> memory\n");
>> +                       return -ENOMEM;
>> +               }
>> +               sess->s.ib_dev_ref = 1;
>> +               query_fast_reg_mode(sess);
>> +       } else {
>> +               int num_wr;
>> +
>> +               /*
>> +                * Here we assume that session members are correctly set.
>> +                * This is always true if user connection (cid == 0) is
>> +                * established first.
>> +                */
>> +               if (WARN_ON(!sess->s.ib_dev))
>> +                       return -EINVAL;
>> +               if (WARN_ON(!sess->queue_depth))
>> +                       return -EINVAL;
>> +
>> +               /* Shared between connections */
>> +               sess->s.ib_dev_ref++;
>> +               cq_size = sess->queue_depth;
>> +               num_wr = DIV_ROUND_UP(sess->max_pages_per_mr,
>> sess->max_sge);
>> +               wr_queue_size = sess->s.ib_dev->attrs.max_qp_wr;
>> +               wr_queue_size = min_t(int, wr_queue_size,
>> +                                     sess->queue_depth * num_wr *
>> +                                     (use_fr ? 3 : 2) + 1);
>> +       }
>> +       cq_vector = con->cpu % sess->s.ib_dev->dev->num_comp_vectors;
>> +       err = ibtrs_cq_qp_create(&sess->s, &con->c, sess->max_sge,
>> +                                cq_vector, cq_size, wr_queue_size,
>> +                                IB_POLL_SOFTIRQ);
>> +       /*
>> +        * In case of error we do not bother to clean previous
>> allocations,
>> +        * since destroy_con_cq_qp() must be called.
>> +        */

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05  8:56   ` Jinpu Wang
  2018-02-05 11:36     ` Sagi Grimberg
@ 2018-02-05 16:16     ` Bart Van Assche
  2018-02-05 16:36       ` Jinpu Wang
  2018-02-07 16:35       ` Christopher Lameter
  1 sibling, 2 replies; 79+ messages in thread
From: Bart Van Assche @ 2018-02-05 16:16 UTC (permalink / raw)
  To: jinpu.wang
  Cc: linux-block, hch, linux-rdma, roman.penyaev, sagi, ogerlitz,
	axboe, danil.kipnis

T24gTW9uLCAyMDE4LTAyLTA1IGF0IDA5OjU2ICswMTAwLCBKaW5wdSBXYW5nIHdyb3RlOg0KPiBI
aSBCYXJ0LA0KPiANCj4gTXkgYW5vdGhlciAyIGNlbnRzOikNCj4gT24gRnJpLCBGZWIgMiwgMjAx
OCBhdCA2OjA1IFBNLCBCYXJ0IFZhbiBBc3NjaGUgPEJhcnQuVmFuQXNzY2hlQHdkYy5jb20+IHdy
b3RlOg0KPiA+IE9uIEZyaSwgMjAxOC0wMi0wMiBhdCAxNTowOCArMDEwMCwgUm9tYW4gUGVuIHdy
b3RlOg0KPiA+ID4gbyBTaW1wbGUgY29uZmlndXJhdGlvbiBvZiBJQk5CRDoNCj4gPiA+ICAgIC0g
U2VydmVyIHNpZGUgaXMgY29tcGxldGVseSBwYXNzaXZlOiB2b2x1bWVzIGRvIG5vdCBuZWVkIHRv
IGJlDQo+ID4gPiAgICAgIGV4cGxpY2l0bHkgZXhwb3J0ZWQuDQo+ID4gDQo+ID4gVGhhdCBzb3Vu
ZHMgbGlrZSBhIHNlY3VyaXR5IGhvbGU/IEkgdGhpbmsgdGhlIGFiaWxpdHkgdG8gY29uZmlndXJl
IHdoZXRoZXIgb3INCj4gPiBub3QgYW4gaW5pdGlhdG9yIGlzIGFsbG93ZWQgdG8gbG9nIGluIGlz
IGVzc2VudGlhbCBhbmQgYWxzbyB3aGljaCB2b2x1bWVzIGFuDQo+ID4gaW5pdGlhdG9yIGhhcyBh
Y2Nlc3MgdG8uDQo+IA0KPiBPdXIgZGVzaWduIHRhcmdldCBmb3Igd2VsbCBjb250cm9sbGVkIHBy
b2R1Y3Rpb24gZW52aXJvbm1lbnQsIHNvIHNlY3VyaXR5IGlzDQo+IGhhbmRsZSBpbiBvdGhlciBs
YXllci4gT24gc2VydmVyIHNpZGUsIGFkbWluIGNhbiBzZXQgdGhlIGRldl9zZWFyY2hfcGF0aCBp
bg0KPiBtb2R1bGUgcGFyYW1ldGVyIHRvIHNldCBwYXJlbnQgZGlyZWN0b3J5LCB0aGlzIHdpbGwg
Y29uY2F0ZW5hdGUgd2l0aCB0aGUgcGF0aA0KPiBjbGllbnQgc2VuZCBpbiBvcGVuIG1lc3NhZ2Ug
dG8gb3BlbiBhIGJsb2NrIGRldmljZS4NCg0KSGVsbG8gSmFjaywNCg0KVGhhdCBhcHByb2FjaCBt
YXkgd29yayB3ZWxsIGZvciB5b3VyIGVtcGxveWVyIGJ1dCBzb3JyeSBJIGRvbid0IHRoaW5rIHRo
aXMgaXMNCnN1ZmZpY2llbnQgZm9yIGFuIHVwc3RyZWFtIGRyaXZlci4gSSB0aGluayB0aGF0IG1v
c3QgdXNlcnMgd2hvIGNvbmZpZ3VyZSBhDQpuZXR3b3JrIHN0b3JhZ2UgdGFyZ2V0IGV4cGVjdCBm
dWxsIGNvbnRyb2wgb3ZlciB3aGljaCBzdG9yYWdlIGRldmljZXMgYXJlIGV4cG9ydGVkDQphbmQg
YWxzbyBvdmVyIHdoaWNoIGNsaWVudHMgZG8gaGF2ZSBhbmQgZG8gbm90IGhhdmUgYWNjZXNzLg0K
DQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/24] ibtrs: client: main functionality
  2018-02-05 14:19     ` Roman Penyaev
@ 2018-02-05 16:24       ` Bart Van Assche
  0 siblings, 0 replies; 79+ messages in thread
From: Bart Van Assche @ 2018-02-05 16:24 UTC (permalink / raw)
  To: roman.penyaev, sagi
  Cc: danil.kipnis, hch, linux-block, linux-rdma, jinpu.wang, axboe, ogerlitz

T24gTW9uLCAyMDE4LTAyLTA1IGF0IDE1OjE5ICswMTAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0K
PiBPbiBNb24sIEZlYiA1LCAyMDE4IGF0IDEyOjE5IFBNLCBTYWdpIEdyaW1iZXJnIDxzYWdpQGdy
aW1iZXJnLm1lPiB3cm90ZToNCj4gPiBEbyB5b3UgYWN0dWFsbHkgZXZlciBoYXZlIHJlbW90ZSB3
cml0ZSBhY2Nlc3MgaW4geW91ciBwcm90b2NvbD8NCj4gDQo+IFdlIGRvIG5vdCBoYXZlIHJlYWRz
LCBpbnN0ZWFkIGNsaWVudCB3cml0ZXMgb24gd3JpdGUgYW5kIHNlcnZlciB3cml0ZXMNCj4gb24g
cmVhZC4gKHdyaXRlIG9ubHkgc3RvcmFnZSBzb2x1dGlvbiA6KQ0KDQpTbyB0aGVyZSBhcmUgbm8g
cmVzdHJpY3Rpb25zIG9uIHdoaWNoIGNsaWVudHMgY2FuIGxvZyBpbiBhbmQgYW55IGNsaWVudCBj
YW4NCnNlbmQgUkRNQSB3cml0ZXMgdG8gdGhlIHRhcmdldCBzeXN0ZW0/IEkgdGhpbmsgdGhpcyBu
ZWVkcyB0byBiZSBpbXByb3ZlZCAuLi4NCg0KVGhhbmtzLA0KDQpCYXJ0Lg0KDQoNCg==

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 16:16     ` Bart Van Assche
@ 2018-02-05 16:36       ` Jinpu Wang
  2018-02-07 16:35       ` Christopher Lameter
  1 sibling, 0 replies; 79+ messages in thread
From: Jinpu Wang @ 2018-02-05 16:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, linux-rdma, roman.penyaev, sagi, ogerlitz,
	axboe, danil.kipnis

On Mon, Feb 5, 2018 at 5:16 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote:
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
>> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > > o Simple configuration of IBNBD:
>> > >    - Server side is completely passive: volumes do not need to be
>> > >      explicitly exported.
>> >
>> > That sounds like a security hole? I think the ability to configure whether or
>> > not an initiator is allowed to log in is essential and also which volumes an
>> > initiator has access to.
>>
>> Our design target for well controlled production environment, so security is
>> handle in other layer. On server side, admin can set the dev_search_path in
>> module parameter to set parent directory, this will concatenate with the path
>> client send in open message to open a block device.
>
> Hello Jack,
>
> That approach may work well for your employer but sorry I don't think this is
> sufficient for an upstream driver. I think that most users who configure a
> network storage target expect full control over which storage devices are exported
> and also over which clients do have and do not have access.
>
> Bart.
Hello Bart,

I agree for general purpose, it may be good to have better access control.

Thanks,
-- 
Jack Wang
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 14:17         ` Sagi Grimberg
@ 2018-02-05 16:40           ` Danil Kipnis
  2018-02-05 18:38             ` Bart Van Assche
  0 siblings, 1 reply; 79+ messages in thread
From: Danil Kipnis @ 2018-02-05 16:40 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jinpu Wang, Bart Van Assche, roman.penyaev, linux-block,
	linux-rdma, hch, ogerlitz, axboe

On Mon, Feb 5, 2018 at 3:17 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>>> Hi Bart,
>>>>
>>>> My another 2 cents:)
>>>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche <Bart.VanAssche@wdc.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>>>>>>
>>>>>>
>>>>>> o Simple configuration of IBNBD:
>>>>>>      - Server side is completely passive: volumes do not need to be
>>>>>>        explicitly exported.
>>>>>
>>>>>
>>>>>
>>>>> That sounds like a security hole? I think the ability to configure
>>>>> whether or
>>>>> not an initiator is allowed to log in is essential and also which
>>>>> volumes
>>>>> an
>>>>> initiator has access to.
>>>>
>>>>
>>>> Our design target for well controlled production environment, so
>>>> security is handle in other layer.
>>>
>>>
>>>
>>> What will happen to a new adopter of the code you are contributing?
>>
>>
>> Hi Sagi, Hi Bart,
>> thanks for your feedback.
>> We considered the "storage cluster" setup, where each ibnbd client has
>> access to each ibnbd server. Each ibnbd server manages devices under
>> his "dev_search_path" and can provide access to them to any ibnbd
>> client in the network.
>
>
> I don't understand how that helps?
>
>> On top of that Ibnbd server has an additional
>> "artificial" restriction, that a device can be mapped in writable-mode
>> by only one client at once.
>
>
> I think one would still need the option to disallow readable export as
> well.

It just occurred to me, that we could easily extend the interface in
such a way that each client (i.e. each session) would have on server
side her own directory with the devices it can access. I.e. instead of
just "dev_search_path" per server, any client would be able to only
access devices under <dev_search_path>/session_name. (session name
must already be generated by each client in a unique way). This way
one could have an explicit control over which devices can be accessed
by which clients. Do you think that would do it?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 12:16 ` Sagi Grimberg
  2018-02-05 12:30   ` Sagi Grimberg
@ 2018-02-05 16:58   ` Bart Van Assche
  2018-02-05 17:16     ` Roman Penyaev
  2018-02-06 13:12   ` Roman Penyaev
  2 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-05 16:58 UTC (permalink / raw)
  To: roman.penyaev, linux-block, linux-rdma, sagi
  Cc: hch, danil.kipnis, jinpu.wang, axboe, ogerlitz

T24gTW9uLCAyMDE4LTAyLTA1IGF0IDE0OjE2ICswMjAwLCBTYWdpIEdyaW1iZXJnIHdyb3RlOg0K
PiAtIFlvdXIgbGF0ZW5jeSBtZWFzdXJlbWVudHMgYXJlIHN1cnByaXNpbmdseSBoaWdoIGZvciBh
IG51bGwgdGFyZ2V0DQo+ICAgIGRldmljZSAoZXZlbiBmb3IgbG93IGVuZCBudm1lIGRldmljZSBh
Y3R1YWxseSkgcmVnYXJkbGVzcyBvZiB0aGUNCj4gICAgdHJhbnNwb3J0IGltcGxlbWVudGF0aW9u
Lg0KPiANCj4gRm9yIGV4YW1wbGU6DQo+IC0gUUQ9MSByZWFkIGxhdGVuY3kgaXMgNjQ4Ljk1IGZv
ciBpYm5iZCAoSSBhc3N1bWUgdXNlY3MgcmlnaHQ/KSB3aGljaCBpcw0KPiAgICBmYWlybHkgaGln
aC4gb24gbnZtZS1yZG1hIGl0cyAxMDU4IHVzLCB3aGljaCBtZWFucyBvdmVyIDEgbWlsbGlzZWNv
bmQNCj4gICAgYW5kIGV2ZW4gMS4yNTQgbXMgZm9yIHNycC4gTGFzdCB0aW1lIEkgdGVzdGVkIG52
bWUtcmRtYSByZWFkIFFEPTENCj4gICAgbGF0ZW5jeSBJIGdvdCB+MTQgdXMuIFNvIHNvbWV0aGlu
ZyBkb2VzIG5vdCBhZGQgdXAgaGVyZS4gSWYgdGhpcyBpcw0KPiAgICBub3Qgc29tZSBjb25maWd1
cmF0aW9uIGlzc3VlLCB0aGVuIHdlIGhhdmUgc2VyaW91cyBidWdzIHRvIGhhbmRsZS4uDQo+IA0K
PiAtIFFEPTE2IHRoZSByZWFkIGxhdGVuY2llcyBhcmUgPiAxMG1zIGZvciBudWxsIGRldmljZXM/
ISBJJ20gaGF2aW5nDQo+ICAgIHRyb3VibGVzIHVuZGVyc3RhbmRpbmcgaG93IHlvdSB3ZXJlIGFi
bGUgdG8gZ2V0IHN1Y2ggaGlnaCBsYXRlbmNpZXMNCj4gICAgKD4gMTAwIG1zIGZvciBRRD49MTAw
KQ0KPiANCj4gQ2FuIHlvdSBzaGFyZSBtb3JlIGluZm9ybWF0aW9uIGFib3V0IHlvdXIgc2V0dXA/
IEl0IHdvdWxkIHJlYWxseSBoZWxwDQo+IHVzIHVuZGVyc3RhbmQgbW9yZS4NCg0KSSB3b3VsZCBh
bHNvIGFwcHJlY2lhdGUgaXQgaWYgbW9yZSBpbmZvcm1hdGlvbiBjb3VsZCBiZSBwcm92aWRlZCBh
Ym91dCB0aGUNCm1lYXN1cmVtZW50IHJlc3VsdHMuIEluIGFkZGl0aW9uIHRvIGFuc3dlcmluZyBT
YWdpJ3MgcXVlc3Rpb25zLCB3b3VsZCBpdA0KYmUgcG9zc2libGUgdG8gc2hhcmUgdGhlIGZpbyBq
b2IgdGhhdCB3YXMgdXNlZCBmb3IgbWVhc3VyaW5nIGxhdGVuY3k/IEluDQpodHRwczovL2V2ZW50
cy5zdGF0aWMubGludXhmb3VuZC5vcmcvc2l0ZXMvZXZlbnRzL2ZpbGVzL3NsaWRlcy9Db3B5JTIw
b2YlMjBJQk5CRC1WYXVsdC0yMDE3LTUucGRmDQpJIGZvdW5kIHRoZSBmb2xsb3dpbmc6DQoNCmlv
ZGVwdGg9MTI4DQppb2RlcHRoX2JhdGNoX3N1Ym1pdD0xMjgNCg0KSWYgeW91IHdhbnQgdG8ga2Vl
cCB0aGUgcGlwZWxpbmUgZnVsbCBJIHRoaW5rIHRoYXQgeW91IG5lZWQgdG8gc2V0IHRoZQ0KaW9k
ZXB0aF9iYXRjaF9zdWJtaXQgcGFyYW1ldGVyIHRvIGEgdmFsdWUgdGhhdCBpcyBtdWNoIGxvd2Vy
IHRoYW4gaW9kZXB0aC4NCkkgdGhpbmsgdGhhdCBzZXR0aW5nIGlvZGVwdGhfYmF0Y2hfc3VibWl0
IGVxdWFsIHRvIGlvZGVwdGggd2lsbCB5aWVsZA0Kc3Vib3B0aW1hbCBJT1BTIHJlc3VsdHMuIEpl
bnMsIHBsZWFzZSBjb3JyZWN0IG1lIGlmIEkgZ290IHRoaXMgd3JvbmcuDQoNClRoYW5rcywNCg0K
QmFydC4NCg0KDQo=

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/24] ibtrs: client: main functionality
  2018-02-05 14:14       ` Sagi Grimberg
@ 2018-02-05 17:05         ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-05 17:05 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, linux-block, linux-rdma, danil.kipnis, hch,
	ogerlitz, jinpu.wang, axboe

On Mon, Feb 5, 2018 at 3:14 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> Indeed, seems sbitmap can be reused.
>>
>> But tags is a part of IBTRS, and is not related to block device at all.
>> One
>> IBTRS connection (session) handles many block devices
>
>
> we use host shared tag sets for the case of multiple block devices.

Unfortunately (or fortunately, depends on the indented mq design) tags are
not shared between hw_queues.  So in our case (1 session queue, N devices)
you always have to specify tags->nr_hw_queues = 1 or magic will not happen
and you will always have more tags than your session supports.  But
nr_hw_queues = 1 kills performance dramatically.  What scales well is
the following: nr_hw_queues == num_online_cpus() == number of QPs in
one session.

>> (or any IO producers).
>
>
> Lets wait until we actually have this theoretical non-block IO
> producers..
>
>> With a tag you get a free slot of a buffer where you can read/write, so
>> once
>> you've allocated a tag you won't sleep on IO path inside a library.
>
>
> Same for block tags (given that you don't set the request queue
> otherwise)
>
>> Also tag
>> helps a lot on IO fail-over to another connection (multipath
>> implementation,
>> which is also a part of the transport library, not a block device), where
>> you
>> simply reuse the same buffer slot (with a tag in your hands) forwarding IO
>> to
>> another RDMA connection.
>
>
> What is the benefit of this detached architecture?

That gives us separated rdma IO library, where ibnbd is one of the players.

> IMO, one reason why you ended up not reusing a lot of the infrastructure
> is yielded from the  attempt to support a theoretical different consumer
> that is not ibnbd.

Well, not quite.  Not using rdma api helpers (we will use it) and not
using tags from block layer (we need tags inside transport) this is not
"a lot of the infrastructure" :)

I would say that we are not fast enough to follow all kernel trends.
That is the major reason, but not other user of ibtrs.

> Did you actually had plans for any other consumers?

Yep, the major target is replicated block storage, that's why separated
transport.

> Personally, I think you will be much better off with a unified approach
> for your block device implementation.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 16:58   ` Bart Van Assche
@ 2018-02-05 17:16     ` Roman Penyaev
  2018-02-05 17:20       ` Bart Van Assche
  0 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-05 17:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-rdma, sagi, hch, danil.kipnis, jinpu.wang,
	axboe, ogerlitz

Hi Bart,

On Mon, Feb 5, 2018 at 5:58 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Mon, 2018-02-05 at 14:16 +0200, Sagi Grimberg wrote:
>> - Your latency measurements are surprisingly high for a null target
>>    device (even for low end nvme device actually) regardless of the
>>    transport implementation.
>>
>> For example:
>> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>>    fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>>    and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>>    latency I got ~14 us. So something does not add up here. If this is
>>    not some configuration issue, then we have serious bugs to handle..
>>
>> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>>    troubles understanding how you were able to get such high latencies
>>    (> 100 ms for QD>=100)
>>
>> Can you share more information about your setup? It would really help
>> us understand more.
>
> I would also appreciate it if more information could be provided about the
> measurement results. In addition to answering Sagi's questions, would it
> be possible to share the fio job that was used for measuring latency? In
> https://events.static.linuxfound.org/sites/events/files/slides/Copy%20of%20IBNBD-Vault-2017-5.pdf
> I found the following:
>
> iodepth=128
> iodepth_batch_submit=128
>
> If you want to keep the pipeline full I think that you need to set the
> iodepth_batch_submit parameter to a value that is much lower than iodepth.
> I think that setting iodepth_batch_submit equal to iodepth will yield
> suboptimal IOPS results. Jens, please correct me if I got this wrong.

Sorry, Bart, I would answer here in a few words (I would like to answer
in details tomorrow on Sagi's mail).

Everything (fio jobs, setup, etc) is given in the same link:

https://www.spinics.net/lists/linux-rdma/msg48799.html

at the bottom you will find links on google docs with many pages
and archived fio jobs and scripts. (I do not remember exactly,
one year passed, but there should be everything).

Regarding smaller iodepth_batch_submit - that decreases performance.
Once I played with that, even introduced new iodepth_batch_complete_max
option for fio, but then I decided to stop and simply chose this
configuration, which provides me fastest results.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 17:16     ` Roman Penyaev
@ 2018-02-05 17:20       ` Bart Van Assche
  2018-02-06 11:47         ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-05 17:20 UTC (permalink / raw)
  To: roman.penyaev
  Cc: linux-block, hch, linux-rdma, jinpu.wang, sagi, ogerlitz, axboe,
	danil.kipnis

T24gTW9uLCAyMDE4LTAyLTA1IGF0IDE4OjE2ICswMTAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0K
PiBFdmVyeXRoaW5nIChmaW8gam9icywgc2V0dXAsIGV0YykgaXMgZ2l2ZW4gaW4gdGhlIHNhbWUg
bGluazoNCj4gDQo+IGh0dHBzOi8vd3d3LnNwaW5pY3MubmV0L2xpc3RzL2xpbnV4LXJkbWEvbXNn
NDg3OTkuaHRtbA0KPiANCj4gYXQgdGhlIGJvdHRvbSB5b3Ugd2lsbCBmaW5kIGxpbmtzIG9uIGdv
b2dsZSBkb2NzIHdpdGggbWFueSBwYWdlcw0KPiBhbmQgYXJjaGl2ZWQgZmlvIGpvYnMgYW5kIHNj
cmlwdHMuIChJIGRvIG5vdCByZW1lbWJlciBleGFjdGx5LA0KPiBvbmUgeWVhciBwYXNzZWQsIGJ1
dCB0aGVyZSBzaG91bGQgYmUgZXZlcnl0aGluZykuDQo+IA0KPiBSZWdhcmRpbmcgc21hbGxlciBp
b2RlcHRoX2JhdGNoX3N1Ym1pdCAtIHRoYXQgZGVjcmVhc2VzIHBlcmZvcm1hbmNlLg0KPiBPbmNl
IEkgcGxheWVkIHdpdGggdGhhdCwgZXZlbiBpbnRyb2R1Y2VkIG5ldyBpb2RlcHRoX2JhdGNoX2Nv
bXBsZXRlX21heA0KPiBvcHRpb24gZm9yIGZpbywgYnV0IHRoZW4gSSBkZWNpZGVkIHRvIHN0b3Ag
YW5kIHNpbXBseSBjaG9zZSB0aGlzDQo+IGNvbmZpZ3VyYXRpb24sIHdoaWNoIHByb3ZpZGVzIG1l
IGZhc3Rlc3QgcmVzdWx0cy4NCg0KSGVsbG8gUm9tYW4sDQoNClRoYXQncyB3ZWlyZC4gRm9yIHdo
aWNoIHByb3RvY29scyBkaWQgcmVkdWNpbmcgaW9kZXB0aF9iYXRjaF9zdWJtaXQgbGVhZA0KdG8g
bG93ZXIgcGVyZm9ybWFuY2U6IGFsbCB0aGUgdGVzdGVkIHByb3RvY29scyBvciBvbmx5IHNvbWUg
b2YgdGhlbT8NCg0KVGhhbmtzLA0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 16:40           ` Danil Kipnis
@ 2018-02-05 18:38             ` Bart Van Assche
  2018-02-06  9:44               ` Danil Kipnis
  0 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-05 18:38 UTC (permalink / raw)
  To: Danil Kipnis, Sagi Grimberg
  Cc: Jinpu Wang, roman.penyaev, linux-block, linux-rdma, hch, ogerlitz, axboe

On 02/05/18 08:40, Danil Kipnis wrote:
> It just occurred to me, that we could easily extend the interface in
> such a way that each client (i.e. each session) would have on server
> side her own directory with the devices it can access. I.e. instead of
> just "dev_search_path" per server, any client would be able to only
> access devices under <dev_search_path>/session_name. (session name
> must already be generated by each client in a unique way). This way
> one could have an explicit control over which devices can be accessed
> by which clients. Do you think that would do it?

Hello Danil,

That sounds interesting to me. However, I think that approach requires 
to configure client access completely before the kernel target side 
module is loaded. It does not allow to configure permissions dynamically 
after the kernel target module has been loaded. Additionally, I don't 
see how to support attributes per (initiator, block device) pair with 
that approach. LIO e.g. supports the 
/sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. 
You may want to implement similar functionality if you want to convince 
more users to use IBNBD.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 18:38             ` Bart Van Assche
@ 2018-02-06  9:44               ` Danil Kipnis
  2018-02-06 15:35                 ` Bart Van Assche
  0 siblings, 1 reply; 79+ messages in thread
From: Danil Kipnis @ 2018-02-06  9:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Sagi Grimberg, Jinpu Wang, roman.penyaev, linux-block,
	linux-rdma, hch, ogerlitz, axboe

On Mon, Feb 5, 2018 at 7:38 PM, Bart Van Assche <bart.vanassche@wdc.com> wrote:
> On 02/05/18 08:40, Danil Kipnis wrote:
>>
>> It just occurred to me, that we could easily extend the interface in
>> such a way that each client (i.e. each session) would have on server
>> side her own directory with the devices it can access. I.e. instead of
>> just "dev_search_path" per server, any client would be able to only
>> access devices under <dev_search_path>/session_name. (session name
>> must already be generated by each client in a unique way). This way
>> one could have an explicit control over which devices can be accessed
>> by which clients. Do you think that would do it?
>
>
> Hello Danil,
>
> That sounds interesting to me. However, I think that approach requires to
> configure client access completely before the kernel target side module is
> loaded. It does not allow to configure permissions dynamically after the
> kernel target module has been loaded. Additionally, I don't see how to
> support attributes per (initiator, block device) pair with that approach.
> LIO e.g. supports the
> /sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. You
> may want to implement similar functionality if you want to convince more
> users to use IBNBD.
>
> Thanks,
>
> Bart.

Hello Bart,

the configuration (which devices can be accessed by a particular
client) can happen also after the kernel target module is loaded. The
directory in <dev_search_path> is a module parameter and is fixed. It
contains for example "/ibnbd_devices/". But a particular client X
would be able to only access the devices located in the subdirectory
"/ibnbd_devices/client_x/". (The sessionname here is client_x) One can
add or remove the devices from that directory (those are just symlinks
to /dev/xxx) at any time - before or after the server module is
loaded. But you are right, we need something additional in order to be
able to specify which devices a client can access writable and which
readonly. May be another subdirectories "wr" and "ro" for each client:
those under /ibnbd_devices/client_x/ro/ can only be read by client_x
and those in /ibnbd_devices/client_x/wr/ can also be written to?

Thanks,

Danil.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 17:20       ` Bart Van Assche
@ 2018-02-06 11:47         ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-06 11:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, linux-rdma, jinpu.wang, sagi, ogerlitz, axboe,
	danil.kipnis

On Mon, Feb 5, 2018 at 6:20 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Mon, 2018-02-05 at 18:16 +0100, Roman Penyaev wrote:
>> Everything (fio jobs, setup, etc) is given in the same link:
>>
>> https://www.spinics.net/lists/linux-rdma/msg48799.html
>>
>> at the bottom you will find links on google docs with many pages
>> and archived fio jobs and scripts. (I do not remember exactly,
>> one year passed, but there should be everything).
>>
>> Regarding smaller iodepth_batch_submit - that decreases performance.
>> Once I played with that, even introduced new iodepth_batch_complete_max
>> option for fio, but then I decided to stop and simply chose this
>> configuration, which provides me fastest results.
>
> Hello Roman,
>
> That's weird. For which protocols did reducing iodepth_batch_submit lead
> to lower performance: all the tested protocols or only some of them?

Hi Bart,

Seems that does not depend on protocol (when I tested it was true for nvme
and ibnbd).  That depends on a load.  On high load (1 or few fio jobs are
dedicated to each cpu, and we have 64 cpus) it turns out to be faster to wait
completions for all queue for that particular block dev, instead of switching
from kernel to userspace for each completed IO.

But I can assure you that performance difference is very minor, it exists,
but it does not change the whole picture of what you see on this google
sheet. So what I tried to achieve is to squeeze everything I could, nothing
more.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules
  2018-02-05 10:52   ` Sagi Grimberg
@ 2018-02-06 12:01     ` Roman Penyaev
  2018-02-06 16:10       ` Jason Gunthorpe
  0 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-06 12:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

Hi Sagi,

On Mon, Feb 5, 2018 at 11:52 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Roman,
>
> Here are some comments below.
>
>> +int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
>> +{
>> +       struct ib_recv_wr wr, *bad_wr;
>> +
>> +       wr.next    = NULL;
>> +       wr.wr_cqe  = cqe;
>> +       wr.sg_list = NULL;
>> +       wr.num_sge = 0;
>> +
>> +       return ib_post_recv(con->qp, &wr, &bad_wr);
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);
>
>
> What is this designed to do?

Each IO completion (response from server to client) is an immediate
message with no data inside.  Using IMM field we are able to find IO
in the inflight table and complete it with an error code if any.

>> +int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu
>> *iu,
>> +                                struct ib_sge *sge, unsigned int num_sge,
>> +                                u32 rkey, u64 rdma_addr, u32 imm_data,
>> +                                enum ib_send_flags flags)
>> +{
>> +       struct ib_send_wr *bad_wr;
>> +       struct ib_rdma_wr wr;
>> +       int i;
>> +
>> +       wr.wr.next        = NULL;
>> +       wr.wr.wr_cqe      = &iu->cqe;
>> +       wr.wr.sg_list     = sge;
>> +       wr.wr.num_sge     = num_sge;
>> +       wr.rkey           = rkey;
>> +       wr.remote_addr    = rdma_addr;
>> +       wr.wr.opcode      = IB_WR_RDMA_WRITE_WITH_IMM;
>> +       wr.wr.ex.imm_data = cpu_to_be32(imm_data);
>> +       wr.wr.send_flags  = flags;
>> +
>> +       /*
>> +        * If one of the sges has 0 size, the operation will fail with an
>> +        * length error
>> +        */
>> +       for (i = 0; i < num_sge; i++)
>> +               if (WARN_ON(sge[i].length == 0))
>> +                       return -EINVAL;
>> +
>> +       return ib_post_send(con->qp, &wr.wr, &bad_wr);
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_iu_post_rdma_write_imm);
>> +
>> +int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe
>> *cqe,
>> +                                   u32 imm_data, enum ib_send_flags
>> flags)
>> +{
>> +       struct ib_send_wr wr, *bad_wr;
>> +
>> +       memset(&wr, 0, sizeof(wr));
>> +       wr.wr_cqe       = cqe;
>> +       wr.send_flags   = flags;
>> +       wr.opcode       = IB_WR_RDMA_WRITE_WITH_IMM;
>> +       wr.ex.imm_data  = cpu_to_be32(imm_data);
>> +
>> +       return ib_post_send(con->qp, &wr, &bad_wr);
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_post_rdma_write_imm_empty);
>
>
> Christoph did a great job adding a generic rdma rw API, please
> reuse it, if you rely on needed functionality that does not exist
> there, please enhance it instead of open-coding a new rdma engine
> library.

Good to know, thanks.

>> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device
>> *dev)
>> +{
>> +       int err;
>> +
>> +       d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
>> +       if (IS_ERR(d->pd))
>> +               return PTR_ERR(d->pd);
>> +       d->dev = dev;
>> +       d->lkey = d->pd->local_dma_lkey;
>> +       d->rkey = d->pd->unsafe_global_rkey;
>> +
>> +       err = ibtrs_query_device(d);
>> +       if (unlikely(err))
>> +               ib_dealloc_pd(d->pd);
>> +
>> +       return err;
>> +}
>
>
> I must say that this makes me frustrated.. We stopped doing these
> sort of things long time ago. No way we can even consider accepting
> the unsafe use of the global rkey exposing the entire memory space for
> remote access permissions.
>
> Sorry for being blunt, but this protocol design which makes a concious
> decision to expose unconditionally is broken by definition.

I suppose we can also afford the same trick which nvme does: provide
register_always module argument, can we?  That can be also interesting
to measure the performance difference.

When I did nvme testing with register_always=true/false I saw the
difference.  Would be nice to measure ibtrs with register_always=true
once ibtrs supports that.

>> +struct ibtrs_ib_dev *ibtrs_ib_dev_find_get(struct rdma_cm_id *cm_id)
>> +{
>> +       struct ibtrs_ib_dev *dev;
>> +       int err;
>> +
>> +       mutex_lock(&device_list_mutex);
>> +       list_for_each_entry(dev, &device_list, entry) {
>> +               if (dev->dev->node_guid == cm_id->device->node_guid &&
>> +                   kref_get_unless_zero(&dev->ref))
>> +                       goto out_unlock;
>> +       }
>> +       dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>> +       if (unlikely(!dev))
>> +               goto out_err;
>> +
>> +       kref_init(&dev->ref);
>> +       err = ibtrs_ib_dev_init(dev, cm_id->device);
>> +       if (unlikely(err))
>> +               goto out_free;
>> +       list_add(&dev->entry, &device_list);
>> +out_unlock:
>> +       mutex_unlock(&device_list_mutex);
>> +
>> +       return dev;
>> +
>> +out_free:
>> +       kfree(dev);
>> +out_err:
>> +       mutex_unlock(&device_list_mutex);
>> +
>> +       return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_ib_dev_find_get);
>
>
> Is it time to make this a common helper in rdma_cm?

True, can be also the patch for nvme.

>> +static void schedule_hb(struct ibtrs_sess *sess)
>> +{
>> +       queue_delayed_work(sess->hb_wq, &sess->hb_dwork,
>> +                          msecs_to_jiffies(sess->hb_interval_ms));
>> +}
>
>
> What does hb stand for?

Just heartbeats :)

>
>> +void ibtrs_send_hb_ack(struct ibtrs_sess *sess)
>> +{
>> +       struct ibtrs_con *usr_con = sess->con[0];
>> +       u32 imm;
>> +       int err;
>> +
>> +       imm = ibtrs_to_imm(IBTRS_HB_ACK_IMM, 0);
>> +       err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe,
>> +                                             imm, IB_SEND_SIGNALED);
>> +       if (unlikely(err)) {
>> +               sess->hb_err_handler(usr_con, err);
>> +               return;
>> +       }
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_send_hb_ack);
>
>
> What is this?
>
> What is all this hb stuff?

Heartbeats acknowledgements.

>> +static int ibtrs_str_ipv4_to_sockaddr(const char *addr, size_t len,
>> +                                     short port, struct sockaddr *dst)
>> +{
>> +       struct sockaddr_in *dst_sin = (struct sockaddr_in *)dst;
>> +       int ret;
>> +
>> +       ret = in4_pton(addr, len, (u8 *)&dst_sin->sin_addr.s_addr,
>> +                      '\0', NULL);
>> +       if (ret == 0)
>> +               return -EINVAL;
>> +
>> +       dst_sin->sin_family = AF_INET;
>> +       dst_sin->sin_port = htons(port);
>> +
>> +       return 0;
>> +}
>> +
>> +static int ibtrs_str_ipv6_to_sockaddr(const char *addr, size_t len,
>> +                                     short port, struct sockaddr *dst)
>> +{
>> +       struct sockaddr_in6 *dst_sin6 = (struct sockaddr_in6 *)dst;
>> +       int ret;
>> +
>> +       ret = in6_pton(addr, len, dst_sin6->sin6_addr.s6_addr,
>> +                      '\0', NULL);
>> +       if (ret != 1)
>> +               return -EINVAL;
>> +
>> +       dst_sin6->sin6_family = AF_INET6;
>> +       dst_sin6->sin6_port = htons(port);
>> +
>> +       return 0;
>> +}
>
>
> We already added helpers for this in net utils, you don't need to
> code it again.

Nice.  Will reuse then.

>> +
>> +static int ibtrs_str_gid_to_sockaddr(const char *addr, size_t len,
>> +                                    short port, struct sockaddr *dst)
>> +{
>> +       struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
>> +       int ret;
>> +
>> +       /* We can use some of the I6 functions since GID is a valid
>> +        * IPv6 address format
>> +        */
>> +       ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
>> +       if (ret == 0)
>> +               return -EINVAL;
>> +
>> +       dst_ib->sib_family = AF_IB;
>> +       /*
>> +        * Use the same TCP server port number as the IB service ID
>> +        * on the IB port space range
>> +        */
>> +       dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
>> +       dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
>> +       dst_ib->sib_pkey = cpu_to_be16(0xffff);
>> +
>> +       return 0;
>> +}
>
>
> Would be a nice addition to net utils.

Got it.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 04/24] ibtrs: client: private header with client structs and functions
  2018-02-05 10:59   ` Sagi Grimberg
@ 2018-02-06 12:23     ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-06 12:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

Hi Sagi,

On Mon, Feb 5, 2018 at 11:59 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Roman,
>
>
>> +struct ibtrs_clt_io_req {
>> +       struct list_head        list;
>> +       struct ibtrs_iu         *iu;
>> +       struct scatterlist      *sglist; /* list holding user data */
>> +       unsigned int            sg_cnt;
>> +       unsigned int            sg_size;
>> +       unsigned int            data_len;
>> +       unsigned int            usr_len;
>> +       void                    *priv;
>> +       bool                    in_use;
>> +       struct ibtrs_clt_con    *con;
>> +       union {
>> +               struct ib_pool_fmr      **fmr_list;
>> +               struct ibtrs_fr_desc    **fr_list;
>> +       };
>
>
> We are pretty much stuck with fmrs for legacy devices, it has
> no future support plans, please don't add new dependencies
> on it. Its already hard enough to get rid of it.

Got it, we have a plan to get rid of fmr.  But as I remember our
internal tests: fr is slower.  The question: why that can be
according to your experience?  I will retest, but still that is
interesting to know.

>> +       void                    *map_page;
>> +       struct ibtrs_tag        *tag;
>
>
> Can I ask why do you need another tag that is not the request
> tag?

Once I responded already, the summary is the following:

1. Indeed mq supports tags sharing, but only between hw queues, not
globally, so for us that means tags->nr_hw_queues = 1, which kills
performance.

2. We need tags sharing in the transport library, which should not
be tightly coupled with block device.


>> +       u16                     nmdesc;
>> +       enum dma_data_direction dir;
>> +       ibtrs_conf_fn           *conf;
>> +       unsigned long           start_time;
>> +};
>> +
>
>
>> +static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
>> +{
>> +       if (unlikely(!c))
>> +               return NULL;
>> +
>> +       return container_of(c, struct ibtrs_clt_con, c);
>> +}
>> +
>> +static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
>> +{
>> +       if (unlikely(!s))
>> +               return NULL;
>> +
>> +       return container_of(s, struct ibtrs_clt_sess, s);
>> +}
>
>
> Seems a bit awkward that container_of wrappers check pointer validity...

That can be fixed, frankly, I don't remember code paths where I
implicitly rely on that returned null: session or connection are
always expected as valid pointers.

>> +/**
>> + * list_next_or_null_rr - get next list element in round-robin fashion.
>> + * @pos:     entry, starting cursor.
>> + * @head:    head of the list to examine. This list must have at least
>> one
>> + *           element, namely @pos.
>> + * @member:  name of the list_head structure within typeof(*pos).
>> + *
>> + * Important to understand that @pos is a list entry, which can be
>> already
>> + * removed using list_del_rcu(), so if @head has become empty NULL will
>> be
>> + * returned. Otherwise next element is returned in round-robin fashion.
>> + */
>> +#define list_next_or_null_rcu_rr(pos, head, member) ({                 \
>> +       typeof(pos) ________next = NULL;                                \
>> +                                                                       \
>> +       if (!list_empty(head))                                          \
>> +               ________next = (pos)->member.next != (head) ?           \
>> +                       list_entry_rcu((pos)->member.next,              \
>> +                                      typeof(*pos), member) :          \
>> +                       list_entry_rcu((pos)->member.next->next,        \
>> +                                      typeof(*pos), member);           \
>> +       ________next;                                                   \
>> +})
>
>
> Why is this local to your driver?

Yeah, of course I can try to extend list.h

>> +
>> +/* See ibtrs-log.h */
>> +#define TYPES_TO_SESSNAME(obj)                                         \
>> +       LIST(CASE(obj, struct ibtrs_clt_sess *, s.sessname),            \
>> +            CASE(obj, struct ibtrs_clt *, sessname))
>> +
>> +#define TAG_SIZE(clt) (sizeof(struct ibtrs_tag) + (clt)->pdu_sz)
>> +#define GET_TAG(clt, idx) ((clt)->tags + TAG_SIZE(clt) * idx)
>
>
> Still don't understand why this is even needed..

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/24] ibtrs: client: sysfs interface functions
  2018-02-05 11:20   ` Sagi Grimberg
@ 2018-02-06 12:28     ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-06 12:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

On Mon, Feb 5, 2018 at 12:20 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Roman,
>
>
>> This is the sysfs interface to IBTRS sessions on client side:
>>
>>    /sys/kernel/ibtrs_client/<SESS-NAME>/
>>      *** IBTRS session created by ibtrs_clt_open() API call
>>      |
>>      |- max_reconnect_attempts
>>      |  *** number of reconnect attempts for session
>>      |
>>      |- add_path
>>      |  *** adds another connection path into IBTRS session
>>      |
>>      |- paths/<DEST-IP>/
>>         *** established paths to server in a session
>>         |
>>         |- disconnect
>>         |  *** disconnect path
>>         |
>>         |- reconnect
>>         |  *** reconnect path
>>         |
>>         |- remove_path
>>         |  *** remove current path
>>         |
>>         |- state
>>         |  *** retrieve current path state
>>         |
>>         |- stats/
>>            *** current path statistics
>>            |
>>           |- cpu_migration
>>           |- rdma
>>           |- rdma_lat
>>           |- reconnects
>>           |- reset_all
>>           |- sg_entries
>>           |- wc_completions
>>
>> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
>> Cc: Jack Wang <jinpu.wang@profitbricks.com>
>
>
> I think stats usually belong in debugfs.

I will change that.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/24] ibtrs: server: main functionality
  2018-02-05 11:29   ` Sagi Grimberg
@ 2018-02-06 12:46     ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-06 12:46 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

On Mon, Feb 5, 2018 at 12:29 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Roman,
>
> Some comments below.
>
>
> On 02/02/2018 04:08 PM, Roman Pen wrote:
>>
>> This is main functionality of ibtrs-server module, which accepts
>> set of RDMA connections (so called IBTRS session), creates/destroys
>> sysfs entries associated with IBTRS session and notifies upper layer
>> (user of IBTRS API) about RDMA requests or link events.
>>
>> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
>> Cc: Jack Wang <jinpu.wang@profitbricks.com>
>> ---
>>   drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1811
>> ++++++++++++++++++++++++++++++
>>   1 file changed, 1811 insertions(+)
>>
>> diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>> b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>> new file mode 100644
>> index 000000000000..0d1fc08bd821
>> --- /dev/null
>> +++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>> @@ -0,0 +1,1811 @@
>> +/*
>> + * InfiniBand Transport Layer
>> + *
>> + * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
>> + * Authors: Fabian Holler <mail@fholler.de>
>> + *          Jack Wang <jinpu.wang@profitbricks.com>
>> + *          Kleber Souza <kleber.souza@profitbricks.com>
>> + *          Danil Kipnis <danil.kipnis@profitbricks.com>
>> + *          Roman Penyaev <roman.penyaev@profitbricks.com>
>> + *          Milind Dumbare <Milind.dumbare@gmail.com>
>> + *
>> + * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
>> + * Authors: Danil Kipnis <danil.kipnis@profitbricks.com>
>> + *          Roman Penyaev <roman.penyaev@profitbricks.com>
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version 2
>> + * of the License, or (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#undef pr_fmt
>> +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
>> +
>> +#include <linux/module.h>
>> +#include <linux/mempool.h>
>> +
>> +#include "ibtrs-srv.h"
>> +#include "ibtrs-log.h"
>> +
>> +MODULE_AUTHOR("ibnbd@profitbricks.com");
>> +MODULE_DESCRIPTION("IBTRS Server");
>> +MODULE_VERSION(IBTRS_VER_STRING);
>> +MODULE_LICENSE("GPL");
>> +
>> +#define DEFAULT_MAX_IO_SIZE_KB 128
>> +#define DEFAULT_MAX_IO_SIZE (DEFAULT_MAX_IO_SIZE_KB * 1024)
>> +#define MAX_REQ_SIZE PAGE_SIZE
>> +#define MAX_SG_COUNT ((MAX_REQ_SIZE - sizeof(struct ibtrs_msg_rdma_read))
>> \
>> +                     / sizeof(struct ibtrs_sg_desc))
>> +
>> +static int max_io_size = DEFAULT_MAX_IO_SIZE;
>> +static int rcv_buf_size = DEFAULT_MAX_IO_SIZE + MAX_REQ_SIZE;
>> +
>> +static int max_io_size_set(const char *val, const struct kernel_param
>> *kp)
>> +{
>> +       int err, ival;
>> +
>> +       err = kstrtoint(val, 0, &ival);
>> +       if (err)
>> +               return err;
>> +
>> +       if (ival < 4096 || ival + MAX_REQ_SIZE > (4096 * 1024) ||
>> +           (ival + MAX_REQ_SIZE) % 512 != 0) {
>> +               pr_err("Invalid max io size value %d, has to be"
>> +                      " > %d, < %d\n", ival, 4096, 4194304);
>> +               return -EINVAL;
>> +       }
>> +
>> +       max_io_size = ival;
>> +       rcv_buf_size = max_io_size + MAX_REQ_SIZE;
>> +       pr_info("max io size changed to %d\n", ival);
>> +
>> +       return 0;
>> +}
>> +
>> +static const struct kernel_param_ops max_io_size_ops = {
>> +       .set            = max_io_size_set,
>> +       .get            = param_get_int,
>> +};
>> +module_param_cb(max_io_size, &max_io_size_ops, &max_io_size, 0444);
>> +MODULE_PARM_DESC(max_io_size,
>> +                "Max size for each IO request, when change the unit is in
>> byte"
>> +                " (default: " __stringify(DEFAULT_MAX_IO_SIZE_KB) "KB)");
>> +
>> +#define DEFAULT_SESS_QUEUE_DEPTH 512
>> +static int sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
>> +module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
>> +MODULE_PARM_DESC(sess_queue_depth,
>> +                "Number of buffers for pending I/O requests to allocate"
>> +                " per session. Maximum: "
>> __stringify(MAX_SESS_QUEUE_DEPTH)
>> +                " (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
>> +
>> +/* We guarantee to serve 10 paths at least */
>> +#define CHUNK_POOL_SIZE (DEFAULT_SESS_QUEUE_DEPTH * 10)
>> +static mempool_t *chunk_pool;
>> +
>> +static int retry_count = 7;
>> +
>> +static int retry_count_set(const char *val, const struct kernel_param
>> *kp)
>> +{
>> +       int err, ival;
>> +
>> +       err = kstrtoint(val, 0, &ival);
>> +       if (err)
>> +               return err;
>> +
>> +       if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT) {
>> +               pr_err("Invalid retry count value %d, has to be"
>> +                      " > %d, < %d\n", ival, MIN_RTR_CNT, MAX_RTR_CNT);
>> +               return -EINVAL;
>> +       }
>> +
>> +       retry_count = ival;
>> +       pr_info("QP retry count changed to %d\n", ival);
>> +
>> +       return 0;
>> +}
>> +
>> +static const struct kernel_param_ops retry_count_ops = {
>> +       .set            = retry_count_set,
>> +       .get            = param_get_int,
>> +};
>> +module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
>> +
>> +MODULE_PARM_DESC(retry_count, "Number of times to send the message if
>> the"
>> +                " remote side didn't respond with Ack or Nack (default:
>> 3,"
>> +                " min: " __stringify(MIN_RTR_CNT) ", max: "
>> +                __stringify(MAX_RTR_CNT) ")");
>> +
>> +static char cq_affinity_list[256] = "";
>> +static cpumask_t cq_affinity_mask = { CPU_BITS_ALL };
>> +
>> +static void init_cq_affinity(void)
>> +{
>> +       sprintf(cq_affinity_list, "0-%d", nr_cpu_ids - 1);
>> +}
>> +
>> +static int cq_affinity_list_set(const char *val, const struct
>> kernel_param *kp)
>> +{
>> +       int ret = 0, len = strlen(val);
>> +       cpumask_var_t new_value;
>> +
>> +       if (!strlen(cq_affinity_list))
>> +               init_cq_affinity();
>> +
>> +       if (len >= sizeof(cq_affinity_list))
>> +               return -EINVAL;
>> +       if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
>> +               return -ENOMEM;
>> +
>> +       ret = cpulist_parse(val, new_value);
>> +       if (ret) {
>> +               pr_err("Can't set cq_affinity_list \"%s\": %d\n", val,
>> +                      ret);
>> +               goto free_cpumask;
>> +       }
>> +
>> +       strlcpy(cq_affinity_list, val, sizeof(cq_affinity_list));
>> +       *strchrnul(cq_affinity_list, '\n') = '\0';
>> +       cpumask_copy(&cq_affinity_mask, new_value);
>> +
>> +       pr_info("cq_affinity_list changed to %*pbl\n",
>> +               cpumask_pr_args(&cq_affinity_mask));
>> +free_cpumask:
>> +       free_cpumask_var(new_value);
>> +       return ret;
>> +}
>> +
>> +static struct kparam_string cq_affinity_list_kparam_str = {
>> +       .maxlen = sizeof(cq_affinity_list),
>> +       .string = cq_affinity_list
>> +};
>> +
>> +static const struct kernel_param_ops cq_affinity_list_ops = {
>> +       .set    = cq_affinity_list_set,
>> +       .get    = param_get_string,
>> +};
>> +
>> +module_param_cb(cq_affinity_list, &cq_affinity_list_ops,
>> +               &cq_affinity_list_kparam_str, 0644);
>> +MODULE_PARM_DESC(cq_affinity_list, "Sets the list of cpus to use as cq
>> vectors."
>> +                "(default: use all possible CPUs)");
>> +
>
>
> Can you explain why not using configfs?

No reason, will switch.

>> +static void ibtrs_srv_close_work(struct work_struct *work)
>> +{
>> +       struct ibtrs_srv_sess *sess;
>> +       struct ibtrs_srv_ctx *ctx;
>> +       struct ibtrs_srv_con *con;
>> +       int i;
>> +
>> +       sess = container_of(work, typeof(*sess), close_work);
>> +       ctx = sess->srv->ctx;
>> +
>> +       ibtrs_srv_destroy_sess_files(sess);
>> +       ibtrs_srv_stop_hb(sess);
>> +
>> +       for (i = 0; i < sess->s.con_num; i++) {
>> +               con = to_srv_con(sess->s.con[i]);
>> +               if (!con)
>> +                       continue;
>> +
>> +               rdma_disconnect(con->c.cm_id);
>> +               ib_drain_qp(con->c.qp);
>> +       }
>> +       /* Wait for all inflights */
>> +       ibtrs_srv_wait_ops_ids(sess);
>> +
>> +       /* Notify upper layer if we are the last path */
>> +       ibtrs_srv_sess_down(sess);
>> +
>> +       unmap_cont_bufs(sess);
>> +       ibtrs_srv_free_ops_ids(sess);
>> +
>> +       for (i = 0; i < sess->s.con_num; i++) {
>> +               con = to_srv_con(sess->s.con[i]);
>> +               if (!con)
>> +                       continue;
>> +
>> +               ibtrs_cq_qp_destroy(&con->c);
>> +               rdma_destroy_id(con->c.cm_id);
>> +               kfree(con);
>> +       }
>> +       ibtrs_ib_dev_put(sess->s.ib_dev);
>> +
>> +       del_path_from_srv(sess);
>> +       put_srv(sess->srv);
>> +       sess->srv = NULL;
>> +       ibtrs_srv_change_state(sess, IBTRS_SRV_CLOSED);
>> +
>> +       kfree(sess->rdma_addr);
>> +       kfree(sess->s.con);
>> +       kfree(sess);
>> +}
>> +
>> +static int ibtrs_rdma_do_accept(struct ibtrs_srv_sess *sess,
>> +                               struct rdma_cm_id *cm_id)
>> +{
>> +       struct ibtrs_srv *srv = sess->srv;
>> +       struct ibtrs_msg_conn_rsp msg;
>> +       struct rdma_conn_param param;
>> +       int err;
>> +
>> +       memset(&param, 0, sizeof(param));
>> +       param.retry_count = retry_count;
>> +       param.rnr_retry_count = 7;
>> +       param.private_data = &msg;
>> +       param.private_data_len = sizeof(msg);
>> +
>> +       memset(&msg, 0, sizeof(msg));
>> +       msg.magic = cpu_to_le16(IBTRS_MAGIC);
>> +       msg.version = cpu_to_le16(IBTRS_VERSION);
>> +       msg.errno = 0;
>> +       msg.queue_depth = cpu_to_le16(srv->queue_depth);
>> +       msg.rkey = cpu_to_le32(sess->s.ib_dev->rkey);
>
>
> As said, this cannot happen anymore...

We've already planned to change that.  Would be interesting to see
the performance results.

>> +static struct rdma_cm_id *ibtrs_srv_cm_init(struct ibtrs_srv_ctx *ctx,
>> +                                           struct sockaddr *addr,
>> +                                           enum rdma_port_space ps)
>> +{
>> +       struct rdma_cm_id *cm_id;
>> +       int ret;
>> +
>> +       cm_id = rdma_create_id(&init_net, ibtrs_srv_rdma_cm_handler,
>> +                              ctx, ps, IB_QPT_RC);
>> +       if (IS_ERR(cm_id)) {
>> +               ret = PTR_ERR(cm_id);
>> +               pr_err("Creating id for RDMA connection failed, err:
>> %d\n",
>> +                      ret);
>> +               goto err_out;
>> +       }
>> +       ret = rdma_bind_addr(cm_id, addr);
>> +       if (ret) {
>> +               pr_err("Binding RDMA address failed, err: %d\n", ret);
>> +               goto err_cm;
>> +       }
>> +       ret = rdma_listen(cm_id, 64);
>> +       if (ret) {
>> +               pr_err("Listening on RDMA connection failed, err: %d\n",
>> +                      ret);
>> +               goto err_cm;
>> +       }
>> +
>> +       switch (addr->sa_family) {
>> +       case AF_INET:
>> +               pr_debug("listening on port %u\n",
>> +                        ntohs(((struct sockaddr_in *)addr)->sin_port));
>> +               break;
>> +       case AF_INET6:
>> +               pr_debug("listening on port %u\n",
>> +                        ntohs(((struct sockaddr_in6 *)addr)->sin6_port));
>> +               break;
>> +       case AF_IB:
>> +               pr_debug("listening on service id 0x%016llx\n",
>> +                        be64_to_cpu(rdma_get_service_id(cm_id, addr)));
>> +               break;
>> +       default:
>> +               pr_debug("listening on address family %u\n",
>> addr->sa_family);
>> +       }
>
>
> We already have printk that accepts address format...

Nice, thanks.

>> +
>> +       return cm_id;
>> +
>> +err_cm:
>> +       rdma_destroy_id(cm_id);
>> +err_out:
>> +
>> +       return ERR_PTR(ret);
>> +}
>> +
>> +static int ibtrs_srv_rdma_init(struct ibtrs_srv_ctx *ctx, unsigned int
>> port)
>> +{
>> +       struct sockaddr_in6 sin = {
>> +               .sin6_family    = AF_INET6,
>> +               .sin6_addr      = IN6ADDR_ANY_INIT,
>> +               .sin6_port      = htons(port),
>> +       };
>> +       struct sockaddr_ib sib = {
>> +               .sib_family                     = AF_IB,
>> +               .sib_addr.sib_subnet_prefix     = 0ULL,
>> +               .sib_addr.sib_interface_id      = 0ULL,
>> +               .sib_sid        = cpu_to_be64(RDMA_IB_IP_PS_IB | port),
>> +               .sib_sid_mask   = cpu_to_be64(0xffffffffffffffffULL),
>> +               .sib_pkey       = cpu_to_be16(0xffff),
>> +       };
>
>
> ipv4?

sockaddr_in6 accepts ipv4 also.

>> +       struct rdma_cm_id *cm_ip, *cm_ib;
>> +       int ret;
>> +
>> +       /*
>> +        * We accept both IPoIB and IB connections, so we need to keep
>> +        * two cm id's, one for each socket type and port space.
>> +        * If the cm initialization of one of the id's fails, we abort
>> +        * everything.
>> +        */
>> +       cm_ip = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sin,
>> RDMA_PS_TCP);
>> +       if (unlikely(IS_ERR(cm_ip)))
>> +               return PTR_ERR(cm_ip);
>> +
>> +       cm_ib = ibtrs_srv_cm_init(ctx, (struct sockaddr *)&sib,
>> RDMA_PS_IB);
>> +       if (unlikely(IS_ERR(cm_ib))) {
>> +               ret = PTR_ERR(cm_ib);
>> +               goto free_cm_ip;
>> +       }
>> +
>> +       ctx->cm_id_ip = cm_ip;
>> +       ctx->cm_id_ib = cm_ib;
>> +
>> +       return 0;
>> +
>> +free_cm_ip:
>> +       rdma_destroy_id(cm_ip);
>> +
>> +       return ret;
>> +}
>> +
>> +static struct ibtrs_srv_ctx *alloc_srv_ctx(rdma_ev_fn *rdma_ev,
>> +                                          link_ev_fn *link_ev)
>> +{
>> +       struct ibtrs_srv_ctx *ctx;
>> +
>> +       ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>> +       if (!ctx)
>> +               return NULL;
>> +
>> +       ctx->rdma_ev = rdma_ev;
>> +       ctx->link_ev = link_ev;
>> +       mutex_init(&ctx->srv_mutex);
>> +       INIT_LIST_HEAD(&ctx->srv_list);
>> +
>> +       return ctx;
>> +}
>> +
>> +static void free_srv_ctx(struct ibtrs_srv_ctx *ctx)
>> +{
>> +       WARN_ON(!list_empty(&ctx->srv_list));
>> +       kfree(ctx);
>> +}
>> +
>> +struct ibtrs_srv_ctx *ibtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn
>> *link_ev,
>> +                                    unsigned int port)
>> +{
>> +       struct ibtrs_srv_ctx *ctx;
>> +       int err;
>> +
>> +       ctx = alloc_srv_ctx(rdma_ev, link_ev);
>> +       if (unlikely(!ctx))
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       err = ibtrs_srv_rdma_init(ctx, port);
>> +       if (unlikely(err)) {
>> +               free_srv_ctx(ctx);
>> +               return ERR_PTR(err);
>> +       }
>> +       /* Do not let module be unloaded if server context is alive */
>> +       __module_get(THIS_MODULE);
>> +
>> +       return ctx;
>> +}
>> +EXPORT_SYMBOL(ibtrs_srv_open);
>> +
>> +void ibtrs_srv_queue_close(struct ibtrs_srv_sess *sess)
>> +{
>> +       close_sess(sess);
>> +}
>> +
>> +static void close_sess(struct ibtrs_srv_sess *sess)
>> +{
>> +       enum ibtrs_srv_state old_state;
>> +
>> +       if (ibtrs_srv_change_state_get_old(sess, IBTRS_SRV_CLOSING,
>> +                                          &old_state))
>> +               queue_work(ibtrs_wq, &sess->close_work);
>> +       WARN_ON(sess->state != IBTRS_SRV_CLOSING);
>> +}
>> +
>> +static void close_sessions(struct ibtrs_srv *srv)
>> +{
>> +       struct ibtrs_srv_sess *sess;
>> +
>> +       mutex_lock(&srv->paths_mutex);
>> +       list_for_each_entry(sess, &srv->paths_list, s.entry)
>> +               close_sess(sess);
>> +       mutex_unlock(&srv->paths_mutex);
>> +}
>> +
>> +static void close_ctx(struct ibtrs_srv_ctx *ctx)
>> +{
>> +       struct ibtrs_srv *srv;
>> +
>> +       mutex_lock(&ctx->srv_mutex);
>> +       list_for_each_entry(srv, &ctx->srv_list, ctx_list)
>> +               close_sessions(srv);
>> +       mutex_unlock(&ctx->srv_mutex);
>> +       flush_workqueue(ibtrs_wq);
>> +}
>> +
>> +void ibtrs_srv_close(struct ibtrs_srv_ctx *ctx)
>> +{
>> +       rdma_destroy_id(ctx->cm_id_ip);
>> +       rdma_destroy_id(ctx->cm_id_ib);
>> +       close_ctx(ctx);
>> +       free_srv_ctx(ctx);
>> +       module_put(THIS_MODULE);
>> +}
>> +EXPORT_SYMBOL(ibtrs_srv_close);
>> +
>> +static int check_module_params(void)
>> +{
>> +       if (sess_queue_depth < 1 || sess_queue_depth >
>> MAX_SESS_QUEUE_DEPTH) {
>> +               pr_err("Invalid sess_queue_depth parameter value\n");
>> +               return -EINVAL;
>> +       }
>> +
>> +       /* check if IB immediate data size is enough to hold the mem_id
>> and the
>> +        * offset inside the memory chunk
>> +        */
>> +       if (ilog2(sess_queue_depth - 1) + ilog2(rcv_buf_size - 1) >
>> +           MAX_IMM_PAYL_BITS) {
>> +               pr_err("RDMA immediate size (%db) not enough to encode "
>> +                      "%d buffers of size %dB. Reduce 'sess_queue_depth'
>> "
>> +                      "or 'max_io_size' parameters.\n",
>> MAX_IMM_PAYL_BITS,
>> +                      sess_queue_depth, rcv_buf_size);
>> +               return -EINVAL;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int __init ibtrs_server_init(void)
>> +{
>> +       int err;
>> +
>> +       if (!strlen(cq_affinity_list))
>> +               init_cq_affinity();
>> +
>> +       pr_info("Loading module %s, version: %s "
>> +               "(retry_count: %d, cq_affinity_list: %s, "
>> +               "max_io_size: %d, sess_queue_depth: %d)\n",
>> +               KBUILD_MODNAME, IBTRS_VER_STRING, retry_count,
>> +               cq_affinity_list, max_io_size, sess_queue_depth);
>> +
>> +       err = check_module_params();
>> +       if (err) {
>> +               pr_err("Failed to load module, invalid module parameters,"
>> +                      " err: %d\n", err);
>> +               return err;
>> +       }
>> +       chunk_pool = mempool_create_page_pool(CHUNK_POOL_SIZE,
>> +                                             get_order(rcv_buf_size));
>> +       if (unlikely(!chunk_pool)) {
>> +               pr_err("Failed preallocate pool of chunks\n");
>> +               return -ENOMEM;
>> +       }
>> +       ibtrs_wq = alloc_workqueue("ibtrs_server_wq", WQ_MEM_RECLAIM, 0);
>> +       if (!ibtrs_wq) {
>> +               pr_err("Failed to load module, alloc ibtrs_server_wq
>> failed\n");
>> +               goto out_chunk_pool;
>> +       }
>> +       err = ibtrs_srv_create_sysfs_module_files();
>> +       if (err) {
>> +               pr_err("Failed to load module, can't create sysfs files,"
>> +                      " err: %d\n", err);
>> +               goto out_ibtrs_wq;
>> +       }
>> +
>> +       return 0;
>> +
>> +out_ibtrs_wq:
>> +       destroy_workqueue(ibtrs_wq);
>> +out_chunk_pool:
>> +       mempool_destroy(chunk_pool);
>> +
>> +       return err;
>> +}
>> +
>> +static void __exit ibtrs_server_exit(void)
>> +{
>> +       ibtrs_srv_destroy_sysfs_module_files();
>> +       destroy_workqueue(ibtrs_wq);
>> +       mempool_destroy(chunk_pool);
>> +}
>> +
>> +module_init(ibtrs_server_init);
>> +module_exit(ibtrs_server_exit);
>>
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 12:16 ` Sagi Grimberg
  2018-02-05 12:30   ` Sagi Grimberg
  2018-02-05 16:58   ` Bart Van Assche
@ 2018-02-06 13:12   ` Roman Penyaev
  2018-02-06 16:01     ` Bart Van Assche
  2 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-06 13:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

Hi Sagi,

On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Roman and the team,
>
> On 02/02/2018 04:08 PM, Roman Pen wrote:
>>
>> This series introduces IBNBD/IBTRS modules.
>>
>> IBTRS (InfiniBand Transport) is a reliable high speed transport library
>> which allows for establishing connection between client and server
>> machines via RDMA.
>
>
> So its not strictly infiniband correct?

This is RDMA.  Original IB prefix is a bit confusing, that's true.

>  It is optimized to transfer (read/write) IO blocks
>>
>> in the sense that it follows the BIO semantics of providing the
>> possibility to either write data from a scatter-gather list to the
>> remote side or to request ("read") data transfer from the remote side
>> into a given set of buffers.
>>
>> IBTRS is multipath capable and provides I/O fail-over and load-balancing
>> functionality.
>
>
> Couple of questions on your multipath implementation?
> 1. What was your main objective over dm-multipath?

No objections, mpath is a part of the transport ibtrs library.

> 2. What was the consideration of this implementation over
> creating a stand-alone bio based device node to reinject the
> bio to the original block device?

ibnbd and ibtrs are separate, on fail-over or load-balancing
we work with IO requests inside a library.

>> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
>> (client and server) that allow for remote access of a block device on
>> the server over IBTRS protocol. After being mapped, the remote block
>> devices can be accessed on the client side as local block devices.
>> Internally IBNBD uses IBTRS as an RDMA transport library.
>>
>> Why?
>>
>>     - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>>       thus internal protocol is simple and consists of several request
>>          types only without awareness of underlaying hardware devices.
>
>
> Can you explain how the protocol is developed for thin-p? What are the
> essence of how its suited for it?

Here I wanted to emphasize, that we do not support any HW commands,
like nvme does, thus internal protocol consists of several commands.
So answering on your question "how the protocol is developed for thin-p"
I would put it another way around: "protocol does nothing to support
real device, because all we need is to map thin-p volumes".  It is just
simpler.

>>     - IBTRS was developed as an independent RDMA transport library, which
>>       supports fail-over and load-balancing policies using multipath, thus
>>          it can be used for any other IO needs rather than only for block
>>          device.
>
>
> What do you mean by "any other IO"?

I mean other IO producers, not only ibnbd, since this is just a transport
library.

>
>>     - IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
>>       https://www.spinics.net/lists/linux-rdma/msg48799.html
>>       (I retested on latest 4.14 kernel - there is no any significant
>>           difference, thus I post the old link).
>
>
> That is interesting to learn.
>
> Reading your reference brings a couple of questions though,
> - Its unclear to me how ibnbd performs reads without performing memory
>   registration. Is it using the global dma rkey?

Yes, global rkey.

WRITE: writes from client
READ: writes from server

> - Its unclear to me how there is a difference in noreg in writes,
>   because for small writes nvme-rdma never register memory (it uses
>   inline data).

No support for that.

> - Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that
>   seems considerably low against other reports. Can you try and explain
>   what was the bottleneck? This can be a potential bug and I (and the
>   rest of the community is interesting in knowing more details).

Sure, I can try.  BTW, what are other reports and numbers?

> - srp/scst comparison is really not fair having it in legacy request
>   mode. Can you please repeat it and report a bug to either linux-rdma
>   or to the scst mailing list?

Yep, I can retest with mq.

> - Your latency measurements are surprisingly high for a null target
>   device (even for low end nvme device actually) regardless of the
>   transport implementation.

Hm, network configuration?  These are results on machines dedicated
to our team for testing in one of our datacenters. Nothing special
in configuration.

> For example:
> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>   fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>   and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>   latency I got ~14 us. So something does not add up here. If this is
>   not some configuration issue, then we have serious bugs to handle..
>
> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>   troubles understanding how you were able to get such high latencies
>   (> 100 ms for QD>=100)

What QD stands for? queue depth?  This is not a queue depth, this is
how many fio jobs are dedicated.

And regarding latencies: I can suspect only network configuration.


> Can you share more information about your setup? It would really help
> us understand more.

Everything is specified in the google doc sheet.  Also you can
download the fio files, links are also provided, at the bottom.

https://www.spinics.net/lists/linux-rdma/msg48799.html


[1] FIO runner and results extractor script:
    https://drive.google.com/open?id=0B8_SivzwHdgSS2RKcmc4bWg0YjA

[2] Archive with FIO configurations and results:
    https://drive.google.com/open?id=0B8_SivzwHdgSaDlhMXV6THhoRXc

[3] Google sheet with performance measurements:
    https://drive.google.com/open?id=1sCTBKLA5gbhhkgd2USZXY43VL3zLidzdqDeObZn9Edc

[4] NVMEoF configuration:
    https://drive.google.com/open?id=0B8_SivzwHdgSTzRjbGtmaVR6LWM

[5] SCST configuration:
    https://drive.google.com/open?id=0B8_SivzwHdgSM1B5eGpKWmFJMFk



--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-06  9:44               ` Danil Kipnis
@ 2018-02-06 15:35                 ` Bart Van Assche
  0 siblings, 0 replies; 79+ messages in thread
From: Bart Van Assche @ 2018-02-06 15:35 UTC (permalink / raw)
  To: danil.kipnis
  Cc: linux-block, hch, linux-rdma, roman.penyaev, jinpu.wang, sagi,
	ogerlitz, axboe

T24gVHVlLCAyMDE4LTAyLTA2IGF0IDEwOjQ0ICswMTAwLCBEYW5pbCBLaXBuaXMgd3JvdGU6DQo+
IHRoZSBjb25maWd1cmF0aW9uICh3aGljaCBkZXZpY2VzIGNhbiBiZSBhY2Nlc3NlZCBieSBhIHBh
cnRpY3VsYXINCj4gY2xpZW50KSBjYW4gaGFwcGVuIGFsc28gYWZ0ZXIgdGhlIGtlcm5lbCB0YXJn
ZXQgbW9kdWxlIGlzIGxvYWRlZC4gVGhlDQo+IGRpcmVjdG9yeSBpbiA8ZGV2X3NlYXJjaF9wYXRo
PiBpcyBhIG1vZHVsZSBwYXJhbWV0ZXIgYW5kIGlzIGZpeGVkLiBJdA0KPiBjb250YWlucyBmb3Ig
ZXhhbXBsZSAiL2libmJkX2RldmljZXMvIi4gQnV0IGEgcGFydGljdWxhciBjbGllbnQgWA0KPiB3
b3VsZCBiZSBhYmxlIHRvIG9ubHkgYWNjZXNzIHRoZSBkZXZpY2VzIGxvY2F0ZWQgaW4gdGhlIHN1
YmRpcmVjdG9yeQ0KPiAiL2libmJkX2RldmljZXMvY2xpZW50X3gvIi4gKFRoZSBzZXNzaW9ubmFt
ZSBoZXJlIGlzIGNsaWVudF94KSBPbmUgY2FuDQo+IGFkZCBvciByZW1vdmUgdGhlIGRldmljZXMg
ZnJvbSB0aGF0IGRpcmVjdG9yeSAodGhvc2UgYXJlIGp1c3Qgc3ltbGlua3MNCj4gdG8gL2Rldi94
eHgpIGF0IGFueSB0aW1lIC0gYmVmb3JlIG9yIGFmdGVyIHRoZSBzZXJ2ZXIgbW9kdWxlIGlzDQo+
IGxvYWRlZC4gQnV0IHlvdSBhcmUgcmlnaHQsIHdlIG5lZWQgc29tZXRoaW5nIGFkZGl0aW9uYWwg
aW4gb3JkZXIgdG8gYmUNCj4gYWJsZSB0byBzcGVjaWZ5IHdoaWNoIGRldmljZXMgYSBjbGllbnQg
Y2FuIGFjY2VzcyB3cml0YWJsZSBhbmQgd2hpY2gNCj4gcmVhZG9ubHkuIE1heSBiZSBhbm90aGVy
IHN1YmRpcmVjdG9yaWVzICJ3ciIgYW5kICJybyIgZm9yIGVhY2ggY2xpZW50Og0KPiB0aG9zZSB1
bmRlciAvaWJuYmRfZGV2aWNlcy9jbGllbnRfeC9yby8gY2FuIG9ubHkgYmUgcmVhZCBieSBjbGll
bnRfeA0KPiBhbmQgdGhvc2UgaW4gL2libmJkX2RldmljZXMvY2xpZW50X3gvd3IvIGNhbiBhbHNv
IGJlIHdyaXR0ZW4gdG8/DQoNClBsZWFzZSB1c2UgYSBzdGFuZGFyZCBrZXJuZWwgZmlsZXN5c3Rl
bSAoc3lzZnMgb3IgY29uZmlnZnMpIGluc3RlYWQgb2YNCnJlaW52ZW50aW5nIGl0Lg0KDQpUaGFu
a3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-06 13:12   ` Roman Penyaev
@ 2018-02-06 16:01     ` Bart Van Assche
  2018-02-07 12:57       ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-06 16:01 UTC (permalink / raw)
  To: roman.penyaev, sagi
  Cc: linux-block, hch, linux-rdma, jinpu.wang, ogerlitz,
	Bart Van Assche, axboe, danil.kipnis

T24gVHVlLCAyMDE4LTAyLTA2IGF0IDE0OjEyICswMTAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0K
PiBPbiBNb24sIEZlYiA1LCAyMDE4IGF0IDE6MTYgUE0sIFNhZ2kgR3JpbWJlcmcgPHNhZ2lAZ3Jp
bWJlcmcubWU+IHdyb3RlOg0KPiA+IFsgLi4uIF0NCj4gPiAtIHNycC9zY3N0IGNvbXBhcmlzb24g
aXMgcmVhbGx5IG5vdCBmYWlyIGhhdmluZyBpdCBpbiBsZWdhY3kgcmVxdWVzdA0KPiA+ICAgbW9k
ZS4gQ2FuIHlvdSBwbGVhc2UgcmVwZWF0IGl0IGFuZCByZXBvcnQgYSBidWcgdG8gZWl0aGVyIGxp
bnV4LXJkbWENCj4gPiAgIG9yIHRvIHRoZSBzY3N0IG1haWxpbmcgbGlzdD8NCj4gDQo+IFllcCwg
SSBjYW4gcmV0ZXN0IHdpdGggbXEuDQo+IA0KPiA+IC0gWW91ciBsYXRlbmN5IG1lYXN1cmVtZW50
cyBhcmUgc3VycHJpc2luZ2x5IGhpZ2ggZm9yIGEgbnVsbCB0YXJnZXQNCj4gPiAgIGRldmljZSAo
ZXZlbiBmb3IgbG93IGVuZCBudm1lIGRldmljZSBhY3R1YWxseSkgcmVnYXJkbGVzcyBvZiB0aGUN
Cj4gPiAgIHRyYW5zcG9ydCBpbXBsZW1lbnRhdGlvbi4NCj4gDQo+IEhtLCBuZXR3b3JrIGNvbmZp
Z3VyYXRpb24/ICBUaGVzZSBhcmUgcmVzdWx0cyBvbiBtYWNoaW5lcyBkZWRpY2F0ZWQNCj4gdG8g
b3VyIHRlYW0gZm9yIHRlc3RpbmcgaW4gb25lIG9mIG91ciBkYXRhY2VudGVycy4gTm90aGluZyBz
cGVjaWFsDQo+IGluIGNvbmZpZ3VyYXRpb24uDQoNCkhlbGxvIFJvbWFuLA0KDQpJIGFncmVlIHRo
YXQgdGhlIGxhdGVuY3kgbnVtYmVycyBhcmUgd2F5IHRvbyBoaWdoIGZvciBhIG51bGwgdGFyZ2V0
IGRldmljZS4NCkxhc3QgdGltZSBJIG1lYXN1cmVkIGxhdGVuY3kgZm9yIHRoZSBTUlAgcHJvdG9j
b2wgYWdhaW5zdCBhbiBTQ1NUIHRhcmdldA0KKyBudWxsIGJsb2NrIGRyaXZlciBhdCB0aGUgdGFy
Z2V0IHNpZGUgYW5kIENvbm5lY3RYLTMgYWRhcHRlcnMgSSBtZWFzdXJlZCBhDQpsYXRlbmN5IG9m
IGFib3V0IDE0IG1pY3Jvc2Vjb25kcy4gVGhhdCdzIGFsbW9zdCAxMDAgdGltZXMgbGVzcyB0aGFu
IHRoZQ0KbWVhc3VyZW1lbnQgcmVzdWx0cyBpbiBodHRwczovL3d3dy5zcGluaWNzLm5ldC9saXN0
cy9saW51eC1yZG1hL21zZzQ4Nzk5Lmh0bWwuDQoNClNvbWV0aGluZyBlbHNlIEkgd291bGQgbGlr
ZSB0byB1bmRlcnN0YW5kIGJldHRlciBpcyBob3cgbXVjaCBvZiB0aGUgbGF0ZW5jeQ0KZ2FwIGJl
dHdlZW4gTlZNZU9GL1NSUCBhbmQgSUJOQkQgY2FuIGJlIGNsb3NlZCB3aXRob3V0IGNoYW5naW5n
IHRoZSB3aXJlDQpwcm90b2NvbC4gV2FzIGUuZy4gc3VwcG9ydCBmb3IgaW1tZWRpYXRlIGRhdGEg
cHJlc2VudCBpbiB0aGUgTlZNZU9GIGFuZC9vcg0KU1JQIGRyaXZlcnMgdXNlZCBvbiB5b3VyIHRl
c3Qgc2V0dXA/IEFyZSB5b3UgYXdhcmUgdGhhdCB0aGUgTlZNZU9GIHRhcmdldA0KZHJpdmVyIGNh
bGxzIHBhZ2VfYWxsb2MoKSBmcm9tIHRoZSBob3QgcGF0aCBidXQgdGhhdCB0aGVyZSBhcmUgcGxh
bnMgdG8NCmF2b2lkIHRoZXNlIGNhbGxzIGluIHRoZSBob3QgcGF0aCBieSB1c2luZyBhIGNhY2hp
bmcgbWVjaGFuaXNtIHNpbWlsYXIgdG8NCnRoZSBTR1YgY2FjaGUgaW4gU0NTVD8gQXJlIHlvdSBh
d2FyZSB0aGF0IGEgc2lnbmlmaWNhbnQgbGF0ZW5jeSByZWR1Y3Rpb24NCmNhbiBiZSBhY2hpZXZl
ZCBieSBjaGFuZ2luZyB0aGUgU0NTVCBTR1YgY2FjaGUgZnJvbSBhIGdsb2JhbCBpbnRvIGEgcGVy
LUNQVQ0KY2FjaGU/IFJlZ2FyZGluZyB0aGUgU1JQIG1lYXN1cmVtZW50czogaGF2ZSB5b3UgdHJp
ZWQgdG8gc2V0IHRoZQ0KbmV2ZXJfcmVnaXN0ZXIga2VybmVsIG1vZHVsZSBwYXJhbWV0ZXIgdG8g
dHJ1ZT8gSSdtIGFza2luZyB0aGlzIGJlY2F1c2UgSQ0KdGhpbmsgdGhhdCBtb2RlIGlzIG1vc3Qg
c2ltaWxhciB0byBob3cgdGhlIElCTkJEIGluaXRpYXRvciBkcml2ZXIgd29ya3MuDQoNClRoYW5r
cywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules
  2018-02-06 12:01     ` Roman Penyaev
@ 2018-02-06 16:10       ` Jason Gunthorpe
  2018-02-07 10:34         ` Roman Penyaev
  0 siblings, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2018-02-06 16:10 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Sagi Grimberg, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Bart Van Assche, Or Gerlitz, Danil Kipnis,
	Jack Wang

On Tue, Feb 06, 2018 at 01:01:23PM +0100, Roman Penyaev wrote:

> >> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device
> >> *dev)
> >> +{
> >> +       int err;
> >> +
> >> +       d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
> >> +       if (IS_ERR(d->pd))
> >> +               return PTR_ERR(d->pd);
> >> +       d->dev = dev;
> >> +       d->lkey = d->pd->local_dma_lkey;
> >> +       d->rkey = d->pd->unsafe_global_rkey;
> >> +
> >> +       err = ibtrs_query_device(d);
> >> +       if (unlikely(err))
> >> +               ib_dealloc_pd(d->pd);
> >> +
> >> +       return err;
> >> +}
> >
> >
> > I must say that this makes me frustrated.. We stopped doing these
> > sort of things long time ago. No way we can even consider accepting
> > the unsafe use of the global rkey exposing the entire memory space for
> > remote access permissions.
> >
> > Sorry for being blunt, but this protocol design which makes a concious
> > decision to expose unconditionally is broken by definition.
> 
> I suppose we can also afford the same trick which nvme does: provide
> register_always module argument, can we?  That can be also interesting
> to measure the performance difference.

I can be firmer than Sagi - new code that has IB_PD_UNSAFE_GLOBAL_RKEY
at can not be accepted upstream.

Once you get that fixed, you should go and read my past comments on
how to properly order dma mapping and completions and fix that
too. Then redo your performance..

Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules
  2018-02-06 16:10       ` Jason Gunthorpe
@ 2018-02-07 10:34         ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-07 10:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Bart Van Assche, Or Gerlitz, Danil Kipnis,
	Jack Wang

On Tue, Feb 6, 2018 at 5:10 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Feb 06, 2018 at 01:01:23PM +0100, Roman Penyaev wrote:
>
>> >> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device
>> >> *dev)
>> >> +{
>> >> +       int err;
>> >> +
>> >> +       d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
>> >> +       if (IS_ERR(d->pd))
>> >> +               return PTR_ERR(d->pd);
>> >> +       d->dev = dev;
>> >> +       d->lkey = d->pd->local_dma_lkey;
>> >> +       d->rkey = d->pd->unsafe_global_rkey;
>> >> +
>> >> +       err = ibtrs_query_device(d);
>> >> +       if (unlikely(err))
>> >> +               ib_dealloc_pd(d->pd);
>> >> +
>> >> +       return err;
>> >> +}
>> >
>> >
>> > I must say that this makes me frustrated.. We stopped doing these
>> > sort of things long time ago. No way we can even consider accepting
>> > the unsafe use of the global rkey exposing the entire memory space for
>> > remote access permissions.
>> >
>> > Sorry for being blunt, but this protocol design which makes a concious
>> > decision to expose unconditionally is broken by definition.
>>
>> I suppose we can also afford the same trick which nvme does: provide
>> register_always module argument, can we?  That can be also interesting
>> to measure the performance difference.
>
> I can be firmer than Sagi - new code that has IB_PD_UNSAFE_GLOBAL_RKEY
> at can not be accepted upstream.
>
> Once you get that fixed, you should go and read my past comments on
> how to properly order dma mapping and completions and fix that
> too. Then redo your performance..

Clear.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-06 16:01     ` Bart Van Assche
@ 2018-02-07 12:57       ` Roman Penyaev
  2018-02-07 16:35         ` Bart Van Assche
  0 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-07 12:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: sagi, linux-block, hch, linux-rdma, jinpu.wang, ogerlitz, axboe,
	danil.kipnis

On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
>> On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>> > [ ... ]
>> > - srp/scst comparison is really not fair having it in legacy request
>> >   mode. Can you please repeat it and report a bug to either linux-rdma
>> >   or to the scst mailing list?
>>
>> Yep, I can retest with mq.
>>
>> > - Your latency measurements are surprisingly high for a null target
>> >   device (even for low end nvme device actually) regardless of the
>> >   transport implementation.
>>
>> Hm, network configuration?  These are results on machines dedicated
>> to our team for testing in one of our datacenters. Nothing special
>> in configuration.
>

Hello Bart,

> I agree that the latency numbers are way too high for a null target device.
> Last time I measured latency for the SRP protocol against an SCST target
> + null block driver at the target side and ConnectX-3 adapters I measured a
> latency of about 14 microseconds. That's almost 100 times less than the
> measurement results in https://www.spinics.net/lists/linux-rdma/msg48799.html.

Here is the following configuration of the setup:

Initiator and target HW configuration:
    AMD Opteron 6386 SE, 64CPU, 128Gb
    InfiniBand: Mellanox Technologies MT26428
                [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]

Also, I remember that between initiator and target there were two IB switches.
Unfortunately, I can't repeat the same configuration, but will retest as
soon as we get new HW.

> Something else I would like to understand better is how much of the latency
> gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
> protocol. Was e.g. support for immediate data present in the NVMeOF and/or
> SRP drivers used on your test setup?

I did not get the question. IBTRS uses empty messages with only imm_data
field set to respond on IO. This is a part of the IBTRS protocol.  I do
not understand how can immediate data be present in other drivers, if
those do not use it in their protocols.  I am lost here.

> Are you aware that the NVMeOF target driver calls page_alloc() from the hot path but that there are plans to
> avoid these calls in the hot path by using a caching mechanism similar to
> the SGV cache in SCST? Are you aware that a significant latency reduction
> can be achieved by changing the SCST SGV cache from a global into a per-CPU
> cache?

No, I am not aware. That is nice, that there is a lot of room for performance
tweaks. I will definitely retest on fresh kernel once everything is done on
nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys).
Maybe there are some other parameters which can be also tweaked?

> Regarding the SRP measurements: have you tried to set the
> never_register kernel module parameter to true? I'm asking this because I
> think that mode is most similar to how the IBNBD initiator driver works.

yes, according to my notes from that link (frankly, I do not remember,
but that is what I wrote 1 year ago):

    * Where suffixes mean:

     _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
              with 'register_always=N' param

That what you are asking, right?

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 12:30   ` Sagi Grimberg
@ 2018-02-07 13:06     ` Roman Penyaev
  0 siblings, 0 replies; 79+ messages in thread
From: Roman Penyaev @ 2018-02-07 13:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Or Gerlitz, Danil Kipnis, Jack Wang

Hi Sagi and all,

On Mon, Feb 5, 2018 at 1:30 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Roman and the team (again), replying to my own email :)
>
> I forgot to mention that first of all thank you for upstreaming
> your work! I fully support your goal to have your production driver
> upstream to minimize your maintenance efforts. I hope that my
> feedback didn't came across with a different impression, that was
> certainly not my intent.

Well, I've just recovered from two heart attacks, which I got
while reading your replies, but now I am fine, thanks :)

> It would be great if you can address and/or reply to my feedback
> (as well as others) and re-spin it again.

Jokes aside, we would like to thank you all, guys, for valuable
feedback. I got a lot of useful remarks from you Sagi and you Bart.
We will try to cover them in next version and will provide up-to-date
comparison results.

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-05 16:16     ` Bart Van Assche
  2018-02-05 16:36       ` Jinpu Wang
@ 2018-02-07 16:35       ` Christopher Lameter
  2018-02-07 17:18         ` Roman Penyaev
  1 sibling, 1 reply; 79+ messages in thread
From: Christopher Lameter @ 2018-02-07 16:35 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jinpu.wang, linux-block, hch, linux-rdma, roman.penyaev, sagi,
	ogerlitz, axboe, danil.kipnis

On Mon, 5 Feb 2018, Bart Van Assche wrote:

> That approach may work well for your employer but sorry I don't think this is
> sufficient for an upstream driver. I think that most users who configure a
> network storage target expect full control over which storage devices are exported
> and also over which clients do have and do not have access.

Well is that actually true for IPoIB? It seems that I can arbitrarily
attach to any partition I want without access control. In many ways some
of the RDMA layers and modules are loose with security since performance
is what matters mostly and deployments occur in separate production
environments.

We have had security issues (that not fully resolved yet) with the RDMA
RPC API for years.. So maybe lets relax on the security requirements a
bit?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-07 12:57       ` Roman Penyaev
@ 2018-02-07 16:35         ` Bart Van Assche
  0 siblings, 0 replies; 79+ messages in thread
From: Bart Van Assche @ 2018-02-07 16:35 UTC (permalink / raw)
  To: roman.penyaev
  Cc: linux-block, hch, linux-rdma, jinpu.wang, sagi, ogerlitz, axboe,
	danil.kipnis

T24gV2VkLCAyMDE4LTAyLTA3IGF0IDEzOjU3ICswMTAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0K
PiBPbiBUdWUsIEZlYiA2LCAyMDE4IGF0IDU6MDEgUE0sIEJhcnQgVmFuIEFzc2NoZSA8QmFydC5W
YW5Bc3NjaGVAd2RjLmNvbT4gd3JvdGU6DQo+ID4gT24gVHVlLCAyMDE4LTAyLTA2IGF0IDE0OjEy
ICswMTAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0KPiA+IFNvbWV0aGluZyBlbHNlIEkgd291bGQg
bGlrZSB0byB1bmRlcnN0YW5kIGJldHRlciBpcyBob3cgbXVjaCBvZiB0aGUgbGF0ZW5jeQ0KPiA+
IGdhcCBiZXR3ZWVuIE5WTWVPRi9TUlAgYW5kIElCTkJEIGNhbiBiZSBjbG9zZWQgd2l0aG91dCBj
aGFuZ2luZyB0aGUgd2lyZQ0KPiA+IHByb3RvY29sLiBXYXMgZS5nLiBzdXBwb3J0IGZvciBpbW1l
ZGlhdGUgZGF0YSBwcmVzZW50IGluIHRoZSBOVk1lT0YgYW5kL29yDQo+ID4gU1JQIGRyaXZlcnMg
dXNlZCBvbiB5b3VyIHRlc3Qgc2V0dXA/DQo+IA0KPiBJIGRpZCBub3QgZ2V0IHRoZSBxdWVzdGlv
bi4gSUJUUlMgdXNlcyBlbXB0eSBtZXNzYWdlcyB3aXRoIG9ubHkgaW1tX2RhdGENCj4gZmllbGQg
c2V0IHRvIHJlc3BvbmQgb24gSU8uIFRoaXMgaXMgYSBwYXJ0IG9mIHRoZSBJQlRSUyBwcm90b2Nv
bC4gIEkgZG8NCj4gbm90IHVuZGVyc3RhbmQgaG93IGNhbiBpbW1lZGlhdGUgZGF0YSBiZSBwcmVz
ZW50IGluIG90aGVyIGRyaXZlcnMsIGlmDQo+IHRob3NlIGRvIG5vdCB1c2UgaXQgaW4gdGhlaXIg
cHJvdG9jb2xzLiAgSSBhbSBsb3N0IGhlcmUuDQoNCldpdGggImltbWVkaWF0ZSBkYXRhIiBJIHdh
cyByZWZlcnJpbmcgdG8gaW5jbHVkaW5nIHRoZSBlbnRpcmUgd3JpdGUgYnVmZmVyDQppbiB0aGUg
d3JpdGUgUERVIGl0c2VsZi4gU2VlIGUuZy4gdGhlIGVuYWJsZV9pbW1fZGF0YSBrZXJuZWwgbW9k
dWxlIHBhcmFtZXRlcg0Kb2YgdGhlIGliX3NycC1iYWNrcG9ydCBkcml2ZXIuIFNlZSBhbHNvIHRo
ZSB1c2Ugb2YgU1JQX0RBVEFfREVTQ19JTU0gaW4gdGhlDQpTQ1NUIGliX3NycHQgdGFyZ2V0IGRy
aXZlci4gTmVpdGhlciB0aGUgdXBzdHJlYW0gU1JQIGluaXRpYXRvciBub3IgdGhlIHVwc3RyZWFt
DQpTUlAgdGFyZ2V0IHN1cHBvcnQgaW1tZWRpYXRlIGRhdGEgdG9kYXkuIEhvd2V2ZXIsIHNlbmRp
bmcgdGhhdCBjb2RlIHVwc3RyZWFtDQppcyBvbiBteSB0by1kbyBsaXN0Lg0KDQpGb3IgdGhlIHVw
c3RyZWFtIE5WTWVPRiBpbml0aWF0b3IgYW5kIHRhcmdldCBkcml2ZXJzLCBzZWUgYWxzbyB0aGUg
Y2FsbCBvZg0KbnZtZV9yZG1hX21hcF9zZ19pbmxpbmUoKSBpbiBudm1lX3JkbWFfbWFwX2RhdGEo
KS4NCg0KPiA+IEFyZSB5b3UgYXdhcmUgdGhhdCB0aGUgTlZNZU9GIHRhcmdldCBkcml2ZXIgY2Fs
bHMgcGFnZV9hbGxvYygpIGZyb20gdGhlIGhvdCBwYXRoIGJ1dCB0aGF0IHRoZXJlIGFyZSBwbGFu
cyB0bw0KPiA+IGF2b2lkIHRoZXNlIGNhbGxzIGluIHRoZSBob3QgcGF0aCBieSB1c2luZyBhIGNh
Y2hpbmcgbWVjaGFuaXNtIHNpbWlsYXIgdG8NCj4gPiB0aGUgU0dWIGNhY2hlIGluIFNDU1Q/IEFy
ZSB5b3UgYXdhcmUgdGhhdCBhIHNpZ25pZmljYW50IGxhdGVuY3kgcmVkdWN0aW9uDQo+ID4gY2Fu
IGJlIGFjaGlldmVkIGJ5IGNoYW5naW5nIHRoZSBTQ1NUIFNHViBjYWNoZSBmcm9tIGEgZ2xvYmFs
IGludG8gYSBwZXItQ1BVDQo+ID4gY2FjaGU/DQo+IA0KPiBObywgSSBhbSBub3QgYXdhcmUuIFRo
YXQgaXMgbmljZSwgdGhhdCB0aGVyZSBpcyBhIGxvdCBvZiByb29tIGZvciBwZXJmb3JtYW5jZQ0K
PiB0d2Vha3MuIEkgd2lsbCBkZWZpbml0ZWx5IHJldGVzdCBvbiBmcmVzaCBrZXJuZWwgb25jZSBl
dmVyeXRoaW5nIGlzIGRvbmUgb24NCj4gbnZtZSwgc2NzdCBvciBpYnRycyAoZXNwZWNpYWxseSB3
aGVuIHdlIGdldCByaWQgb2YgZm1ycyBhbmQgVU5TQUZFIHJrZXlzKS4NCg0KUmVjZW50bHkgdGhl
IGZ1bmN0aW9ucyBzZ2xfYWxsb2MoKSBhbmQgc2dsX2ZyZWUoKSB3ZXJlIGludHJvZHVjZWQgaW4g
dGhlIHVwc3RyZWFtDQprZXJuZWwgKHRoZXNlIHdpbGwgYmUgaW5jbHVkZWQgaW4ga2VybmVsIHY0
LjE2KS4gVGhlIE5WTWUgdGFyZ2V0IGRyaXZlciwgTElPIGFuZA0Kc2V2ZXJhbCBvdGhlciBkcml2
ZXJzIGhhdmUgYmVlbiBtb2RpZmllZCB0byB1c2UgdGhlc2UgZnVuY3Rpb25zIGluc3RlYWQgb2Yg
dGhlaXINCm93biBjb3B5IG9mIHRoYXQgZnVuY3Rpb24uIFRoZSBuZXh0IHN0ZXAgaXMgdG8gcmVw
bGFjZSB0aGVzZSBmdW5jdGlvbiBjYWxscyBieQ0KY2FsbHMgdG8gZnVuY3Rpb25zIHRoYXQgcGVy
Zm9ybSBjYWNoZWQgYWxsb2NhdGlvbnMuDQoNCj4gPiBSZWdhcmRpbmcgdGhlIFNSUCBtZWFzdXJl
bWVudHM6IGhhdmUgeW91IHRyaWVkIHRvIHNldCB0aGUNCj4gPiBuZXZlcl9yZWdpc3RlciBrZXJu
ZWwgbW9kdWxlIHBhcmFtZXRlciB0byB0cnVlPyBJJ20gYXNraW5nIHRoaXMgYmVjYXVzZSBJDQo+
ID4gdGhpbmsgdGhhdCBtb2RlIGlzIG1vc3Qgc2ltaWxhciB0byBob3cgdGhlIElCTkJEIGluaXRp
YXRvciBkcml2ZXIgd29ya3MuDQo+IA0KPiB5ZXMsIGFjY29yZGluZyB0byBteSBub3RlcyBmcm9t
IHRoYXQgbGluayAoZnJhbmtseSwgSSBkbyBub3QgcmVtZW1iZXIsDQo+IGJ1dCB0aGF0IGlzIHdo
YXQgSSB3cm90ZSAxIHllYXIgYWdvKToNCj4gDQo+ICAgICAqIFdoZXJlIHN1ZmZpeGVzIG1lYW46
DQo+IA0KPiAgICAgIF9ub3JlZyAtIG1vZHVsZXMgb24gaW5pdGlhdG9yIHNpZGUgKGliX3NycCwg
bnZtZV9yZG1hKSB3ZXJlIGxvYWRlZA0KPiAgICAgICAgICAgICAgIHdpdGggJ3JlZ2lzdGVyX2Fs
d2F5cz1OJyBwYXJhbQ0KPiANCj4gVGhhdCB3aGF0IHlvdSBhcmUgYXNraW5nLCByaWdodD8NCg0K
Tm90IHJlYWxseS4gV2l0aCByZWdpc3Rlcl9hbHdheXM9WSBtZW1vcnkgcmVnaXN0cmF0aW9uIGlz
IGFsd2F5cyB1c2VkIGJ5IHRoZQ0KU1JQIGluaXRpYXRvciwgZXZlbiBpZiB0aGUgZGF0YSBjYW4g
YmUgY29hbGVzY2VkIGludG8gYSBzaW5nbGUgc2cgZW50cnkuIFdpdGgNCnJlZ2lzdGVyX2Fsd2F5
cz1OIG1lbW9yeSByZWdpc3RyYXRpb24gaXMgb25seSBwZXJmb3JtZWQgaWYgbXVsdGlwbGUgc2cg
ZW50cmllcw0KYXJlIG5lZWRlZCB0byBkZXNjcmliZSB0aGUgZGF0YS4gQW5kIHdpdGggbmV2ZXJf
cmVnaXN0ZXI9WSBtZW1vcnkgcmVnaXN0cmF0aW9uDQppcyBub3QgdXNlZCBldmVuIGlmIG11bHRp
cGxlIHNnIGVudHJpZXMgYXJlIG5lZWRlZCB0byBkZXNjcmliZSB0aGUgZGF0YSBidWZmZXIuDQoN
ClRoYW5rcywNCg0KQmFydC4NCg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-07 16:35       ` Christopher Lameter
@ 2018-02-07 17:18         ` Roman Penyaev
  2018-02-07 17:32           ` Bart Van Assche
  0 siblings, 1 reply; 79+ messages in thread
From: Roman Penyaev @ 2018-02-07 17:18 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Bart Van Assche, jinpu.wang, linux-block, hch, linux-rdma, sagi,
	ogerlitz, axboe, danil.kipnis

On Wed, Feb 7, 2018 at 5:35 PM, Christopher Lameter <cl@linux.com> wrote:
> On Mon, 5 Feb 2018, Bart Van Assche wrote:
>
>> That approach may work well for your employer but sorry I don't think this is
>> sufficient for an upstream driver. I think that most users who configure a
>> network storage target expect full control over which storage devices are exported
>> and also over which clients do have and do not have access.
>
> Well is that actually true for IPoIB? It seems that I can arbitrarily
> attach to any partition I want without access control. In many ways some
> of the RDMA layers and modules are loose with security since performance
> is what matters mostly and deployments occur in separate production
> environments.
>
> We have had security issues (that not fully resolved yet) with the RDMA
> RPC API for years.. So maybe lets relax on the security requirements a
> bit?
>

Frankly speaking I do not understand the "security" about this kind of
block devices and RDMA in particular.  I can admit that personally I do
not see the whole picture, so can someone provide the real usecase/scenario?
What we have in our datacenters is trusted environment (do others exist?).
You need a volume, you create it.  You need to map a volume remotely -
you map it.  Of course there are provisioning checks, rw/ro checks, etc.
But in general any IP/key checks (is that client really a "good" guy or not?)
are simply useless.  So the question is: are there real life setups where
some of the local IB network members can be untrusted?

--
Roman

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-07 17:18         ` Roman Penyaev
@ 2018-02-07 17:32           ` Bart Van Assche
  2018-02-08 17:38             ` Danil Kipnis
  0 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-07 17:32 UTC (permalink / raw)
  To: roman.penyaev, cl
  Cc: linux-block, hch, linux-rdma, jinpu.wang, sagi, ogerlitz, axboe,
	danil.kipnis

T24gV2VkLCAyMDE4LTAyLTA3IGF0IDE4OjE4ICswMTAwLCBSb21hbiBQZW55YWV2IHdyb3RlOg0K
PiBTbyB0aGUgcXVlc3Rpb24gaXM6IGFyZSB0aGVyZSByZWFsIGxpZmUgc2V0dXBzIHdoZXJlDQo+
IHNvbWUgb2YgdGhlIGxvY2FsIElCIG5ldHdvcmsgbWVtYmVycyBjYW4gYmUgdW50cnVzdGVkPw0K
DQpIZWxsbyBSb21hbiwNCg0KWW91IG1heSB3YW50IHRvIHJlYWQgbW9yZSBhYm91dCB0aGUgbGF0
ZXN0IGV2b2x1dGlvbnMgd2l0aCByZWdhcmQgdG8gbmV0d29yaw0Kc2VjdXJpdHkuIEFuIGFydGlj
bGUgdGhhdCBJIGNhbiByZWNvbW1lbmQgaXMgdGhlIGZvbGxvd2luZzogIkdvb2dsZSByZXZlYWxz
DQpvd24gc2VjdXJpdHkgcmVnaW1lIHBvbGljeSB0cnVzdHMgbm8gbmV0d29yaywgYW55d2hlcmUs
IGV2ZXIiDQooaHR0cHM6Ly93d3cudGhlcmVnaXN0ZXIuY28udWsvMjAxNi8wNC8wNi9nb29nbGVz
X2JleW9uZGNvcnBfc2VjdXJpdHlfcG9saWN5LykuDQoNCklmIGRhdGEtY2VudGVycyB3b3VsZCBz
dGFydCBkZXBsb3lpbmcgUkRNQSBhbW9uZyB0aGVpciBlbnRpcmUgZGF0YSBjZW50ZXJzDQoobWF5
YmUgdGhleSBhcmUgYWxyZWFkeSBkb2luZyB0aGlzKSB0aGVuIEkgdGhpbmsgdGhleSB3aWxsIHdh
bnQgdG8gcmVzdHJpY3QNCmFjY2VzcyB0byBibG9jayBkZXZpY2VzIHRvIG9ubHkgdGhvc2UgaW5p
dGlhdG9yIHN5c3RlbXMgdGhhdCBuZWVkIGl0Lg0KDQpUaGFua3MsDQoNCkJhcnQuDQoNCg0K

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-07 17:32           ` Bart Van Assche
@ 2018-02-08 17:38             ` Danil Kipnis
  2018-02-08 18:09               ` Bart Van Assche
  0 siblings, 1 reply; 79+ messages in thread
From: Danil Kipnis @ 2018-02-08 17:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: roman.penyaev, cl, linux-block, hch, linux-rdma, jinpu.wang,
	sagi, ogerlitz, axboe

On Wed, Feb 7, 2018 at 6:32 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote:
>> So the question is: are there real life setups where
>> some of the local IB network members can be untrusted?
>
> Hello Roman,
>
> You may want to read more about the latest evolutions with regard to network
> security. An article that I can recommend is the following: "Google reveals
> own security regime policy trusts no network, anywhere, ever"
> (https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/).
>
> If data-centers would start deploying RDMA among their entire data centers
> (maybe they are already doing this) then I think they will want to restrict
> access to block devices to only those initiator systems that need it.
>
> Thanks,
>
> Bart.
>
>

Hi Bart,

thanks for the link to the article. To the best of my understanding,
the guys suggest to authenticate the devices first and only then
authenticate the users who use the devices in order to get access to a
corporate service. They also mention in the presentation the current
trend of moving corporate services into the cloud. But I think this is
not about the devices from which that cloud is build of. Isn't a cloud
first build out of devices connected via IB and then users (and their
devices) are provided access to the services of that cloud as a whole?
If a malicious user already plugged his device into an IB switch of a
cloud internal infrastructure, isn't it game over anyway? Can't he
just take the hard drives instead of mapping them?

Thanks,

Danil.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-08 17:38             ` Danil Kipnis
@ 2018-02-08 18:09               ` Bart Van Assche
  2018-06-04 12:27                 ` Danil Kipnis
  0 siblings, 1 reply; 79+ messages in thread
From: Bart Van Assche @ 2018-02-08 18:09 UTC (permalink / raw)
  To: danil.kipnis
  Cc: linux-block, hch, linux-rdma, roman.penyaev, jinpu.wang,
	ogerlitz, sagi, axboe, cl

T24gVGh1LCAyMDE4LTAyLTA4IGF0IDE4OjM4ICswMTAwLCBEYW5pbCBLaXBuaXMgd3JvdGU6DQo+
IHRoYW5rcyBmb3IgdGhlIGxpbmsgdG8gdGhlIGFydGljbGUuIFRvIHRoZSBiZXN0IG9mIG15IHVu
ZGVyc3RhbmRpbmcsDQo+IHRoZSBndXlzIHN1Z2dlc3QgdG8gYXV0aGVudGljYXRlIHRoZSBkZXZp
Y2VzIGZpcnN0IGFuZCBvbmx5IHRoZW4NCj4gYXV0aGVudGljYXRlIHRoZSB1c2VycyB3aG8gdXNl
IHRoZSBkZXZpY2VzIGluIG9yZGVyIHRvIGdldCBhY2Nlc3MgdG8gYQ0KPiBjb3Jwb3JhdGUgc2Vy
dmljZS4gVGhleSBhbHNvIG1lbnRpb24gaW4gdGhlIHByZXNlbnRhdGlvbiB0aGUgY3VycmVudA0K
PiB0cmVuZCBvZiBtb3ZpbmcgY29ycG9yYXRlIHNlcnZpY2VzIGludG8gdGhlIGNsb3VkLiBCdXQg
SSB0aGluayB0aGlzIGlzDQo+IG5vdCBhYm91dCB0aGUgZGV2aWNlcyBmcm9tIHdoaWNoIHRoYXQg
Y2xvdWQgaXMgYnVpbGQgb2YuIElzbid0IGEgY2xvdWQNCj4gZmlyc3QgYnVpbGQgb3V0IG9mIGRl
dmljZXMgY29ubmVjdGVkIHZpYSBJQiBhbmQgdGhlbiB1c2VycyAoYW5kIHRoZWlyDQo+IGRldmlj
ZXMpIGFyZSBwcm92aWRlZCBhY2Nlc3MgdG8gdGhlIHNlcnZpY2VzIG9mIHRoYXQgY2xvdWQgYXMg
YSB3aG9sZT8NCj4gSWYgYSBtYWxpY2lvdXMgdXNlciBhbHJlYWR5IHBsdWdnZWQgaGlzIGRldmlj
ZSBpbnRvIGFuIElCIHN3aXRjaCBvZiBhDQo+IGNsb3VkIGludGVybmFsIGluZnJhc3RydWN0dXJl
LCBpc24ndCBpdCBnYW1lIG92ZXIgYW55d2F5PyBDYW4ndCBoZQ0KPiBqdXN0IHRha2UgdGhlIGhh
cmQgZHJpdmVzIGluc3RlYWQgb2YgbWFwcGluZyB0aGVtPw0KDQpIZWxsbyBEYW5pbCwNCg0KSXQg
c2VlbXMgbGlrZSB3ZSBlYWNoIGhhdmUgYmVlbiBmb2N1c3Npbmcgb24gZGlmZmVyZW50IGFzcGVj
dHMgb2YgdGhlIGFydGljbGUuDQpUaGUgcmVhc29uIEkgcmVmZXJyZWQgdG8gdGhhdCBhcnRpY2xl
IGlzIGJlY2F1c2UgSSByZWFkIHRoZSBmb2xsb3dpbmcgaW4NCnRoYXQgYXJ0aWNsZTogIlVubGlr
ZSB0aGUgY29udmVudGlvbmFsIHBlcmltZXRlciBzZWN1cml0eSBtb2RlbCwgQmV5b25kQ29ycA0K
ZG9lc27igJl0IGdhdGUgYWNjZXNzIHRvIHNlcnZpY2VzIGFuZCB0b29scyBiYXNlZCBvbiBhIHVz
ZXLigJlzIHBoeXNpY2FsIGxvY2F0aW9uDQpvciB0aGUgb3JpZ2luYXRpbmcgbmV0d29yayBbIC4u
LiBdIFRoZSB6ZXJvIHRydXN0IGFyY2hpdGVjdHVyZSBzcGVsbHMgdHJvdWJsZQ0KZm9yIHRyYWRp
dGlvbmFsIGF0dGFja3MgdGhhdCByZWx5IG9uIHBlbmV0cmF0aW5nIGEgdG91Z2ggcGVyaW1ldGVy
IHRvIHdhbHR6DQpmcmVlbHkgd2l0aGluIGFuIG9wZW4gaW50ZXJuYWwgbmV0d29yay4iIFN1cHBv
c2UgZS5nLiB0aGF0IGFuIG9yZ2FuaXphdGlvbg0KZGVjaWRlcyB0byB1c2UgUm9DRSBvciBpV0FS
UCBmb3IgY29ubmVjdGl2aXR5IGJldHdlZW4gYmxvY2sgc3RvcmFnZSBpbml0aWF0b3INCnN5c3Rl
bXMgYW5kIGJsb2NrIHN0b3JhZ2UgdGFyZ2V0IHN5c3RlbXMgYW5kIHRoYXQgaXQgaGFzIGEgc2lu
Z2xlIGNvbXBhbnktDQp3aWRlIEV0aGVybmV0IG5ldHdvcmsuIElmIHRoZSB0YXJnZXQgc3lzdGVt
IGRvZXMgbm90IHJlc3RyaWN0IGFjY2VzcyBiYXNlZA0Kb24gaW5pdGlhdG9yIElQIGFkZHJlc3Mg
dGhlbiBhbnkgcGVuZXRyYXRvciB3b3VsZCBiZSBhYmxlIHRvIGFjY2VzcyBhbGwgdGhlDQpibG9j
ayBkZXZpY2VzIGV4cG9ydGVkIGJ5IHRoZSB0YXJnZXQgYWZ0ZXIgYSBTb2Z0Um9DRSBvciBTb2Z0
aVdBUlAgaW5pdGlhdG9yDQpkcml2ZXIgaGFzIGJlZW4gbG9hZGVkLiBJZiB0aGUgdGFyZ2V0IHN5
c3RlbSBob3dldmVyIHJlc3RyaWN0cyBhY2Nlc3MgYmFzZWQNCm9uIHRoZSBpbml0aWF0b3IgSVAg
YWRkcmVzcyB0aGVuIHRoYXQgd291bGQgbWFrZSBpdCBoYXJkZXIgZm9yIGEgcGVuZXRyYXRvcg0K
dG8gYWNjZXNzIHRoZSBleHBvcnRlZCBibG9jayBzdG9yYWdlIGRldmljZXMuIEluc3RlYWQgb2Yg
anVzdCBwZW5ldHJhdGluZyB0aGUNCm5ldHdvcmsgYWNjZXNzLCBJUCBhZGRyZXNzIHNwb29maW5n
IHdvdWxkIGhhdmUgdG8gYmUgdXNlZCBvciBhY2Nlc3Mgd291bGQNCmhhdmUgdG8gYmUgb2J0YWlu
ZWQgdG8gYSBzeXN0ZW0gdGhhdCBoYXMgYmVlbiBncmFudGVkIGFjY2VzcyB0byB0aGUgdGFyZ2V0
DQpzeXN0ZW0uDQoNClRoYW5rcywNCg0KQmFydC4NCg0KDQo=

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-02 16:40   ` Doug Ledford
  2018-02-05  8:45     ` Jinpu Wang
@ 2018-06-04 12:14     ` Danil Kipnis
  1 sibling, 0 replies; 79+ messages in thread
From: Danil Kipnis @ 2018-06-04 12:14 UTC (permalink / raw)
  To: dledford
  Cc: Bart Van Assche, Roman Penyaev, linux-block, linux-rdma,
	Christoph Hellwig, ogerlitz, Jack Wang, axboe, Sagi Grimberg

[-- Attachment #1: Type: text/plain, Size: 3181 bytes --]

Hi Doug,

thanks for the feedback. You read the cover letter correctly: our transport
library implements multipath (load balancing and failover) on top of RDMA
API. Its name "IBTRS" is slightly misleading in that regard: it can sit on
top of ROCE as well. The library allows for "bundling" multiple rdma
"paths" (source addr - destination addr pair) into one "session". So our
session consists of one or more paths and each path under the hood consists
of as many QPs (each connecting source with destination) as there are CPUs
on the client system. The user load (In our case IBNBD is a block device
and generates some block requests) is load-balanced on per cpu-basis.
I understand, this is something very different to what smc-r is doing. Am I
right? Do you know what stage MP-RDMA development currently is?

Best,

Danil Kipnis.

On Fri, Feb 2, 2018 at 5:40 PM Doug Ledford <dledford@redhat.com> wrote:

> On Fri, 2018-02-02 at 16:07 +0000, Bart Van Assche wrote:
> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> > > Since the first version the following was changed:
> > >
> > >    - Load-balancing and IO fail-over using multipath features were
> added.
> > >    - Major parts of the code were rewritten, simplified and overall
> code
> > >      size was reduced by a quarter.
> >
> > That is interesting to know, but what happened to the feedback that Sagi
> and
> > I provided on v1? Has that feedback been addressed? See also
> > https://www.spinics.net/lists/linux-rdma/msg47819.html and
> > https://www.spinics.net/lists/linux-rdma/msg47879.html.
> >
> > Regarding multipath support: there are already two multipath
> implementations
> > upstream (dm-mpath and the multipath implementation in the NVMe
> initiator).
> > I'm not sure we want a third multipath implementation in the Linux
> kernel.
>
> There's more than that.  There was also md-multipath, and smc-r includes
> another version of multipath, plus I assume we support mptcp as well.
>
> But, to be fair, the different multipaths in this list serve different
> purposes and I'm not sure they could all be generalized out and served
> by a single multipath code.  Although, fortunately, md-multipath is
> deprecated, so no need to worry about it, and it is only dm-multipath
> and nvme multipath that deal directly with block devices and assume
> block semantics.  If I read the cover letter right (and I haven't dug
> into the code to confirm this), the ibtrs multipath has much more in
> common with smc-r multipath, where it doesn't really assume a block
> layer device sits on top of it, it's more of a pure network multipath,
> which the implementation of smc-r is and mptcp would be too.  I would
> like to see a core RDMA multipath implementation soon that would
> abstract out some of these multipath tasks, at least across RDMA links,
> and that didn't have the current limitations (smc-r only supports RoCE
> links, and it sounds like ibtrs only supports IB like links, but maybe
> I'm wrong there, I haven't looked at the patches yet).
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG KeyID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: Type: text/html, Size: 4250 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
  2018-02-08 18:09               ` Bart Van Assche
@ 2018-06-04 12:27                 ` Danil Kipnis
  0 siblings, 0 replies; 79+ messages in thread
From: Danil Kipnis @ 2018-06-04 12:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, linux-rdma, Roman Penyaev,
	Jack Wang, ogerlitz, Sagi Grimberg, axboe, cl

Hi Doug,

thanks for the feedback. You read the cover letter correctly: our
transport library implements multipath (load balancing and failover)
on top of RDMA API. Its name "IBTRS" is slightly misleading in that
regard: it can sit on top of ROCE as well. The library allows for
"bundling" multiple rdma "paths" (source addr - destination addr pair)
into one "session". So our session consists of one or more paths and
each path under the hood consists of as many QPs (each connecting
source with destination) as there are CPUs on the client system. The
user load (In our case IBNBD is a block device and generates some
block requests) is load-balanced on per cpu-basis.
I understand, this is something very different to what smc-r is doing.
Am I right? Do you know what stage MP-RDMA development currently is?

Best,

Danil Kipnis.

P.S. Sorry for the duplicate if any, first mail was returned cause of html.

On Thu, Feb 8, 2018 at 7:10 PM Bart Van Assche <Bart.VanAssche@wdc.com> wro=
te:
>
> On Thu, 2018-02-08 at 18:38 +0100, Danil Kipnis wrote:
> > thanks for the link to the article. To the best of my understanding,
> > the guys suggest to authenticate the devices first and only then
> > authenticate the users who use the devices in order to get access to a
> > corporate service. They also mention in the presentation the current
> > trend of moving corporate services into the cloud. But I think this is
> > not about the devices from which that cloud is build of. Isn't a cloud
> > first build out of devices connected via IB and then users (and their
> > devices) are provided access to the services of that cloud as a whole?
> > If a malicious user already plugged his device into an IB switch of a
> > cloud internal infrastructure, isn't it game over anyway? Can't he
> > just take the hard drives instead of mapping them?
>
> Hello Danil,
>
> It seems like we each have been focussing on different aspects of the art=
icle.
> The reason I referred to that article is because I read the following in
> that article: "Unlike the conventional perimeter security model, BeyondCo=
rp
> doesn=E2=80=99t gate access to services and tools based on a user=E2=80=
=99s physical location
> or the originating network [ ... ] The zero trust architecture spells tro=
uble
> for traditional attacks that rely on penetrating a tough perimeter to wal=
tz
> freely within an open internal network." Suppose e.g. that an organizatio=
n
> decides to use RoCE or iWARP for connectivity between block storage initi=
ator
> systems and block storage target systems and that it has a single company=
-
> wide Ethernet network. If the target system does not restrict access base=
d
> on initiator IP address then any penetrator would be able to access all t=
he
> block devices exported by the target after a SoftRoCE or SoftiWARP initia=
tor
> driver has been loaded. If the target system however restricts access bas=
ed
> on the initiator IP address then that would make it harder for a penetrat=
or
> to access the exported block storage devices. Instead of just penetrating=
 the
> network access, IP address spoofing would have to be used or access would
> have to be obtained to a system that has been granted access to the targe=
t
> system.
>
> Thanks,
>
> Bart.
>
>


--=20
Danil Kipnis
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2018-06-04 12:27 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-02 14:08 [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
2018-02-02 14:08 ` [PATCH 01/24] ibtrs: public interface header to establish RDMA connections Roman Pen
2018-02-02 14:08 ` [PATCH 02/24] ibtrs: private headers with IBTRS protocol structs and helpers Roman Pen
2018-02-02 14:08 ` [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules Roman Pen
2018-02-05 10:52   ` Sagi Grimberg
2018-02-06 12:01     ` Roman Penyaev
2018-02-06 16:10       ` Jason Gunthorpe
2018-02-07 10:34         ` Roman Penyaev
2018-02-02 14:08 ` [PATCH 04/24] ibtrs: client: private header with client structs and functions Roman Pen
2018-02-05 10:59   ` Sagi Grimberg
2018-02-06 12:23     ` Roman Penyaev
2018-02-02 14:08 ` [PATCH 05/24] ibtrs: client: main functionality Roman Pen
2018-02-02 16:54   ` Bart Van Assche
2018-02-05 13:27     ` Roman Penyaev
2018-02-05 14:14       ` Sagi Grimberg
2018-02-05 17:05         ` Roman Penyaev
2018-02-05 11:19   ` Sagi Grimberg
2018-02-05 14:19     ` Roman Penyaev
2018-02-05 16:24       ` Bart Van Assche
2018-02-02 14:08 ` [PATCH 06/24] ibtrs: client: statistics functions Roman Pen
2018-02-02 14:08 ` [PATCH 07/24] ibtrs: client: sysfs interface functions Roman Pen
2018-02-05 11:20   ` Sagi Grimberg
2018-02-06 12:28     ` Roman Penyaev
2018-02-02 14:08 ` [PATCH 08/24] ibtrs: server: private header with server structs and functions Roman Pen
2018-02-02 14:08 ` [PATCH 09/24] ibtrs: server: main functionality Roman Pen
2018-02-05 11:29   ` Sagi Grimberg
2018-02-06 12:46     ` Roman Penyaev
2018-02-02 14:08 ` [PATCH 10/24] ibtrs: server: statistics functions Roman Pen
2018-02-02 14:08 ` [PATCH 11/24] ibtrs: server: sysfs interface functions Roman Pen
2018-02-02 14:08 ` [PATCH 12/24] ibtrs: include client and server modules into kernel compilation Roman Pen
2018-02-02 14:08 ` [PATCH 13/24] ibtrs: a bit of documentation Roman Pen
2018-02-02 14:08 ` [PATCH 14/24] ibnbd: private headers with IBNBD protocol structs and helpers Roman Pen
2018-02-02 14:08 ` [PATCH 15/24] ibnbd: client: private header with client structs and functions Roman Pen
2018-02-02 14:08 ` [PATCH 16/24] ibnbd: client: main functionality Roman Pen
2018-02-02 15:11   ` Jens Axboe
2018-02-05 12:54     ` Roman Penyaev
2018-02-02 14:08 ` [PATCH 17/24] ibnbd: client: sysfs interface functions Roman Pen
2018-02-02 14:08 ` [PATCH 18/24] ibnbd: server: private header with server structs and functions Roman Pen
2018-02-02 14:08 ` [PATCH 19/24] ibnbd: server: main functionality Roman Pen
2018-02-02 14:09 ` [PATCH 20/24] ibnbd: server: functionality for IO submission to file or block dev Roman Pen
2018-02-02 14:09 ` [PATCH 21/24] ibnbd: server: sysfs interface functions Roman Pen
2018-02-02 14:09 ` [PATCH 22/24] ibnbd: include client and server modules into kernel compilation Roman Pen
2018-02-02 14:09 ` [PATCH 23/24] ibnbd: a bit of documentation Roman Pen
2018-02-02 15:55   ` Bart Van Assche
2018-02-05 13:03     ` Roman Penyaev
2018-02-05 14:16       ` Sagi Grimberg
2018-02-02 14:09 ` [PATCH 24/24] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Roman Pen
2018-02-02 16:07 ` [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Bart Van Assche
2018-02-02 16:40   ` Doug Ledford
2018-02-05  8:45     ` Jinpu Wang
2018-06-04 12:14     ` Danil Kipnis
2018-02-02 17:05 ` Bart Van Assche
2018-02-05  8:56   ` Jinpu Wang
2018-02-05 11:36     ` Sagi Grimberg
2018-02-05 13:38       ` Danil Kipnis
2018-02-05 14:17         ` Sagi Grimberg
2018-02-05 16:40           ` Danil Kipnis
2018-02-05 18:38             ` Bart Van Assche
2018-02-06  9:44               ` Danil Kipnis
2018-02-06 15:35                 ` Bart Van Assche
2018-02-05 16:16     ` Bart Van Assche
2018-02-05 16:36       ` Jinpu Wang
2018-02-07 16:35       ` Christopher Lameter
2018-02-07 17:18         ` Roman Penyaev
2018-02-07 17:32           ` Bart Van Assche
2018-02-08 17:38             ` Danil Kipnis
2018-02-08 18:09               ` Bart Van Assche
2018-06-04 12:27                 ` Danil Kipnis
2018-02-05 12:16 ` Sagi Grimberg
2018-02-05 12:30   ` Sagi Grimberg
2018-02-07 13:06     ` Roman Penyaev
2018-02-05 16:58   ` Bart Van Assche
2018-02-05 17:16     ` Roman Penyaev
2018-02-05 17:20       ` Bart Van Assche
2018-02-06 11:47         ` Roman Penyaev
2018-02-06 13:12   ` Roman Penyaev
2018-02-06 16:01     ` Bart Van Assche
2018-02-07 12:57       ` Roman Penyaev
2018-02-07 16:35         ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).