Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device
@ 2019-12-30 10:29 Jack Wang
  2019-12-30 10:29 ` [PATCH v6 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
                   ` (26 more replies)
  0 siblings, 27 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

Hi all,

here is V6 of the RTRS (former IBTRS) rdma transport library and the
corresponding RNBD (former IBNBD) rdma network block device.

Changelog since v5:
1 rebased to linux-5.5-rc4
2 fix typo in my email address in first patch
3 cleanup copyright as suggested by Leon Romanovsky
4 remove 2 redudant kobject_del in error path as suggested by Leon Romanovsky
5 add MAINTAINERS entries in alphabetical order as Gal Pressman suggested


Introduction
-------------

RTRS (RDMA Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is based on RDMA-CM, so expect also to support RoCE
and iWARP, but we mainly tested in IB environment. It is optimized to
transfer (read/write) IO blocks in the sense that it follows the BIO
semantics of providing the possibility to either write data from a
scatter-gather list to the remote side or to request ("read") data
transfer from the remote side into a given set of buffers.

RTRS is multipath capable and provides I/O fail-over and load-balancing
functionality, i.e. in RTRS terminology, an RTRS path is a set of RDMA
connections and particular path is selected according to the load-balancing policy.
It can be used for other components not bind to RNBD.

RNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over RTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally RNBD uses RTRS as an RDMA transport library.

Commits for kernel can be found here:
   https://github.com/ionos-enterprise/ibnbd/commits/linux-5.5-rc4-ibnbd-v6
The out-of-tree modules are here:
   https://github.com/ionos-enterprise/ibnbd

As always, please review and share your comments, 

thanks.

Jack Wang (25):
  sysfs: export sysfs_remove_file_self()
  rtrs: public interface header to establish RDMA connections
  rtrs: private headers with rtrs protocol structs and helpers
  rtrs: core: lib functions shared between client and server modules
  rtrs: client: private header with client structs and functions
  rtrs: client: main functionality
  rtrs: client: statistics functions
  rtrs: client: sysfs interface functions
  rtrs: server: private header with server structs and functions
  rtrs: server: main functionality
  rtrs: server: statistics functions
  rtrs: server: sysfs interface functions
  rtrs: include client and server modules into kernel compilation
  rtrs: a bit of documentation
  rnbd: private headers with rnbd protocol structs and helpers
  rnbd: client: private header with client structs and functions
  rnbd: client: main functionality
  rnbd: client: sysfs interface functions
  rnbd: server: private header with server structs and functions
  rnbd: server: main functionality
  rnbd: server: functionality for IO submission to file or block dev
  rnbd: server: sysfs interface functions
  rnbd: include client and server modules into kernel compilation
  rnbd: a bit of documentation
  MAINTAINERS: Add maintainers for RNBD/RTRS modules

 Documentation/ABI/testing/sysfs-block-rnbd    |   51 +
 .../ABI/testing/sysfs-class-rnbd-client       |  117 +
 .../ABI/testing/sysfs-class-rnbd-server       |   57 +
 .../ABI/testing/sysfs-class-rtrs-client       |  190 ++
 .../ABI/testing/sysfs-class-rtrs-server       |   81 +
 MAINTAINERS                                   |   14 +
 drivers/block/Kconfig                         |    2 +
 drivers/block/Makefile                        |    1 +
 drivers/block/rnbd/Kconfig                    |   28 +
 drivers/block/rnbd/Makefile                   |   17 +
 drivers/block/rnbd/README                     |   92 +
 drivers/block/rnbd/rnbd-clt-sysfs.c           |  641 ++++
 drivers/block/rnbd/rnbd-clt.c                 | 1743 ++++++++++
 drivers/block/rnbd/rnbd-clt.h                 |  151 +
 drivers/block/rnbd/rnbd-common.c              |   25 +
 drivers/block/rnbd/rnbd-log.h                 |   43 +
 drivers/block/rnbd/rnbd-proto.h               |  307 ++
 drivers/block/rnbd/rnbd-srv-dev.c             |  144 +
 drivers/block/rnbd/rnbd-srv-dev.h             |  112 +
 drivers/block/rnbd/rnbd-srv-sysfs.c           |  213 ++
 drivers/block/rnbd/rnbd-srv.c                 |  864 +++++
 drivers/block/rnbd/rnbd-srv.h                 |   81 +
 drivers/infiniband/Kconfig                    |    1 +
 drivers/infiniband/ulp/Makefile               |    1 +
 drivers/infiniband/ulp/rtrs/Kconfig           |   27 +
 drivers/infiniband/ulp/rtrs/Makefile          |   17 +
 drivers/infiniband/ulp/rtrs/README            |  149 +
 drivers/infiniband/ulp/rtrs/rtrs-clt-stats.c  |  435 +++
 drivers/infiniband/ulp/rtrs/rtrs-clt-sysfs.c  |  501 +++
 drivers/infiniband/ulp/rtrs/rtrs-clt.c        | 2934 +++++++++++++++++
 drivers/infiniband/ulp/rtrs/rtrs-clt.h        |  296 ++
 drivers/infiniband/ulp/rtrs/rtrs-log.h        |   32 +
 drivers/infiniband/ulp/rtrs/rtrs-pri.h        |  408 +++
 drivers/infiniband/ulp/rtrs/rtrs-srv-stats.c  |   91 +
 drivers/infiniband/ulp/rtrs/rtrs-srv-sysfs.c  |  297 ++
 drivers/infiniband/ulp/rtrs/rtrs-srv.c        | 2169 ++++++++++++
 drivers/infiniband/ulp/rtrs/rtrs-srv.h        |  141 +
 drivers/infiniband/ulp/rtrs/rtrs.c            |  628 ++++
 drivers/infiniband/ulp/rtrs/rtrs.h            |  316 ++
 fs/sysfs/file.c                               |    1 +
 40 files changed, 13418 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-block-rnbd
 create mode 100644 Documentation/ABI/testing/sysfs-class-rnbd-client
 create mode 100644 Documentation/ABI/testing/sysfs-class-rnbd-server
 create mode 100644 Documentation/ABI/testing/sysfs-class-rtrs-client
 create mode 100644 Documentation/ABI/testing/sysfs-class-rtrs-server
 create mode 100644 drivers/block/rnbd/Kconfig
 create mode 100644 drivers/block/rnbd/Makefile
 create mode 100644 drivers/block/rnbd/README
 create mode 100644 drivers/block/rnbd/rnbd-clt-sysfs.c
 create mode 100644 drivers/block/rnbd/rnbd-clt.c
 create mode 100644 drivers/block/rnbd/rnbd-clt.h
 create mode 100644 drivers/block/rnbd/rnbd-common.c
 create mode 100644 drivers/block/rnbd/rnbd-log.h
 create mode 100644 drivers/block/rnbd/rnbd-proto.h
 create mode 100644 drivers/block/rnbd/rnbd-srv-dev.c
 create mode 100644 drivers/block/rnbd/rnbd-srv-dev.h
 create mode 100644 drivers/block/rnbd/rnbd-srv-sysfs.c
 create mode 100644 drivers/block/rnbd/rnbd-srv.c
 create mode 100644 drivers/block/rnbd/rnbd-srv.h
 create mode 100644 drivers/infiniband/ulp/rtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/rtrs/Makefile
 create mode 100644 drivers/infiniband/ulp/rtrs/README
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt-stats.c
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt-sysfs.c
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt.c
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt.h
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-log.h
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-pri.h
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv-stats.c
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv-sysfs.c
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv.c
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv.h
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs.c
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 01/25] sysfs: export sysfs_remove_file_self()
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 10:29 ` [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections Jack Wang
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev, Roman Pen, linux-kernel

From: Jack Wang <jinpu.wang@cloud.ionos.com>

Function is going to be used in transport over RDMA module
in subsequent patches, so export it to GPL modules.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
[jwang: extend the commit message]
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 fs/sysfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 130fc6fbcc03..1ff4672d7746 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -492,6 +492,7 @@ bool sysfs_remove_file_self(struct kobject *kobj, const struct attribute *attr)
 	kernfs_put(kn);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(sysfs_remove_file_self);
 
 void sysfs_remove_files(struct kobject *kobj, const struct attribute * const *ptr)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
  2019-12-30 10:29 ` [PATCH v6 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 19:25   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers Jack Wang
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

Introduce public header which provides set of API functions to
establish RDMA connections from client to server machine using
RTRS protocol, which manages RDMA connections for each session,
does multipathing and load balancing.

Main functions for client (active) side:

 rtrs_clt_open() - Creates set of RDMA connections incapsulated
                    in IBTRS session and returns pointer on RTRS
		    session object.
 rtrs_clt_close() - Closes RDMA connections associated with RTRS
                     session.
 rtrs_clt_request() - Requests zero-copy RDMA transfer to/from
                       server.

Main functions for server (passive) side:

 rtrs_srv_open() - Starts listening for RTRS clients on specified
                    port and invokes RTRS callbacks for incoming
		    RDMA requests or link events.
 rtrs_srv_close() - Closes RTRS server context.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs.h | 316 +++++++++++++++++++++++++++++
 1 file changed, 316 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs.h

diff --git a/drivers/infiniband/ulp/rtrs/rtrs.h b/drivers/infiniband/ulp/rtrs/rtrs.h
new file mode 100644
index 000000000000..2a513672a99b
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs.h
@@ -0,0 +1,316 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#ifndef RTRS_H
+#define RTRS_H
+
+#include <linux/socket.h>
+#include <linux/scatterlist.h>
+
+struct rtrs_permit;
+struct rtrs_clt;
+struct rtrs_srv_ctx;
+struct rtrs_srv;
+struct rtrs_srv_op;
+
+/*
+ * Here goes RTRS client API
+ */
+
+/**
+ * enum rtrs_clt_link_ev - Events about connectivity state of a client
+ * @RTRS_CLT_LINK_EV_RECONNECTED	Client was reconnected.
+ * @RTRS_CLT_LINK_EV_DISCONNECTED	Client was disconnected.
+ */
+enum rtrs_clt_link_ev {
+	RTRS_CLT_LINK_EV_RECONNECTED,
+	RTRS_CLT_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * Source and destination address of a path to be established
+ */
+struct rtrs_addr {
+	struct sockaddr_storage *src;
+	struct sockaddr_storage *dst;
+};
+
+typedef void (link_clt_ev_fn)(void *priv, enum rtrs_clt_link_ev ev);
+/**
+ * rtrs_clt_open() - Open a session to an RTRS server
+ * @priv: User supplied private data.
+ * @link_ev: Event notification for connection state changes
+ *	@priv: User supplied data that was passed to rtrs_clt_open()
+ *	@ev: Occurred event
+ * @sessname: name of the session
+ * @paths: Paths to be established defined by their src and dst addresses
+ * @path_cnt: Number of elemnts in the @paths array
+ * @port: port to be used by the RTRS session
+ * @pdu_sz: Size of extra payload which can be accessed after permit allocation.
+ * @max_inflight_msg: Max. number of parallel inflight messages for the session
+ * @max_segments: Max. number of segments per IO request
+ * @reconnect_delay_sec: time between reconnect tries
+ * @max_reconnect_attempts: Number of times to reconnect on error before giving
+ *			    up, 0 for * disabled, -1 for forever
+ *
+ * Starts session establishment with the rtrs_server. The function can block
+ * up to ~2000ms until it returns.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct rtrs_clt *rtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct rtrs_addr *paths,
+				 size_t path_cnt, short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts);
+
+/**
+ * rtrs_clt_close() - Close a session
+ * @sess: Session handle. Session is freed on return.
+ */
+void rtrs_clt_close(struct rtrs_clt *sess);
+
+/**
+ * rtrs_permit_from_pdu() - converts opaque pdu pointer to rtrs_permit
+ * @pdu: opaque pointer
+ */
+struct rtrs_permit *rtrs_permit_from_pdu(void *pdu);
+
+/**
+ * rtrs_permit_to_pdu() - converts rtrs_permit to opaque pdu pointer
+ * @permit: RTRS permit pointer
+ */
+void *rtrs_permit_to_pdu(struct rtrs_permit *permit);
+
+enum {
+	RTRS_PERMIT_NOWAIT = 0,
+	RTRS_PERMIT_WAIT   = 1,
+};
+
+/**
+ * enum rtrs_clt_con_type() type of ib connection to use with a given permit
+ * @USR_CON - use connection reserved vor "service" messages
+ * @IO_CON - use a connection reserved for IO
+ */
+enum rtrs_clt_con_type {
+	RTRS_USR_CON,
+	RTRS_IO_CON
+};
+
+/**
+ * rtrs_clt_get_permit() - allocates permit for future RDMA operation
+ * @sess:	Current session
+ * @con_type:	Type of connection to use with the permit
+ * @wait:	Wait type
+ *
+ * Description:
+ *    Allocates permit for the following RDMA operation.  Permit is used
+ *    to preallocate all resources and to propagate memory pressure
+ *    up earlier.
+ *
+ * Context:
+ *    Can sleep if @wait == RTRS_TAG_WAIT
+ */
+struct rtrs_permit *rtrs_clt_get_permit(struct rtrs_clt *sess,
+				    enum rtrs_clt_con_type con_type,
+				    int wait);
+
+/**
+ * rtrs_clt_put_permit() - puts allocated permit
+ * @sess:	Current session
+ * @permit:	Permit to be freed
+ *
+ * Context:
+ *    Does not matter
+ */
+void rtrs_clt_put_permit(struct rtrs_clt *sess, struct rtrs_permit *permit);
+
+typedef void (rtrs_conf_fn)(void *priv, int errno);
+/**
+ * rtrs_clt_request() - Request data transfer to/from server via RDMA.
+ *
+ * @dir:	READ/WRITE
+ * @conf:	callback function to be called as confirmation
+ * @sess:	Session
+ * @permit:	Preallocated permit
+ * @priv:	User provided data, passed back with corresponding
+ *		@(conf) confirmation.
+ * @vec:	Message that is send to server together with the request.
+ *		Sum of len of all @vec elements limited to <= IO_MSG_SIZE.
+ *		Since the msg is copied internally it can be allocated on stack.
+ * @nr:		Number of elements in @vec.
+ * @len:	length of data send to/from server
+ * @sg:		Pages to be sent/received to/from server.
+ * @sg_cnt:	Number of elements in the @sg
+ *
+ * Return:
+ * 0:		Success
+ * <0:		Error
+ *
+ * On dir=READ rtrs client will request a data transfer from Server to client.
+ * The data that the server will respond with will be stored in @sg when
+ * the user receives an %RTRS_CLT_RDMA_EV_RDMA_REQUEST_WRITE_COMPL event.
+ * On dir=WRITE rtrs client will rdma write data in sg to server side.
+ */
+int rtrs_clt_request(int dir, rtrs_conf_fn *conf, struct rtrs_clt *sess,
+		      struct rtrs_permit *permit, void *priv,
+		      const struct kvec *vec, size_t nr, size_t len,
+		      struct scatterlist *sg, unsigned int sg_cnt);
+
+/**
+ * rtrs_attrs - RTRS session attributes
+ */
+struct rtrs_attrs {
+	u32	queue_depth;
+	u32	max_io_size;
+	u8	sessname[NAME_MAX];
+	struct kobject *sess_kobj;
+};
+
+/**
+ * rtrs_clt_query() - queries RTRS session attributes
+ *
+ * Returns:
+ *    0 on success
+ *    -ECOMM		no connection to the server
+ */
+int rtrs_clt_query(struct rtrs_clt *sess, struct rtrs_attrs *attr);
+
+/*
+ * Here goes RTRS server API
+ */
+
+/**
+ * enum rtrs_srv_link_ev - Server link events
+ * @RTRS_SRV_LINK_EV_CONNECTED:	Connection from client established
+ * @RTRS_SRV_LINK_EV_DISCONNECTED:	Connection was disconnected, all
+ *					connection RTRS resources were freed.
+ */
+enum rtrs_srv_link_ev {
+	RTRS_SRV_LINK_EV_CONNECTED,
+	RTRS_SRV_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * rdma_ev_fn():	Event notification for RDMA operations
+ *			If the callback returns a value != 0, an error message
+ *			for the data transfer will be sent to the client.
+
+ *	@sess:		Session
+ *	@priv:		Private data set by rtrs_srv_set_sess_priv()
+ *	@id:		internal RTRS operation id
+ *	@dir:		READ/WRITE
+ *	@data:		Pointer to (bidirectional) rdma memory area:
+ *			- in case of %RTRS_SRV_RDMA_EV_RECV contains
+ *			data sent by the client
+ *			- in case of %RTRS_SRV_RDMA_EV_WRITE_REQ points to the
+ *			memory area where the response is to be written to
+ *	@datalen:	Size of the memory area in @data
+ *	@usr:		The extra user message sent by the client (%vec)
+ *	@usrlen:	Size of the user message
+ */
+typedef int (rdma_ev_fn)(struct rtrs_srv *sess, void *priv,
+			 struct rtrs_srv_op *id, int dir,
+			 void *data, size_t datalen, const void *usr,
+			 size_t usrlen);
+
+/**
+ * link_ev_fn():	Events about connective state changes
+ *			If the callback returns != 0 and the event
+ *			%RTRS_SRV_LINK_EV_CONNECTED the corresponding session
+ *			will be destroyed.
+ *	@sess:		Session
+ *	@ev:		event
+ *	@priv:		Private data from user if previously set with
+ *			rtrs_srv_set_sess_priv()
+ */
+typedef int (link_ev_fn)(struct rtrs_srv *sess, enum rtrs_srv_link_ev ev,
+			 void *priv);
+
+/**
+ * rtrs_srv_open() - open RTRS server context
+ * @ops:		callback functions
+ *
+ * Creates server context with specified callbacks.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct rtrs_srv_ctx *rtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port);
+
+/**
+ * rtrs_srv_close() - close RTRS server context
+ * @ctx: pointer to server context
+ *
+ * Closes RTRS server context with all client sessions.
+ */
+void rtrs_srv_close(struct rtrs_srv_ctx *ctx);
+
+/**
+ * rtrs_srv_resp_rdma() - Finish an RDMA request
+ *
+ * @id:		Internal RTRS operation identifier
+ * @errno:	Response Code send to the other side for this operation.
+ *		0 = success, <=0 error
+ *
+ * Finish a RDMA operation. A message is sent to the client and the
+ * corresponding memory areas will be released.
+ */
+void rtrs_srv_resp_rdma(struct rtrs_srv_op *id, int errno);
+
+/**
+ * rtrs_srv_set_sess_priv() - Set private pointer in rtrs_srv.
+ * @sess:	Session
+ * @priv:	The private pointer that is associated with the session.
+ */
+void rtrs_srv_set_sess_priv(struct rtrs_srv *sess, void *priv);
+
+/**
+ * rtrs_srv_get_sess_qdepth() - Get rtrs_srv qdepth.
+ * @sess:	Session
+ */
+int rtrs_srv_get_queue_depth(struct rtrs_srv *sess);
+
+/**
+ * rtrs_srv_get_sess_name() - Get rtrs_srv peer hostname.
+ * @sess:	Session
+ * @sessname:	Sessname buffer
+ * @len:	Length of sessname buffer
+ */
+int rtrs_srv_get_sess_name(struct rtrs_srv *sess, char *sessname, size_t len);
+
+/**
+ * rtrs_addr_to_sockaddr() - convert path string "src,dst" to sockaddreses
+ * @str		string containing source and destination addr of a path
+ *		separated by comma. I.e. "ip:1.1.1.1,ip:1.1.1.2". If str
+ *		contains only one address it's considered to be destination.
+ * @len		string length
+ * @addr->dst	will be set to the destination sockadddr.
+ * @addr->src	will be set to the source address or to NULL
+ *		if str doesn't contain any sorce address.
+ *
+ * Returns zero if conversion successful. Non-zero otherwise.
+ */
+int rtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct rtrs_addr *addr);
+
+/**
+ * sockaddr_to_str() - convert sockaddr to a string.
+ * @addr	the sockadddr structure to be converted.
+ * @buf		string containing socket addr.
+ * @len		string length.
+ *
+ * The return value is the number of characters written into buf not
+ * including the trailing '\0'. If len is == 0 the function returns 0..
+ */
+int sockaddr_to_str(const struct sockaddr *addr, char *buf, size_t len);
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
  2019-12-30 10:29 ` [PATCH v6 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
  2019-12-30 10:29 ` [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 19:48   ` Bart Van Assche
  2019-12-31  0:07   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 04/25] rtrs: core: lib functions shared between client and server modules Jack Wang
                   ` (23 subsequent siblings)
  26 siblings, 2 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

These are common private headers with rtrs protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-log.h |  32 ++
 drivers/infiniband/ulp/rtrs/rtrs-pri.h | 408 +++++++++++++++++++++++++
 2 files changed, 440 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-log.h
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-pri.h

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-log.h b/drivers/infiniband/ulp/rtrs/rtrs-log.h
new file mode 100644
index 000000000000..570329a73ee4
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-log.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#ifndef RTRS_LOG_H
+#define RTRS_LOG_H
+
+#define rtrs_prefix(obj) (obj->sessname)
+
+#define rtrs_log(fn, obj, fmt, ...)				\
+	fn("<%s>: " fmt, rtrs_prefix(obj), ##__VA_ARGS__)
+
+#define rtrs_err(obj, fmt, ...)	\
+	rtrs_log(pr_err, obj, fmt, ##__VA_ARGS__)
+#define rtrs_err_rl(obj, fmt, ...)	\
+	rtrs_log(pr_err_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define rtrs_wrn(obj, fmt, ...)	\
+	rtrs_log(pr_warn, obj, fmt, ##__VA_ARGS__)
+#define rtrs_wrn_rl(obj, fmt, ...) \
+	rtrs_log(pr_warn_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define rtrs_info(obj, fmt, ...) \
+	rtrs_log(pr_info, obj, fmt, ##__VA_ARGS__)
+#define rtrs_info_rl(obj, fmt, ...) \
+	rtrs_log(pr_info_ratelimited, obj, fmt, ##__VA_ARGS__)
+
+#endif /* RTRS_LOG_H */
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-pri.h b/drivers/infiniband/ulp/rtrs/rtrs-pri.h
new file mode 100644
index 000000000000..f215e6c0ce73
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-pri.h
@@ -0,0 +1,408 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#ifndef RTRS_PRI_H
+#define RTRS_PRI_H
+
+#include <linux/uuid.h>
+#include <rdma/rdma_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib.h>
+
+#include "rtrs.h"
+
+#define RTRS_PROTO_VER_MAJOR 2
+#define RTRS_PROTO_VER_MINOR 0
+
+#define RTRS_PROTO_VER_STRING __stringify(RTRS_PROTO_VER_MAJOR) "." \
+			       __stringify(RTRS_PROTO_VER_MINOR)
+
+enum rtrs_imm_const {
+	MAX_IMM_TYPE_BITS = 4,
+	MAX_IMM_TYPE_MASK = ((1 << MAX_IMM_TYPE_BITS) - 1),
+	MAX_IMM_PAYL_BITS = 28,
+	MAX_IMM_PAYL_MASK = ((1 << MAX_IMM_PAYL_BITS) - 1),
+};
+
+enum rtrs_imm_type {
+	RTRS_IO_REQ_IMM       = 0, /* client to server */
+	RTRS_IO_RSP_IMM       = 1, /* server to client */
+	RTRS_IO_RSP_W_INV_IMM = 2, /* server to client */
+
+	RTRS_HB_MSG_IMM = 8,
+	RTRS_HB_ACK_IMM = 9,
+
+	RTRS_LAST_IMM,
+};
+
+enum {
+	SERVICE_CON_QUEUE_DEPTH = 512,
+
+	MIN_RTR_CNT = 1,
+	MAX_RTR_CNT = 7,
+
+	MAX_PATHS_NUM = 128,
+
+	/*
+	 * With the current size of the tag allocated on the client, 4K
+	 * is the maximum number of tags we can allocate.  This number is
+	 * also used on the client to allocate the IU for the user connection
+	 * to receive the RDMA addresses from the server.
+	 */
+	MAX_SESS_QUEUE_DEPTH = 4096,
+
+	RTRS_HB_INTERVAL_MS = 5000,
+	RTRS_HB_MISSED_MAX = 5,
+
+	RTRS_MAGIC = 0x1BBD,
+	RTRS_PROTO_VER = (RTRS_PROTO_VER_MAJOR << 8) | RTRS_PROTO_VER_MINOR,
+};
+
+struct rtrs_ib_dev;
+
+struct rtrs_ib_dev_pool_ops {
+	struct rtrs_ib_dev *(*alloc)(void);
+	void (*free)(struct rtrs_ib_dev *dev);
+	int (*init)(struct rtrs_ib_dev *dev);
+	void (*deinit)(struct rtrs_ib_dev *dev);
+};
+
+struct rtrs_ib_dev_pool {
+	struct mutex		mutex;
+	struct list_head	list;
+	enum ib_pd_flags	pd_flags;
+	const struct rtrs_ib_dev_pool_ops *ops;
+};
+
+struct rtrs_ib_dev {
+	struct ib_device	 *ib_dev;
+	struct ib_pd		 *ib_pd;
+	struct kref		 ref;
+	struct list_head	 entry;
+	struct rtrs_ib_dev_pool *pool;
+};
+
+struct rtrs_con {
+	struct rtrs_sess	*sess;
+	struct ib_qp		*qp;
+	struct ib_cq		*cq;
+	struct rdma_cm_id	*cm_id;
+	unsigned int		cid;
+};
+
+typedef void (rtrs_hb_handler_t)(struct rtrs_con *con);
+
+struct rtrs_sess {
+	struct list_head	entry;
+	struct sockaddr_storage dst_addr;
+	struct sockaddr_storage src_addr;
+	char			sessname[NAME_MAX];
+	uuid_t			uuid;
+	struct rtrs_con	**con;
+	unsigned int		con_num;
+	unsigned int		recon_cnt;
+	struct rtrs_ib_dev	*dev;
+	int			dev_ref;
+	struct ib_cqe		*hb_cqe;
+	rtrs_hb_handler_t	*hb_err_handler;
+	struct workqueue_struct *hb_wq;
+	struct delayed_work	hb_dwork;
+	unsigned int		hb_interval_ms;
+	unsigned int		hb_missed_cnt;
+	unsigned int		hb_missed_max;
+};
+
+struct rtrs_iu {
+	struct list_head        list;
+	struct ib_cqe           cqe;
+	dma_addr_t              dma_addr;
+	void                    *buf;
+	size_t                  size;
+	enum dma_data_direction direction;
+};
+
+/**
+ * enum rtrs_msg_types - RTRS message types.
+ * @RTRS_MSG_INFO_REQ:		Client additional info request to the server
+ * @RTRS_MSG_INFO_RSP:		Server additional info response to the client
+ * @RTRS_MSG_WRITE:		Client writes data per RDMA to server
+ * @RTRS_MSG_READ:		Client requests data transfer from server
+ * @RTRS_MSG_RKEY_RSP:		Server refreshed rkey for rbuf
+ */
+enum rtrs_msg_types {
+	RTRS_MSG_INFO_REQ,
+	RTRS_MSG_INFO_RSP,
+	RTRS_MSG_WRITE,
+	RTRS_MSG_READ,
+	RTRS_MSG_RKEY_RSP,
+};
+
+/**
+ * enum rtrs_msg_flags - RTRS message flags.
+ * @RTRS_NEED_INVAL:	Send invalidation in response.
+ * @RTRS_MSG_NEW_RKEY_F: Send refreshed rkey in response.
+ */
+enum rtrs_msg_flags {
+	RTRS_MSG_NEED_INVAL_F = 1 << 0,
+	RTRS_MSG_NEW_RKEY_F = 1 << 1,
+};
+
+/**
+ * struct rtrs_sg_desc - RDMA-Buffer entry description
+ * @addr:	Address of RDMA destination buffer
+ * @key:	Authorization rkey to write to the buffer
+ * @len:	Size of the buffer
+ */
+struct rtrs_sg_desc {
+	__le64			addr;
+	__le32			key;
+	__le32			len;
+};
+
+/**
+ * struct rtrs_msg_conn_req - Client connection request to the server
+ * @magic:	   RTRS magic
+ * @version:	   RTRS protocol version
+ * @cid:	   Current connection id
+ * @cid_num:	   Number of connections per session
+ * @recon_cnt:	   Reconnections counter
+ * @sess_uuid:	   UUID of a session (path)
+ * @paths_uuid:	   UUID of a group of sessions (paths)
+ *
+ * NOTE: max size 56 bytes, see man rdma_connect().
+ */
+struct rtrs_msg_conn_req {
+	u8		__cma_version; /* Is set to 0 by cma.c in case of
+					* AF_IB, do not touch that.
+					*/
+	u8		__ip_version;  /* On sender side that should be
+					* set to 0, or cma_save_ip_info()
+					* extract garbage and will fail.
+					*/
+	__le16		magic;
+	__le16		version;
+	__le16		cid;
+	__le16		cid_num;
+	__le16		recon_cnt;
+	uuid_t		sess_uuid;
+	uuid_t		paths_uuid;
+	u8		reserved[12];
+};
+
+/**
+ * struct rtrs_msg_conn_rsp - Server connection response to the client
+ * @magic:	   RTRS magic
+ * @version:	   RTRS protocol version
+ * @errno:	   If rdma_accept() then 0, if rdma_reject() indicates error
+ * @queue_depth:   max inflight messages (queue-depth) in this session
+ * @max_io_size:   max io size server supports
+ * @max_hdr_size:  max msg header size server supports
+ *
+ * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
+ */
+struct rtrs_msg_conn_rsp {
+	__le16		magic;
+	__le16		version;
+	__le16		errno;
+	__le16		queue_depth;
+	__le32		max_io_size;
+	__le32		max_hdr_size;
+	__le32		flags;
+	u8		reserved[36];
+};
+
+/**
+ * struct rtrs_msg_info_req
+ * @type:		@RTRS_MSG_INFO_REQ
+ * @sessname:		Session name chosen by client
+ */
+struct rtrs_msg_info_req {
+	__le16		type;
+	u8		sessname[NAME_MAX];
+	u8		reserved[15];
+};
+
+/**
+ * struct rtrs_msg_info_rsp
+ * @type:		@RTRS_MSG_INFO_RSP
+ * @sg_cnt:		Number of @desc entries
+ * @desc:		RDMA buffers where the client can write to server
+ */
+struct rtrs_msg_info_rsp {
+	__le16		type;
+	__le16          sg_cnt;
+	u8              reserved[4];
+	struct rtrs_sg_desc desc[];
+};
+
+/**
+ * struct rtrs_msg_rkey_rsp
+ * @type:		@RTRS_MSG_RKEY_RSP
+ * @buf_id:		RDMA buf_id of the new rkey
+ * @rkey:		new remote key for RDMA buffers id from server
+ */
+struct rtrs_msg_rkey_rsp {
+	__le16		type;
+	__le16          buf_id;
+	__le32		rkey;
+};
+
+/**
+ * struct rtrs_msg_rdma_read - RDMA data transfer request from client
+ * @type:		always @RTRS_MSG_READ
+ * @usr_len:		length of user payload
+ * @sg_cnt:		number of @desc entries
+ * @desc:		RDMA buffers where the server can write the result to
+ */
+struct rtrs_msg_rdma_read {
+	__le16			type;
+	__le16			usr_len;
+	__le16			flags;
+	__le16			sg_cnt;
+	struct rtrs_sg_desc    desc[];
+};
+
+/**
+ * struct_msg_rdma_write - Message transferred to server with RDMA-Write
+ * @type:		always @RTRS_MSG_WRITE
+ * @usr_len:		length of user payload
+ */
+struct rtrs_msg_rdma_write {
+	__le16			type;
+	__le16			usr_len;
+};
+
+/**
+ * struct_msg_rdma_hdr - header for read or write request
+ * @type:		@RTRS_MSG_WRITE | @RTRS_MSG_READ
+ */
+struct rtrs_msg_rdma_hdr {
+	__le16			type;
+};
+
+/* rtrs.c */
+
+struct rtrs_iu *rtrs_iu_alloc(u32 queue_size, size_t size, gfp_t t,
+			      struct ib_device *dev, enum dma_data_direction,
+			      void (*done)(struct ib_cq *cq, struct ib_wc *wc));
+void rtrs_iu_free(struct rtrs_iu *iu, enum dma_data_direction dir,
+		  struct ib_device *dev, u32 queue_size);
+int rtrs_iu_post_recv(struct rtrs_con *con, struct rtrs_iu *iu);
+int rtrs_iu_post_send(struct rtrs_con *con, struct rtrs_iu *iu, size_t size,
+		      struct ib_send_wr *head);
+int rtrs_iu_post_rdma_write_imm(struct rtrs_con *con, struct rtrs_iu *iu,
+				struct ib_sge *sge, unsigned int num_sge,
+				u32 rkey, u64 rdma_addr, u32 imm_data,
+				enum ib_send_flags flags,
+				struct ib_send_wr *head);
+
+int rtrs_post_recv_empty(struct rtrs_con *con, struct ib_cqe *cqe);
+int rtrs_post_recv_empty_x2(struct rtrs_con *con, struct ib_cqe *cqe);
+int rtrs_post_rdma_write_imm_empty(struct rtrs_con *con, struct ib_cqe *cqe,
+				   u32 imm_data, enum ib_send_flags flags,
+				   struct ib_send_wr *head);
+
+int rtrs_cq_qp_create(struct rtrs_sess *rtrs_sess, struct rtrs_con *con,
+		      u32 max_send_sge, int cq_vector, u16 cq_size,
+		      u16 wr_queue_size, enum ib_poll_context poll_ctx);
+void rtrs_cq_qp_destroy(struct rtrs_con *con);
+
+void rtrs_init_hb(struct rtrs_sess *sess, struct ib_cqe *cqe,
+		  unsigned int interval_ms, unsigned int missed_max,
+		  rtrs_hb_handler_t *err_handler,
+		  struct workqueue_struct *wq);
+void rtrs_start_hb(struct rtrs_sess *sess);
+void rtrs_stop_hb(struct rtrs_sess *sess);
+void rtrs_send_hb_ack(struct rtrs_sess *sess);
+
+void rtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
+			   struct rtrs_ib_dev_pool *pool);
+void rtrs_ib_dev_pool_deinit(struct rtrs_ib_dev_pool *pool);
+
+struct rtrs_ib_dev *rtrs_ib_dev_find_or_add(struct ib_device *ib_dev,
+					    struct rtrs_ib_dev_pool *pool);
+int rtrs_ib_dev_put(struct rtrs_ib_dev *dev);
+
+static inline u32 rtrs_to_imm(u32 type, u32 payload)
+{
+	BUILD_BUG_ON(MAX_IMM_PAYL_BITS + MAX_IMM_TYPE_BITS != 32);
+	BUILD_BUG_ON(RTRS_LAST_IMM > (1<<MAX_IMM_TYPE_BITS));
+	return ((type & MAX_IMM_TYPE_MASK) << MAX_IMM_PAYL_BITS) |
+		(payload & MAX_IMM_PAYL_MASK);
+}
+
+static inline void rtrs_from_imm(u32 imm, u32 *type, u32 *payload)
+{
+	*payload = (imm & MAX_IMM_PAYL_MASK);
+	*type = (imm >> MAX_IMM_PAYL_BITS);
+}
+
+static inline u32 rtrs_to_io_req_imm(u32 addr)
+{
+	return rtrs_to_imm(RTRS_IO_REQ_IMM, addr);
+}
+
+static inline u32 rtrs_to_io_rsp_imm(u32 msg_id, int errno, bool w_inval)
+{
+	enum rtrs_imm_type type;
+	u32 payload;
+
+	/* 9 bits for errno, 19 bits for msg_id */
+	payload = (abs(errno) & 0x1ff) << 19 | (msg_id & 0x7ffff);
+	type = (w_inval ? RTRS_IO_RSP_W_INV_IMM : RTRS_IO_RSP_IMM);
+
+	return rtrs_to_imm(type, payload);
+}
+
+static inline void rtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
+{
+	/* 9 bits for errno, 19 bits for msg_id */
+	*msg_id = (payload & 0x7ffff);
+	*errno = -(int)((payload >> 19) & 0x1ff);
+}
+
+#define STAT_STORE_FUNC(type, set_value, reset)				\
+static ssize_t set_value##_store(struct kobject *kobj,			\
+			     struct kobj_attribute *attr,		\
+			     const char *buf, size_t count)		\
+{									\
+	int ret = -EINVAL;						\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	if (sysfs_streq(buf, "1"))					\
+		ret = reset(&sess->stats, true);			\
+	else if (sysfs_streq(buf, "0"))					\
+		ret = reset(&sess->stats, false);			\
+	if (ret)							\
+		return ret;						\
+									\
+	return count;							\
+}
+
+#define STAT_SHOW_FUNC(type, get_value, print)				\
+static ssize_t get_value##_show(struct kobject *kobj,			\
+			   struct kobj_attribute *attr,			\
+			   char *page)					\
+{									\
+	type *sess = container_of(kobj, type, kobj_stats);		\
+									\
+	return print(&sess->stats, page, PAGE_SIZE);			\
+}
+
+#define STAT_ATTR(type, stat, print, reset)				\
+STAT_STORE_FUNC(type, stat, reset)					\
+STAT_SHOW_FUNC(type, stat, print)					\
+static struct kobj_attribute stat##_attr =				\
+		__ATTR(stat, 0644,					\
+		       stat##_show,					\
+		       stat##_store)
+
+#endif /* RTRS_PRI_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 04/25] rtrs: core: lib functions shared between client and server modules
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (2 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 22:25   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 05/25] rtrs: client: private header with client structs and functions Jack Wang
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is a set of library functions existing as a rtrs-core module,
used by client and server modules.

Mainly these functions wrap IB and RDMA calls and provide a bit higher
abstraction for implementing of RTRS protocol on client or server
sides.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs.c | 628 +++++++++++++++++++++++++++++
 1 file changed, 628 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs.c

diff --git a/drivers/infiniband/ulp/rtrs/rtrs.c b/drivers/infiniband/ulp/rtrs/rtrs.c
new file mode 100644
index 000000000000..8498e3a4d4e3
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs.c
@@ -0,0 +1,628 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/inet.h>
+
+#include "rtrs-pri.h"
+#include "rtrs-log.h"
+
+MODULE_DESCRIPTION("RTRS Core");
+MODULE_LICENSE("GPL");
+
+struct rtrs_iu *rtrs_iu_alloc(u32 queue_size, size_t size, gfp_t gfp_mask,
+				struct ib_device *dma_dev,
+				enum dma_data_direction dir,
+				void (*done)(struct ib_cq *cq,
+					     struct ib_wc *wc))
+{
+	struct rtrs_iu *ius, *iu;
+	int i;
+
+	WARN_ON(!queue_size);
+	ius = kcalloc(queue_size, sizeof(*ius), gfp_mask);
+
+	if (unlikely(!ius))
+		return NULL;
+	for (i = 0; i < queue_size; i++) {
+		iu = &ius[i];
+		iu->buf = kzalloc(size, gfp_mask);
+		if (unlikely(!iu->buf))
+			goto err;
+
+		iu->dma_addr = ib_dma_map_single(dma_dev, iu->buf, size, dir);
+		if (unlikely(ib_dma_mapping_error(dma_dev, iu->dma_addr)))
+			goto err;
+
+		iu->cqe.done  = done;
+		iu->size      = size;
+		iu->direction = dir;
+	}
+
+	return ius;
+
+err:
+	rtrs_iu_free(ius, dir, dma_dev, i);
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(rtrs_iu_alloc);
+
+void rtrs_iu_free(struct rtrs_iu *ius, enum dma_data_direction dir,
+		   struct ib_device *ibdev, u32 queue_size)
+{
+	struct rtrs_iu *iu;
+	int i;
+
+	if (!ius)
+		return;
+
+	for (i = 0; i < queue_size; i++) {
+		iu = &ius[i];
+		ib_dma_unmap_single(ibdev, iu->dma_addr, iu->size, dir);
+		kfree(iu->buf);
+	}
+	kfree(ius);
+}
+EXPORT_SYMBOL_GPL(rtrs_iu_free);
+
+int rtrs_iu_post_recv(struct rtrs_con *con, struct rtrs_iu *iu)
+{
+	struct rtrs_sess *sess = con->sess;
+	struct ib_recv_wr wr;
+	const struct ib_recv_wr *bad_wr;
+	struct ib_sge list;
+
+	list.addr   = iu->dma_addr;
+	list.length = iu->size;
+	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
+
+	if (WARN_ON(list.length == 0)) {
+		rtrs_wrn(con->sess,
+			  "Posting receive work request failed, sg list is empty\n");
+		return -EINVAL;
+	}
+
+	wr.next    = NULL;
+	wr.wr_cqe  = &iu->cqe;
+	wr.sg_list = &list;
+	wr.num_sge = 1;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(rtrs_iu_post_recv);
+
+int rtrs_post_recv_empty(struct rtrs_con *con, struct ib_cqe *cqe)
+{
+	struct ib_recv_wr wr;
+	const struct ib_recv_wr *bad_wr;
+
+	wr.next    = NULL;
+	wr.wr_cqe  = cqe;
+	wr.sg_list = NULL;
+	wr.num_sge = 0;
+
+	return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(rtrs_post_recv_empty);
+
+int rtrs_post_recv_empty_x2(struct rtrs_con *con, struct ib_cqe *cqe)
+{
+	struct ib_recv_wr wr_arr[2], *wr;
+	const struct ib_recv_wr *bad_wr;
+	int i;
+
+	memset(wr_arr, 0, sizeof(wr_arr));
+	for (i = 0; i < ARRAY_SIZE(wr_arr); i++) {
+		wr = &wr_arr[i];
+		wr->wr_cqe  = cqe;
+		if (i)
+			/* Chain backwards */
+			wr->next = &wr_arr[i - 1];
+	}
+
+	return ib_post_recv(con->qp, wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(rtrs_post_recv_empty_x2);
+
+int rtrs_iu_post_send(struct rtrs_con *con, struct rtrs_iu *iu, size_t size,
+		       struct ib_send_wr *head)
+{
+	struct rtrs_sess *sess = con->sess;
+	struct ib_send_wr wr;
+	const struct ib_send_wr *bad_wr;
+	struct ib_sge list;
+
+	if ((WARN_ON(size == 0)))
+		return -EINVAL;
+
+	list.addr   = iu->dma_addr;
+	list.length = size;
+	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.next       = NULL;
+	wr.wr_cqe     = &iu->cqe;
+	wr.sg_list    = &list;
+	wr.num_sge    = 1;
+	wr.opcode     = IB_WR_SEND;
+	wr.send_flags = IB_SEND_SIGNALED;
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr;
+	} else {
+		head = &wr;
+	}
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(rtrs_iu_post_send);
+
+int rtrs_iu_post_rdma_write_imm(struct rtrs_con *con, struct rtrs_iu *iu,
+				 struct ib_sge *sge, unsigned int num_sge,
+				 u32 rkey, u64 rdma_addr, u32 imm_data,
+				 enum ib_send_flags flags,
+				 struct ib_send_wr *head)
+{
+	const struct ib_send_wr *bad_wr;
+	struct ib_rdma_wr wr;
+	int i;
+
+	wr.wr.next	  = NULL;
+	wr.wr.wr_cqe	  = &iu->cqe;
+	wr.wr.sg_list	  = sge;
+	wr.wr.num_sge	  = num_sge;
+	wr.rkey		  = rkey;
+	wr.remote_addr	  = rdma_addr;
+	wr.wr.opcode	  = IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.wr.ex.imm_data = cpu_to_be32(imm_data);
+	wr.wr.send_flags  = flags;
+
+	/*
+	 * If one of the sges has 0 size, the operation will fail with an
+	 * length error
+	 */
+	for (i = 0; i < num_sge; i++)
+		if (WARN_ON(sge[i].length == 0))
+			return -EINVAL;
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr.wr;
+	} else {
+		head = &wr.wr;
+	}
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(rtrs_iu_post_rdma_write_imm);
+
+int rtrs_post_rdma_write_imm_empty(struct rtrs_con *con, struct ib_cqe *cqe,
+				    u32 imm_data, enum ib_send_flags flags,
+				    struct ib_send_wr *head)
+{
+	struct ib_send_wr wr;
+	const struct ib_send_wr *bad_wr;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.wr_cqe	= cqe;
+	wr.send_flags	= flags;
+	wr.opcode	= IB_WR_RDMA_WRITE_WITH_IMM;
+	wr.ex.imm_data	= cpu_to_be32(imm_data);
+
+	if (head) {
+		struct ib_send_wr *tail = head;
+
+		while (tail->next)
+			tail = tail->next;
+		tail->next = &wr;
+	} else {
+		head = &wr;
+	}
+
+	return ib_post_send(con->qp, head, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(rtrs_post_rdma_write_imm_empty);
+
+static void qp_event_handler(struct ib_event *ev, void *ctx)
+{
+	struct rtrs_con *con = ctx;
+
+	switch (ev->event) {
+	case IB_EVENT_COMM_EST:
+		rtrs_info(con->sess, "QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		rdma_notify(con->cm_id, IB_EVENT_COMM_EST);
+		break;
+	default:
+		rtrs_info(con->sess, "Unhandled QP event %s (%d) received\n",
+			   ib_event_msg(ev->event), ev->event);
+		break;
+	}
+}
+
+static int create_cq(struct rtrs_con *con, int cq_vector, u16 cq_size,
+		     enum ib_poll_context poll_ctx)
+{
+	struct rdma_cm_id *cm_id = con->cm_id;
+	struct ib_cq *cq;
+
+	cq = ib_alloc_cq(cm_id->device, con, cq_size,
+			 cq_vector, poll_ctx);
+	if (IS_ERR(cq)) {
+		rtrs_err(con->sess, "Creating completion queue failed, errno: %ld\n",
+			  PTR_ERR(cq));
+		return PTR_ERR(cq);
+	}
+	con->cq = cq;
+
+	return 0;
+}
+
+static int create_qp(struct rtrs_con *con, struct ib_pd *pd,
+		     u16 wr_queue_size, u32 max_sge)
+{
+	struct ib_qp_init_attr init_attr = {NULL};
+	struct rdma_cm_id *cm_id = con->cm_id;
+	int ret;
+
+	init_attr.cap.max_send_wr = wr_queue_size;
+	init_attr.cap.max_recv_wr = wr_queue_size;
+	init_attr.cap.max_recv_sge = 1;
+	init_attr.event_handler = qp_event_handler;
+	init_attr.qp_context = con;
+#undef max_send_sge
+	init_attr.cap.max_send_sge = max_sge;
+
+	init_attr.qp_type = IB_QPT_RC;
+	init_attr.send_cq = con->cq;
+	init_attr.recv_cq = con->cq;
+	init_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+
+	ret = rdma_create_qp(cm_id, pd, &init_attr);
+	if (unlikely(ret)) {
+		rtrs_err(con->sess, "Creating QP failed, err: %d\n", ret);
+		return ret;
+	}
+	con->qp = cm_id->qp;
+
+	return ret;
+}
+
+int rtrs_cq_qp_create(struct rtrs_sess *sess, struct rtrs_con *con,
+		       u32 max_send_sge, int cq_vector, u16 cq_size,
+		       u16 wr_queue_size, enum ib_poll_context poll_ctx)
+{
+	int err;
+
+	err = create_cq(con, cq_vector, cq_size, poll_ctx);
+	if (unlikely(err))
+		return err;
+
+	err = create_qp(con, sess->dev->ib_pd, wr_queue_size, max_send_sge);
+	if (unlikely(err)) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+		return err;
+	}
+	con->sess = sess;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(rtrs_cq_qp_create);
+
+void rtrs_cq_qp_destroy(struct rtrs_con *con)
+{
+	if (con->qp) {
+		rdma_destroy_qp(con->cm_id);
+		con->qp = NULL;
+	}
+	if (con->cq) {
+		ib_free_cq(con->cq);
+		con->cq = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(rtrs_cq_qp_destroy);
+
+static void schedule_hb(struct rtrs_sess *sess)
+{
+	queue_delayed_work(sess->hb_wq, &sess->hb_dwork,
+			   msecs_to_jiffies(sess->hb_interval_ms));
+}
+
+void rtrs_send_hb_ack(struct rtrs_sess *sess)
+{
+	struct rtrs_con *usr_con = sess->con[0];
+	u32 imm;
+	int err;
+
+	imm = rtrs_to_imm(RTRS_HB_ACK_IMM, 0);
+	err = rtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe, imm,
+					      IB_SEND_SIGNALED, NULL);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con);
+		return;
+	}
+}
+EXPORT_SYMBOL_GPL(rtrs_send_hb_ack);
+
+static void hb_work(struct work_struct *work)
+{
+	struct rtrs_con *usr_con;
+	struct rtrs_sess *sess;
+	u32 imm;
+	int err;
+
+	sess = container_of(to_delayed_work(work), typeof(*sess), hb_dwork);
+	usr_con = sess->con[0];
+
+	if (sess->hb_missed_cnt > sess->hb_missed_max) {
+		sess->hb_err_handler(usr_con);
+		return;
+	}
+	if (sess->hb_missed_cnt++) {
+		/* Reschedule work without sending hb */
+		schedule_hb(sess);
+		return;
+	}
+	imm = rtrs_to_imm(RTRS_HB_MSG_IMM, 0);
+	err = rtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe, imm,
+					      IB_SEND_SIGNALED, NULL);
+	if (unlikely(err)) {
+		sess->hb_err_handler(usr_con);
+		return;
+	}
+
+	schedule_hb(sess);
+}
+
+void rtrs_init_hb(struct rtrs_sess *sess, struct ib_cqe *cqe,
+		   unsigned int interval_ms, unsigned int missed_max,
+		   rtrs_hb_handler_t *err_handler,
+		   struct workqueue_struct *wq)
+{
+	sess->hb_cqe = cqe;
+	sess->hb_interval_ms = interval_ms;
+	sess->hb_err_handler = err_handler;
+	sess->hb_wq = wq;
+	sess->hb_missed_max = missed_max;
+	sess->hb_missed_cnt = 0;
+	INIT_DELAYED_WORK(&sess->hb_dwork, hb_work);
+}
+EXPORT_SYMBOL_GPL(rtrs_init_hb);
+
+void rtrs_start_hb(struct rtrs_sess *sess)
+{
+	schedule_hb(sess);
+}
+EXPORT_SYMBOL_GPL(rtrs_start_hb);
+
+void rtrs_stop_hb(struct rtrs_sess *sess)
+{
+	cancel_delayed_work_sync(&sess->hb_dwork);
+	sess->hb_missed_cnt = 0;
+	sess->hb_missed_max = 0;
+}
+EXPORT_SYMBOL_GPL(rtrs_stop_hb);
+
+static int rtrs_str_gid_to_sockaddr(const char *addr, size_t len,
+				     short port, struct sockaddr_storage *dst)
+{
+	struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
+	int ret;
+
+	/*
+	 * We can use some of the I6 functions since GID is a valid
+	 * IPv6 address format
+	 */
+	ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
+	if (ret == 0)
+		return -EINVAL;
+
+	dst_ib->sib_family = AF_IB;
+	/*
+	 * Use the same TCP server port number as the IB service ID
+	 * on the IB port space range
+	 */
+	dst_ib->sib_sid = cpu_to_be64(RDMA_IB_IP_PS_IB | port);
+	dst_ib->sib_sid_mask = cpu_to_be64(0xffffffffffffffffULL);
+	dst_ib->sib_pkey = cpu_to_be16(0xffff);
+
+	return 0;
+}
+
+/**
+ * rtrs_str_to_sockaddr() - Convert rtrs address string to sockaddr
+ * @addr:	String representation of an addr (IPv4, IPv6 or IB GID):
+ *              - "ip:192.168.1.1"
+ *              - "ip:fe80::200:5aee:feaa:20a2"
+ *              - "gid:fe80::200:5aee:feaa:20a2"
+ * @len:        String address length
+ * @port:	Destination port
+ * @dst:	Destination sockaddr structure
+ *
+ * Returns 0 if conversion successful. Non-zero on error.
+ */
+static int rtrs_str_to_sockaddr(const char *addr, size_t len,
+				 short port, struct sockaddr_storage *dst)
+{
+	if (strncmp(addr, "gid:", 4) == 0) {
+		return rtrs_str_gid_to_sockaddr(addr + 4, len - 4, port, dst);
+	} else if (strncmp(addr, "ip:", 3) == 0) {
+		char port_str[8];
+		char *cpy;
+		int err;
+
+		snprintf(port_str, sizeof(port_str), "%u", port);
+		cpy = kstrndup(addr + 3, len - 3, GFP_KERNEL);
+		err = cpy ? inet_pton_with_scope(&init_net, AF_UNSPEC,
+						 cpy, port_str, dst) : -ENOMEM;
+		kfree(cpy);
+
+		return err;
+	}
+	return -EPROTONOSUPPORT;
+}
+
+int sockaddr_to_str(const struct sockaddr *addr, char *buf, size_t len)
+{
+	int cnt;
+
+	switch (addr->sa_family) {
+	case AF_IB:
+		cnt = scnprintf(buf, len, "gid:%pI6",
+			&((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
+		return cnt;
+	case AF_INET:
+		cnt = scnprintf(buf, len, "ip:%pI4",
+			&((struct sockaddr_in *)addr)->sin_addr);
+		return cnt;
+	case AF_INET6:
+		cnt = scnprintf(buf, len, "ip:%pI6c",
+			  &((struct sockaddr_in6 *)addr)->sin6_addr);
+		return cnt;
+	}
+	cnt = scnprintf(buf, len, "<invalid address family>");
+	pr_err("Invalid address family\n");
+	return cnt;
+}
+EXPORT_SYMBOL(sockaddr_to_str);
+
+int rtrs_addr_to_sockaddr(const char *str, size_t len, short port,
+			   struct rtrs_addr *addr)
+{
+	const char *d;
+	int ret;
+
+	d = strchr(str, ',');
+	if (!d)
+		d = strchr(str, '@');
+	if (d) {
+		if (rtrs_str_to_sockaddr(str, d - str, 0, addr->src))
+			return -EINVAL;
+		d += 1;
+		len -= d - str;
+		str  = d;
+
+	} else {
+		addr->src = NULL;
+	}
+	ret = rtrs_str_to_sockaddr(str, len, port, addr->dst);
+
+	return ret;
+}
+EXPORT_SYMBOL(rtrs_addr_to_sockaddr);
+
+void rtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
+			    struct rtrs_ib_dev_pool *pool)
+{
+	WARN_ON(pool->ops && (!pool->ops->alloc ^ !pool->ops->free));
+	INIT_LIST_HEAD(&pool->list);
+	mutex_init(&pool->mutex);
+	pool->pd_flags = pd_flags;
+}
+EXPORT_SYMBOL(rtrs_ib_dev_pool_init);
+
+void rtrs_ib_dev_pool_deinit(struct rtrs_ib_dev_pool *pool)
+{
+	WARN_ON(!list_empty(&pool->list));
+}
+EXPORT_SYMBOL(rtrs_ib_dev_pool_deinit);
+
+static void dev_free(struct kref *ref)
+{
+	struct rtrs_ib_dev_pool *pool;
+	struct rtrs_ib_dev *dev;
+
+	dev = container_of(ref, typeof(*dev), ref);
+	pool = dev->pool;
+
+	mutex_lock(&pool->mutex);
+	list_del(&dev->entry);
+	mutex_unlock(&pool->mutex);
+
+	if (pool->ops && pool->ops->deinit)
+		pool->ops->deinit(dev);
+
+	ib_dealloc_pd(dev->ib_pd);
+
+	if (pool->ops && pool->ops->free)
+		pool->ops->free(dev);
+	else
+		kfree(dev);
+}
+
+int rtrs_ib_dev_put(struct rtrs_ib_dev *dev)
+{
+	return kref_put(&dev->ref, dev_free);
+}
+EXPORT_SYMBOL(rtrs_ib_dev_put);
+
+static int rtrs_ib_dev_get(struct rtrs_ib_dev *dev)
+{
+	return kref_get_unless_zero(&dev->ref);
+}
+
+struct rtrs_ib_dev *
+rtrs_ib_dev_find_or_add(struct ib_device *ib_dev,
+			 struct rtrs_ib_dev_pool *pool)
+{
+	struct rtrs_ib_dev *dev;
+
+	mutex_lock(&pool->mutex);
+	list_for_each_entry(dev, &pool->list, entry) {
+		if (dev->ib_dev->node_guid == ib_dev->node_guid &&
+		    rtrs_ib_dev_get(dev))
+			goto out_unlock;
+	}
+	if (pool->ops && pool->ops->alloc)
+		dev = pool->ops->alloc();
+	else
+		dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (IS_ERR_OR_NULL(dev))
+		goto out_err;
+
+	kref_init(&dev->ref);
+	dev->pool = pool;
+	dev->ib_dev = ib_dev;
+	dev->ib_pd = ib_alloc_pd(ib_dev, pool->pd_flags);
+	if (IS_ERR(dev->ib_pd))
+		goto out_free_dev;
+
+	if (pool->ops && pool->ops->init && pool->ops->init(dev))
+		goto out_free_pd;
+
+	list_add(&dev->entry, &pool->list);
+out_unlock:
+	mutex_unlock(&pool->mutex);
+	return dev;
+
+out_free_pd:
+	ib_dealloc_pd(dev->ib_pd);
+out_free_dev:
+	if (pool->ops && pool->ops->free)
+		pool->ops->free(dev);
+	else
+		kfree(dev);
+out_err:
+	mutex_unlock(&pool->mutex);
+	return NULL;
+}
+EXPORT_SYMBOL(rtrs_ib_dev_find_or_add);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 05/25] rtrs: client: private header with client structs and functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (3 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 04/25] rtrs: core: lib functions shared between client and server modules Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 22:51   ` Bart Van Assche
  2019-12-30 23:03   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 06/25] rtrs: client: main functionality Jack Wang
                   ` (21 subsequent siblings)
  26 siblings, 2 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This header describes main structs and functions used by rtrs-client
module, mainly for managing rtrs sessions, creating/destroying sysfs
entries, accounting statistics on client side.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-clt.h | 296 +++++++++++++++++++++++++
 1 file changed, 296 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt.h

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.h b/drivers/infiniband/ulp/rtrs/rtrs-clt.h
new file mode 100644
index 000000000000..99e8cf53c5d1
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.h
@@ -0,0 +1,296 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#ifndef RTRS_CLT_H
+#define RTRS_CLT_H
+
+#include <linux/device.h>
+#include "rtrs-pri.h"
+
+/**
+ * enum rtrs_clt_state - Client states.
+ */
+enum rtrs_clt_state {
+	RTRS_CLT_CONNECTING,
+	RTRS_CLT_CONNECTING_ERR,
+	RTRS_CLT_RECONNECTING,
+	RTRS_CLT_CONNECTED,
+	RTRS_CLT_CLOSING,
+	RTRS_CLT_CLOSED,
+	RTRS_CLT_DEAD,
+};
+
+static inline const char *rtrs_clt_state_str(enum rtrs_clt_state state)
+{
+	switch (state) {
+	case RTRS_CLT_CONNECTING:
+		return "RTRS_CLT_CONNECTING";
+	case RTRS_CLT_CONNECTING_ERR:
+		return "RTRS_CLT_CONNECTING_ERR";
+	case RTRS_CLT_RECONNECTING:
+		return "RTRS_CLT_RECONNECTING";
+	case RTRS_CLT_CONNECTED:
+		return "RTRS_CLT_CONNECTED";
+	case RTRS_CLT_CLOSING:
+		return "RTRS_CLT_CLOSING";
+	case RTRS_CLT_CLOSED:
+		return "RTRS_CLT_CLOSED";
+	case RTRS_CLT_DEAD:
+		return "RTRS_CLT_DEAD";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+enum rtrs_mp_policy {
+	MP_POLICY_RR,
+	MP_POLICY_MIN_INFLIGHT,
+};
+
+struct rtrs_clt_stats_reconnects {
+	int successful_cnt;
+	int fail_cnt;
+};
+
+struct rtrs_clt_stats_wc_comp {
+	u32 cnt;
+	u64 total_cnt;
+};
+
+struct rtrs_clt_stats_cpu_migr {
+	atomic_t from;
+	int to;
+};
+
+struct rtrs_clt_stats_rdma {
+	struct {
+		u64 cnt;
+		u64 size_total;
+	} dir[2];
+
+	u64 failover_cnt;
+};
+
+struct rtrs_clt_stats_rdma_lat {
+	u64 read;
+	u64 write;
+};
+
+#define MIN_LOG_SG 2
+#define MAX_LOG_SG 5
+#define MAX_LIN_SG BIT(MIN_LOG_SG)
+#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)
+
+#define MAX_LOG_LAT 16
+#define MIN_LOG_LAT 0
+#define LOG_LAT_SZ (MAX_LOG_LAT - MIN_LOG_LAT + 2)
+
+struct rtrs_clt_stats_pcpu {
+	struct rtrs_clt_stats_cpu_migr		cpu_migr;
+	struct rtrs_clt_stats_rdma		rdma;
+	u64					sg_list_total;
+	u64					sg_list_distr[SG_DISTR_SZ];
+	struct rtrs_clt_stats_rdma_lat		rdma_lat_distr[LOG_LAT_SZ];
+	struct rtrs_clt_stats_rdma_lat		rdma_lat_max;
+	struct rtrs_clt_stats_wc_comp		wc_comp;
+};
+
+struct rtrs_clt_stats {
+	bool					enable_rdma_lat;
+	struct rtrs_clt_stats_pcpu    __percpu	*pcpu_stats;
+	struct rtrs_clt_stats_reconnects	reconnects;
+	atomic_t				inflight;
+};
+
+struct rtrs_clt_con {
+	struct rtrs_con	c;
+	struct rtrs_iu		*rsp_ius;
+	u32			queue_size;
+	unsigned int		cpu;
+	atomic_t		io_cnt;
+	int			cm_err;
+};
+
+/**
+ * rtrs_permit - permits the memory allocation for future RDMA operation
+ */
+struct rtrs_permit {
+	enum rtrs_clt_con_type con_type;
+	unsigned int cpu_id;
+	unsigned int mem_id;
+	unsigned int mem_off;
+};
+
+/**
+ * rtrs_clt_io_req - describes one inflight IO request
+ */
+struct rtrs_clt_io_req {
+	struct list_head        list;
+	struct rtrs_iu		*iu;
+	struct scatterlist	*sglist; /* list holding user data */
+	unsigned int		sg_cnt;
+	unsigned int		sg_size;
+	unsigned int		data_len;
+	unsigned int		usr_len;
+	void			*priv;
+	bool			in_use;
+	struct rtrs_clt_con	*con;
+	struct rtrs_sg_desc	*desc;
+	struct ib_sge		*sge;
+	struct rtrs_permit	*permit;
+	enum dma_data_direction dir;
+	rtrs_conf_fn		*conf;
+	unsigned long		start_jiffies;
+
+	struct ib_mr		*mr;
+	struct ib_cqe		inv_cqe;
+	struct completion	inv_comp;
+	int			inv_errno;
+	bool			need_inv_comp;
+	bool			need_inv;
+};
+
+struct rtrs_rbuf {
+	u64 addr;
+	u32 rkey;
+};
+
+struct rtrs_clt_sess {
+	struct rtrs_sess	s;
+	struct rtrs_clt	*clt;
+	wait_queue_head_t	state_wq;
+	enum rtrs_clt_state	state;
+	atomic_t		connected_cnt;
+	struct mutex		init_mutex;
+	struct rtrs_clt_io_req	*reqs;
+	struct delayed_work	reconnect_dwork;
+	struct work_struct	close_work;
+	unsigned int		reconnect_attempts;
+	bool			established;
+	struct rtrs_rbuf	*rbufs;
+	size_t			max_io_size;
+	u32			max_hdr_size;
+	u32			chunk_size;
+	size_t			queue_depth;
+	u32			max_pages_per_mr;
+	int			max_send_sge;
+	u32			flags;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct rtrs_clt_stats  stats;
+	/* cache hca_port and hca_name to display in sysfs */
+	u8			hca_port;
+	char                    hca_name[IB_DEVICE_NAME_MAX];
+	struct list_head __percpu
+				*mp_skip_entry;
+};
+
+struct rtrs_clt {
+	struct list_head   /* __rcu */ paths_list;
+	size_t			       paths_num;
+	struct rtrs_clt_sess
+		      __rcu * __percpu *pcpu_path;
+
+	bool			opened;
+	uuid_t			paths_uuid;
+	int			paths_up;
+	struct mutex		paths_mutex;
+	struct mutex		paths_ev_mutex;
+	char			sessname[NAME_MAX];
+	short			port;
+	unsigned int		max_reconnect_attempts;
+	unsigned int		reconnect_delay_sec;
+	unsigned int		max_segments;
+	void			*permits;
+	unsigned long		*permits_map;
+	size_t			queue_depth;
+	size_t			max_io_size;
+	wait_queue_head_t	permits_wait;
+	size_t			pdu_sz;
+	void			*priv;
+	link_clt_ev_fn		*link_ev;
+	struct device		dev;
+	struct kobject		kobj_paths;
+	enum rtrs_mp_policy	mp_policy;
+};
+
+static inline struct rtrs_clt_con *to_clt_con(struct rtrs_con *c)
+{
+	return container_of(c, struct rtrs_clt_con, c);
+}
+
+static inline struct rtrs_clt_sess *to_clt_sess(struct rtrs_sess *s)
+{
+	return container_of(s, struct rtrs_clt_sess, s);
+}
+
+#define PERMIT_SIZE(clt) (sizeof(struct rtrs_permit) + (clt)->pdu_sz)
+#define GET_PERMIT(clt, idx) ((clt)->permits + PERMIT_SIZE(clt) * idx)
+
+int rtrs_clt_reconnect_from_sysfs(struct rtrs_clt_sess *sess);
+int rtrs_clt_disconnect_from_sysfs(struct rtrs_clt_sess *sess);
+int rtrs_clt_create_path_from_sysfs(struct rtrs_clt *clt,
+				     struct rtrs_addr *addr);
+int rtrs_clt_remove_path_from_sysfs(struct rtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self);
+
+void rtrs_clt_set_max_reconnect_attempts(struct rtrs_clt *clt, int value);
+int rtrs_clt_get_max_reconnect_attempts(const struct rtrs_clt *clt);
+
+/* rtrs-clt-stats.c */
+
+int rtrs_clt_init_stats(struct rtrs_clt_stats *stats);
+void rtrs_clt_free_stats(struct rtrs_clt_stats *stats);
+
+void rtrs_clt_decrease_inflight(struct rtrs_clt_stats *s);
+void rtrs_clt_inc_failover_cnt(struct rtrs_clt_stats *s);
+
+void rtrs_clt_update_rdma_lat(struct rtrs_clt_stats *s, bool read,
+			       unsigned long ms);
+void rtrs_clt_update_wc_stats(struct rtrs_clt_con *con);
+void rtrs_clt_update_all_stats(struct rtrs_clt_io_req *req, int dir);
+
+int rtrs_clt_reset_sg_list_distr_stats(struct rtrs_clt_stats *stats,
+					bool enable);
+int rtrs_clt_stats_sg_list_distr_to_str(struct rtrs_clt_stats *stats,
+					 char *buf, size_t len);
+int rtrs_clt_reset_rdma_lat_distr_stats(struct rtrs_clt_stats *stats,
+					 bool enable);
+ssize_t rtrs_clt_stats_rdma_lat_distr_to_str(struct rtrs_clt_stats *stats,
+					      char *page, size_t len);
+int rtrs_clt_reset_cpu_migr_stats(struct rtrs_clt_stats *stats, bool enable);
+int rtrs_clt_stats_migration_cnt_to_str(struct rtrs_clt_stats *stats, char *buf,
+					 size_t len);
+int rtrs_clt_reset_reconnects_stat(struct rtrs_clt_stats *stats, bool enable);
+int rtrs_clt_stats_reconnects_to_str(struct rtrs_clt_stats *stats, char *buf,
+				      size_t len);
+int rtrs_clt_reset_wc_comp_stats(struct rtrs_clt_stats *stats, bool enable);
+int rtrs_clt_stats_wc_completion_to_str(struct rtrs_clt_stats *stats, char *buf,
+					 size_t len);
+int rtrs_clt_reset_rdma_stats(struct rtrs_clt_stats *stats, bool enable);
+ssize_t rtrs_clt_stats_rdma_to_str(struct rtrs_clt_stats *stats,
+				    char *page, size_t len);
+int rtrs_clt_reset_all_stats(struct rtrs_clt_stats *stats, bool enable);
+ssize_t rtrs_clt_reset_all_help(struct rtrs_clt_stats *stats,
+				 char *page, size_t len);
+
+/* rtrs-clt-sysfs.c */
+
+int rtrs_clt_create_sysfs_root_folders(struct rtrs_clt *clt);
+int rtrs_clt_create_sysfs_root_files(struct rtrs_clt *clt);
+void rtrs_clt_destroy_sysfs_root_folders(struct rtrs_clt *clt);
+void rtrs_clt_destroy_sysfs_root_files(struct rtrs_clt *clt);
+
+int rtrs_clt_create_sess_files(struct rtrs_clt_sess *sess);
+void rtrs_clt_destroy_sess_files(struct rtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self);
+
+#endif /* RTRS_CLT_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 06/25] rtrs: client: main functionality
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (4 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 05/25] rtrs: client: private header with client structs and functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 23:53   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 07/25] rtrs: client: statistics functions Jack Wang
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is main functionality of rtrs-client module, which manages
set of RDMA connections for each rtrs session, does multipathing,
load balancing and failover of RDMA requests.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-clt.c | 2934 ++++++++++++++++++++++++
 1 file changed, 2934 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt.c

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
new file mode 100644
index 000000000000..3036952121d5
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -0,0 +1,2934 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/rculist.h>
+#include <linux/blkdev.h> /* for BLK_MAX_SEGMENT_SIZE */
+
+#include "rtrs-clt.h"
+#include "rtrs-log.h"
+
+#define RTRS_CONNECT_TIMEOUT_MS 30000
+
+MODULE_DESCRIPTION("RTRS Client");
+MODULE_LICENSE("GPL");
+
+static ushort nr_cons_per_session;
+module_param(nr_cons_per_session, ushort, 0444);
+MODULE_PARM_DESC(nr_cons_per_session,
+		 "Number of connections per session. (default: nr_cpu_ids)");
+
+static int retry_cnt = 7;
+module_param_named(retry_cnt, retry_cnt, int, 0644);
+MODULE_PARM_DESC(retry_cnt,
+		 "Number of times to send the message if the remote side didn't respond with Ack or Nack (default: 7, min: "
+		 __stringify(MIN_RTR_CNT) ", max: "
+		 __stringify(MAX_RTR_CNT) ")");
+
+static int __read_mostly noreg_cnt;
+module_param_named(noreg_cnt, noreg_cnt, int, 0444);
+MODULE_PARM_DESC(noreg_cnt,
+		 "Max number of SG entries when MR registration does not happen (default: 0)");
+
+static const struct rtrs_ib_dev_pool_ops dev_pool_ops;
+static struct rtrs_ib_dev_pool dev_pool = {
+	.ops = &dev_pool_ops
+};
+
+static struct workqueue_struct *rtrs_wq;
+static struct class *rtrs_dev_class;
+
+static inline bool rtrs_clt_is_connected(const struct rtrs_clt *clt)
+{
+	struct rtrs_clt_sess *sess;
+	bool connected = false;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry)
+		connected |= (READ_ONCE(sess->state) == RTRS_CLT_CONNECTED);
+	rcu_read_unlock();
+
+	return connected;
+}
+
+static inline struct rtrs_permit *
+__rtrs_get_permit(struct rtrs_clt *clt, enum rtrs_clt_con_type con_type)
+{
+	size_t max_depth = clt->queue_depth;
+	struct rtrs_permit *permit;
+	int cpu, bit;
+
+	cpu = get_cpu();
+	do {
+		bit = find_first_zero_bit(clt->permits_map, max_depth);
+		if (unlikely(bit >= max_depth)) {
+			put_cpu();
+			return NULL;
+		}
+
+	} while (unlikely(test_and_set_bit_lock(bit, clt->permits_map)));
+	put_cpu();
+
+	permit = GET_PERMIT(clt, bit);
+	WARN_ON(permit->mem_id != bit);
+	permit->cpu_id = cpu;
+	permit->con_type = con_type;
+
+	return permit;
+}
+
+static inline void __rtrs_put_permit(struct rtrs_clt *clt,
+				      struct rtrs_permit *permit)
+{
+	clear_bit_unlock(permit->mem_id, clt->permits_map);
+}
+
+struct rtrs_permit *rtrs_clt_get_permit(struct rtrs_clt *clt,
+					  enum rtrs_clt_con_type con_type,
+					  int can_wait)
+{
+	struct rtrs_permit *permit;
+	DEFINE_WAIT(wait);
+
+	permit = __rtrs_get_permit(clt, con_type);
+	if (likely(permit) || !can_wait)
+		return permit;
+
+	do {
+		prepare_to_wait(&clt->permits_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		permit = __rtrs_get_permit(clt, con_type);
+		if (likely(permit))
+			break;
+
+		io_schedule();
+	} while (1);
+
+	finish_wait(&clt->permits_wait, &wait);
+
+	return permit;
+}
+EXPORT_SYMBOL(rtrs_clt_get_permit);
+
+void rtrs_clt_put_permit(struct rtrs_clt *clt, struct rtrs_permit *permit)
+{
+	if (WARN_ON(!test_bit(permit->mem_id, clt->permits_map)))
+		return;
+
+	__rtrs_put_permit(clt, permit);
+
+	/*
+	 * Putting a permit is a barrier, so we will observe
+	 * new entry in the wait list, no worries.
+	 */
+	if (waitqueue_active(&clt->permits_wait))
+		wake_up(&clt->permits_wait);
+}
+EXPORT_SYMBOL(rtrs_clt_put_permit);
+
+struct rtrs_permit *rtrs_permit_from_pdu(void *pdu)
+{
+	return pdu - sizeof(struct rtrs_permit);
+}
+EXPORT_SYMBOL(rtrs_permit_from_pdu);
+
+void *rtrs_permit_to_pdu(struct rtrs_permit *permit)
+{
+	return permit + 1;
+}
+EXPORT_SYMBOL(rtrs_permit_to_pdu);
+
+/**
+ * rtrs_permit_to_clt_con() - returns RDMA connection id by the permit
+ *
+ * Note:
+ *     IO connection starts from 1.
+ *     0 connection is for user messages.
+ */
+static
+struct rtrs_clt_con *rtrs_permit_to_clt_con(struct rtrs_clt_sess *sess,
+					      struct rtrs_permit *permit)
+{
+	int id = 0;
+
+	if (likely(permit->con_type == RTRS_IO_CON))
+		id = (permit->cpu_id % (sess->s.con_num - 1)) + 1;
+
+	return to_clt_con(sess->s.con[id]);
+}
+
+static bool __rtrs_clt_change_state(struct rtrs_clt_sess *sess,
+				     enum rtrs_clt_state new_state)
+{
+	enum rtrs_clt_state old_state;
+	bool changed = false;
+
+	lockdep_assert_held(&sess->state_wq.lock);
+
+	old_state = sess->state;
+	switch (new_state) {
+	case RTRS_CLT_CONNECTING:
+		switch (old_state) {
+		case RTRS_CLT_RECONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_CLT_RECONNECTING:
+		switch (old_state) {
+		case RTRS_CLT_CONNECTED:
+		case RTRS_CLT_CONNECTING_ERR:
+		case RTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_CLT_CONNECTED:
+		switch (old_state) {
+		case RTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_CLT_CONNECTING_ERR:
+		switch (old_state) {
+		case RTRS_CLT_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_CLT_CLOSING:
+		switch (old_state) {
+		case RTRS_CLT_CONNECTING:
+		case RTRS_CLT_CONNECTING_ERR:
+		case RTRS_CLT_RECONNECTING:
+		case RTRS_CLT_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_CLT_CLOSED:
+		switch (old_state) {
+		case RTRS_CLT_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_CLT_DEAD:
+		switch (old_state) {
+		case RTRS_CLT_CLOSED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed) {
+		sess->state = new_state;
+		wake_up_locked(&sess->state_wq);
+	}
+
+	return changed;
+}
+
+static bool rtrs_clt_change_state_from_to(struct rtrs_clt_sess *sess,
+					   enum rtrs_clt_state old_state,
+					   enum rtrs_clt_state new_state)
+{
+	bool changed = false;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	if (sess->state == old_state)
+		changed = __rtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static void rtrs_rdma_error_recovery(struct rtrs_clt_con *con)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	if (rtrs_clt_change_state_from_to(sess,
+					   RTRS_CLT_CONNECTED,
+					   RTRS_CLT_RECONNECTING)) {
+		/*
+		 * Normal scenario, reconnect if we were successfully connected
+		 */
+		queue_delayed_work(rtrs_wq, &sess->reconnect_dwork, 0);
+	} else {
+		/*
+		 * Error can happen just on establishing new connection,
+		 * so notify waiter with error state, waiter is responsible
+		 * for cleaning the rest and reconnect if needed.
+		 */
+		rtrs_clt_change_state_from_to(sess,
+					       RTRS_CLT_CONNECTING,
+					       RTRS_CLT_CONNECTING_ERR);
+	}
+}
+
+static void rtrs_clt_fast_reg_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_clt_con *con = cq->cq_context;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(con->c.sess, "Failed IB_WR_REG_MR: %s\n",
+			  ib_wc_status_msg(wc->status));
+		rtrs_rdma_error_recovery(con);
+	}
+}
+
+static struct ib_cqe fast_reg_cqe = {
+	.done = rtrs_clt_fast_reg_done
+};
+
+static void complete_rdma_req(struct rtrs_clt_io_req *req, int errno,
+			      bool notify, bool can_wait);
+
+static void rtrs_clt_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_clt_io_req *req =
+		container_of(wc->wr_cqe, typeof(*req), inv_cqe);
+	struct rtrs_clt_con *con = cq->cq_context;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(con->c.sess, "Failed IB_WR_LOCAL_INV: %s\n",
+			  ib_wc_status_msg(wc->status));
+		rtrs_rdma_error_recovery(con);
+	}
+	req->need_inv = false;
+	if (likely(req->need_inv_comp))
+		complete(&req->inv_comp);
+	else
+		/* Complete request from INV callback */
+		complete_rdma_req(req, req->inv_errno, true, false);
+}
+
+static int rtrs_inv_rkey(struct rtrs_clt_io_req *req)
+{
+	struct rtrs_clt_con *con = req->con;
+	const struct ib_send_wr *bad_wr;
+	struct ib_send_wr wr = {
+		.opcode		    = IB_WR_LOCAL_INV,
+		.wr_cqe		    = &req->inv_cqe,
+		.next		    = NULL,
+		.num_sge	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
+		.ex.invalidate_rkey = req->mr->rkey,
+	};
+	req->inv_cqe.done = rtrs_clt_inv_rkey_done;
+
+	return ib_post_send(con->c.qp, &wr, &bad_wr);
+}
+
+static void complete_rdma_req(struct rtrs_clt_io_req *req, int errno,
+			      bool notify, bool can_wait)
+{
+	struct rtrs_clt_con *con = req->con;
+	struct rtrs_clt_sess *sess;
+	int err;
+
+	if (WARN_ON(!req->in_use))
+		return;
+	if (WARN_ON(!req->con))
+		return;
+	sess = to_clt_sess(con->c.sess);
+
+	if (req->sg_cnt) {
+		if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) {
+			/*
+			 * We are here to invalidate RDMA read requests
+			 * ourselves.  In normal scenario server should
+			 * send INV for all requested RDMA reads, but
+			 * we are here, thus two things could happen:
+			 *
+			 *    1.  this is failover, when errno != 0
+			 *        and can_wait == 1,
+			 *
+			 *    2.  something totally bad happened and
+			 *        server forgot to send INV, so we
+			 *        should do that ourselves.
+			 */
+
+			if (likely(can_wait)) {
+				req->need_inv_comp = true;
+			} else {
+				/* This should be IO path, so always notify */
+				WARN_ON(!notify);
+				/* Save errno for INV callback */
+				req->inv_errno = errno;
+			}
+
+			err = rtrs_inv_rkey(req);
+			if (unlikely(err)) {
+				rtrs_err(con->c.sess, "Send INV WR key=%#x: %d\n",
+					  req->mr->rkey, err);
+			} else if (likely(can_wait)) {
+				wait_for_completion(&req->inv_comp);
+			} else {
+				/*
+				 * Something went wrong, so request will be
+				 * completed from INV callback.
+				 */
+				WARN_ON_ONCE(1);
+
+				return;
+			}
+		}
+		ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
+				req->sg_cnt, req->dir);
+	}
+	if (sess->stats.enable_rdma_lat)
+		rtrs_clt_update_rdma_lat(&sess->stats,
+					  req->dir == DMA_FROM_DEVICE,
+					  jiffies_to_msecs(jiffies -
+							   req->start_jiffies));
+	rtrs_clt_decrease_inflight(&sess->stats);
+
+	req->in_use = false;
+	req->con = NULL;
+
+	if (notify)
+		req->conf(req->priv, errno);
+}
+
+static int rtrs_post_send_rdma(struct rtrs_clt_con *con,
+				struct rtrs_clt_io_req *req,
+				struct rtrs_rbuf *rbuf, u32 off,
+				u32 imm, struct ib_send_wr *wr)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	enum ib_send_flags flags;
+	struct ib_sge sge;
+
+	if (unlikely(!req->sg_size)) {
+		rtrs_wrn(con->c.sess,
+			 "Doing RDMA Write failed, no data supplied\n");
+		return -EINVAL;
+	}
+
+	/* user data and user message in the first list element */
+	sge.addr   = req->iu->dma_addr;
+	sge.length = req->sg_size;
+	sge.lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr,
+				      req->sg_size, DMA_TO_DEVICE);
+
+	return rtrs_iu_post_rdma_write_imm(&con->c, req->iu, &sge, 1,
+					    rbuf->rkey, rbuf->addr + off,
+					    imm, flags, wr);
+}
+
+static void process_io_rsp(struct rtrs_clt_sess *sess, u32 msg_id,
+			   s16 errno, bool w_inval)
+{
+	struct rtrs_clt_io_req *req;
+
+	if (WARN_ON(msg_id >= sess->queue_depth))
+		return;
+
+	req = &sess->reqs[msg_id];
+	/* Drop need_inv if server responsed with invalidation */
+	req->need_inv &= !w_inval;
+	complete_rdma_req(req, errno, true, false);
+}
+
+static void rtrs_clt_recv_done(struct rtrs_clt_con *con, struct ib_wc *wc)
+{
+	struct rtrs_iu *iu;
+	int err;
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	WARN_ON(sess->flags != RTRS_MSG_NEW_RKEY_F);
+	iu = container_of(wc->wr_cqe, struct rtrs_iu,
+			  cqe);
+	err = rtrs_iu_post_recv(&con->c, iu);
+	if (unlikely(err)) {
+		rtrs_err(con->c.sess, "post iu failed %d\n", err);
+		rtrs_rdma_error_recovery(con);
+	}
+}
+
+static void rtrs_clt_rkey_rsp_done(struct rtrs_clt_con *con, struct ib_wc *wc)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rtrs_msg_rkey_rsp *msg;
+	u32 imm_type, imm_payload;
+	bool w_inval = false;
+	struct rtrs_iu *iu;
+	u32 buf_id;
+	int err;
+
+	WARN_ON(sess->flags != RTRS_MSG_NEW_RKEY_F);
+
+	iu = container_of(wc->wr_cqe, struct rtrs_iu, cqe);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		rtrs_err(con->c.sess, "rkey response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, iu->dma_addr,
+				   iu->size, DMA_FROM_DEVICE);
+	msg = iu->buf;
+	if (unlikely(le16_to_cpu(msg->type) != RTRS_MSG_RKEY_RSP)) {
+		rtrs_err(sess->clt, "rkey response is malformed: type %d\n",
+			  le16_to_cpu(msg->type));
+		goto out;
+	}
+	buf_id = le16_to_cpu(msg->buf_id);
+	if (WARN_ON(buf_id >= sess->queue_depth))
+		goto out;
+
+	rtrs_from_imm(be32_to_cpu(wc->ex.imm_data), &imm_type, &imm_payload);
+	if (likely(imm_type == RTRS_IO_RSP_IMM ||
+		   imm_type == RTRS_IO_RSP_W_INV_IMM)) {
+		u32 msg_id;
+
+		w_inval = (imm_type == RTRS_IO_RSP_W_INV_IMM);
+		rtrs_from_io_rsp_imm(imm_payload, &msg_id, &err);
+
+		if (WARN_ON(buf_id != msg_id))
+			goto out;
+		sess->rbufs[buf_id].rkey = le32_to_cpu(msg->rkey);
+		process_io_rsp(sess, msg_id, err, w_inval);
+	}
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, iu->dma_addr,
+				      iu->size, DMA_FROM_DEVICE);
+	return rtrs_clt_recv_done(con, wc);
+out:
+	rtrs_rdma_error_recovery(con);
+}
+
+static void rtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+
+static struct ib_cqe io_comp_cqe = {
+	.done = rtrs_clt_rdma_done
+};
+
+static void rtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_clt_con *con = cq->cq_context;
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u32 imm_type, imm_payload;
+	bool w_inval = false;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			rtrs_err(sess->clt, "RDMA failed: %s\n",
+				  ib_wc_status_msg(wc->status));
+			rtrs_rdma_error_recovery(con);
+		}
+		return;
+	}
+	rtrs_clt_update_wc_stats(con);
+
+	switch (wc->opcode) {
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe->done != rtrs_clt_rdma_done))
+			return;
+		rtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == RTRS_IO_RSP_IMM ||
+			   imm_type == RTRS_IO_RSP_W_INV_IMM)) {
+			u32 msg_id;
+
+			w_inval = (imm_type == RTRS_IO_RSP_W_INV_IMM);
+			rtrs_from_io_rsp_imm(imm_payload, &msg_id, &err);
+
+			process_io_rsp(sess, msg_id, err, w_inval);
+		} else if (imm_type == RTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			rtrs_send_hb_ack(&sess->s);
+			if (sess->flags == RTRS_MSG_NEW_RKEY_F)
+				return  rtrs_clt_recv_done(con, wc);
+		} else if (imm_type == RTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+			if (sess->flags == RTRS_MSG_NEW_RKEY_F)
+				return  rtrs_clt_recv_done(con, wc);
+		} else {
+			rtrs_wrn(con->c.sess, "Unknown IMM type %u\n",
+				  imm_type);
+		}
+		if (w_inval)
+			/*
+			 * Post x2 empty WRs: first is for this RDMA with IMM,
+			 * second is for RECV with INV, which happened earlier.
+			 */
+			err = rtrs_post_recv_empty_x2(&con->c, &io_comp_cqe);
+		else
+			err = rtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			rtrs_err(con->c.sess, "rtrs_post_recv_empty(): %d\n",
+				  err);
+			rtrs_rdma_error_recovery(con);
+			break;
+		}
+		break;
+	case IB_WC_RECV:
+		/*
+		 * Key invalidations from server side
+		 */
+		WARN_ON(!(wc->wc_flags & IB_WC_WITH_INVALIDATE ||
+			  wc->wc_flags & IB_WC_WITH_IMM));
+		WARN_ON(wc->wr_cqe->done != rtrs_clt_rdma_done);
+		if (sess->flags == RTRS_MSG_NEW_RKEY_F) {
+			if (wc->wc_flags & IB_WC_WITH_INVALIDATE)
+				return  rtrs_clt_recv_done(con, wc);
+
+			return  rtrs_clt_rkey_rsp_done(con, wc);
+		}
+		break;
+	case IB_WC_RDMA_WRITE:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+
+	default:
+		rtrs_wrn(sess->clt, "Unexpected WC type: %d\n", wc->opcode);
+		return;
+	}
+}
+
+static int post_recv_io(struct rtrs_clt_con *con, size_t q_size)
+{
+	int err, i;
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	for (i = 0; i < q_size; i++) {
+		if (sess->flags == RTRS_MSG_NEW_RKEY_F) {
+			struct rtrs_iu *iu = &con->rsp_ius[i];
+
+			err = rtrs_iu_post_recv(&con->c, iu);
+		} else {
+			err = rtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		}
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct rtrs_clt_sess *sess)
+{
+	size_t q_size = 0;
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (cid == 0)
+			q_size = SERVICE_CON_QUEUE_DEPTH;
+		else
+			q_size = sess->queue_depth;
+
+		/*
+		 * x2 for RDMA read responses + FR key invalidations,
+		 * RDMA writes do not require any FR registrations.
+		 */
+		q_size *= 2;
+
+		err = post_recv_io(to_clt_con(sess->s.con[cid]), q_size);
+		if (unlikely(err)) {
+			rtrs_err(sess->clt, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+struct path_it {
+	int i;
+	struct list_head skip_list;
+	struct rtrs_clt *clt;
+	struct rtrs_clt_sess *(*next_path)(struct path_it *it);
+};
+
+#define do_each_path(path, clt, it) {					\
+	path_it_init(it, clt);						\
+	rcu_read_lock();						\
+	for ((it)->i = 0; ((path) = ((it)->next_path)(it)) &&		\
+			  (it)->i < (it)->clt->paths_num;		\
+	     (it)->i++)
+
+#define while_each_path(it)						\
+	path_it_deinit(it);						\
+	rcu_read_unlock();						\
+	}
+
+/**
+ * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
+ * @head:	the head for the list.
+ * @ptr:        the list head to take the next element from.
+ * @type:       the type of the struct this is embedded in.
+ * @memb:       the name of the list_head within the struct.
+ *
+ * Next element returned in round-robin fashion, i.e. head will be skipped,
+ * but if list is observed as empty, NULL will be returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ */
+#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
+({ \
+	list_next_or_null_rcu(head, ptr, type, memb) ?: \
+		list_next_or_null_rcu(head, READ_ONCE((ptr)->next), \
+				      type, memb); \
+})
+
+/**
+ * get_next_path_rr() - Returns path in round-robin fashion.
+ * @it	the path pointer
+ *
+ * Related to @MP_POLICY_RR
+ *
+ * Locks:
+ *    rcu_read_lock() must be hold.
+ */
+static struct rtrs_clt_sess *get_next_path_rr(struct path_it *it)
+{
+	struct rtrs_clt_sess __rcu **ppcpu_path;
+	struct rtrs_clt_sess *path;
+	struct rtrs_clt *clt;
+
+	clt = it->clt;
+
+	/*
+	 * Here we use two RCU objects: @paths_list and @pcpu_path
+	 * pointer.  See rtrs_clt_remove_path_from_arr() for details
+	 * how that is handled.
+	 */
+
+	ppcpu_path = this_cpu_ptr(clt->pcpu_path);
+	path = rcu_dereference(*ppcpu_path);
+	if (unlikely(!path))
+		path = list_first_or_null_rcu(&clt->paths_list,
+					      typeof(*path), s.entry);
+	else
+		path = list_next_or_null_rr_rcu(&clt->paths_list,
+						&path->s.entry,
+						typeof(*path),
+						s.entry);
+	rcu_assign_pointer(*ppcpu_path, path);
+
+	return path;
+}
+
+/**
+ * get_next_path_min_inflight() - Returns path with minimal inflight count.
+ * @it	the path pointer
+ *
+ * Related to @MP_POLICY_MIN_INFLIGHT
+ *
+ * Locks:
+ *    rcu_read_lock() must be hold.
+ */
+static struct rtrs_clt_sess *get_next_path_min_inflight(struct path_it *it)
+{
+	struct rtrs_clt_sess *min_path = NULL;
+	struct rtrs_clt *clt = it->clt;
+	struct rtrs_clt_sess *sess;
+	int min_inflight = INT_MAX;
+	int inflight;
+
+	list_for_each_entry_rcu(sess, &clt->paths_list, s.entry) {
+		if (unlikely(!list_empty(raw_cpu_ptr(sess->mp_skip_entry))))
+			continue;
+
+		inflight = atomic_read(&sess->stats.inflight);
+
+		if (inflight < min_inflight) {
+			min_inflight = inflight;
+			min_path = sess;
+		}
+	}
+
+	/*
+	 * add the path to the skip list, so that next time we can get
+	 * a different one
+	 */
+	if (min_path)
+		list_add(raw_cpu_ptr(min_path->mp_skip_entry), &it->skip_list);
+
+	return min_path;
+}
+
+static inline void path_it_init(struct path_it *it, struct rtrs_clt *clt)
+{
+	INIT_LIST_HEAD(&it->skip_list);
+	it->clt = clt;
+	it->i = 0;
+
+	if (clt->mp_policy == MP_POLICY_RR)
+		it->next_path = get_next_path_rr;
+	else
+		it->next_path = get_next_path_min_inflight;
+}
+
+static inline void path_it_deinit(struct path_it *it)
+{
+	struct list_head *skip, *tmp;
+	/*
+	 * The skip_list is used only for the MIN_INFLIGHT policy.
+	 * We need to remove paths from it, so that next IO can insert
+	 * paths (->mp_skip_entry) into a skip_list again.
+	 */
+	list_for_each_safe(skip, tmp, &it->skip_list)
+		list_del_init(skip);
+}
+
+/**
+ * rtrs_clt_init_req() Initialize an rtrs_clt_io_req holding information
+ * about an inflight IO.
+ * The user buffer holding user control message (not data) is copied into
+ * the corresponding buffer of rtrs_iu (req->iu->buf), which later on will
+ * also hold the control message of rtrs.
+ */
+static inline void rtrs_clt_init_req(struct rtrs_clt_io_req *req,
+				      struct rtrs_clt_sess *sess,
+				      rtrs_conf_fn *conf,
+				      struct rtrs_permit *permit, void *priv,
+				      const struct kvec *vec, size_t usr_len,
+				      struct scatterlist *sg, size_t sg_cnt,
+				      size_t data_len, int dir)
+{
+	struct iov_iter iter;
+	size_t len;
+
+	req->permit = permit;
+	req->in_use = true;
+	req->usr_len = usr_len;
+	req->data_len = data_len;
+	req->sglist = sg;
+	req->sg_cnt = sg_cnt;
+	req->priv = priv;
+	req->dir = dir;
+	req->con = rtrs_permit_to_clt_con(sess, permit);
+	req->conf = conf;
+	req->need_inv = false;
+	req->need_inv_comp = false;
+	req->inv_errno = 0;
+
+	iov_iter_kvec(&iter, READ, vec, 1, usr_len);
+	len = _copy_from_iter(req->iu->buf, usr_len, &iter);
+	WARN_ON(len != usr_len);
+
+	reinit_completion(&req->inv_comp);
+	if (sess->stats.enable_rdma_lat)
+		req->start_jiffies = jiffies;
+}
+
+static inline struct rtrs_clt_io_req *
+rtrs_clt_get_req(struct rtrs_clt_sess *sess, rtrs_conf_fn *conf,
+		  struct rtrs_permit *permit, void *priv,
+		  const struct kvec *vec, size_t usr_len,
+		  struct scatterlist *sg, size_t sg_cnt,
+		  size_t data_len, int dir)
+{
+	struct rtrs_clt_io_req *req;
+
+	req = &sess->reqs[permit->mem_id];
+	rtrs_clt_init_req(req, sess, conf, permit, priv, vec, usr_len,
+			   sg, sg_cnt, data_len, dir);
+	return req;
+}
+
+static inline struct rtrs_clt_io_req *
+rtrs_clt_get_copy_req(struct rtrs_clt_sess *alive_sess,
+		       struct rtrs_clt_io_req *fail_req)
+{
+	struct rtrs_clt_io_req *req;
+	struct kvec vec = {
+		.iov_base = fail_req->iu->buf,
+		.iov_len  = fail_req->usr_len
+	};
+
+	req = &alive_sess->reqs[fail_req->permit->mem_id];
+	rtrs_clt_init_req(req, alive_sess, fail_req->conf, fail_req->permit,
+			   fail_req->priv, &vec, fail_req->usr_len,
+			   fail_req->sglist, fail_req->sg_cnt,
+			   fail_req->data_len, fail_req->dir);
+	return req;
+}
+
+static int rtrs_post_rdma_write_sg(struct rtrs_clt_con *con,
+				    struct rtrs_clt_io_req *req,
+				    struct rtrs_rbuf *rbuf,
+				    u32 size, u32 imm)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct ib_sge *sge = req->sge;
+	enum ib_send_flags flags;
+	struct scatterlist *sg;
+	size_t num_sge;
+	int i;
+
+	for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+		sge[i].addr   = sg_dma_address(sg);
+		sge[i].length = sg_dma_len(sg);
+		sge[i].lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+	}
+	sge[i].addr   = req->iu->dma_addr;
+	sge[i].length = size;
+	sge[i].lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+
+	num_sge = 1 + req->sg_cnt;
+
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->io_cnt) % sess->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, req->iu->dma_addr,
+				      size, DMA_TO_DEVICE);
+
+	return rtrs_iu_post_rdma_write_imm(&con->c, req->iu, sge, num_sge,
+					    rbuf->rkey, rbuf->addr, imm,
+					    flags, NULL);
+}
+
+static int rtrs_clt_write_req(struct rtrs_clt_io_req *req)
+{
+	struct rtrs_clt_con *con = req->con;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_clt_sess *sess = to_clt_sess(s);
+	struct rtrs_msg_rdma_write *msg;
+
+	struct rtrs_rbuf *rbuf;
+	int ret, count = 0;
+	u32 imm, buf_id;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		rtrs_wrn(s, "Write request failed, size too big %zu > %d\n",
+			  tsize, sess->chunk_size);
+		return -EMSGSIZE;
+	}
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(sess->s.dev->ib_dev, req->sglist,
+				      req->sg_cnt, req->dir);
+		if (unlikely(!count)) {
+			rtrs_wrn(s, "Write request failed, map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put rtrs msg after sg and user message */
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(RTRS_MSG_WRITE);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	/* rtrs message on server side will be after user data and message */
+	imm = req->permit->mem_off + req->data_len + req->usr_len;
+	imm = rtrs_to_io_req_imm(imm);
+	buf_id = req->permit->mem_id;
+	req->sg_size = tsize;
+	rbuf = &sess->rbufs[buf_id];
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	rtrs_clt_update_all_stats(req, WRITE);
+
+	ret = rtrs_post_rdma_write_sg(req->con, req, rbuf,
+				       req->usr_len + sizeof(*msg),
+				       imm);
+	if (unlikely(ret)) {
+		rtrs_err(s, "Write request failed: %d\n", ret);
+		rtrs_clt_decrease_inflight(&sess->stats);
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(sess->s.dev->ib_dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+static int rtrs_map_sg_fr(struct rtrs_clt_io_req *req, size_t count)
+{
+	int nr;
+
+	/* Align the MR to a 4K page size to match the block virt boundary */
+	nr = ib_map_mr_sg(req->mr, req->sglist, count, NULL, SZ_4K);
+	if (unlikely(nr < req->sg_cnt)) {
+		if (nr < 0)
+			return nr;
+		return -EINVAL;
+	}
+	ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
+
+	return nr;
+}
+
+static int rtrs_clt_read_req(struct rtrs_clt_io_req *req)
+{
+	struct rtrs_clt_con *con = req->con;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_clt_sess *sess = to_clt_sess(s);
+	struct rtrs_msg_rdma_read *msg;
+	struct rtrs_ib_dev *dev;
+	struct scatterlist *sg;
+
+	struct ib_reg_wr rwr;
+	struct ib_send_wr *wr = NULL;
+
+	int i, ret, count = 0;
+	u32 imm, buf_id;
+
+	const size_t tsize = sizeof(*msg) + req->data_len + req->usr_len;
+
+	s = &sess->s;
+	dev = sess->s.dev;
+
+	if (unlikely(tsize > sess->chunk_size)) {
+		rtrs_wrn(s,
+			  "Read request failed, message size is %zu, bigger than CHUNK_SIZE %d\n",
+			  tsize, sess->chunk_size);
+		return -EMSGSIZE;
+	}
+
+	if (req->sg_cnt) {
+		count = ib_dma_map_sg(dev->ib_dev, req->sglist, req->sg_cnt,
+				      req->dir);
+		if (unlikely(!count)) {
+			rtrs_wrn(s,
+				  "Read request failed, dma map failed\n");
+			return -EINVAL;
+		}
+	}
+	/* put our message into req->buf after user message*/
+	msg = req->iu->buf + req->usr_len;
+	msg->type = cpu_to_le16(RTRS_MSG_READ);
+	msg->usr_len = cpu_to_le16(req->usr_len);
+
+	if (count > noreg_cnt) {
+		ret = rtrs_map_sg_fr(req, count);
+		if (ret < 0) {
+			rtrs_err_rl(s,
+				     "Read request failed, failed to map  fast reg. data, err: %d\n",
+				     ret);
+			ib_dma_unmap_sg(dev->ib_dev, req->sglist, req->sg_cnt,
+					req->dir);
+			return ret;
+		}
+		memset(&rwr, 0, sizeof(rwr));
+		rwr.wr.next = NULL;
+		rwr.wr.opcode = IB_WR_REG_MR;
+		rwr.wr.wr_cqe = &fast_reg_cqe;
+		rwr.wr.num_sge = 0;
+		rwr.mr = req->mr;
+		rwr.key = req->mr->rkey;
+		rwr.access = (IB_ACCESS_LOCAL_WRITE |
+			      IB_ACCESS_REMOTE_WRITE);
+		wr = &rwr.wr;
+
+		msg->sg_cnt = cpu_to_le16(1);
+		msg->flags = cpu_to_le16(RTRS_MSG_NEED_INVAL_F);
+
+		msg->desc[0].addr = cpu_to_le64(req->mr->iova);
+		msg->desc[0].key = cpu_to_le32(req->mr->rkey);
+		msg->desc[0].len = cpu_to_le32(req->mr->length);
+
+		/* Further invalidation is required */
+		req->need_inv = !!RTRS_MSG_NEED_INVAL_F;
+
+	} else {
+		msg->sg_cnt = cpu_to_le16(count);
+		msg->flags = 0;
+
+		for_each_sg(req->sglist, sg, req->sg_cnt, i) {
+			msg->desc[i].addr = cpu_to_le64(sg_dma_address(sg));
+			msg->desc[i].key =
+				cpu_to_le32(dev->ib_pd->unsafe_global_rkey);
+			msg->desc[i].len = cpu_to_le32(sg_dma_len(sg));
+		}
+	}
+	/*
+	 * rtrs message will be after the space reserved for disk data and
+	 * user message
+	 */
+	imm = req->permit->mem_off + req->data_len + req->usr_len;
+	imm = rtrs_to_io_req_imm(imm);
+	buf_id = req->permit->mem_id;
+
+	req->sg_size  = sizeof(*msg);
+	req->sg_size += le16_to_cpu(msg->sg_cnt) * sizeof(struct rtrs_sg_desc);
+	req->sg_size += req->usr_len;
+
+	/*
+	 * Update stats now, after request is successfully sent it is not
+	 * safe anymore to touch it.
+	 */
+	rtrs_clt_update_all_stats(req, READ);
+
+	ret = rtrs_post_send_rdma(req->con, req, &sess->rbufs[buf_id],
+				   req->data_len, imm, wr);
+	if (unlikely(ret)) {
+		rtrs_err(s, "Read request failed: %d\n", ret);
+		rtrs_clt_decrease_inflight(&sess->stats);
+		req->need_inv = false;
+		if (req->sg_cnt)
+			ib_dma_unmap_sg(dev->ib_dev, req->sglist,
+					req->sg_cnt, req->dir);
+	}
+
+	return ret;
+}
+
+/**
+ * rtrs_clt_failover_req() Try to find an active path for a failed request
+ */
+static int rtrs_clt_failover_req(struct rtrs_clt *clt,
+				  struct rtrs_clt_io_req *fail_req)
+{
+	struct rtrs_clt_sess *alive_sess;
+	struct rtrs_clt_io_req *req;
+	int err = -ECONNABORTED;
+	struct path_it it;
+
+	do_each_path(alive_sess, clt, &it) {
+		if (unlikely(READ_ONCE(alive_sess->state) !=
+			     RTRS_CLT_CONNECTED))
+			continue;
+		req = rtrs_clt_get_copy_req(alive_sess, fail_req);
+		if (req->dir == DMA_TO_DEVICE)
+			err = rtrs_clt_write_req(req);
+		else
+			err = rtrs_clt_read_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		rtrs_clt_inc_failover_cnt(&alive_sess->stats);
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+
+static void fail_all_outstanding_reqs(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt *clt = sess->clt;
+	struct rtrs_clt_io_req *req;
+	int i, err;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		if (!req->in_use)
+			continue;
+
+		/*
+		 * Safely (without notification) complete failed request.
+		 * After completion this request is still useble and can
+		 * be failovered to another path.
+		 */
+		complete_rdma_req(req, -ECONNABORTED, false, true);
+
+		err = rtrs_clt_failover_req(clt, req);
+		if (unlikely(err))
+			/* Failover failed, notify anyway */
+			req->conf(req->priv, err);
+	}
+}
+
+static void free_sess_reqs(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt_io_req *req;
+	int i;
+
+	if (!sess->reqs)
+		return;
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		if (req->mr)
+			ib_dereg_mr(req->mr);
+		kfree(req->sge);
+		rtrs_iu_free(req->iu, DMA_TO_DEVICE,
+			      sess->s.dev->ib_dev, 1);
+	}
+	kfree(sess->reqs);
+	sess->reqs = NULL;
+}
+
+static int alloc_sess_reqs(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt_io_req *req;
+	struct rtrs_clt *clt = sess->clt;
+	int i, err = -ENOMEM;
+
+	sess->reqs = kcalloc(sess->queue_depth, sizeof(*sess->reqs),
+			     GFP_KERNEL);
+	if (unlikely(!sess->reqs))
+		return -ENOMEM;
+
+	for (i = 0; i < sess->queue_depth; ++i) {
+		req = &sess->reqs[i];
+		req->iu = rtrs_iu_alloc(1, sess->max_hdr_size, GFP_KERNEL,
+					 sess->s.dev->ib_dev,
+					 DMA_TO_DEVICE,
+					 rtrs_clt_rdma_done);
+		if (unlikely(!req->iu))
+			goto out;
+
+		req->sge = kmalloc_array(clt->max_segments + 1,
+					 sizeof(*req->sge), GFP_KERNEL);
+		if (unlikely(!req->sge))
+			goto out;
+
+		req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
+				      sess->max_pages_per_mr);
+		if (IS_ERR(req->mr)) {
+			err = PTR_ERR(req->mr);
+			req->mr = NULL;
+			pr_err("Failed to alloc sess->max_pages_per_mr %d\n",
+			       sess->max_pages_per_mr);
+			goto out;
+		}
+
+		init_completion(&req->inv_comp);
+	}
+
+	return 0;
+
+out:
+	free_sess_reqs(sess);
+
+	return err;
+}
+
+static int alloc_permits(struct rtrs_clt *clt)
+{
+	unsigned int chunk_bits;
+	int err, i;
+
+	clt->permits_map = kcalloc(BITS_TO_LONGS(clt->queue_depth),
+				   sizeof(long), GFP_KERNEL);
+	if (unlikely(!clt->permits_map)) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+	clt->permits = kcalloc(clt->queue_depth, PERMIT_SIZE(clt), GFP_KERNEL);
+	if (unlikely(!clt->permits)) {
+		err = -ENOMEM;
+		goto err_map;
+	}
+	chunk_bits = ilog2(clt->queue_depth - 1) + 1;
+	for (i = 0; i < clt->queue_depth; i++) {
+		struct rtrs_permit *permit;
+
+		permit = GET_PERMIT(clt, i);
+		permit->mem_id = i;
+		permit->mem_off = i << (MAX_IMM_PAYL_BITS - chunk_bits);
+	}
+
+	return 0;
+
+err_map:
+	kfree(clt->permits_map);
+	clt->permits_map = NULL;
+out_err:
+	return err;
+}
+
+static void free_permits(struct rtrs_clt *clt)
+{
+	kfree(clt->permits_map);
+	clt->permits_map = NULL;
+	kfree(clt->permits);
+	clt->permits = NULL;
+}
+
+static void query_fast_reg_mode(struct rtrs_clt_sess *sess)
+{
+	struct ib_device *ib_dev;
+	u64 max_pages_per_mr;
+	int mr_page_shift;
+
+	ib_dev = sess->s.dev->ib_dev;
+
+	/*
+	 * Use the smallest page size supported by the HCA, down to a
+	 * minimum of 4096 bytes. We're unlikely to build large sglists
+	 * out of smaller entries.
+	 */
+	mr_page_shift      = max(12, ffs(ib_dev->attrs.page_size_cap) - 1);
+	max_pages_per_mr   = ib_dev->attrs.max_mr_size;
+	do_div(max_pages_per_mr, (1ull << mr_page_shift));
+	sess->max_pages_per_mr =
+		min3(sess->max_pages_per_mr, (u32)max_pages_per_mr,
+		     ib_dev->attrs.max_fast_reg_page_list_len);
+	sess->max_send_sge = ib_dev->attrs.max_send_sge;
+}
+
+static bool rtrs_clt_change_state_get_old(struct rtrs_clt_sess *sess,
+					   enum rtrs_clt_state new_state,
+					   enum rtrs_clt_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_wq.lock);
+	*old_state = sess->state;
+	changed = __rtrs_clt_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_wq.lock);
+
+	return changed;
+}
+
+static bool rtrs_clt_change_state(struct rtrs_clt_sess *sess,
+				   enum rtrs_clt_state new_state)
+{
+	enum rtrs_clt_state old_state;
+
+	return rtrs_clt_change_state_get_old(sess, new_state, &old_state);
+}
+
+static void rtrs_clt_hb_err_handler(struct rtrs_con *c)
+{
+	struct rtrs_clt_con *con = container_of(c, typeof(*con), c);
+
+	rtrs_rdma_error_recovery(con);
+}
+
+static void rtrs_clt_init_hb(struct rtrs_clt_sess *sess)
+{
+	rtrs_init_hb(&sess->s, &io_comp_cqe,
+		      RTRS_HB_INTERVAL_MS,
+		      RTRS_HB_MISSED_MAX,
+		      rtrs_clt_hb_err_handler,
+		      rtrs_wq);
+}
+
+static void rtrs_clt_start_hb(struct rtrs_clt_sess *sess)
+{
+	rtrs_start_hb(&sess->s);
+}
+
+static void rtrs_clt_stop_hb(struct rtrs_clt_sess *sess)
+{
+	rtrs_stop_hb(&sess->s);
+}
+
+static void rtrs_clt_reconnect_work(struct work_struct *work);
+static void rtrs_clt_close_work(struct work_struct *work);
+
+static struct rtrs_clt_sess *alloc_sess(struct rtrs_clt *clt,
+					 const struct rtrs_addr *path,
+					 size_t con_num, u16 max_segments)
+{
+	struct rtrs_clt_sess *sess;
+	int err = -ENOMEM;
+	int cpu;
+
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	/* Extra connection for user messages */
+	con_num += 1;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_sess;
+
+	mutex_init(&sess->init_mutex);
+	uuid_gen(&sess->s.uuid);
+	memcpy(&sess->s.dst_addr, path->dst,
+	       rdma_addr_size((struct sockaddr *)path->dst));
+
+	/*
+	 * rdma_resolve_addr() passes src_addr to cma_bind_addr, which
+	 * checks the sa_family to be non-zero. If user passed src_addr=NULL
+	 * the sess->src_addr will contain only zeros, which is then fine.
+	 */
+	if (path->src)
+		memcpy(&sess->s.src_addr, path->src,
+		       rdma_addr_size((struct sockaddr *)path->src));
+	strlcpy(sess->s.sessname, clt->sessname, sizeof(sess->s.sessname));
+	sess->s.con_num = con_num;
+	sess->clt = clt;
+	sess->max_pages_per_mr = max_segments * BLK_MAX_SEGMENT_SIZE >> 12;
+	init_waitqueue_head(&sess->state_wq);
+	sess->state = RTRS_CLT_CONNECTING;
+	atomic_set(&sess->connected_cnt, 0);
+	INIT_WORK(&sess->close_work, rtrs_clt_close_work);
+	INIT_DELAYED_WORK(&sess->reconnect_dwork, rtrs_clt_reconnect_work);
+	rtrs_clt_init_hb(sess);
+
+	sess->mp_skip_entry = alloc_percpu(typeof(*sess->mp_skip_entry));
+	if (unlikely(!sess->mp_skip_entry))
+		goto err_free_con;
+
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(per_cpu_ptr(sess->mp_skip_entry, cpu));
+
+	err = rtrs_clt_init_stats(&sess->stats);
+	if (unlikely(err))
+		goto err_free_percpu;
+
+	return sess;
+
+err_free_percpu:
+	free_percpu(sess->mp_skip_entry);
+err_free_con:
+	kfree(sess->s.con);
+err_free_sess:
+	kfree(sess);
+err:
+	return ERR_PTR(err);
+}
+
+static void free_sess(struct rtrs_clt_sess *sess)
+{
+	rtrs_clt_free_stats(&sess->stats);
+	free_percpu(sess->mp_skip_entry);
+	kfree(sess->s.con);
+	kfree(sess->rbufs);
+	kfree(sess);
+}
+
+static int create_con(struct rtrs_clt_sess *sess, unsigned int cid)
+{
+	struct rtrs_clt_con *con;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con))
+		return -ENOMEM;
+
+	/* Map first two connections to the first CPU */
+	con->cpu  = (cid ? cid - 1 : 0) % nr_cpu_ids;
+	con->c.cid = cid;
+	con->c.sess = &sess->s;
+	atomic_set(&con->io_cnt, 0);
+
+	sess->s.con[cid] = &con->c;
+
+	return 0;
+}
+
+static void destroy_con(struct rtrs_clt_con *con)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	sess->s.con[con->c.cid] = NULL;
+	kfree(con);
+}
+
+static int create_con_cq_qp(struct rtrs_clt_con *con)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	u16 wr_queue_size;
+	int err, cq_vector;
+	struct rtrs_msg_rkey_rsp *rsp;
+
+	/*
+	 * This function can fail, but still destroy_con_cq_qp() should
+	 * be called, this is because create_con_cq_qp() is called on cm
+	 * event path, thus caller/waiter never knows: have we failed before
+	 * create_con_cq_qp() or after.  To solve this dilemma without
+	 * creating any additional flags just allow destroy_con_cq_qp() be
+	 * called many times.
+	 */
+
+	if (con->c.cid == 0) {
+		/*
+		 * One completion for each receive and two for each send
+		 * (send request + registration)
+		 * + 2 for drain and heartbeat
+		 * in case qp gets into error state
+		 */
+		wr_queue_size = SERVICE_CON_QUEUE_DEPTH * 3 + 2;
+		/* We must be the first here */
+		if (WARN_ON(sess->s.dev))
+			return -EINVAL;
+
+		/*
+		 * The whole session uses device from user connection.
+		 * Be careful not to close user connection before ib dev
+		 * is gracefully put.
+		 */
+		sess->s.dev = rtrs_ib_dev_find_or_add(con->c.cm_id->device,
+						       &dev_pool);
+		if (unlikely(!sess->s.dev)) {
+			rtrs_wrn(sess->clt,
+				  "rtrs_ib_dev_find_get_or_add(): no memory\n");
+			return -ENOMEM;
+		}
+		sess->s.dev_ref = 1;
+		query_fast_reg_mode(sess);
+	} else {
+		/*
+		 * Here we assume that session members are correctly set.
+		 * This is always true if user connection (cid == 0) is
+		 * established first.
+		 */
+		if (WARN_ON(!sess->s.dev))
+			return -EINVAL;
+		if (WARN_ON(!sess->queue_depth))
+			return -EINVAL;
+
+		/* Shared between connections */
+		sess->s.dev_ref++;
+		wr_queue_size =
+			min_t(int, sess->s.dev->ib_dev->attrs.max_qp_wr,
+			      /* QD * (REQ + RSP + FR REGS or INVS) + drain */
+			      sess->queue_depth * 3 + 1);
+	}
+	/* alloc iu to recv new rkey reply when server reports flags set */
+	if (sess->flags == RTRS_MSG_NEW_RKEY_F || con->c.cid == 0) {
+		con->rsp_ius = rtrs_iu_alloc(wr_queue_size, sizeof(*rsp),
+					      GFP_KERNEL, sess->s.dev->ib_dev,
+					      DMA_FROM_DEVICE,
+					      rtrs_clt_rdma_done);
+		if (unlikely(!con->rsp_ius))
+			return -ENOMEM;
+		con->queue_size = wr_queue_size;
+	}
+	cq_vector = con->cpu % sess->s.dev->ib_dev->num_comp_vectors;
+	err = rtrs_cq_qp_create(&sess->s, &con->c, sess->max_send_sge,
+				 cq_vector, wr_queue_size, wr_queue_size,
+				 IB_POLL_SOFTIRQ);
+	/*
+	 * In case of error we do not bother to clean previous allocations,
+	 * since destroy_con_cq_qp() must be called.
+	 */
+
+	if (unlikely(err))
+		return err;
+	return err;
+}
+
+static void destroy_con_cq_qp(struct rtrs_clt_con *con)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	/*
+	 * Be careful here: destroy_con_cq_qp() can be called even
+	 * create_con_cq_qp() failed, see comments there.
+	 */
+
+	rtrs_cq_qp_destroy(&con->c);
+	if (con->rsp_ius) {
+		rtrs_iu_free(con->rsp_ius, DMA_FROM_DEVICE,
+			      sess->s.dev->ib_dev, con->queue_size);
+		con->rsp_ius = NULL;
+		con->queue_size = 0;
+	}
+	if (sess->s.dev_ref && !--sess->s.dev_ref) {
+		rtrs_ib_dev_put(sess->s.dev);
+		sess->s.dev = NULL;
+	}
+}
+
+static void stop_cm(struct rtrs_clt_con *con)
+{
+	rdma_disconnect(con->c.cm_id);
+	if (con->c.qp)
+		ib_drain_qp(con->c.qp);
+}
+
+static void destroy_cm(struct rtrs_clt_con *con)
+{
+	rdma_destroy_id(con->c.cm_id);
+	con->c.cm_id = NULL;
+}
+
+static int rtrs_rdma_addr_resolved(struct rtrs_clt_con *con)
+{
+	struct rtrs_sess *s = con->c.sess;
+	int err;
+
+	err = create_con_cq_qp(con);
+	if (unlikely(err)) {
+		rtrs_err(s, "create_con_cq_qp(), err: %d\n", err);
+		return err;
+	}
+	err = rdma_resolve_route(con->c.cm_id, RTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		rtrs_err(s, "Resolving route failed, err: %d\n", err);
+		destroy_con_cq_qp(con);
+	}
+
+	return err;
+}
+
+static int rtrs_rdma_route_resolved(struct rtrs_clt_con *con)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rtrs_clt *clt = sess->clt;
+	struct rtrs_msg_conn_req msg;
+	struct rdma_conn_param param;
+
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.retry_count = clamp(retry_cnt, MIN_RTR_CNT, MAX_RTR_CNT);
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	/*
+	 * Those two are the part of struct cma_hdr which is shared
+	 * with private_data in case of AF_IB, so put zeroes to avoid
+	 * wrong validation inside cma.c on receiver side.
+	 */
+	msg.__cma_version = 0;
+	msg.__ip_version = 0;
+	msg.magic = cpu_to_le16(RTRS_MAGIC);
+	msg.version = cpu_to_le16(RTRS_PROTO_VER);
+	msg.cid = cpu_to_le16(con->c.cid);
+	msg.cid_num = cpu_to_le16(sess->s.con_num);
+	msg.recon_cnt = cpu_to_le16(sess->s.recon_cnt);
+	uuid_copy(&msg.sess_uuid, &sess->s.uuid);
+	uuid_copy(&msg.paths_uuid, &clt->paths_uuid);
+
+	err = rdma_connect(con->c.cm_id, &param);
+	if (err)
+		rtrs_err(clt, "rdma_connect(): %d\n", err);
+
+	return err;
+}
+
+static int rtrs_rdma_conn_established(struct rtrs_clt_con *con,
+				       struct rdma_cm_event *ev)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rtrs_clt *clt = sess->clt;
+	const struct rtrs_msg_conn_rsp *msg;
+	u16 version, queue_depth;
+	int errno;
+	u8 len;
+
+	msg = ev->param.conn.private_data;
+	len = ev->param.conn.private_data_len;
+	if (unlikely(len < sizeof(*msg))) {
+		rtrs_err(clt, "Invalid RTRS connection response\n");
+		return -ECONNRESET;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != RTRS_MAGIC)) {
+		rtrs_err(clt, "Invalid RTRS magic\n");
+		return -ECONNRESET;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != RTRS_PROTO_VER_MAJOR)) {
+		rtrs_err(clt, "Unsupported major RTRS version: %d, expected %d\n",
+			  version >> 8, RTRS_PROTO_VER_MAJOR);
+		return -ECONNRESET;
+	}
+	errno = le16_to_cpu(msg->errno);
+	if (unlikely(errno)) {
+		rtrs_err(clt, "Invalid RTRS message: errno %d\n",
+			  errno);
+		return -ECONNRESET;
+	}
+	if (con->c.cid == 0) {
+		queue_depth = le16_to_cpu(msg->queue_depth);
+
+		if (queue_depth > MAX_SESS_QUEUE_DEPTH) {
+			rtrs_err(clt, "Invalid RTRS message: queue=%d\n",
+				  queue_depth);
+			return -ECONNRESET;
+		}
+		if (!sess->rbufs || sess->queue_depth < queue_depth) {
+			kfree(sess->rbufs);
+			sess->rbufs = kcalloc(queue_depth, sizeof(*sess->rbufs),
+					      GFP_KERNEL);
+			if (unlikely(!sess->rbufs)) {
+				rtrs_err(clt,
+					  "Failed to allocate queue_depth=%d\n",
+					  queue_depth);
+				return -ENOMEM;
+			}
+		}
+		sess->queue_depth = queue_depth;
+		sess->max_hdr_size = le32_to_cpu(msg->max_hdr_size);
+		sess->max_io_size = le32_to_cpu(msg->max_io_size);
+		sess->flags = le32_to_cpu(msg->flags);
+		sess->chunk_size = sess->max_io_size + sess->max_hdr_size;
+
+		/*
+		 * Global queue depth and IO size is always a minimum.
+		 * If while a reconnection server sends us a value a bit
+		 * higher - client does not care and uses cached minimum.
+		 *
+		 * Since we can have several sessions (paths) restablishing
+		 * connections in parallel, use lock.
+		 */
+		mutex_lock(&clt->paths_mutex);
+		clt->queue_depth = min_not_zero(sess->queue_depth,
+						clt->queue_depth);
+		clt->max_io_size = min_not_zero(sess->max_io_size,
+						clt->max_io_size);
+		mutex_unlock(&clt->paths_mutex);
+
+		/*
+		 * Cache the hca_port and hca_name for sysfs
+		 */
+		sess->hca_port = con->c.cm_id->port_num;
+		scnprintf(sess->hca_name, sizeof(sess->hca_name),
+			  sess->s.dev->ib_dev->name);
+		sess->s.src_addr = con->c.cm_id->route.addr.src_addr;
+	}
+
+	return 0;
+}
+
+static inline void flag_success_on_conn(struct rtrs_clt_con *con)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+
+	atomic_inc(&sess->connected_cnt);
+	con->cm_err = 1;
+}
+
+static int rtrs_rdma_conn_rejected(struct rtrs_clt_con *con,
+				    struct rdma_cm_event *ev)
+{
+	struct rtrs_sess *s = con->c.sess;
+	const struct rtrs_msg_conn_rsp *msg;
+	const char *rej_msg;
+	int status, errno;
+	u8 data_len;
+
+	status = ev->status;
+	rej_msg = rdma_reject_msg(con->c.cm_id, status);
+	msg = rdma_consumer_reject_data(con->c.cm_id, ev, &data_len);
+
+	if (msg && data_len >= sizeof(*msg)) {
+		errno = (int16_t)le16_to_cpu(msg->errno);
+		if (errno == -EBUSY)
+			rtrs_err(s,
+				  "Previous session is still exists on the server, please reconnect later\n");
+		else
+			rtrs_err(s,
+				  "Connect rejected: status %d (%s), rtrs errno %d\n",
+				  status, rej_msg, errno);
+	} else {
+		rtrs_err(s,
+			  "Connect rejected but with malformed message: status %d (%s)\n",
+			  status, rej_msg);
+	}
+
+	return -ECONNRESET;
+}
+
+static void rtrs_clt_close_conns(struct rtrs_clt_sess *sess, bool wait)
+{
+	if (rtrs_clt_change_state(sess, RTRS_CLT_CLOSING))
+		queue_work(rtrs_wq, &sess->close_work);
+	if (wait)
+		flush_work(&sess->close_work);
+}
+
+static inline void flag_error_on_conn(struct rtrs_clt_con *con, int cm_err)
+{
+	if (con->cm_err == 1) {
+		struct rtrs_clt_sess *sess;
+
+		sess = to_clt_sess(con->c.sess);
+		if (atomic_dec_and_test(&sess->connected_cnt))
+			wake_up(&sess->state_wq);
+	}
+	con->cm_err = cm_err;
+}
+
+static int rtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct rtrs_clt_con *con = cm_id->context;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_clt_sess *sess = to_clt_sess(s);
+	int cm_err = 0;
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		cm_err = rtrs_rdma_addr_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		cm_err = rtrs_rdma_route_resolved(con);
+		break;
+	case RDMA_CM_EVENT_ESTABLISHED:
+		con->cm_err = rtrs_rdma_conn_established(con, ev);
+		if (likely(!con->cm_err)) {
+			/*
+			 * Report success and wake up. Here we abuse state_wq,
+			 * i.e. wake up without state change, but we set cm_err.
+			 */
+			flag_success_on_conn(con);
+			wake_up(&sess->state_wq);
+			return 0;
+		}
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+		cm_err = rtrs_rdma_conn_rejected(con, ev);
+		break;
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		rtrs_wrn(s, "CM error event %d\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+		cm_err = -EHOSTUNREACH;
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		cm_err = -ECONNRESET;
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		/*
+		 * Device removal is a special case.  Queue close and return 0.
+		 */
+		rtrs_clt_close_conns(sess, false);
+		return 0;
+	default:
+		rtrs_err(s, "Unexpected RDMA CM event (%d)\n", ev->event);
+		cm_err = -ECONNRESET;
+		break;
+	}
+
+	if (cm_err) {
+		/*
+		 * cm error makes sense only on connection establishing,
+		 * in other cases we rely on normal procedure of reconnecting.
+		 */
+		flag_error_on_conn(con, cm_err);
+		rtrs_rdma_error_recovery(con);
+	}
+
+	return 0;
+}
+
+static int create_cm(struct rtrs_clt_con *con)
+{
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_clt_sess *sess = to_clt_sess(s);
+	struct rdma_cm_id *cm_id;
+	int err;
+
+	cm_id = rdma_create_id(&init_net, rtrs_clt_rdma_cm_handler, con,
+			       sess->s.dst_addr.ss_family == AF_IB ?
+			       RDMA_PS_IB : RDMA_PS_TCP, IB_QPT_RC);
+	if (IS_ERR(cm_id)) {
+		err = PTR_ERR(cm_id);
+		rtrs_err(s, "Failed to create CM ID, err: %d\n", err);
+
+		return err;
+	}
+	con->c.cm_id = cm_id;
+	con->cm_err = 0;
+	/* allow the port to be reused */
+	err = rdma_set_reuseaddr(cm_id, 1);
+	if (err != 0) {
+		rtrs_err(s, "Set address reuse failed, err: %d\n", err);
+		goto destroy_cm;
+	}
+	err = rdma_resolve_addr(cm_id, (struct sockaddr *)&sess->s.src_addr,
+				(struct sockaddr *)&sess->s.dst_addr,
+				RTRS_CONNECT_TIMEOUT_MS);
+	if (unlikely(err)) {
+		rtrs_err(s, "Failed to resolve address, err: %d\n", err);
+		goto destroy_cm;
+	}
+	/*
+	 * Combine connection status and session events. This is needed
+	 * for waiting two possible cases: cm_err has something meaningful
+	 * or session state was really changed to error by device removal.
+	 */
+	err = wait_event_interruptible_timeout(
+			sess->state_wq,
+			con->cm_err || sess->state != RTRS_CLT_CONNECTING,
+			msecs_to_jiffies(RTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(err == 0 || err == -ERESTARTSYS)) {
+		if (err == 0)
+			err = -ETIMEDOUT;
+		/* Timedout or interrupted */
+		goto errr;
+	}
+	if (unlikely(con->cm_err < 0)) {
+		err = con->cm_err;
+		goto errr;
+	}
+	if (unlikely(READ_ONCE(sess->state) != RTRS_CLT_CONNECTING)) {
+		/* Device removal */
+		err = -ECONNABORTED;
+		goto errr;
+	}
+
+	return 0;
+
+errr:
+	stop_cm(con);
+	/* Is safe to call destroy if cq_qp is not inited */
+	destroy_con_cq_qp(con);
+destroy_cm:
+	destroy_cm(con);
+
+	return err;
+}
+
+static void rtrs_clt_sess_up(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt *clt = sess->clt;
+	int up;
+
+	/*
+	 * We can fire RECONNECTED event only when all paths were
+	 * connected on rtrs_clt_open(), then each was disconnected
+	 * and the first one connected again.  That's why this nasty
+	 * game with counter value.
+	 */
+
+	mutex_lock(&clt->paths_ev_mutex);
+	up = ++clt->paths_up;
+	/*
+	 * Here it is safe to access paths num directly since up counter
+	 * is greater than MAX_PATHS_NUM only while rtrs_clt_open() is
+	 * in progress, thus paths removals are impossible.
+	 */
+	if (up > MAX_PATHS_NUM && up == MAX_PATHS_NUM + clt->paths_num)
+		clt->paths_up = clt->paths_num;
+	else if (up == 1)
+		clt->link_ev(clt->priv, RTRS_CLT_LINK_EV_RECONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+	sess->reconnect_attempts = 0;
+	sess->stats.reconnects.successful_cnt++;
+}
+
+static void rtrs_clt_sess_down(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt *clt = sess->clt;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&clt->paths_ev_mutex);
+	WARN_ON(!clt->paths_up);
+	if (--clt->paths_up == 0)
+		clt->link_ev(clt->priv, RTRS_CLT_LINK_EV_DISCONNECTED);
+	mutex_unlock(&clt->paths_ev_mutex);
+}
+
+static void rtrs_clt_stop_and_destroy_conns(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt_con *con;
+	unsigned int cid;
+
+	WARN_ON(READ_ONCE(sess->state) == RTRS_CLT_CONNECTED);
+
+	/*
+	 * Possible race with rtrs_clt_open(), when DEVICE_REMOVAL comes
+	 * exactly in between.  Start destroying after it finishes.
+	 */
+	mutex_lock(&sess->init_mutex);
+	mutex_unlock(&sess->init_mutex);
+
+	/*
+	 * All IO paths must observe !CONNECTED state before we
+	 * free everything.
+	 */
+	synchronize_rcu();
+
+	rtrs_clt_stop_hb(sess);
+
+	/*
+	 * The order it utterly crucial: firstly disconnect and complete all
+	 * rdma requests with error (thus set in_use=false for requests),
+	 * then fail outstanding requests checking in_use for each, and
+	 * eventually notify upper layer about session disconnection.
+	 */
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (!sess->s.con[cid])
+			break;
+		con = to_clt_con(sess->s.con[cid]);
+		stop_cm(con);
+	}
+	fail_all_outstanding_reqs(sess);
+	free_sess_reqs(sess);
+	rtrs_clt_sess_down(sess);
+
+	/*
+	 * Wait for graceful shutdown, namely when peer side invokes
+	 * rdma_disconnect(). 'connected_cnt' is decremented only on
+	 * CM events, thus if other side had crashed and hb has detected
+	 * something is wrong, here we will stuck for exactly timeout ms,
+	 * since CM does not fire anything.  That is fine, we are not in
+	 * hurry.
+	 */
+	wait_event_timeout(sess->state_wq, !atomic_read(&sess->connected_cnt),
+			   msecs_to_jiffies(RTRS_CONNECT_TIMEOUT_MS));
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (!sess->s.con[cid])
+			break;
+		con = to_clt_con(sess->s.con[cid]);
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+}
+
+static inline bool xchg_sessions(struct rtrs_clt_sess __rcu **rcu_ppcpu_path,
+				 struct rtrs_clt_sess *sess,
+				 struct rtrs_clt_sess *next)
+{
+	struct rtrs_clt_sess **ppcpu_path;
+
+	/* Call cmpxchg() without sparse warnings */
+	ppcpu_path = (typeof(ppcpu_path))rcu_ppcpu_path;
+	return (sess == cmpxchg(ppcpu_path, sess, next));
+}
+
+static void rtrs_clt_remove_path_from_arr(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt *clt = sess->clt;
+	struct rtrs_clt_sess *next;
+	bool wait_for_grace = false;
+	int cpu;
+
+	mutex_lock(&clt->paths_mutex);
+	list_del_rcu(&sess->s.entry);
+
+	/* Make sure everybody observes path removal. */
+	synchronize_rcu();
+
+	/*
+	 * At this point nobody sees @sess in the list, but still we have
+	 * dangling pointer @pcpu_path which _can_ point to @sess.  Since
+	 * nobody can observe @sess in the list, we guarantee that IO path
+	 * will not assign @sess to @pcpu_path, i.e. @pcpu_path can be equal
+	 * to @sess, but can never again become @sess.
+	 */
+
+	/*
+	 * Decrement paths number only after grace period, because
+	 * caller of do_each_path() must firstly observe list without
+	 * path and only then decremented paths number.
+	 *
+	 * Otherwise there can be the following situation:
+	 *    o Two paths exist and IO is coming.
+	 *    o One path is removed:
+	 *      CPU#0                          CPU#1
+	 *      do_each_path():                rtrs_clt_remove_path_from_arr():
+	 *          path = get_next_path()
+	 *          ^^^                            list_del_rcu(path)
+	 *          [!CONNECTED path]              clt->paths_num--
+	 *                                              ^^^^^^^^^
+	 *          load clt->paths_num                 from 2 to 1
+	 *                    ^^^^^^^^^
+	 *                    sees 1
+	 *
+	 *      path is observed as !CONNECTED, but do_each_path() loop
+	 *      ends, because expression i < clt->paths_num is false.
+	 */
+	clt->paths_num--;
+
+	/*
+	 * Get @next connection from current @sess which is going to be
+	 * removed.  If @sess is the last element, then @next is NULL.
+	 */
+	next = list_next_or_null_rr_rcu(&clt->paths_list, &sess->s.entry,
+					typeof(*next), s.entry);
+
+	/*
+	 * @pcpu paths can still point to the path which is going to be
+	 * removed, so change the pointer manually.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct rtrs_clt_sess __rcu **ppcpu_path;
+
+		ppcpu_path = per_cpu_ptr(clt->pcpu_path, cpu);
+		if (rcu_dereference(*ppcpu_path) != sess)
+			/*
+			 * synchronize_rcu() was called just after deleting
+			 * entry from the list, thus IO code path cannot
+			 * change pointer back to the pointer which is going
+			 * to be removed, we are safe here.
+			 */
+			continue;
+
+		/*
+		 * We race with IO code path, which also changes pointer,
+		 * thus we have to be careful not to overwrite it.
+		 */
+		if (xchg_sessions(ppcpu_path, sess, next))
+			/*
+			 * @ppcpu_path was successfully replaced with @next,
+			 * that means that someone could also pick up the
+			 * @sess and dereferencing it right now, so wait for
+			 * a grace period is required.
+			 */
+			wait_for_grace = true;
+	}
+	if (wait_for_grace)
+		synchronize_rcu();
+
+	mutex_unlock(&clt->paths_mutex);
+}
+
+static void rtrs_clt_add_path_to_arr(struct rtrs_clt_sess *sess,
+				      struct rtrs_addr *addr)
+{
+	struct rtrs_clt *clt = sess->clt;
+
+	mutex_lock(&clt->paths_mutex);
+	clt->paths_num++;
+
+	/*
+	 * Firstly increase paths_num, wait for GP and then
+	 * add path to the list.  Why?  Since we add path with
+	 * !CONNECTED state explanation is similar to what has
+	 * been written in rtrs_clt_remove_path_from_arr().
+	 */
+	synchronize_rcu();
+
+	list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+	mutex_unlock(&clt->paths_mutex);
+}
+
+static void rtrs_clt_close_work(struct work_struct *work)
+{
+	struct rtrs_clt_sess *sess;
+
+	sess = container_of(work, struct rtrs_clt_sess, close_work);
+
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	rtrs_clt_stop_and_destroy_conns(sess);
+	/*
+	 * Sounds stupid, huh?  No, it is not.  Consider this sequence:
+	 *
+	 *   #CPU0                              #CPU1
+	 *   1.  CONNECTED->RECONNECTING
+	 *   2.                                 RECONNECTING->CLOSING
+	 *   3.  queue_work(&reconnect_dwork)
+	 *   4.                                 queue_work(&close_work);
+	 *   5.  reconnect_work();              close_work();
+	 *
+	 * To avoid that case do cancel twice: before and after.
+	 */
+	cancel_delayed_work_sync(&sess->reconnect_dwork);
+	rtrs_clt_change_state(sess, RTRS_CLT_CLOSED);
+}
+
+static int init_conns(struct rtrs_clt_sess *sess)
+{
+	unsigned int cid;
+	int err;
+
+	/*
+	 * On every new session connections increase reconnect counter
+	 * to avoid clashes with previous sessions not yet closed
+	 * sessions on a server side.
+	 */
+	sess->s.recon_cnt++;
+
+	/* Establish all RDMA connections  */
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		err = create_con(sess, cid);
+		if (unlikely(err))
+			goto destroy;
+
+		err = create_cm(to_clt_con(sess->s.con[cid]));
+		if (unlikely(err)) {
+			destroy_con(to_clt_con(sess->s.con[cid]));
+			goto destroy;
+		}
+	}
+	err = alloc_sess_reqs(sess);
+	if (unlikely(err))
+		goto destroy;
+
+	rtrs_clt_start_hb(sess);
+
+	return 0;
+
+destroy:
+	while (cid--) {
+		struct rtrs_clt_con *con = to_clt_con(sess->s.con[cid]);
+
+		stop_cm(con);
+		destroy_con_cq_qp(con);
+		destroy_cm(con);
+		destroy_con(con);
+	}
+	/*
+	 * If we've never taken async path and got an error, say,
+	 * doing rdma_resolve_addr(), switch to CONNECTION_ERR state
+	 * manually to keep reconnecting.
+	 */
+	rtrs_clt_change_state(sess, RTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+static void rtrs_clt_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_clt_con *con = cq->cq_context;
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct rtrs_iu, cqe);
+	rtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.dev->ib_dev, 1);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(sess->clt, "Sess info request send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		rtrs_clt_change_state(sess, RTRS_CLT_CONNECTING_ERR);
+		return;
+	}
+
+	rtrs_clt_update_wc_stats(con);
+}
+
+static int process_info_rsp(struct rtrs_clt_sess *sess,
+			    const struct rtrs_msg_info_rsp *msg)
+{
+	unsigned int sg_cnt, total_len;
+	int i, sgi;
+
+	sg_cnt = le16_to_cpu(msg->sg_cnt);
+	if (unlikely(!sg_cnt))
+		return -EINVAL;
+	/*
+	 * Check if IB immediate data size is enough to hold the mem_id and
+	 * the offset inside the memory chunk.
+	 */
+	if (unlikely((ilog2(sg_cnt - 1) + 1) +
+		     (ilog2(sess->chunk_size - 1) + 1) >
+		     MAX_IMM_PAYL_BITS)) {
+		rtrs_err(sess->clt,
+			  "RDMA immediate size (%db) not enough to encode %d buffers of size %dB\n",
+			  MAX_IMM_PAYL_BITS, sg_cnt, sess->chunk_size);
+		return -EINVAL;
+	}
+	if (unlikely(!sg_cnt || (sess->queue_depth % sg_cnt))) {
+		rtrs_err(sess->clt, "Incorrect sg_cnt %d, is not multiple\n",
+			  sg_cnt);
+		return -EINVAL;
+	}
+	total_len = 0;
+	for (sgi = 0, i = 0; sgi < sg_cnt && i < sess->queue_depth; sgi++) {
+		const struct rtrs_sg_desc *desc = &msg->desc[sgi];
+		u32 len, rkey;
+		u64 addr;
+
+		addr = le64_to_cpu(desc->addr);
+		rkey = le32_to_cpu(desc->key);
+		len  = le32_to_cpu(desc->len);
+
+		total_len += len;
+
+		if (unlikely(!len || (len % sess->chunk_size))) {
+			rtrs_err(sess->clt, "Incorrect [%d].len %d\n", sgi,
+				  len);
+			return -EINVAL;
+		}
+		for ( ; len && i < sess->queue_depth; i++) {
+			sess->rbufs[i].addr = addr;
+			sess->rbufs[i].rkey = rkey;
+
+			len  -= sess->chunk_size;
+			addr += sess->chunk_size;
+		}
+	}
+	/* Sanity check */
+	if (unlikely(sgi != sg_cnt || i != sess->queue_depth)) {
+		rtrs_err(sess->clt, "Incorrect sg vector, not fully mapped\n");
+		return -EINVAL;
+	}
+	if (unlikely(total_len != sess->chunk_size * sess->queue_depth)) {
+		rtrs_err(sess->clt, "Incorrect total_len %d\n", total_len);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void rtrs_clt_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_clt_con *con = cq->cq_context;
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rtrs_msg_info_rsp *msg;
+	enum rtrs_clt_state state;
+	struct rtrs_iu *iu;
+	size_t rx_sz;
+	int err;
+
+	state = RTRS_CLT_CONNECTING_ERR;
+
+	WARN_ON(con->c.cid);
+	iu = container_of(wc->wr_cqe, struct rtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(sess->clt, "Sess info response recv failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto out;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		rtrs_err(sess->clt, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, iu->dma_addr,
+				   iu->size, DMA_FROM_DEVICE);
+	msg = iu->buf;
+	if (unlikely(le16_to_cpu(msg->type) != RTRS_MSG_INFO_RSP)) {
+		rtrs_err(sess->clt, "Sess info response is malformed: type %d\n",
+			  le16_to_cpu(msg->type));
+		goto out;
+	}
+	rx_sz  = sizeof(*msg);
+	rx_sz += sizeof(msg->desc[0]) * le16_to_cpu(msg->sg_cnt);
+	if (unlikely(wc->byte_len < rx_sz)) {
+		rtrs_err(sess->clt, "Sess info response is malformed: size %d\n",
+			  wc->byte_len);
+		goto out;
+	}
+	err = process_info_rsp(sess, msg);
+	if (unlikely(err))
+		goto out;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err))
+		goto out;
+
+	state = RTRS_CLT_CONNECTED;
+
+out:
+	rtrs_clt_update_wc_stats(con);
+	rtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev, 1);
+	rtrs_clt_change_state(sess, state);
+}
+
+static int rtrs_send_sess_info(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt_con *usr_con = to_clt_con(sess->s.con[0]);
+	struct rtrs_msg_info_req *msg;
+	struct rtrs_iu *tx_iu, *rx_iu;
+	size_t rx_sz;
+	int err;
+
+	rx_sz  = sizeof(struct rtrs_msg_info_rsp);
+	rx_sz += sizeof(u64) * MAX_SESS_QUEUE_DEPTH;
+
+	tx_iu = rtrs_iu_alloc(1, sizeof(struct rtrs_msg_info_req), GFP_KERNEL,
+			       sess->s.dev->ib_dev, DMA_TO_DEVICE,
+			       rtrs_clt_info_req_done);
+	rx_iu = rtrs_iu_alloc(1, rx_sz, GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_FROM_DEVICE, rtrs_clt_info_rsp_done);
+	if (unlikely(!tx_iu || !rx_iu)) {
+		rtrs_err(sess->clt, "rtrs_iu_alloc(): no memory\n");
+		err = -ENOMEM;
+		goto out;
+	}
+	/* Prepare for getting info response */
+	err = rtrs_iu_post_recv(&usr_con->c, rx_iu);
+	if (unlikely(err)) {
+		rtrs_err(sess->clt, "rtrs_iu_post_recv(), err: %d\n", err);
+		goto out;
+	}
+	rx_iu = NULL;
+
+	msg = tx_iu->buf;
+	msg->type = cpu_to_le16(RTRS_MSG_INFO_REQ);
+	memcpy(msg->sessname, sess->s.sessname, sizeof(msg->sessname));
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, tx_iu->dma_addr,
+				      tx_iu->size, DMA_TO_DEVICE);
+
+	/* Send info request */
+	err = rtrs_iu_post_send(&usr_con->c, tx_iu, sizeof(*msg), NULL);
+	if (unlikely(err)) {
+		rtrs_err(sess->clt, "rtrs_iu_post_send(), err: %d\n", err);
+		goto out;
+	}
+	tx_iu = NULL;
+
+	/* Wait for state change */
+	wait_event_interruptible_timeout(sess->state_wq,
+					 sess->state != RTRS_CLT_CONNECTING,
+					 msecs_to_jiffies(
+						 RTRS_CONNECT_TIMEOUT_MS));
+	if (unlikely(READ_ONCE(sess->state) != RTRS_CLT_CONNECTED)) {
+		if (READ_ONCE(sess->state) == RTRS_CLT_CONNECTING_ERR)
+			err = -ECONNRESET;
+		else
+			err = -ETIMEDOUT;
+		goto out;
+	}
+
+out:
+	if (tx_iu)
+		rtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.dev->ib_dev, 1);
+	if (rx_iu)
+		rtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev, 1);
+	if (unlikely(err))
+		/* If we've never taken async path because of malloc problems */
+		rtrs_clt_change_state(sess, RTRS_CLT_CONNECTING_ERR);
+
+	return err;
+}
+
+/**
+ * init_sess() - establishes all session connections and does handshake
+ *
+ * In case of error full close or reconnect procedure should be taken,
+ * because reconnect or close async works can be started.
+ */
+static int init_sess(struct rtrs_clt_sess *sess)
+{
+	int err;
+
+	mutex_lock(&sess->init_mutex);
+	err = init_conns(sess);
+	if (unlikely(err)) {
+		rtrs_err(sess->clt, "init_conns(), err: %d\n", err);
+		goto out;
+	}
+	err = rtrs_send_sess_info(sess);
+	if (unlikely(err)) {
+		rtrs_err(sess->clt, "rtrs_send_sess_info(), err: %d\n", err);
+		goto out;
+	}
+	rtrs_clt_sess_up(sess);
+out:
+	mutex_unlock(&sess->init_mutex);
+
+	return err;
+}
+
+static void rtrs_clt_reconnect_work(struct work_struct *work)
+{
+	struct rtrs_clt_sess *sess;
+	struct rtrs_clt *clt;
+	unsigned int delay_ms;
+	int err;
+
+	sess = container_of(to_delayed_work(work), struct rtrs_clt_sess,
+			    reconnect_dwork);
+	clt = sess->clt;
+
+	if (READ_ONCE(sess->state) == RTRS_CLT_CLOSING)
+		/* User requested closing */
+		return;
+
+	if (sess->reconnect_attempts >= clt->max_reconnect_attempts) {
+		/* Close a session completely if max attempts is reached */
+		rtrs_clt_close_conns(sess, false);
+		return;
+	}
+	sess->reconnect_attempts++;
+
+	/* Stop everything */
+	rtrs_clt_stop_and_destroy_conns(sess);
+	rtrs_clt_change_state(sess, RTRS_CLT_CONNECTING);
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto reconnect_again;
+
+	return;
+
+reconnect_again:
+	if (rtrs_clt_change_state(sess, RTRS_CLT_RECONNECTING)) {
+		sess->stats.reconnects.fail_cnt++;
+		delay_ms = clt->reconnect_delay_sec * 1000;
+		queue_delayed_work(rtrs_wq, &sess->reconnect_dwork,
+				   msecs_to_jiffies(delay_ms));
+	}
+}
+
+static void rtrs_clt_dev_release(struct device *dev)
+{
+	struct rtrs_clt *clt  = container_of(dev, struct rtrs_clt, dev);
+
+	kfree(clt);
+}
+
+static struct rtrs_clt *alloc_clt(const char *sessname, size_t paths_num,
+				   short port, size_t pdu_sz,
+				   void *priv, link_clt_ev_fn *link_ev,
+				   unsigned int max_segments,
+				   unsigned int reconnect_delay_sec,
+				   unsigned int max_reconnect_attempts)
+{
+	struct rtrs_clt *clt;
+	int err;
+
+	if (unlikely(!paths_num || paths_num > MAX_PATHS_NUM))
+		return ERR_PTR(-EINVAL);
+
+	if (unlikely(strlen(sessname) >= sizeof(clt->sessname)))
+		return ERR_PTR(-EINVAL);
+
+	clt = kzalloc(sizeof(*clt), GFP_KERNEL);
+	if (unlikely(!clt))
+		return ERR_PTR(-ENOMEM);
+
+	clt->pcpu_path = alloc_percpu(typeof(*clt->pcpu_path));
+	if (unlikely(!clt->pcpu_path)) {
+		kfree(clt);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	uuid_gen(&clt->paths_uuid);
+	INIT_LIST_HEAD_RCU(&clt->paths_list);
+	clt->paths_num = paths_num;
+	clt->paths_up = MAX_PATHS_NUM;
+	clt->port = port;
+	clt->pdu_sz = pdu_sz;
+	clt->max_segments = max_segments;
+	clt->reconnect_delay_sec = reconnect_delay_sec;
+	clt->max_reconnect_attempts = max_reconnect_attempts;
+	clt->priv = priv;
+	clt->link_ev = link_ev;
+	clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	strlcpy(clt->sessname, sessname, sizeof(clt->sessname));
+	init_waitqueue_head(&clt->permits_wait);
+	mutex_init(&clt->paths_ev_mutex);
+	mutex_init(&clt->paths_mutex);
+
+	clt->dev.class = rtrs_dev_class;
+	clt->dev.release = rtrs_clt_dev_release;
+	dev_set_name(&clt->dev, "%s", sessname);
+
+	err = device_register(&clt->dev);
+	if (unlikely(err))
+		goto percpu_free;
+
+	err = rtrs_clt_create_sysfs_root_folders(clt);
+	if (unlikely(err))
+		goto dev_unregister;
+
+	return clt;
+
+dev_unregister:
+	device_unregister(&clt->dev);
+percpu_free:
+	free_percpu(clt->pcpu_path);
+	kfree(clt);
+	return ERR_PTR(err);
+}
+
+static void wait_for_inflight_permits(struct rtrs_clt *clt)
+{
+	if (clt->permits_map) {
+		size_t sz = clt->queue_depth;
+
+		wait_event(clt->permits_wait,
+			   find_first_bit(clt->permits_map, sz) >= sz);
+	}
+}
+
+static void free_clt(struct rtrs_clt *clt)
+{
+	rtrs_clt_destroy_sysfs_root_folders(clt);
+	wait_for_inflight_permits(clt);
+	free_permits(clt);
+	free_percpu(clt->pcpu_path);
+	/* release callback will free clt in last put */
+	device_unregister(&clt->dev);
+}
+
+struct rtrs_clt *rtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+				 const char *sessname,
+				 const struct rtrs_addr *paths,
+				 size_t paths_num,
+				 short port,
+				 size_t pdu_sz, u8 reconnect_delay_sec,
+				 u16 max_segments,
+				 s16 max_reconnect_attempts)
+{
+	struct rtrs_clt_sess *sess, *tmp;
+	struct rtrs_clt *clt;
+	int err, i;
+
+	clt = alloc_clt(sessname, paths_num, port, pdu_sz, priv, link_ev,
+			max_segments, reconnect_delay_sec,
+			max_reconnect_attempts);
+	if (IS_ERR(clt)) {
+		err = PTR_ERR(clt);
+		goto out;
+	}
+	for (i = 0; i < paths_num; i++) {
+		struct rtrs_clt_sess *sess;
+
+		sess = alloc_sess(clt, &paths[i], nr_cons_per_session,
+				  max_segments);
+		if (IS_ERR(sess)) {
+			err = PTR_ERR(sess);
+			rtrs_err(clt, "alloc_sess(), err: %d\n", err);
+			goto close_all_sess;
+		}
+		list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
+
+		err = init_sess(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+
+		err = rtrs_clt_create_sess_files(sess);
+		if (unlikely(err))
+			goto close_all_sess;
+	}
+	err = alloc_permits(clt);
+	if (unlikely(err)) {
+		rtrs_err(clt, "alloc_permits(), err: %d\n", err);
+		goto close_all_sess;
+	}
+	err = rtrs_clt_create_sysfs_root_files(clt);
+	if (unlikely(err))
+		goto close_all_sess;
+
+	/*
+	 * There is a race if someone decides to completely remove just
+	 * newly created path using sysfs entry.  To avoid the race we
+	 * use simple 'opened' flag, see rtrs_clt_remove_path_from_sysfs().
+	 */
+	clt->opened = true;
+
+	/* Do not let module be unloaded if client is alive */
+	__module_get(THIS_MODULE);
+
+	return clt;
+
+close_all_sess:
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		rtrs_clt_destroy_sess_files(sess, NULL);
+		rtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+
+out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(rtrs_clt_open);
+
+void rtrs_clt_close(struct rtrs_clt *clt)
+{
+	struct rtrs_clt_sess *sess, *tmp;
+
+	/* Firstly forbid sysfs access */
+	rtrs_clt_destroy_sysfs_root_files(clt);
+	rtrs_clt_destroy_sysfs_root_folders(clt);
+
+	/* Now it is safe to iterate over all paths without locks */
+	list_for_each_entry_safe(sess, tmp, &clt->paths_list, s.entry) {
+		rtrs_clt_destroy_sess_files(sess, NULL);
+		rtrs_clt_close_conns(sess, true);
+		free_sess(sess);
+	}
+	free_clt(clt);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(rtrs_clt_close);
+
+int rtrs_clt_reconnect_from_sysfs(struct rtrs_clt_sess *sess)
+{
+	enum rtrs_clt_state old_state;
+	int err = -EBUSY;
+	bool changed;
+
+	changed = rtrs_clt_change_state_get_old(sess, RTRS_CLT_RECONNECTING,
+						 &old_state);
+	if (changed) {
+		sess->reconnect_attempts = 0;
+		queue_delayed_work(rtrs_wq, &sess->reconnect_dwork, 0);
+	}
+	if (changed || old_state == RTRS_CLT_RECONNECTING) {
+		/*
+		 * flush_delayed_work() queues pending work for immediate
+		 * execution, so do the flush if we have queued something
+		 * right now or work is pending.
+		 */
+		flush_delayed_work(&sess->reconnect_dwork);
+		err = (READ_ONCE(sess->state) ==
+		       RTRS_CLT_CONNECTED ? 0 : -ENOTCONN);
+	}
+
+	return err;
+}
+
+int rtrs_clt_disconnect_from_sysfs(struct rtrs_clt_sess *sess)
+{
+	rtrs_clt_close_conns(sess, true);
+
+	return 0;
+}
+
+int rtrs_clt_remove_path_from_sysfs(struct rtrs_clt_sess *sess,
+				     const struct attribute *sysfs_self)
+{
+	struct rtrs_clt *clt = sess->clt;
+	enum rtrs_clt_state old_state;
+	bool changed;
+
+	/*
+	 * That can happen only when userspace tries to remove path
+	 * very early, when rtrs_clt_open() is not yet finished.
+	 */
+	if (unlikely(!clt->opened))
+		return -EBUSY;
+
+	/*
+	 * Continue stopping path till state was changed to DEAD or
+	 * state was observed as DEAD:
+	 * 1. State was changed to DEAD - we were fast and nobody
+	 *    invoked rtrs_clt_reconnect(), which can again start
+	 *    reconnecting.
+	 * 2. State was observed as DEAD - we have someone in parallel
+	 *    removing the path.
+	 */
+	do {
+		rtrs_clt_close_conns(sess, true);
+	} while (!(changed = rtrs_clt_change_state_get_old(sess,
+							    RTRS_CLT_DEAD,
+							    &old_state)) &&
+		   old_state != RTRS_CLT_DEAD);
+
+	/*
+	 * If state was successfully changed to DEAD, commit suicide.
+	 */
+	if (likely(changed)) {
+		rtrs_clt_destroy_sess_files(sess, sysfs_self);
+		rtrs_clt_remove_path_from_arr(sess);
+		free_sess(sess);
+	}
+
+	return 0;
+}
+
+void rtrs_clt_set_max_reconnect_attempts(struct rtrs_clt *clt, int value)
+{
+	clt->max_reconnect_attempts = (unsigned int)value;
+}
+
+int rtrs_clt_get_max_reconnect_attempts(const struct rtrs_clt *clt)
+{
+	return (int)clt->max_reconnect_attempts;
+}
+
+int rtrs_clt_request(int dir, rtrs_conf_fn *conf, struct rtrs_clt *clt,
+		      struct rtrs_permit *permit, void *priv,
+		      const struct kvec *vec, size_t nr, size_t data_len,
+		      struct scatterlist *sg, unsigned int sg_cnt)
+{
+	struct rtrs_clt_io_req *req;
+	struct rtrs_clt_sess *sess;
+
+	enum dma_data_direction dma_dir;
+	int err = -ECONNABORTED, i;
+	size_t usr_len, hdr_len;
+	struct path_it it;
+
+	/* Get kvec length */
+	for (i = 0, usr_len = 0; i < nr; i++)
+		usr_len += vec[i].iov_len;
+
+	if (dir == READ) {
+		hdr_len = sizeof(struct rtrs_msg_rdma_read) +
+			  sg_cnt * sizeof(struct rtrs_sg_desc);
+		dma_dir = DMA_FROM_DEVICE;
+	} else {
+		hdr_len = sizeof(struct rtrs_msg_rdma_write);
+		dma_dir = DMA_TO_DEVICE;
+	}
+
+	do_each_path(sess, clt, &it) {
+		if (unlikely(READ_ONCE(sess->state) != RTRS_CLT_CONNECTED))
+			continue;
+
+		if (unlikely(usr_len + hdr_len > sess->max_hdr_size)) {
+			rtrs_wrn_rl(sess->clt,
+				     "%s request failed, user message size is %zu and header length %zu, but max size is %u\n",
+				     dir == READ ? "Read" : "Write",
+				     usr_len, hdr_len, sess->max_hdr_size);
+			err = -EMSGSIZE;
+			break;
+		}
+		req = rtrs_clt_get_req(sess, conf, permit, priv, vec, usr_len,
+					sg, sg_cnt, data_len, dma_dir);
+		if (dir == READ)
+			err = rtrs_clt_read_req(req);
+		else
+			err = rtrs_clt_write_req(req);
+		if (unlikely(err)) {
+			req->in_use = false;
+			continue;
+		}
+		/* Success path */
+		break;
+	} while_each_path(&it);
+
+	return err;
+}
+EXPORT_SYMBOL(rtrs_clt_request);
+
+int rtrs_clt_query(struct rtrs_clt *clt, struct rtrs_attrs *attr)
+{
+	if (unlikely(!rtrs_clt_is_connected(clt)))
+		return -ECOMM;
+
+	attr->queue_depth      = clt->queue_depth;
+	attr->max_io_size      = clt->max_io_size;
+	attr->sess_kobj	       = &clt->dev.kobj;
+	strlcpy(attr->sessname, clt->sessname, sizeof(attr->sessname));
+
+	return 0;
+}
+EXPORT_SYMBOL(rtrs_clt_query);
+
+int rtrs_clt_create_path_from_sysfs(struct rtrs_clt *clt,
+				     struct rtrs_addr *addr)
+{
+	struct rtrs_clt_sess *sess;
+	int err;
+
+	sess = alloc_sess(clt, addr, nr_cons_per_session, clt->max_segments);
+	if (IS_ERR(sess))
+		return PTR_ERR(sess);
+
+	/*
+	 * It is totally safe to add path in CONNECTING state: coming
+	 * IO will never grab it.  Also it is very important to add
+	 * path before init, since init fires LINK_CONNECTED event.
+	 */
+	rtrs_clt_add_path_to_arr(sess, addr);
+
+	err = init_sess(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	err = rtrs_clt_create_sess_files(sess);
+	if (unlikely(err))
+		goto close_sess;
+
+	return 0;
+
+close_sess:
+	rtrs_clt_remove_path_from_arr(sess);
+	rtrs_clt_close_conns(sess, true);
+	free_sess(sess);
+
+	return err;
+}
+
+static int check_module_params(void)
+{
+	if (nr_cons_per_session == 0)
+		nr_cons_per_session = min_t(unsigned int, nr_cpu_ids, U16_MAX);
+
+	return 0;
+}
+
+static int rtrs_clt_ib_dev_init(struct rtrs_ib_dev *dev)
+{
+	if (!(dev->ib_dev->attrs.device_cap_flags &
+	      IB_DEVICE_MEM_MGT_EXTENSIONS)) {
+		pr_err("Memory registrations not supported.\n");
+		return -ENOTSUPP;
+	}
+
+	return 0;
+}
+
+static const struct rtrs_ib_dev_pool_ops dev_pool_ops = {
+	.init = rtrs_clt_ib_dev_init
+};
+
+static int __init rtrs_client_init(void)
+{
+	int err;
+
+	pr_info("Loading module %s, proto %s: (retry_cnt: %d, noreg_cnt: %d)\n",
+		KBUILD_MODNAME, RTRS_PROTO_VER_STRING,
+		retry_cnt, noreg_cnt);
+
+	rtrs_ib_dev_pool_init(noreg_cnt ? IB_PD_UNSAFE_GLOBAL_RKEY : 0,
+			       &dev_pool);
+
+	err = check_module_params();
+	if (unlikely(err)) {
+		pr_err("Failed to load module, invalid module parameters, err: %d\n",
+		       err);
+		return err;
+	}
+	rtrs_dev_class = class_create(THIS_MODULE, "rtrs-client");
+	if (IS_ERR(rtrs_dev_class)) {
+		pr_err("Failed to create rtrs-client dev class\n");
+		return PTR_ERR(rtrs_dev_class);
+	}
+	rtrs_wq = alloc_workqueue("rtrs_client_wq", WQ_MEM_RECLAIM, 0);
+	if (unlikely(!rtrs_wq)) {
+		pr_err("Failed to load module, alloc rtrs_client_wq failed\n");
+		class_destroy(rtrs_dev_class);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void __exit rtrs_client_exit(void)
+{
+	destroy_workqueue(rtrs_wq);
+	class_destroy(rtrs_dev_class);
+	rtrs_ib_dev_pool_deinit(&dev_pool);
+}
+
+module_init(rtrs_client_init);
+module_exit(rtrs_client_exit);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 07/25] rtrs: client: statistics functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (5 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 06/25] rtrs: client: main functionality Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 21:07   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 08/25] rtrs: client: sysfs interface functions Jack Wang
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This introduces set of functions used on client side to account
statistics of RDMA data sent/received, amount of IOs inflight,
latency, cpu migrations, etc.  Almost all statistics is collected
using percpu variables.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-clt-stats.c | 435 +++++++++++++++++++
 1 file changed, 435 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt-stats.c

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt-stats.c b/drivers/infiniband/ulp/rtrs/rtrs-clt-stats.c
new file mode 100644
index 000000000000..aff8ebd3b9f7
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt-stats.c
@@ -0,0 +1,435 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "rtrs-clt.h"
+
+static inline int rtrs_clt_ms_to_id(unsigned long ms)
+{
+	int id = ms ? ilog2(ms) - MIN_LOG_LAT + 1 : 0;
+
+	return clamp(id, 0, LOG_LAT_SZ - 1);
+}
+
+void rtrs_clt_update_rdma_lat(struct rtrs_clt_stats *stats, bool read,
+			       unsigned long ms)
+{
+	struct rtrs_clt_stats_pcpu *s;
+	int id;
+
+	id = rtrs_clt_ms_to_id(ms);
+	s = this_cpu_ptr(stats->pcpu_stats);
+	if (read) {
+		s->rdma_lat_distr[id].read++;
+		if (s->rdma_lat_max.read < ms)
+			s->rdma_lat_max.read = ms;
+	} else {
+		s->rdma_lat_distr[id].write++;
+		if (s->rdma_lat_max.write < ms)
+			s->rdma_lat_max.write = ms;
+	}
+}
+
+void rtrs_clt_decrease_inflight(struct rtrs_clt_stats *stats)
+{
+	atomic_dec(&stats->inflight);
+}
+
+void rtrs_clt_update_wc_stats(struct rtrs_clt_con *con)
+{
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rtrs_clt_stats *stats = &sess->stats;
+	struct rtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	cpu = raw_smp_processor_id();
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->wc_comp.cnt++;
+	s->wc_comp.total_cnt++;
+	if (unlikely(con->cpu != cpu)) {
+		s->cpu_migr.to++;
+
+		/* Careful here, override s pointer */
+		s = per_cpu_ptr(stats->pcpu_stats, con->cpu);
+		atomic_inc(&s->cpu_migr.from);
+	}
+}
+
+void rtrs_clt_inc_failover_cnt(struct rtrs_clt_stats *stats)
+{
+	struct rtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.failover_cnt++;
+}
+
+static inline u32 rtrs_clt_stats_get_avg_wc_cnt(struct rtrs_clt_stats *stats)
+{
+	u32 cnt = 0;
+	u64 sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct rtrs_clt_stats_pcpu *s;
+
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		sum += s->wc_comp.total_cnt;
+		cnt += s->wc_comp.cnt;
+	}
+
+	return cnt ? sum / cnt : 0;
+}
+
+int rtrs_clt_stats_wc_completion_to_str(struct rtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	return scnprintf(buf, len, "%u\n",
+			 rtrs_clt_stats_get_avg_wc_cnt(stats));
+}
+
+ssize_t rtrs_clt_stats_rdma_lat_distr_to_str(struct rtrs_clt_stats *stats,
+					      char *page, size_t len)
+{
+	struct rtrs_clt_stats_rdma_lat res[LOG_LAT_SZ];
+	struct rtrs_clt_stats_rdma_lat max;
+	struct rtrs_clt_stats_pcpu *s;
+
+	ssize_t cnt = 0;
+	int i, cpu;
+
+	max.write = 0;
+	max.read = 0;
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+		if (max.write < s->rdma_lat_max.write)
+			max.write = s->rdma_lat_max.write;
+		if (max.read < s->rdma_lat_max.read)
+			max.read = s->rdma_lat_max.read;
+	}
+	for (i = 0; i < ARRAY_SIZE(res); i++) {
+		res[i].write = 0;
+		res[i].read = 0;
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+			res[i].write += s->rdma_lat_distr[i].write;
+			res[i].read += s->rdma_lat_distr[i].read;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(res) - 1; i++)
+		cnt += scnprintf(page + cnt, len - cnt,
+				 "< %6d ms: %llu %llu\n",
+				 1 << (i + MIN_LOG_LAT), res[i].read,
+				 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, ">= %5d ms: %llu %llu\n",
+			 1 << (i - 1 + MIN_LOG_LAT), res[i].read,
+			 res[i].write);
+	cnt += scnprintf(page + cnt, len - cnt, " maximum ms: %llu %llu\n",
+			 max.read, max.write);
+
+	return cnt;
+}
+
+int rtrs_clt_stats_migration_cnt_to_str(struct rtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct rtrs_clt_stats_pcpu *s;
+
+	size_t used;
+	int cpu;
+
+	used = scnprintf(buf, len, "    ");
+	for_each_possible_cpu(cpu)
+		used += scnprintf(buf + used, len - used, " CPU%u", cpu);
+
+	used += scnprintf(buf + used, len - used, "\nfrom:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  atomic_read(&s->cpu_migr.from));
+	}
+
+	used += scnprintf(buf + used, len - used, "\nto  :");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		used += scnprintf(buf + used, len - used, " %d",
+				  s->cpu_migr.to);
+	}
+	used += scnprintf(buf + used, len - used, "\n");
+
+	return used;
+}
+
+int rtrs_clt_stats_reconnects_to_str(struct rtrs_clt_stats *stats, char *buf,
+				      size_t len)
+{
+	return scnprintf(buf, len, "%d %d\n",
+			 stats->reconnects.successful_cnt,
+			 stats->reconnects.fail_cnt);
+}
+
+ssize_t rtrs_clt_stats_rdma_to_str(struct rtrs_clt_stats *stats,
+				    char *page, size_t len)
+{
+	struct rtrs_clt_stats_rdma sum;
+	struct rtrs_clt_stats_rdma *r;
+	int cpu;
+
+	memset(&sum, 0, sizeof(sum));
+
+	for_each_possible_cpu(cpu) {
+		r = &per_cpu_ptr(stats->pcpu_stats, cpu)->rdma;
+
+		sum.dir[READ].cnt	  += r->dir[READ].cnt;
+		sum.dir[READ].size_total  += r->dir[READ].size_total;
+		sum.dir[WRITE].cnt	  += r->dir[WRITE].cnt;
+		sum.dir[WRITE].size_total += r->dir[WRITE].size_total;
+		sum.failover_cnt	  += r->failover_cnt;
+	}
+
+	return scnprintf(page, len, "%llu %llu %llu %llu %u %llu\n",
+			 sum.dir[READ].cnt, sum.dir[READ].size_total,
+			 sum.dir[WRITE].cnt, sum.dir[WRITE].size_total,
+			 atomic_read(&stats->inflight), sum.failover_cnt);
+}
+
+int rtrs_clt_stats_sg_list_distr_to_str(struct rtrs_clt_stats *stats,
+					 char *buf, size_t len)
+{
+	struct rtrs_clt_stats_pcpu *s;
+
+	int i, cpu, cnt;
+
+	cnt = scnprintf(buf, len, "n\\cpu:");
+	for_each_possible_cpu(cpu)
+		cnt += scnprintf(buf + cnt, len - cnt, "%5d", cpu);
+
+	for (i = 0; i < SG_DISTR_SZ; i++) {
+		if (i <= MAX_LIN_SG)
+			cnt += scnprintf(buf + cnt, len - cnt, "\n= %3d:", i);
+		else if (i < SG_DISTR_SZ - 1)
+			cnt += scnprintf(buf + cnt, len - cnt, "\n< %3d:",
+					 1 << (i + MIN_LOG_SG - MAX_LIN_SG));
+		else
+			cnt += scnprintf(buf + cnt, len - cnt, "\n>=%3d:",
+					 1 << (i + MIN_LOG_SG -
+					       MAX_LIN_SG - 1));
+
+		for_each_possible_cpu(cpu) {
+			unsigned int p, p_i, p_f;
+			u64 total, distr;
+
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			total = s->sg_list_total;
+			distr = s->sg_list_distr[i];
+
+			p = total ? distr * 1000 / total : 0;
+			p_i = p / 10;
+			p_f = p % 10;
+
+			if (distr)
+				cnt += scnprintf(buf + cnt, len - cnt,
+						 " %2u.%01u", p_i, p_f);
+			else
+				cnt += scnprintf(buf + cnt, len - cnt, "    0");
+		}
+	}
+
+	cnt += scnprintf(buf + cnt, len - cnt, "\ntotal:");
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		cnt += scnprintf(buf + cnt, len - cnt, " %llu",
+				 s->sg_list_total);
+	}
+	cnt += scnprintf(buf + cnt, len - cnt, "\n");
+
+	return cnt;
+}
+
+ssize_t rtrs_clt_reset_all_help(struct rtrs_clt_stats *s,
+				 char *page, size_t len)
+{
+	return scnprintf(page, len, "echo 1 to reset all statistics\n");
+}
+
+int rtrs_clt_reset_rdma_stats(struct rtrs_clt_stats *stats, bool enable)
+{
+	struct rtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->rdma, 0, sizeof(s->rdma));
+	}
+
+	return 0;
+}
+
+int rtrs_clt_reset_rdma_lat_distr_stats(struct rtrs_clt_stats *stats,
+					 bool enable)
+{
+	struct rtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (enable) {
+		for_each_possible_cpu(cpu) {
+			s = per_cpu_ptr(stats->pcpu_stats, cpu);
+			memset(&s->rdma_lat_max, 0, sizeof(s->rdma_lat_max));
+			memset(&s->rdma_lat_distr, 0,
+			       sizeof(s->rdma_lat_distr));
+		}
+	}
+	stats->enable_rdma_lat = enable;
+
+	return 0;
+}
+
+int rtrs_clt_reset_sg_list_distr_stats(struct rtrs_clt_stats *stats,
+					bool enable)
+{
+	struct rtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->sg_list_total, 0, sizeof(s->sg_list_total));
+		memset(&s->sg_list_distr, 0, sizeof(s->sg_list_distr));
+	}
+
+	return 0;
+}
+
+int rtrs_clt_reset_cpu_migr_stats(struct rtrs_clt_stats *stats, bool enable)
+{
+	struct rtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->cpu_migr, 0, sizeof(s->cpu_migr));
+	}
+
+	return 0;
+}
+
+int rtrs_clt_reset_reconnects_stat(struct rtrs_clt_stats *stats, bool enable)
+{
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	memset(&stats->reconnects, 0, sizeof(stats->reconnects));
+
+	return 0;
+}
+
+int rtrs_clt_reset_wc_comp_stats(struct rtrs_clt_stats *stats, bool enable)
+{
+	struct rtrs_clt_stats_pcpu *s;
+	int cpu;
+
+	if (unlikely(!enable))
+		return -EINVAL;
+
+	for_each_possible_cpu(cpu) {
+		s = per_cpu_ptr(stats->pcpu_stats, cpu);
+		memset(&s->wc_comp, 0, sizeof(s->wc_comp));
+	}
+
+	return 0;
+}
+
+int rtrs_clt_reset_all_stats(struct rtrs_clt_stats *s, bool enable)
+{
+	if (enable) {
+		rtrs_clt_reset_rdma_stats(s, enable);
+		rtrs_clt_reset_rdma_lat_distr_stats(s, enable);
+		rtrs_clt_reset_sg_list_distr_stats(s, enable);
+		rtrs_clt_reset_cpu_migr_stats(s, enable);
+		rtrs_clt_reset_reconnects_stat(s, enable);
+		rtrs_clt_reset_wc_comp_stats(s, enable);
+		atomic_set(&s->inflight, 0);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static inline void rtrs_clt_record_sg_distr(u64 stat[SG_DISTR_SZ], u64 *total,
+					     unsigned int cnt)
+{
+	int i;
+
+	i = cnt > MAX_LIN_SG ? ilog2(cnt) + MAX_LIN_SG - MIN_LOG_SG + 1 : cnt;
+	i = i < SG_DISTR_SZ ? i : SG_DISTR_SZ - 1;
+
+	stat[i]++;
+	(*total)++;
+}
+
+static inline void rtrs_clt_update_rdma_stats(struct rtrs_clt_stats *stats,
+					       size_t size, int d)
+{
+	struct rtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	s->rdma.dir[d].cnt++;
+	s->rdma.dir[d].size_total += size;
+}
+
+void rtrs_clt_update_all_stats(struct rtrs_clt_io_req *req, int dir)
+{
+	struct rtrs_clt_con *con = req->con;
+	struct rtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+	struct rtrs_clt_stats *stats = &sess->stats;
+	unsigned int len;
+
+	struct rtrs_clt_stats_pcpu *s;
+
+	s = this_cpu_ptr(stats->pcpu_stats);
+	rtrs_clt_record_sg_distr(s->sg_list_distr, &s->sg_list_total,
+				  req->sg_cnt);
+	len = req->usr_len + req->data_len;
+	rtrs_clt_update_rdma_stats(stats, len, dir);
+	atomic_inc(&stats->inflight);
+}
+
+int rtrs_clt_init_stats(struct rtrs_clt_stats *stats)
+{
+	stats->enable_rdma_lat = false;
+	stats->pcpu_stats = alloc_percpu(typeof(*stats->pcpu_stats));
+	if (unlikely(!stats->pcpu_stats))
+		return -ENOMEM;
+
+	/*
+	 * successful_cnt will be set to 0 after session
+	 * is established for the first time
+	 */
+	stats->reconnects.successful_cnt = -1;
+
+	return 0;
+}
+
+void rtrs_clt_free_stats(struct rtrs_clt_stats *stats)
+{
+	free_percpu(stats->pcpu_stats);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 08/25] rtrs: client: sysfs interface functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (6 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 07/25] rtrs: client: statistics functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 21:14   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 09/25] rtrs: server: private header with server structs and functions Jack Wang
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is the sysfs interface to rtrs sessions on client side:

  /sys/devices/virtual/rtrs-client/<SESS-NAME>/
    *** rtrs session created by rtrs_clt_open() API call
    |
    |- max_reconnect_attempts
    |  *** number of reconnect attempts for session
    |
    |- add_path
    |  *** adds another connection path into rtrs session
    |
    |- paths/<SRC@DST>/
       *** established paths to server in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- reconnect
       |  *** reconnect path
       |
       |- remove_path
       |  *** remove current path
       |
       |- state
       |  *** retrieve current path state
       |
       |- hca_port
       |  *** HCA port number
       |
       |- hca_name
       |  *** HCA name
       |
       |- stats/
          *** current path statistics
          |
	  |- cpu_migration
	  |- rdma
	  |- rdma_lat
	  |- reconnects
	  |- reset_all
	  |- sg_entries
	  |- wc_completions

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-clt-sysfs.c | 501 +++++++++++++++++++
 1 file changed, 501 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-clt-sysfs.c

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt-sysfs.c b/drivers/infiniband/ulp/rtrs/rtrs-clt-sysfs.c
new file mode 100644
index 000000000000..ad4eb2c58fb2
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt-sysfs.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "rtrs-pri.h"
+#include "rtrs-clt.h"
+#include "rtrs-log.h"
+
+#define MIN_MAX_RECONN_ATT -1
+#define MAX_MAX_RECONN_ATT 9999
+
+static struct kobj_type ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+};
+
+static ssize_t max_reconnect_attempts_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *page)
+{
+	struct rtrs_clt *clt;
+
+	clt = container_of(dev, struct rtrs_clt, dev);
+
+	return sprintf(page, "%d\n", rtrs_clt_get_max_reconnect_attempts(clt));
+}
+
+static ssize_t max_reconnect_attempts_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf,
+					    size_t count)
+{
+	struct rtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(dev, struct rtrs_clt, dev);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (unlikely(ret)) {
+		rtrs_err(clt, "%s: failed to convert string '%s' to int\n",
+			  attr->attr.name, buf);
+		return ret;
+	}
+	if (unlikely(value > MAX_MAX_RECONN_ATT ||
+		     value < MIN_MAX_RECONN_ATT)) {
+		rtrs_err(clt,
+			  "%s: invalid range (provided: '%s', accepted: min: %d, max: %d)\n",
+			  attr->attr.name, buf, MIN_MAX_RECONN_ATT,
+			  MAX_MAX_RECONN_ATT);
+		return -EINVAL;
+	}
+	rtrs_clt_set_max_reconnect_attempts(clt, value);
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(max_reconnect_attempts);
+
+static ssize_t mpath_policy_show(struct device *dev,
+				 struct device_attribute *attr,
+				 char *page)
+{
+	struct rtrs_clt *clt;
+
+	clt = container_of(dev, struct rtrs_clt, dev);
+
+	switch (clt->mp_policy) {
+	case MP_POLICY_RR:
+		return sprintf(page, "round-robin (RR: %d)\n", clt->mp_policy);
+	case MP_POLICY_MIN_INFLIGHT:
+		return sprintf(page, "min-inflight (MI: %d)\n", clt->mp_policy);
+	default:
+		return sprintf(page, "Unknown (%d)\n", clt->mp_policy);
+	}
+}
+
+static ssize_t mpath_policy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf,
+				  size_t count)
+{
+	struct rtrs_clt *clt;
+	int value;
+	int ret;
+
+	clt = container_of(dev, struct rtrs_clt, dev);
+
+	ret = kstrtoint(buf, 10, &value);
+	if (!ret && (value == MP_POLICY_RR ||
+		     value == MP_POLICY_MIN_INFLIGHT)) {
+		clt->mp_policy = value;
+		return count;
+	}
+
+	if (!strncasecmp(buf, "round-robin", 11) ||
+	    !strncasecmp(buf, "rr", 2))
+		clt->mp_policy = MP_POLICY_RR;
+	else if (!strncasecmp(buf, "min-inflight", 12) ||
+		 !strncasecmp(buf, "mi", 2))
+		clt->mp_policy = MP_POLICY_MIN_INFLIGHT;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(mpath_policy);
+
+static ssize_t add_path_show(struct device *dev,
+			     struct device_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE,
+			 "Usage: echo [<source addr>@]<destination addr> > %s\n\n*addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]\n",
+			 attr->attr.name);
+}
+
+static ssize_t add_path_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t count)
+{
+	struct sockaddr_storage srcaddr, dstaddr;
+	struct rtrs_addr addr = {
+		.src = &srcaddr,
+		.dst = &dstaddr
+	};
+	struct rtrs_clt *clt;
+	const char *nl;
+	size_t len;
+	int err;
+
+	clt = container_of(dev, struct rtrs_clt, dev);
+
+	nl = strchr(buf, '\n');
+	if (nl)
+		len = nl - buf;
+	else
+		len = count;
+	err = rtrs_addr_to_sockaddr(buf, len, clt->port, &addr);
+	if (unlikely(err))
+		return -EINVAL;
+
+	err = rtrs_clt_create_path_from_sysfs(clt, &addr);
+	if (unlikely(err))
+		return err;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(add_path);
+
+static ssize_t rtrs_clt_state_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *page)
+{
+	struct rtrs_clt_sess *sess;
+
+	sess = container_of(kobj, struct rtrs_clt_sess, kobj);
+	if (sess->state == RTRS_CLT_CONNECTED)
+		return sprintf(page, "connected\n");
+
+	return sprintf(page, "disconnected\n");
+}
+
+static struct kobj_attribute rtrs_clt_state_attr =
+	__ATTR(state, 0444, rtrs_clt_state_show, NULL);
+
+static ssize_t rtrs_clt_reconnect_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t rtrs_clt_reconnect_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct rtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct rtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		rtrs_err(sess->clt, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = rtrs_clt_reconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute rtrs_clt_reconnect_attr =
+	__ATTR(reconnect, 0644, rtrs_clt_reconnect_show,
+	       rtrs_clt_reconnect_store);
+
+static ssize_t rtrs_clt_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t rtrs_clt_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct rtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct rtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		rtrs_err(sess->clt, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = rtrs_clt_disconnect_from_sysfs(sess);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute rtrs_clt_disconnect_attr =
+	__ATTR(disconnect, 0644, rtrs_clt_disconnect_show,
+	       rtrs_clt_disconnect_store);
+
+static ssize_t rtrs_clt_remove_path_show(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t rtrs_clt_remove_path_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct rtrs_clt_sess *sess;
+	int ret;
+
+	sess = container_of(kobj, struct rtrs_clt_sess, kobj);
+	if (!sysfs_streq(buf, "1")) {
+		rtrs_err(sess->clt, "%s: unknown value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+	ret = rtrs_clt_remove_path_from_sysfs(sess, &attr->attr);
+	if (unlikely(ret))
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute rtrs_clt_remove_path_attr =
+	__ATTR(remove_path, 0644, rtrs_clt_remove_path_show,
+	       rtrs_clt_remove_path_store);
+
+STAT_ATTR(struct rtrs_clt_sess, cpu_migration,
+	  rtrs_clt_stats_migration_cnt_to_str,
+	  rtrs_clt_reset_cpu_migr_stats);
+
+STAT_ATTR(struct rtrs_clt_sess, sg_entries,
+	  rtrs_clt_stats_sg_list_distr_to_str,
+	  rtrs_clt_reset_sg_list_distr_stats);
+
+STAT_ATTR(struct rtrs_clt_sess, reconnects,
+	  rtrs_clt_stats_reconnects_to_str,
+	  rtrs_clt_reset_reconnects_stat);
+
+STAT_ATTR(struct rtrs_clt_sess, rdma_lat,
+	  rtrs_clt_stats_rdma_lat_distr_to_str,
+	  rtrs_clt_reset_rdma_lat_distr_stats);
+
+STAT_ATTR(struct rtrs_clt_sess, wc_completion,
+	  rtrs_clt_stats_wc_completion_to_str,
+	  rtrs_clt_reset_wc_comp_stats);
+
+STAT_ATTR(struct rtrs_clt_sess, rdma,
+	  rtrs_clt_stats_rdma_to_str,
+	  rtrs_clt_reset_rdma_stats);
+
+STAT_ATTR(struct rtrs_clt_sess, reset_all,
+	  rtrs_clt_reset_all_help,
+	  rtrs_clt_reset_all_stats);
+
+static struct attribute *rtrs_clt_stats_attrs[] = {
+	&sg_entries_attr.attr,
+	&cpu_migration_attr.attr,
+	&reconnects_attr.attr,
+	&rdma_lat_attr.attr,
+	&wc_completion_attr.attr,
+	&rdma_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group rtrs_clt_stats_attr_group = {
+	.attrs = rtrs_clt_stats_attrs,
+};
+
+static int rtrs_clt_create_stats_files(struct kobject *kobj,
+					struct kobject *kobj_stats)
+{
+	int ret;
+
+	ret = kobject_init_and_add(kobj_stats, &ktype, kobj, "stats");
+	if (ret) {
+		pr_err("Failed to init and add stats kobject, err: %d\n",
+		       ret);
+		return ret;
+	}
+
+	ret = sysfs_create_group(kobj_stats, &rtrs_clt_stats_attr_group);
+	if (ret) {
+		pr_err("failed to create stats sysfs group, err: %d\n",
+		       ret);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_del(kobj_stats);
+	kobject_put(kobj_stats);
+
+	return ret;
+}
+
+static ssize_t rtrs_clt_hca_port_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_clt_sess *sess;
+
+	sess = container_of(kobj, typeof(*sess), kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%u\n", sess->hca_port);
+}
+
+static struct kobj_attribute rtrs_clt_hca_port_attr =
+	__ATTR(hca_port, 0444, rtrs_clt_hca_port_show, NULL);
+
+static ssize_t rtrs_clt_hca_name_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_clt_sess *sess;
+
+	sess = container_of(kobj, struct rtrs_clt_sess, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", sess->hca_name);
+}
+
+static struct kobj_attribute rtrs_clt_hca_name_attr =
+	__ATTR(hca_name, 0444, rtrs_clt_hca_name_show, NULL);
+
+static ssize_t rtrs_clt_src_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_clt_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct rtrs_clt_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute rtrs_clt_src_addr_attr =
+	__ATTR(src_addr, 0444, rtrs_clt_src_addr_show, NULL);
+
+static ssize_t rtrs_clt_dst_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_clt_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct rtrs_clt_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute rtrs_clt_dst_addr_attr =
+	__ATTR(dst_addr, 0444, rtrs_clt_dst_addr_show, NULL);
+
+static struct attribute *rtrs_clt_sess_attrs[] = {
+	&rtrs_clt_hca_name_attr.attr,
+	&rtrs_clt_hca_port_attr.attr,
+	&rtrs_clt_src_addr_attr.attr,
+	&rtrs_clt_dst_addr_attr.attr,
+	&rtrs_clt_state_attr.attr,
+	&rtrs_clt_reconnect_attr.attr,
+	&rtrs_clt_disconnect_attr.attr,
+	&rtrs_clt_remove_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group rtrs_clt_sess_attr_group = {
+	.attrs = rtrs_clt_sess_attrs,
+};
+
+int rtrs_clt_create_sess_files(struct rtrs_clt_sess *sess)
+{
+	struct rtrs_clt *clt = sess->clt;
+	char str[NAME_MAX];
+	int err, cnt;
+
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			      str, sizeof(str));
+	cnt += scnprintf(str + cnt, sizeof(str) - cnt, "@");
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			str + cnt, sizeof(str) - cnt);
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &clt->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add: %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj, &rtrs_clt_sess_attr_group);
+	if (unlikely(err)) {
+		pr_err("sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = rtrs_clt_create_stats_files(&sess->kobj, &sess->kobj_stats);
+	if (unlikely(err))
+		goto put_kobj;
+
+	return 0;
+
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+
+	return err;
+}
+
+void rtrs_clt_destroy_sess_files(struct rtrs_clt_sess *sess,
+				  const struct attribute *sysfs_self)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		if (sysfs_self)
+			/* To avoid deadlock firstly commit suicide */
+			sysfs_remove_file_self(&sess->kobj, sysfs_self);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+	}
+}
+
+static struct attribute *rtrs_clt_attrs[] = {
+	&dev_attr_max_reconnect_attempts.attr,
+	&dev_attr_mpath_policy.attr,
+	&dev_attr_add_path.attr,
+	NULL,
+};
+
+static struct attribute_group rtrs_clt_attr_group = {
+	.attrs = rtrs_clt_attrs,
+};
+
+int rtrs_clt_create_sysfs_root_folders(struct rtrs_clt *clt)
+{
+	return kobject_init_and_add(&clt->kobj_paths, &ktype,
+				    &clt->dev.kobj, "paths");
+}
+
+int rtrs_clt_create_sysfs_root_files(struct rtrs_clt *clt)
+{
+	return sysfs_create_group(&clt->dev.kobj, &rtrs_clt_attr_group);
+}
+
+void rtrs_clt_destroy_sysfs_root_folders(struct rtrs_clt *clt)
+{
+	if (clt->kobj_paths.state_in_sysfs) {
+		kobject_del(&clt->kobj_paths);
+		kobject_put(&clt->kobj_paths);
+	}
+}
+
+void rtrs_clt_destroy_sysfs_root_files(struct rtrs_clt *clt)
+{
+	sysfs_remove_group(&clt->dev.kobj, &rtrs_clt_attr_group);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 09/25] rtrs: server: private header with server structs and functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (7 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 08/25] rtrs: client: sysfs interface functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 21:24   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 10/25] rtrs: server: main functionality Jack Wang
                   ` (17 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This header describes main structs and functions used by rtrs-server
module, mainly for accepting rtrs sessions, creating/destroying
sysfs entries, accounting statistics on server side.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-srv.h | 141 +++++++++++++++++++++++++
 1 file changed, 141 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv.h

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.h b/drivers/infiniband/ulp/rtrs/rtrs-srv.h
new file mode 100644
index 000000000000..6ab6e2d3f564
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.h
@@ -0,0 +1,141 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#ifndef RTRS_SRV_H
+#define RTRS_SRV_H
+
+#include <linux/device.h>
+#include <linux/refcount.h>
+#include "rtrs-pri.h"
+
+/**
+ * enum rtrs_srv_state - Server states.
+ */
+enum rtrs_srv_state {
+	RTRS_SRV_CONNECTING,
+	RTRS_SRV_CONNECTED,
+	RTRS_SRV_CLOSING,
+	RTRS_SRV_CLOSED,
+};
+
+struct rtrs_stats_wc_comp {
+	atomic64_t	calls;
+	atomic64_t	total_wc_cnt;
+};
+
+struct rtrs_srv_stats_rdma_stats {
+	struct {
+		atomic64_t	cnt;
+		atomic64_t	size_total;
+	} dir[2];
+};
+
+struct rtrs_srv_stats {
+	struct rtrs_srv_stats_rdma_stats	rdma_stats;
+	struct rtrs_stats_wc_comp		wc_comp;
+};
+
+struct rtrs_srv_con {
+	struct rtrs_con	c;
+	atomic_t		wr_cnt;
+};
+
+struct rtrs_srv_op {
+	struct rtrs_srv_con		*con;
+	u32				msg_id;
+	u8				dir;
+	struct rtrs_msg_rdma_read	*rd_msg;
+	struct ib_rdma_wr		*tx_wr;
+	struct ib_sge			*tx_sg;
+};
+
+struct rtrs_srv_mr {
+	struct ib_mr	*mr;
+	struct sg_table	sgt;
+	struct ib_cqe	inv_cqe; /* only for always_invalidate=true */
+	u32		msg_id; /* only for always_invalidate=true */
+	u32		msg_off; /* only for always_invalidate=true */
+	struct rtrs_iu	*iu; /* send buffer for new rkey msg */
+};
+
+struct rtrs_srv_sess {
+	struct rtrs_sess	s;
+	struct rtrs_srv	*srv;
+	struct work_struct	close_work;
+	enum rtrs_srv_state	state;
+	spinlock_t		state_lock;
+	int			cur_cq_vector;
+	struct rtrs_srv_op	**ops_ids;
+	atomic_t		ids_inflight;
+	wait_queue_head_t	ids_waitq;
+	struct rtrs_srv_mr	*mrs;
+	unsigned int		mrs_num;
+	dma_addr_t		*dma_addr;
+	bool			established;
+	unsigned int		mem_bits;
+	struct kobject		kobj;
+	struct kobject		kobj_stats;
+	struct rtrs_srv_stats	stats;
+};
+
+struct rtrs_srv {
+	struct list_head	paths_list;
+	int			paths_up;
+	struct mutex		paths_ev_mutex;
+	size_t			paths_num;
+	struct mutex		paths_mutex;
+	uuid_t			paths_uuid;
+	refcount_t		refcount;
+	struct rtrs_srv_ctx	*ctx;
+	struct list_head	ctx_list;
+	void			*priv;
+	size_t			queue_depth;
+	struct page		**chunks;
+	struct device		dev;
+	unsigned int		dev_ref;
+	struct kobject		kobj_paths;
+};
+
+struct rtrs_srv_ctx {
+	rdma_ev_fn *rdma_ev;
+	link_ev_fn *link_ev;
+	struct rdma_cm_id *cm_id_ip;
+	struct rdma_cm_id *cm_id_ib;
+	struct mutex srv_mutex;
+	struct list_head srv_list;
+};
+
+extern struct class *rtrs_dev_class;
+
+void close_sess(struct rtrs_srv_sess *sess);
+
+/* rtrs-srv-stats.c */
+
+void rtrs_srv_update_rdma_stats(struct rtrs_srv_stats *s, size_t size, int d);
+void rtrs_srv_update_wc_stats(struct rtrs_srv_stats *s);
+
+int rtrs_srv_reset_rdma_stats(struct rtrs_srv_stats *stats, bool enable);
+ssize_t rtrs_srv_stats_rdma_to_str(struct rtrs_srv_stats *stats,
+				    char *page, size_t len);
+int rtrs_srv_reset_wc_completion_stats(struct rtrs_srv_stats *stats,
+					bool enable);
+int rtrs_srv_stats_wc_completion_to_str(struct rtrs_srv_stats *stats, char *buf,
+					 size_t len);
+int rtrs_srv_reset_all_stats(struct rtrs_srv_stats *stats, bool enable);
+ssize_t rtrs_srv_reset_all_help(struct rtrs_srv_stats *stats,
+				 char *page, size_t len);
+
+/* rtrs-srv-sysfs.c */
+
+int rtrs_srv_create_sess_files(struct rtrs_srv_sess *sess);
+void rtrs_srv_destroy_sess_files(struct rtrs_srv_sess *sess);
+
+#endif /* RTRS_SRV_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 10/25] rtrs: server: main functionality
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (8 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 09/25] rtrs: server: private header with server structs and functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 22:03   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 11/25] rtrs: server: statistics functions Jack Wang
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is main functionality of rtrs-server module, which accepts
set of RDMA connections (so called rtrs session), creates/destroys
sysfs entries associated with rtrs session and notifies upper layer
(user of RTRS API) about RDMA requests or link events.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-srv.c | 2169 ++++++++++++++++++++++++
 1 file changed, 2169 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv.c

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
new file mode 100644
index 000000000000..7ab51d8a3b4e
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -0,0 +1,2169 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/mempool.h>
+
+#include "rtrs-srv.h"
+#include "rtrs-log.h"
+
+MODULE_DESCRIPTION("RTRS Server");
+MODULE_LICENSE("GPL");
+
+/* Must be power of 2, see mask from mr->page_size in ib_sg_to_pages() */
+#define DEFAULT_MAX_CHUNK_SIZE (128 << 10)
+#define DEFAULT_SESS_QUEUE_DEPTH 512
+#define MAX_HDR_SIZE PAGE_SIZE
+#define MAX_SG_COUNT ((MAX_HDR_SIZE - sizeof(struct rtrs_msg_rdma_read)) \
+		      / sizeof(struct rtrs_sg_desc))
+
+/* We guarantee to serve 10 paths at least */
+#define CHUNK_POOL_SZ 10
+
+static struct rtrs_ib_dev_pool dev_pool;
+static mempool_t *chunk_pool;
+struct class *rtrs_dev_class;
+
+static int __read_mostly max_chunk_size = DEFAULT_MAX_CHUNK_SIZE;
+static int __read_mostly sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
+
+static bool always_invalidate = true;
+module_param(always_invalidate, bool, 0444);
+MODULE_PARM_DESC(always_invalidate,
+		 "Invalidate memory registration for contiguous memory regions before accessing.");
+
+module_param_named(max_chunk_size, max_chunk_size, int, 0444);
+MODULE_PARM_DESC(max_chunk_size,
+		 "Max size for each IO request, when change the unit is in byte (default: "
+		 __stringify(DEFAULT_MAX_CHUNK_SIZE) "KB)");
+
+module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
+MODULE_PARM_DESC(sess_queue_depth,
+		 "Number of buffers for pending I/O requests to allocate per session. Maximum: "
+		 __stringify(MAX_SESS_QUEUE_DEPTH) " (default: "
+		 __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
+
+static char cq_affinity_list[256];
+static cpumask_t cq_affinity_mask = { CPU_BITS_ALL };
+
+static void init_cq_affinity(void)
+{
+	sprintf(cq_affinity_list, "0-%d", nr_cpu_ids - 1);
+}
+
+static int cq_affinity_list_set(const char *val, const struct kernel_param *kp)
+{
+	int ret = 0, len = strlen(val);
+	cpumask_var_t new_value;
+
+	init_cq_affinity();
+
+	if (len >= sizeof(cq_affinity_list))
+		return -EINVAL;
+	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = cpulist_parse(val, new_value);
+	if (ret) {
+		pr_err("Can't set cq_affinity_list \"%s\": %d\n", val,
+		       ret);
+		goto free_cpumask;
+	}
+
+	strlcpy(cq_affinity_list, val, sizeof(cq_affinity_list));
+	*strchrnul(cq_affinity_list, '\n') = '\0';
+	cpumask_copy(&cq_affinity_mask, new_value);
+
+	pr_info("cq_affinity_list changed to %*pbl\n",
+		cpumask_pr_args(&cq_affinity_mask));
+free_cpumask:
+	free_cpumask_var(new_value);
+	return ret;
+}
+
+static struct kparam_string cq_affinity_list_kparam_str = {
+	.maxlen	= sizeof(cq_affinity_list),
+	.string	= cq_affinity_list
+};
+
+static const struct kernel_param_ops cq_affinity_list_ops = {
+	.set	= cq_affinity_list_set,
+	.get	= param_get_string,
+};
+
+module_param_cb(cq_affinity_list, &cq_affinity_list_ops,
+		&cq_affinity_list_kparam_str, 0644);
+MODULE_PARM_DESC(cq_affinity_list,
+		 "Sets the list of cpus to use as cq vectors. (default: use all possible CPUs)");
+
+static struct workqueue_struct *rtrs_wq;
+
+static inline struct rtrs_srv_con *to_srv_con(struct rtrs_con *c)
+{
+	return container_of(c, struct rtrs_srv_con, c);
+}
+
+static inline struct rtrs_srv_sess *to_srv_sess(struct rtrs_sess *s)
+{
+	return container_of(s, struct rtrs_srv_sess, s);
+}
+
+static bool __rtrs_srv_change_state(struct rtrs_srv_sess *sess,
+				     enum rtrs_srv_state new_state)
+{
+	enum rtrs_srv_state old_state;
+	bool changed = false;
+
+	lockdep_assert_held(&sess->state_lock);
+	old_state = sess->state;
+	switch (new_state) {
+	case RTRS_SRV_CONNECTED:
+		switch (old_state) {
+		case RTRS_SRV_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_SRV_CLOSING:
+		switch (old_state) {
+		case RTRS_SRV_CONNECTING:
+		case RTRS_SRV_CONNECTED:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	case RTRS_SRV_CLOSED:
+		switch (old_state) {
+		case RTRS_SRV_CLOSING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+	if (changed)
+		sess->state = new_state;
+
+	return changed;
+}
+
+static bool rtrs_srv_change_state_get_old(struct rtrs_srv_sess *sess,
+					   enum rtrs_srv_state new_state,
+					   enum rtrs_srv_state *old_state)
+{
+	bool changed;
+
+	spin_lock_irq(&sess->state_lock);
+	*old_state = sess->state;
+	changed = __rtrs_srv_change_state(sess, new_state);
+	spin_unlock_irq(&sess->state_lock);
+
+	return changed;
+}
+
+static bool rtrs_srv_change_state(struct rtrs_srv_sess *sess,
+				   enum rtrs_srv_state new_state)
+{
+	enum rtrs_srv_state old_state;
+
+	return rtrs_srv_change_state_get_old(sess, new_state, &old_state);
+}
+
+static void free_id(struct rtrs_srv_op *id)
+{
+	if (!id)
+		return;
+	kfree(id->tx_wr);
+	kfree(id->tx_sg);
+	kfree(id);
+}
+
+static void rtrs_srv_free_ops_ids(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	int i;
+
+	WARN_ON(atomic_read(&sess->ids_inflight));
+	if (sess->ops_ids) {
+		for (i = 0; i < srv->queue_depth; i++)
+			free_id(sess->ops_ids[i]);
+		kfree(sess->ops_ids);
+		sess->ops_ids = NULL;
+	}
+}
+
+static int rtrs_srv_alloc_ops_ids(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_srv_op *id;
+	int i;
+
+	sess->ops_ids = kcalloc(srv->queue_depth, sizeof(*sess->ops_ids),
+				GFP_KERNEL);
+	if (unlikely(!sess->ops_ids))
+		goto err;
+
+	for (i = 0; i < srv->queue_depth; ++i) {
+		id = kzalloc(sizeof(*id), GFP_KERNEL);
+		if (unlikely(!id))
+			goto err;
+
+		sess->ops_ids[i] = id;
+		id->tx_wr = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_wr),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_wr))
+			goto err;
+
+		id->tx_sg = kcalloc(MAX_SG_COUNT, sizeof(*id->tx_sg),
+				    GFP_KERNEL);
+		if (unlikely(!id->tx_sg))
+			goto err;
+	}
+	init_waitqueue_head(&sess->ids_waitq);
+	atomic_set(&sess->ids_inflight, 0);
+
+	return 0;
+
+err:
+	rtrs_srv_free_ops_ids(sess);
+	return -ENOMEM;
+}
+
+static void rtrs_srv_get_ops_ids(struct rtrs_srv_sess *sess)
+{
+	atomic_inc(&sess->ids_inflight);
+}
+
+static void rtrs_srv_put_ops_ids(struct rtrs_srv_sess *sess)
+{
+	if (atomic_dec_and_test(&sess->ids_inflight))
+		wake_up(&sess->ids_waitq);
+}
+
+static void rtrs_srv_wait_ops_ids(struct rtrs_srv_sess *sess)
+{
+	wait_event(sess->ids_waitq, !atomic_read(&sess->ids_inflight));
+}
+
+static void rtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+
+static struct ib_cqe io_comp_cqe = {
+	.done = rtrs_srv_rdma_done
+};
+
+static void rtrs_srv_reg_mr_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_srv_con *con = cq->cq_context;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(s, "REG MR failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+		return;
+	}
+}
+
+static struct ib_cqe local_reg_cqe = {
+	.done = rtrs_srv_reg_mr_done
+};
+
+static int rdma_write_sg(struct rtrs_srv_op *id)
+{
+	struct rtrs_sess *s = id->con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	dma_addr_t dma_addr = sess->dma_addr[id->msg_id];
+	struct rtrs_srv_mr *srv_mr;
+	struct rtrs_srv *srv = sess->srv;
+	struct ib_send_wr inv_wr, imm_wr;
+	struct ib_rdma_wr *wr = NULL;
+	const struct ib_send_wr *bad_wr;
+	enum ib_send_flags flags;
+	size_t sg_cnt;
+	int err, i, offset;
+	bool need_inval;
+	u32 rkey = 0;
+	struct ib_reg_wr rwr;
+
+	sg_cnt = le16_to_cpu(id->rd_msg->sg_cnt);
+	need_inval = le16_to_cpu(id->rd_msg->flags) & RTRS_MSG_NEED_INVAL_F;
+	if (unlikely(!sg_cnt))
+		return -EINVAL;
+
+	offset = 0;
+	for (i = 0; i < sg_cnt; i++) {
+		struct ib_sge *list;
+
+		wr		= &id->tx_wr[i];
+		list		= &id->tx_sg[i];
+		list->addr	= dma_addr + offset;
+		list->length	= le32_to_cpu(id->rd_msg->desc[i].len);
+
+		/* WR will fail with length error
+		 * if this is 0
+		 */
+		if (unlikely(list->length == 0)) {
+			rtrs_err(s, "Invalid RDMA-Write sg list length 0\n");
+			return -EINVAL;
+		}
+
+		list->lkey = sess->s.dev->ib_pd->local_dma_lkey;
+		offset += list->length;
+
+		wr->wr.wr_cqe	= &io_comp_cqe;
+		wr->wr.sg_list	= list;
+		wr->wr.num_sge	= 1;
+		wr->remote_addr	= le64_to_cpu(id->rd_msg->desc[i].addr);
+		wr->rkey	= le32_to_cpu(id->rd_msg->desc[i].key);
+		if (rkey == 0)
+			rkey = wr->rkey;
+		else
+			/* Only one key is actually used */
+			WARN_ON_ONCE(rkey != wr->rkey);
+
+		if (i < (sg_cnt - 1))
+			wr->wr.next = &id->tx_wr[i + 1].wr;
+
+		wr->wr.opcode = IB_WR_RDMA_WRITE;
+		wr->wr.ex.imm_data = 0;
+		wr->wr.send_flags  = 0;
+	}
+
+	if (need_inval && always_invalidate) {
+		wr->wr.next = &rwr.wr;
+		rwr.wr.next = &inv_wr;
+		inv_wr.next = &imm_wr;
+	} else if (always_invalidate) {
+		wr->wr.next = &rwr.wr;
+		rwr.wr.next = &imm_wr;
+	} else if (need_inval) {
+		wr->wr.next = &inv_wr;
+		inv_wr.next = &imm_wr;
+	} else {
+		wr->wr.next = &imm_wr;
+	}
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+
+	if (need_inval) {
+		inv_wr.wr_cqe = &io_comp_cqe;
+		inv_wr.sg_list = NULL;
+		inv_wr.num_sge = 0;
+		inv_wr.opcode = IB_WR_SEND_WITH_INV;
+		inv_wr.send_flags = 0;
+		inv_wr.ex.invalidate_rkey = rkey;
+	}
+
+	imm_wr.next = NULL;
+	imm_wr.wr_cqe = &io_comp_cqe;
+	if (always_invalidate) {
+		struct ib_sge list;
+		struct rtrs_msg_rkey_rsp *msg;
+
+		srv_mr = &sess->mrs[id->msg_id];
+		rwr.wr.opcode = IB_WR_REG_MR;
+		rwr.wr.wr_cqe = &local_reg_cqe;
+		rwr.wr.num_sge = 0;
+		rwr.mr = srv_mr->mr;
+		rwr.wr.send_flags = 0;
+		rwr.key = srv_mr->mr->rkey;
+		rwr.access = (IB_ACCESS_LOCAL_WRITE |
+			      IB_ACCESS_REMOTE_WRITE);
+		msg = srv_mr->iu->buf;
+		msg->buf_id = cpu_to_le16(id->msg_id);
+		msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
+		msg->rkey = cpu_to_le32(srv_mr->mr->rkey);
+
+		list.addr   = srv_mr->iu->dma_addr;
+		list.length = sizeof(*msg);
+		list.lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+		imm_wr.sg_list = &list;
+		imm_wr.num_sge = 1;
+		imm_wr.opcode = IB_WR_SEND_WITH_IMM;
+		ib_dma_sync_single_for_device(sess->s.dev->ib_dev,
+					      srv_mr->iu->dma_addr,
+					      srv_mr->iu->size, DMA_TO_DEVICE);
+	} else {
+		imm_wr.sg_list = NULL;
+		imm_wr.num_sge = 0;
+		imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	}
+	imm_wr.send_flags = flags;
+	imm_wr.ex.imm_data = cpu_to_be32(rtrs_to_io_rsp_imm(id->msg_id,
+							     0, need_inval));
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, dma_addr,
+				      offset, DMA_BIDIRECTIONAL);
+
+	err = ib_post_send(id->con->c.qp, &id->tx_wr[0].wr, &bad_wr);
+	if (unlikely(err))
+		rtrs_err(s,
+			  "Posting RDMA-Write-Request to QP failed, err: %d\n",
+			  err);
+
+	return err;
+}
+
+/**
+ * send_io_resp_imm() - response with empty IMM on failed READ/WRITE requests or
+ *                      on successful WRITE request.
+ * @con		the connection to send back result
+ * @id		the id associated to io
+ * @errno	the error number of the IO.
+ *
+ * Return 0 on success, errno otherwise.
+ */
+static int send_io_resp_imm(struct rtrs_srv_con *con, struct rtrs_srv_op *id,
+			    int errno)
+{
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct ib_send_wr inv_wr, imm_wr, *wr = NULL;
+	struct ib_reg_wr rwr;
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_srv_mr *srv_mr;
+	bool need_inval = false;
+	enum ib_send_flags flags;
+	const struct ib_send_wr *bad_wr;
+	u32 imm;
+	int err;
+
+	if (id->dir == READ) {
+		struct rtrs_msg_rdma_read *rd_msg = id->rd_msg;
+		size_t sg_cnt;
+
+		need_inval = le16_to_cpu(rd_msg->flags) &
+				RTRS_MSG_NEED_INVAL_F;
+		sg_cnt = le16_to_cpu(rd_msg->sg_cnt);
+
+		if (need_inval) {
+			if (likely(sg_cnt)) {
+				inv_wr.wr_cqe = &io_comp_cqe;
+				inv_wr.sg_list = NULL;
+				inv_wr.num_sge = 0;
+				inv_wr.opcode = IB_WR_SEND_WITH_INV;
+				inv_wr.send_flags = 0;
+				/* Only one key is actually used */
+				inv_wr.ex.invalidate_rkey =
+					le32_to_cpu(rd_msg->desc[0].key);
+			} else {
+				WARN_ON_ONCE(1);
+				need_inval = false;
+			}
+		}
+	}
+
+	if (need_inval && always_invalidate) {
+		wr = &inv_wr;
+		inv_wr.next = &rwr.wr;
+		rwr.wr.next = &imm_wr;
+	} else if (always_invalidate) {
+		wr = &rwr.wr;
+		rwr.wr.next = &imm_wr;
+	} else if (need_inval) {
+		wr = &inv_wr;
+		inv_wr.next = &imm_wr;
+	} else {
+		wr = &imm_wr;
+	}
+	/*
+	 * From time to time we have to post signalled sends,
+	 * or send queue will fill up and only QP reset can help.
+	 */
+	flags = atomic_inc_return(&con->wr_cnt) % srv->queue_depth ?
+			0 : IB_SEND_SIGNALED;
+	imm = rtrs_to_io_rsp_imm(id->msg_id, errno, need_inval);
+	imm_wr.next = NULL;
+	imm_wr.wr_cqe = &io_comp_cqe;
+	if (always_invalidate) {
+		struct ib_sge list;
+		struct rtrs_msg_rkey_rsp *msg;
+
+		srv_mr = &sess->mrs[id->msg_id];
+		rwr.wr.next = &imm_wr;
+		rwr.wr.opcode = IB_WR_REG_MR;
+		rwr.wr.wr_cqe = &local_reg_cqe;
+		rwr.wr.num_sge = 0;
+		rwr.wr.send_flags = 0;
+		rwr.mr = srv_mr->mr;
+		rwr.key = srv_mr->mr->rkey;
+		rwr.access = (IB_ACCESS_LOCAL_WRITE |
+			      IB_ACCESS_REMOTE_WRITE);
+		msg = srv_mr->iu->buf;
+		msg->buf_id = cpu_to_le16(id->msg_id);
+		msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
+		msg->rkey = cpu_to_le32(srv_mr->mr->rkey);
+
+		list.addr   = srv_mr->iu->dma_addr;
+		list.length = sizeof(*msg);
+		list.lkey   = sess->s.dev->ib_pd->local_dma_lkey;
+		imm_wr.sg_list = &list;
+		imm_wr.num_sge = 1;
+		imm_wr.opcode = IB_WR_SEND_WITH_IMM;
+		ib_dma_sync_single_for_device(sess->s.dev->ib_dev,
+					      srv_mr->iu->dma_addr,
+					      srv_mr->iu->size, DMA_TO_DEVICE);
+	} else {
+		imm_wr.sg_list = NULL;
+		imm_wr.num_sge = 0;
+		imm_wr.opcode = IB_WR_RDMA_WRITE_WITH_IMM;
+	}
+	imm_wr.send_flags = flags;
+	imm_wr.ex.imm_data = cpu_to_be32(imm);
+
+	err = ib_post_send(id->con->c.qp, wr, &bad_wr);
+	if (unlikely(err))
+		rtrs_err_rl(s, "Posting RDMA-Reply to QP failed, err: %d\n",
+			     err);
+
+	return err;
+}
+
+void close_sess(struct rtrs_srv_sess *sess)
+{
+	enum rtrs_srv_state old_state;
+
+	if (rtrs_srv_change_state_get_old(sess, RTRS_SRV_CLOSING,
+					   &old_state))
+		queue_work(rtrs_wq, &sess->close_work);
+	WARN_ON(sess->state != RTRS_SRV_CLOSING);
+}
+
+static inline const char *rtrs_srv_state_str(enum rtrs_srv_state state)
+{
+	switch (state) {
+	case RTRS_SRV_CONNECTING:
+		return "RTRS_SRV_CONNECTING";
+	case RTRS_SRV_CONNECTED:
+		return "RTRS_SRV_CONNECTED";
+	case RTRS_SRV_CLOSING:
+		return "RTRS_SRV_CLOSING";
+	case RTRS_SRV_CLOSED:
+		return "RTRS_SRV_CLOSED";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+/*
+ * rtrs_srv_resp_rdma() - sends response to the client.
+ *
+ * Context: any
+ */
+void rtrs_srv_resp_rdma(struct rtrs_srv_op *id, int status)
+{
+	struct rtrs_srv_con *con = id->con;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	int err;
+
+	if (WARN_ON(!id))
+		return;
+
+	if (unlikely(sess->state != RTRS_SRV_CONNECTED)) {
+		rtrs_err_rl(s,
+			     "Sending I/O response failed,  session is disconnected, sess state %s\n",
+			     rtrs_srv_state_str(sess->state));
+		goto out;
+	}
+	if (always_invalidate) {
+		struct rtrs_srv_mr *mr = &sess->mrs[id->msg_id];
+
+		ib_update_fast_reg_key(mr->mr, ib_inc_rkey(mr->mr->rkey));
+	}
+	if (status || id->dir == WRITE || !id->rd_msg->sg_cnt)
+		err = send_io_resp_imm(con, id, status);
+	else
+		err = rdma_write_sg(id);
+	if (unlikely(err)) {
+		rtrs_err_rl(s, "IO response failed: %d\n", err);
+		close_sess(sess);
+	}
+out:
+	rtrs_srv_put_ops_ids(sess);
+}
+EXPORT_SYMBOL(rtrs_srv_resp_rdma);
+
+void rtrs_srv_set_sess_priv(struct rtrs_srv *srv, void *priv)
+{
+	srv->priv = priv;
+}
+EXPORT_SYMBOL(rtrs_srv_set_sess_priv);
+
+static void unmap_cont_bufs(struct rtrs_srv_sess *sess)
+{
+	int i;
+
+	for (i = 0; i < sess->mrs_num; i++) {
+		struct rtrs_srv_mr *srv_mr;
+
+		srv_mr = &sess->mrs[i];
+		rtrs_iu_free(srv_mr->iu, DMA_TO_DEVICE,
+			      sess->s.dev->ib_dev, 1);
+		ib_dereg_mr(srv_mr->mr);
+		ib_dma_unmap_sg(sess->s.dev->ib_dev, srv_mr->sgt.sgl,
+				srv_mr->sgt.nents, DMA_BIDIRECTIONAL);
+		sg_free_table(&srv_mr->sgt);
+	}
+	kfree(sess->mrs);
+}
+
+static int map_cont_bufs(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_sess *ss = &sess->s;
+	int i, mri, err, mrs_num;
+	unsigned int chunk_bits;
+	int chunks_per_mr = 1;
+
+	/*
+	 * Here we map queue_depth chunks to MR.  Firstly we have to
+	 * figure out how many chunks can we map per MR.
+	 */
+	if (always_invalidate) {
+		/*
+		 * in order to do invalidate for each chunks of memory, we needs
+		 * more memory regions.
+		 */
+		mrs_num = srv->queue_depth;
+	} else {
+		chunks_per_mr =
+			sess->s.dev->ib_dev->attrs.max_fast_reg_page_list_len;
+		mrs_num = DIV_ROUND_UP(srv->queue_depth, chunks_per_mr);
+		chunks_per_mr = DIV_ROUND_UP(srv->queue_depth, mrs_num);
+	}
+
+	sess->mrs = kcalloc(mrs_num, sizeof(*sess->mrs), GFP_KERNEL);
+	if (unlikely(!sess->mrs))
+		return -ENOMEM;
+
+	sess->mrs_num = mrs_num;
+
+	for (mri = 0; mri < mrs_num; mri++) {
+		struct rtrs_srv_mr *srv_mr = &sess->mrs[mri];
+		struct sg_table *sgt = &srv_mr->sgt;
+		struct scatterlist *s;
+		struct ib_mr *mr;
+		int nr, chunks;
+		struct rtrs_msg_rkey_rsp *rsp;
+
+		chunks = chunks_per_mr * mri;
+		if (!always_invalidate)
+			chunks_per_mr = min_t(int, chunks_per_mr,
+					      srv->queue_depth - chunks);
+
+		err = sg_alloc_table(sgt, chunks_per_mr, GFP_KERNEL);
+		if (unlikely(err))
+			goto err;
+
+		for_each_sg(sgt->sgl, s, chunks_per_mr, i)
+			sg_set_page(s, srv->chunks[chunks + i],
+				    max_chunk_size, 0);
+
+		nr = ib_dma_map_sg(sess->s.dev->ib_dev, sgt->sgl,
+				   sgt->nents, DMA_BIDIRECTIONAL);
+		if (unlikely(nr < sgt->nents)) {
+			err = nr < 0 ? nr : -EINVAL;
+			goto free_sg;
+		}
+		mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
+				 sgt->nents);
+		if (IS_ERR(mr)) {
+			err = PTR_ERR(mr);
+			goto unmap_sg;
+		}
+		nr = ib_map_mr_sg(mr, sgt->sgl, sgt->nents,
+				  NULL, max_chunk_size);
+		if (unlikely(nr < sgt->nents)) {
+			err = nr < 0 ? nr : -EINVAL;
+			goto dereg_mr;
+		}
+
+		if (always_invalidate) {
+			srv_mr->iu = rtrs_iu_alloc(1, sizeof(*rsp), GFP_KERNEL,
+						    sess->s.dev->ib_dev,
+						    DMA_TO_DEVICE,
+						    rtrs_srv_rdma_done);
+			if (unlikely(!srv_mr->iu)) {
+				rtrs_err(ss, "rtrs_iu_alloc(), err: %d\n",
+					  -ENOMEM);
+				goto free_iu;
+			}
+		}
+		/* Eventually dma addr for each chunk can be cached */
+		for_each_sg(sgt->sgl, s, sgt->orig_nents, i)
+			sess->dma_addr[chunks + i] = sg_dma_address(s);
+
+		ib_update_fast_reg_key(mr, ib_inc_rkey(mr->rkey));
+		srv_mr->mr = mr;
+
+		continue;
+err:
+		while (mri--) {
+			srv_mr = &sess->mrs[mri];
+			sgt = &srv_mr->sgt;
+			mr = srv_mr->mr;
+free_iu:
+			rtrs_iu_free(srv_mr->iu, DMA_TO_DEVICE,
+				      sess->s.dev->ib_dev, 1);
+dereg_mr:
+			ib_dereg_mr(mr);
+unmap_sg:
+			ib_dma_unmap_sg(sess->s.dev->ib_dev, sgt->sgl,
+					sgt->nents, DMA_BIDIRECTIONAL);
+free_sg:
+			sg_free_table(sgt);
+		}
+		kfree(sess->mrs);
+
+		return err;
+	}
+
+	chunk_bits = ilog2(srv->queue_depth - 1) + 1;
+	sess->mem_bits = (MAX_IMM_PAYL_BITS - chunk_bits);
+
+	return 0;
+}
+
+static void rtrs_srv_hb_err_handler(struct rtrs_con *c)
+{
+	close_sess(to_srv_sess(c->sess));
+}
+
+static void rtrs_srv_init_hb(struct rtrs_srv_sess *sess)
+{
+	rtrs_init_hb(&sess->s, &io_comp_cqe,
+		      RTRS_HB_INTERVAL_MS,
+		      RTRS_HB_MISSED_MAX,
+		      rtrs_srv_hb_err_handler,
+		      rtrs_wq);
+}
+
+static void rtrs_srv_start_hb(struct rtrs_srv_sess *sess)
+{
+	rtrs_start_hb(&sess->s);
+}
+
+static void rtrs_srv_stop_hb(struct rtrs_srv_sess *sess)
+{
+	rtrs_stop_hb(&sess->s);
+}
+
+static void rtrs_srv_info_rsp_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_srv_con *con = cq->cq_context;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_iu *iu;
+
+	iu = container_of(wc->wr_cqe, struct rtrs_iu, cqe);
+	rtrs_iu_free(iu, DMA_TO_DEVICE, sess->s.dev->ib_dev, 1);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(s, "Sess info response send failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+		return;
+	}
+	WARN_ON(wc->opcode != IB_WC_SEND);
+	rtrs_srv_update_wc_stats(&sess->stats);
+}
+
+static void rtrs_srv_sess_up(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_srv_ctx *ctx = srv->ctx;
+	int up;
+
+	mutex_lock(&srv->paths_ev_mutex);
+	up = ++srv->paths_up;
+	if (up == 1)
+		ctx->link_ev(srv, RTRS_SRV_LINK_EV_CONNECTED, NULL);
+	mutex_unlock(&srv->paths_ev_mutex);
+
+	/* Mark session as established */
+	sess->established = true;
+}
+
+static void rtrs_srv_sess_down(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_srv_ctx *ctx = srv->ctx;
+
+	if (!sess->established)
+		return;
+
+	sess->established = false;
+	mutex_lock(&srv->paths_ev_mutex);
+	WARN_ON(!srv->paths_up);
+	if (--srv->paths_up == 0)
+		ctx->link_ev(srv, RTRS_SRV_LINK_EV_DISCONNECTED, srv->priv);
+	mutex_unlock(&srv->paths_ev_mutex);
+}
+
+static int post_recv_sess(struct rtrs_srv_sess *sess);
+
+static int process_info_req(struct rtrs_srv_con *con,
+			    struct rtrs_msg_info_req *msg)
+{
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct ib_send_wr *reg_wr = NULL;
+	struct rtrs_msg_info_rsp *rsp;
+	struct rtrs_iu *tx_iu;
+	struct ib_reg_wr *rwr;
+	int mri, err;
+	size_t tx_sz;
+
+	err = post_recv_sess(sess);
+	if (unlikely(err)) {
+		rtrs_err(s, "post_recv_sess(), err: %d\n", err);
+		return err;
+	}
+	rwr = kcalloc(sess->mrs_num, sizeof(*rwr), GFP_KERNEL);
+	if (unlikely(!rwr)) {
+		rtrs_err(s, "No memory\n");
+		return -ENOMEM;
+	}
+	memcpy(sess->s.sessname, msg->sessname, sizeof(sess->s.sessname));
+
+	tx_sz  = sizeof(*rsp);
+	tx_sz += sizeof(rsp->desc[0]) * sess->mrs_num;
+	tx_iu = rtrs_iu_alloc(1, tx_sz, GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_TO_DEVICE, rtrs_srv_info_rsp_done);
+	if (unlikely(!tx_iu)) {
+		rtrs_err(s, "rtrs_iu_alloc(), err: %d\n", -ENOMEM);
+		err = -ENOMEM;
+		goto rwr_free;
+	}
+
+	rsp = tx_iu->buf;
+	rsp->type = cpu_to_le16(RTRS_MSG_INFO_RSP);
+	rsp->sg_cnt = cpu_to_le16(sess->mrs_num);
+
+	for (mri = 0; mri < sess->mrs_num; mri++) {
+		struct ib_mr *mr = sess->mrs[mri].mr;
+
+		rsp->desc[mri].addr = cpu_to_le64(mr->iova);
+		rsp->desc[mri].key  = cpu_to_le32(mr->rkey);
+		rsp->desc[mri].len  = cpu_to_le32(mr->length);
+
+		/*
+		 * Fill in reg MR request and chain them *backwards*
+		 */
+		rwr[mri].wr.next = mri ? &rwr[mri - 1].wr : NULL;
+		rwr[mri].wr.opcode = IB_WR_REG_MR;
+		rwr[mri].wr.wr_cqe = &local_reg_cqe;
+		rwr[mri].wr.num_sge = 0;
+		rwr[mri].wr.send_flags = mri ? 0 : IB_SEND_SIGNALED;
+		rwr[mri].mr = mr;
+		rwr[mri].key = mr->rkey;
+		rwr[mri].access = (IB_ACCESS_LOCAL_WRITE |
+				   IB_ACCESS_REMOTE_WRITE);
+		reg_wr = &rwr[mri].wr;
+	}
+
+	err = rtrs_srv_create_sess_files(sess);
+	if (unlikely(err))
+		goto iu_free;
+	get_device(&sess->srv->dev);
+	rtrs_srv_change_state(sess, RTRS_SRV_CONNECTED);
+	rtrs_srv_start_hb(sess);
+
+	/*
+	 * We do not account number of established connections at the current
+	 * moment, we rely on the client, which should send info request when
+	 * all connections are successfully established.  Thus, simply notify
+	 * listener with a proper event if we are the first path.
+	 */
+	rtrs_srv_sess_up(sess);
+
+	ib_dma_sync_single_for_device(sess->s.dev->ib_dev, tx_iu->dma_addr,
+				      tx_iu->size, DMA_TO_DEVICE);
+
+	/* Send info response */
+	err = rtrs_iu_post_send(&con->c, tx_iu, tx_sz, reg_wr);
+	if (unlikely(err)) {
+		rtrs_err(s, "rtrs_iu_post_send(), err: %d\n", err);
+iu_free:
+		rtrs_iu_free(tx_iu, DMA_TO_DEVICE, sess->s.dev->ib_dev, 1);
+	}
+rwr_free:
+	kfree(rwr);
+
+	return err;
+}
+
+static void rtrs_srv_info_req_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_srv_con *con = cq->cq_context;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_msg_info_req *msg;
+	struct rtrs_iu *iu;
+	int err;
+
+	WARN_ON(con->c.cid);
+
+	iu = container_of(wc->wr_cqe, struct rtrs_iu, cqe);
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(s, "Sess info request receive failed: %s\n",
+			  ib_wc_status_msg(wc->status));
+		goto close;
+	}
+	WARN_ON(wc->opcode != IB_WC_RECV);
+
+	if (unlikely(wc->byte_len < sizeof(*msg))) {
+		rtrs_err(s, "Sess info request is malformed: size %d\n",
+			  wc->byte_len);
+		goto close;
+	}
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, iu->dma_addr,
+				   iu->size, DMA_FROM_DEVICE);
+	msg = iu->buf;
+	if (unlikely(le16_to_cpu(msg->type) != RTRS_MSG_INFO_REQ)) {
+		rtrs_err(s, "Sess info request is malformed: type %d\n",
+			  le16_to_cpu(msg->type));
+		goto close;
+	}
+	err = process_info_req(con, msg);
+	if (unlikely(err))
+		goto close;
+
+out:
+	rtrs_iu_free(iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev, 1);
+	return;
+close:
+	close_sess(sess);
+	goto out;
+}
+
+static int post_recv_info_req(struct rtrs_srv_con *con)
+{
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_iu *rx_iu;
+	int err;
+
+	rx_iu = rtrs_iu_alloc(1, sizeof(struct rtrs_msg_info_req),
+			       GFP_KERNEL, sess->s.dev->ib_dev,
+			       DMA_FROM_DEVICE, rtrs_srv_info_req_done);
+	if (unlikely(!rx_iu)) {
+		rtrs_err(s, "rtrs_iu_alloc(): no memory\n");
+		return -ENOMEM;
+	}
+	/* Prepare for getting info response */
+	err = rtrs_iu_post_recv(&con->c, rx_iu);
+	if (unlikely(err)) {
+		rtrs_err(s, "rtrs_iu_post_recv(), err: %d\n", err);
+		rtrs_iu_free(rx_iu, DMA_FROM_DEVICE, sess->s.dev->ib_dev, 1);
+		return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_io(struct rtrs_srv_con *con, size_t q_size)
+{
+	int i, err;
+
+	for (i = 0; i < q_size; i++) {
+		err = rtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+static int post_recv_sess(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_sess *s = &sess->s;
+	size_t q_size;
+	int err, cid;
+
+	for (cid = 0; cid < sess->s.con_num; cid++) {
+		if (cid == 0)
+			q_size = SERVICE_CON_QUEUE_DEPTH;
+		else
+			q_size = srv->queue_depth;
+
+		err = post_recv_io(to_srv_con(sess->s.con[cid]), q_size);
+		if (unlikely(err)) {
+			rtrs_err(s, "post_recv_io(), err: %d\n", err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static void process_read(struct rtrs_srv_con *con,
+			 struct rtrs_msg_rdma_read *msg,
+			 u32 buf_id, u32 off)
+{
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_srv_ctx *ctx = srv->ctx;
+	struct rtrs_srv_op *id;
+
+	size_t usr_len, data_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != RTRS_SRV_CONNECTED)) {
+		rtrs_err_rl(s,
+			     "Processing read request failed,  session is disconnected, sess state %s\n",
+			     rtrs_srv_state_str(sess->state));
+		return;
+	}
+	rtrs_srv_get_ops_ids(sess);
+	rtrs_srv_update_rdma_stats(&sess->stats, off, READ);
+	id = sess->ops_ids[buf_id];
+	id->con		= con;
+	id->dir		= READ;
+	id->msg_id	= buf_id;
+	id->rd_msg	= msg;
+	usr_len = le16_to_cpu(msg->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	ret = ctx->rdma_ev(srv, srv->priv, id, READ, data, data_len,
+			   data + data_len, usr_len);
+
+	if (unlikely(ret)) {
+		rtrs_err_rl(s,
+			     "Processing read request failed, user module cb reported for msg_id %d, err: %d\n",
+			     buf_id, ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, id, ret);
+	if (ret < 0) {
+		rtrs_err_rl(s,
+			     "Sending err msg for failed RDMA-Write-Req failed, msg_id %d, err: %d\n",
+			     buf_id, ret);
+		close_sess(sess);
+	}
+	rtrs_srv_put_ops_ids(sess);
+}
+
+static void process_write(struct rtrs_srv_con *con,
+			  struct rtrs_msg_rdma_write *req,
+			  u32 buf_id, u32 off)
+{
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_srv_ctx *ctx = srv->ctx;
+	struct rtrs_srv_op *id;
+
+	size_t data_len, usr_len;
+	void *data;
+	int ret;
+
+	if (unlikely(sess->state != RTRS_SRV_CONNECTED)) {
+		rtrs_err_rl(s,
+			     "Processing write request failed,  session is disconnected, sess state %s\n",
+			     rtrs_srv_state_str(sess->state));
+		return;
+	}
+	rtrs_srv_get_ops_ids(sess);
+	rtrs_srv_update_rdma_stats(&sess->stats, off, WRITE);
+	id = sess->ops_ids[buf_id];
+	id->con    = con;
+	id->dir    = WRITE;
+	id->msg_id = buf_id;
+
+	usr_len = le16_to_cpu(req->usr_len);
+	data_len = off - usr_len;
+	data = page_address(srv->chunks[buf_id]);
+	ret = ctx->rdma_ev(srv, srv->priv, id, WRITE, data, data_len,
+			   data + data_len, usr_len);
+	if (unlikely(ret)) {
+		rtrs_err_rl(s,
+			     "Processing write request failed, user module callback reports err: %d\n",
+			     ret);
+		goto send_err_msg;
+	}
+
+	return;
+
+send_err_msg:
+	ret = send_io_resp_imm(con, id, ret);
+	if (ret < 0) {
+		rtrs_err_rl(s,
+			     "Processing write request failed, sending I/O response failed, msg_id %d, err: %d\n",
+			     buf_id, ret);
+		close_sess(sess);
+	}
+	rtrs_srv_put_ops_ids(sess);
+}
+
+static void process_io_req(struct rtrs_srv_con *con, void *msg,
+			   u32 id, u32 off)
+{
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_msg_rdma_hdr *hdr;
+	unsigned int type;
+
+	ib_dma_sync_single_for_cpu(sess->s.dev->ib_dev, sess->dma_addr[id],
+				   max_chunk_size, DMA_BIDIRECTIONAL);
+	hdr = msg;
+	type = le16_to_cpu(hdr->type);
+
+	switch (type) {
+	case RTRS_MSG_WRITE:
+		process_write(con, msg, id, off);
+		break;
+	case RTRS_MSG_READ:
+		process_read(con, msg, id, off);
+		break;
+	default:
+		rtrs_err(s,
+			  "Processing I/O request failed, unknown message type received: 0x%02x\n",
+			  type);
+		goto err;
+	}
+
+	return;
+
+err:
+	close_sess(sess);
+}
+
+static void rtrs_srv_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_srv_mr *mr =
+		container_of(wc->wr_cqe, typeof(*mr), inv_cqe);
+	struct rtrs_srv_con *con = cq->cq_context;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_srv *srv = sess->srv;
+	u32 msg_id, off;
+	void *data;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		rtrs_err(s, "Failed IB_WR_LOCAL_INV: %s\n",
+			  ib_wc_status_msg(wc->status));
+		close_sess(sess);
+	}
+	msg_id = mr->msg_id;
+	off = mr->msg_off;
+	data = page_address(srv->chunks[msg_id]) + off;
+	process_io_req(con, data, msg_id, off);
+}
+
+static int rtrs_srv_inv_rkey(struct rtrs_srv_con *con,
+			      struct rtrs_srv_mr *mr)
+{
+	const struct ib_send_wr *bad_wr;
+	struct ib_send_wr wr = {
+		.opcode		    = IB_WR_LOCAL_INV,
+		.wr_cqe		    = &mr->inv_cqe,
+		.next		    = NULL,
+		.num_sge	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
+		.ex.invalidate_rkey = mr->mr->rkey,
+	};
+	mr->inv_cqe.done = rtrs_srv_inv_rkey_done;
+
+	return ib_post_send(con->c.qp, &wr, &bad_wr);
+}
+
+static void rtrs_srv_rdma_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+	struct rtrs_srv_con *con = cq->cq_context;
+	struct rtrs_sess *s = con->c.sess;
+	struct rtrs_srv_sess *sess = to_srv_sess(s);
+	struct rtrs_srv *srv = sess->srv;
+	u32 imm_type, imm_payload;
+	int err;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		if (wc->status != IB_WC_WR_FLUSH_ERR) {
+			rtrs_err(s,
+				  "%s (wr_cqe: %p, type: %d, vendor_err: 0x%x, len: %u)\n",
+				  ib_wc_status_msg(wc->status), wc->wr_cqe,
+				  wc->opcode, wc->vendor_err, wc->byte_len);
+			close_sess(sess);
+		}
+		return;
+	}
+	rtrs_srv_update_wc_stats(&sess->stats);
+
+	switch (wc->opcode) {
+	case IB_WC_RECV_RDMA_WITH_IMM:
+		/*
+		 * post_recv() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		if (WARN_ON(wc->wr_cqe != &io_comp_cqe))
+			return;
+		err = rtrs_post_recv_empty(&con->c, &io_comp_cqe);
+		if (unlikely(err)) {
+			rtrs_err(s, "rtrs_post_recv(), err: %d\n", err);
+			close_sess(sess);
+			break;
+		}
+		rtrs_from_imm(be32_to_cpu(wc->ex.imm_data),
+			       &imm_type, &imm_payload);
+		if (likely(imm_type == RTRS_IO_REQ_IMM)) {
+			u32 msg_id, off;
+			void *data;
+
+			msg_id = imm_payload >> sess->mem_bits;
+			off = imm_payload & ((1 << sess->mem_bits) - 1);
+			if (unlikely(msg_id > srv->queue_depth ||
+				     off > max_chunk_size)) {
+				rtrs_err(s, "Wrong msg_id %u, off %u\n",
+					  msg_id, off);
+				close_sess(sess);
+				return;
+			}
+			if (always_invalidate) {
+				struct rtrs_srv_mr *mr = &sess->mrs[msg_id];
+
+				mr->msg_off = off;
+				mr->msg_id = msg_id;
+				err = rtrs_srv_inv_rkey(con, mr);
+				if (unlikely(err)) {
+					rtrs_err(s, "rtrs_post_recv(), err: %d\n",
+						  err);
+					close_sess(sess);
+					break;
+				}
+			} else {
+				data = page_address(srv->chunks[msg_id]) + off;
+				process_io_req(con, data, msg_id, off);
+			}
+		} else if (imm_type == RTRS_HB_MSG_IMM) {
+			WARN_ON(con->c.cid);
+			rtrs_send_hb_ack(&sess->s);
+		} else if (imm_type == RTRS_HB_ACK_IMM) {
+			WARN_ON(con->c.cid);
+			sess->s.hb_missed_cnt = 0;
+		} else {
+			rtrs_wrn(s, "Unknown IMM type %u\n", imm_type);
+		}
+		break;
+	case IB_WC_RDMA_WRITE:
+	case IB_WC_SEND:
+		/*
+		 * post_send() RDMA write completions of IO reqs (read/write)
+		 * and hb
+		 */
+		break;
+	default:
+		rtrs_wrn(s, "Unexpected WC type: %d\n", wc->opcode);
+		return;
+	}
+}
+
+int rtrs_srv_get_sess_name(struct rtrs_srv *srv, char *sessname, size_t len)
+{
+	struct rtrs_srv_sess *sess;
+	int err = -ENOTCONN;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (sess->state != RTRS_SRV_CONNECTED)
+			continue;
+		memcpy(sessname, sess->s.sessname,
+		       min_t(size_t, sizeof(sess->s.sessname), len));
+		err = 0;
+		break;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL(rtrs_srv_get_sess_name);
+
+int rtrs_srv_get_queue_depth(struct rtrs_srv *srv)
+{
+	return srv->queue_depth;
+}
+EXPORT_SYMBOL(rtrs_srv_get_queue_depth);
+
+static int find_next_bit_ring(struct rtrs_srv_sess *sess)
+{
+	struct ib_device *ib_dev = sess->s.dev->ib_dev;
+	int v;
+
+	v = cpumask_next(sess->cur_cq_vector, &cq_affinity_mask);
+	if (v >= nr_cpu_ids || v >= ib_dev->num_comp_vectors)
+		v = cpumask_first(&cq_affinity_mask);
+	return v;
+}
+
+static int rtrs_srv_get_next_cq_vector(struct rtrs_srv_sess *sess)
+{
+	sess->cur_cq_vector = find_next_bit_ring(sess);
+
+	return sess->cur_cq_vector;
+}
+
+static struct rtrs_srv *__alloc_srv(struct rtrs_srv_ctx *ctx,
+				     const uuid_t *paths_uuid)
+{
+	struct rtrs_srv *srv;
+	int i;
+
+	srv = kzalloc(sizeof(*srv), GFP_KERNEL);
+	if  (unlikely(!srv))
+		return NULL;
+
+	refcount_set(&srv->refcount, 1);
+	INIT_LIST_HEAD(&srv->paths_list);
+	mutex_init(&srv->paths_mutex);
+	mutex_init(&srv->paths_ev_mutex);
+	uuid_copy(&srv->paths_uuid, paths_uuid);
+	srv->queue_depth = sess_queue_depth;
+	srv->ctx = ctx;
+
+	srv->chunks = kcalloc(srv->queue_depth, sizeof(*srv->chunks),
+			      GFP_KERNEL);
+	if (unlikely(!srv->chunks))
+		goto err_free_srv;
+
+	for (i = 0; i < srv->queue_depth; i++) {
+		srv->chunks[i] = mempool_alloc(chunk_pool, GFP_KERNEL);
+		if (unlikely(!srv->chunks[i])) {
+			pr_err("mempool_alloc() failed\n");
+			goto err_free_chunks;
+		}
+	}
+	list_add(&srv->ctx_list, &ctx->srv_list);
+
+	return srv;
+
+err_free_chunks:
+	while (i--)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+
+err_free_srv:
+	kfree(srv);
+
+	return NULL;
+}
+
+static void free_srv(struct rtrs_srv *srv)
+{
+	int i;
+
+	WARN_ON(refcount_read(&srv->refcount));
+	for (i = 0; i < srv->queue_depth; i++)
+		mempool_free(srv->chunks[i], chunk_pool);
+	kfree(srv->chunks);
+	/* last put to release the srv structure */
+	put_device(&srv->dev);
+}
+
+static inline struct rtrs_srv *__find_srv_and_get(struct rtrs_srv_ctx *ctx,
+						   const uuid_t *paths_uuid)
+{
+	struct rtrs_srv *srv;
+
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list) {
+		if (uuid_equal(&srv->paths_uuid, paths_uuid) &&
+		    refcount_inc_not_zero(&srv->refcount))
+			return srv;
+	}
+
+	return NULL;
+}
+
+static struct rtrs_srv *get_or_create_srv(struct rtrs_srv_ctx *ctx,
+					   const uuid_t *paths_uuid)
+{
+	struct rtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	srv = __find_srv_and_get(ctx, paths_uuid);
+	if (!srv)
+		srv = __alloc_srv(ctx, paths_uuid);
+	mutex_unlock(&ctx->srv_mutex);
+
+	return srv;
+}
+
+static void put_srv(struct rtrs_srv *srv)
+{
+	if (refcount_dec_and_test(&srv->refcount)) {
+		struct rtrs_srv_ctx *ctx = srv->ctx;
+
+		WARN_ON(srv->dev.kobj.state_in_sysfs);
+		WARN_ON(srv->kobj_paths.state_in_sysfs);
+
+		mutex_lock(&ctx->srv_mutex);
+		list_del(&srv->ctx_list);
+		mutex_unlock(&ctx->srv_mutex);
+		free_srv(srv);
+	}
+}
+
+static void __add_path_to_srv(struct rtrs_srv *srv,
+			      struct rtrs_srv_sess *sess)
+{
+	list_add_tail(&sess->s.entry, &srv->paths_list);
+	srv->paths_num++;
+	WARN_ON(srv->paths_num >= MAX_PATHS_NUM);
+}
+
+static void del_path_from_srv(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+
+	if (WARN_ON(!srv))
+		return;
+
+	mutex_lock(&srv->paths_mutex);
+	list_del(&sess->s.entry);
+	WARN_ON(!srv->paths_num);
+	srv->paths_num--;
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static inline int sockaddr_cmp(const struct sockaddr *a,
+			       const struct sockaddr *b)
+{
+	switch (a->sa_family) {
+	case AF_IB:
+		return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
+			      &((struct sockaddr_ib *)b)->sib_addr,
+			      sizeof(struct ib_addr));
+	case AF_INET:
+		return memcmp(&((struct sockaddr_in *)a)->sin_addr,
+			      &((struct sockaddr_in *)b)->sin_addr,
+			      sizeof(struct in_addr));
+	case AF_INET6:
+		return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
+			      &((struct sockaddr_in6 *)b)->sin6_addr,
+			      sizeof(struct in6_addr));
+	default:
+		return -ENOENT;
+	}
+}
+
+static inline bool __is_path_w_addr_exists(struct rtrs_srv *srv,
+					   struct rdma_addr *addr)
+{
+	struct rtrs_srv_sess *sess;
+
+	list_for_each_entry(sess, &srv->paths_list, s.entry)
+		if (!sockaddr_cmp((struct sockaddr *)&sess->s.dst_addr,
+				  (struct sockaddr *)&addr->dst_addr) &&
+		    !sockaddr_cmp((struct sockaddr *)&sess->s.src_addr,
+				  (struct sockaddr *)&addr->src_addr))
+			return true;
+
+	return false;
+}
+
+static void rtrs_srv_close_work(struct work_struct *work)
+{
+	struct rtrs_srv_sess *sess;
+	struct rtrs_srv_con *con;
+	int i;
+
+	sess = container_of(work, typeof(*sess), close_work);
+
+	rtrs_srv_destroy_sess_files(sess);
+	rtrs_srv_stop_hb(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		if (!sess->s.con[i])
+			continue;
+		con = to_srv_con(sess->s.con[i]);
+		rdma_disconnect(con->c.cm_id);
+		ib_drain_qp(con->c.qp);
+	}
+	/* Wait for all inflights */
+	rtrs_srv_wait_ops_ids(sess);
+
+	/* Notify upper layer if we are the last path */
+	rtrs_srv_sess_down(sess);
+
+	unmap_cont_bufs(sess);
+	rtrs_srv_free_ops_ids(sess);
+
+	for (i = 0; i < sess->s.con_num; i++) {
+		if (!sess->s.con[i])
+			continue;
+		con = to_srv_con(sess->s.con[i]);
+		rtrs_cq_qp_destroy(&con->c);
+		rdma_destroy_id(con->c.cm_id);
+		kfree(con);
+	}
+	rtrs_ib_dev_put(sess->s.dev);
+
+	del_path_from_srv(sess);
+	put_srv(sess->srv);
+	sess->srv = NULL;
+	rtrs_srv_change_state(sess, RTRS_SRV_CLOSED);
+
+	kfree(sess->dma_addr);
+	kfree(sess->s.con);
+	kfree(sess);
+}
+
+static int rtrs_rdma_do_accept(struct rtrs_srv_sess *sess,
+				struct rdma_cm_id *cm_id)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_msg_conn_rsp msg;
+	struct rdma_conn_param param;
+	int err;
+
+	memset(&param, 0, sizeof(param));
+	param.rnr_retry_count = 7;
+	param.private_data = &msg;
+	param.private_data_len = sizeof(msg);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(RTRS_MAGIC);
+	msg.version = cpu_to_le16(RTRS_PROTO_VER);
+	msg.errno = 0;
+	msg.queue_depth = cpu_to_le16(srv->queue_depth);
+	msg.max_io_size = cpu_to_le32(max_chunk_size - MAX_HDR_SIZE);
+	msg.max_hdr_size = cpu_to_le32(MAX_HDR_SIZE);
+
+	if (always_invalidate)
+		msg.flags = cpu_to_le32(RTRS_MSG_NEW_RKEY_F);
+
+	err = rdma_accept(cm_id, &param);
+	if (err)
+		pr_err("rdma_accept(), err: %d\n", err);
+
+	return err;
+}
+
+static int rtrs_rdma_do_reject(struct rdma_cm_id *cm_id, int errno)
+{
+	struct rtrs_msg_conn_rsp msg;
+	int err;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.magic = cpu_to_le16(RTRS_MAGIC);
+	msg.version = cpu_to_le16(RTRS_PROTO_VER);
+	msg.errno = cpu_to_le16(errno);
+
+	err = rdma_reject(cm_id, &msg, sizeof(msg));
+	if (err)
+		pr_err("rdma_reject(), err: %d\n", err);
+
+	/* Bounce errno back */
+	return errno;
+}
+
+static struct rtrs_srv_sess *
+__find_sess(struct rtrs_srv *srv, const uuid_t *sess_uuid)
+{
+	struct rtrs_srv_sess *sess;
+
+	list_for_each_entry(sess, &srv->paths_list, s.entry) {
+		if (uuid_equal(&sess->s.uuid, sess_uuid))
+			return sess;
+	}
+
+	return NULL;
+}
+
+static int create_con(struct rtrs_srv_sess *sess,
+		      struct rdma_cm_id *cm_id,
+		      unsigned int cid)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_sess *s = &sess->s;
+	struct rtrs_srv_con *con;
+
+	u16 cq_size, wr_queue_size;
+	int err, cq_vector;
+
+	con = kzalloc(sizeof(*con), GFP_KERNEL);
+	if (unlikely(!con)) {
+		rtrs_err(s, "kzalloc() failed\n");
+		err = -ENOMEM;
+		goto err;
+	}
+
+	con->c.cm_id = cm_id;
+	con->c.sess = &sess->s;
+	con->c.cid = cid;
+	atomic_set(&con->wr_cnt, 0);
+
+	if (con->c.cid == 0) {
+		/*
+		 * All receive and all send (each requiring invalidate)
+		 * + 2 for drain and heartbeat
+		 */
+		wr_queue_size = SERVICE_CON_QUEUE_DEPTH * 3 + 2;
+		cq_size = wr_queue_size;
+	} else {
+		/*
+		 * If we have all receive requests posted and
+		 * all write requests posted and each read request
+		 * requires an invalidate request + drain
+		 * and qp gets into error state.
+		 */
+		cq_size = srv->queue_depth * 3 + 1;
+		/*
+		 * In theory we might have queue_depth * 32
+		 * outstanding requests if an unsafe global key is used
+		 * and we have queue_depth read requests each consisting
+		 * of 32 different addresses. div 3 for mlx5.
+		 */
+		wr_queue_size = sess->s.dev->ib_dev->attrs.max_qp_wr / 3;
+	}
+
+	cq_vector = rtrs_srv_get_next_cq_vector(sess);
+
+	/* TODO: SOFTIRQ can be faster, but be careful with softirq context */
+	err = rtrs_cq_qp_create(&sess->s, &con->c, 1, cq_vector, cq_size,
+				 wr_queue_size, IB_POLL_WORKQUEUE);
+	if (unlikely(err)) {
+		rtrs_err(s, "rtrs_cq_qp_create(), err: %d\n", err);
+		goto free_con;
+	}
+	if (con->c.cid == 0) {
+		err = post_recv_info_req(con);
+		if (unlikely(err))
+			goto free_cqqp;
+	}
+	WARN_ON(sess->s.con[cid]);
+	sess->s.con[cid] = &con->c;
+
+	/*
+	 * Change context from server to current connection.  The other
+	 * way is to use cm_id->qp->qp_context, which does not work on OFED.
+	 */
+	cm_id->context = &con->c;
+
+	return 0;
+
+free_cqqp:
+	rtrs_cq_qp_destroy(&con->c);
+free_con:
+	kfree(con);
+
+err:
+	return err;
+}
+
+static struct rtrs_srv_sess *__alloc_sess(struct rtrs_srv *srv,
+					   struct rdma_cm_id *cm_id,
+					   unsigned int con_num,
+					   unsigned int recon_cnt,
+					   const uuid_t *uuid)
+{
+	struct rtrs_srv_sess *sess;
+	int err = -ENOMEM;
+
+	if (unlikely(srv->paths_num >= MAX_PATHS_NUM)) {
+		err = -ECONNRESET;
+		goto err;
+	}
+	if (unlikely(__is_path_w_addr_exists(srv, &cm_id->route.addr))) {
+		err = -EEXIST;
+		goto err;
+	}
+	sess = kzalloc(sizeof(*sess), GFP_KERNEL);
+	if (unlikely(!sess))
+		goto err;
+
+	sess->dma_addr = kcalloc(srv->queue_depth, sizeof(*sess->dma_addr),
+				 GFP_KERNEL);
+	if (unlikely(!sess->dma_addr))
+		goto err_free_sess;
+
+	sess->s.con = kcalloc(con_num, sizeof(*sess->s.con), GFP_KERNEL);
+	if (unlikely(!sess->s.con))
+		goto err_free_dma_addr;
+
+	sess->state = RTRS_SRV_CONNECTING;
+	sess->srv = srv;
+	sess->cur_cq_vector = -1;
+	sess->s.dst_addr = cm_id->route.addr.dst_addr;
+	sess->s.src_addr = cm_id->route.addr.src_addr;
+	sess->s.con_num = con_num;
+	sess->s.recon_cnt = recon_cnt;
+	uuid_copy(&sess->s.uuid, uuid);
+	spin_lock_init(&sess->state_lock);
+	INIT_WORK(&sess->close_work, rtrs_srv_close_work);
+	rtrs_srv_init_hb(sess);
+
+	sess->s.dev = rtrs_ib_dev_find_or_add(cm_id->device, &dev_pool);
+	if (unlikely(!sess->s.dev)) {
+		err = -ENOMEM;
+		goto err_free_con;
+	}
+	err = map_cont_bufs(sess);
+	if (unlikely(err))
+		goto err_put_dev;
+
+	err = rtrs_srv_alloc_ops_ids(sess);
+	if (unlikely(err))
+		goto err_unmap_bufs;
+
+	__add_path_to_srv(srv, sess);
+
+	return sess;
+
+err_unmap_bufs:
+	unmap_cont_bufs(sess);
+err_put_dev:
+	rtrs_ib_dev_put(sess->s.dev);
+err_free_con:
+	kfree(sess->s.con);
+err_free_dma_addr:
+	kfree(sess->dma_addr);
+err_free_sess:
+	kfree(sess);
+
+err:
+	return ERR_PTR(err);
+}
+
+static int rtrs_rdma_connect(struct rdma_cm_id *cm_id,
+			      const struct rtrs_msg_conn_req *msg,
+			      size_t len)
+{
+	struct rtrs_srv_ctx *ctx = cm_id->context;
+	struct rtrs_srv_sess *sess;
+	struct rtrs_srv *srv;
+
+	u16 version, con_num, cid;
+	u16 recon_cnt;
+	int err;
+
+	if (unlikely(len < sizeof(*msg))) {
+		pr_err("Invalid RTRS connection request\n");
+		goto reject_w_econnreset;
+	}
+	if (unlikely(le16_to_cpu(msg->magic) != RTRS_MAGIC)) {
+		pr_err("Invalid RTRS magic\n");
+		goto reject_w_econnreset;
+	}
+	version = le16_to_cpu(msg->version);
+	if (unlikely(version >> 8 != RTRS_PROTO_VER_MAJOR)) {
+		pr_err("Unsupported major RTRS version: %d, expected %d\n",
+		       version >> 8, RTRS_PROTO_VER_MAJOR);
+		goto reject_w_econnreset;
+	}
+	con_num = le16_to_cpu(msg->cid_num);
+	if (unlikely(con_num > 4096)) {
+		/* Sanity check */
+		pr_err("Too many connections requested: %d\n", con_num);
+		goto reject_w_econnreset;
+	}
+	cid = le16_to_cpu(msg->cid);
+	if (unlikely(cid >= con_num)) {
+		/* Sanity check */
+		pr_err("Incorrect cid: %d >= %d\n", cid, con_num);
+		goto reject_w_econnreset;
+	}
+	recon_cnt = le16_to_cpu(msg->recon_cnt);
+	srv = get_or_create_srv(ctx, &msg->paths_uuid);
+	if (unlikely(!srv)) {
+		err = -ENOMEM;
+		goto reject_w_err;
+	}
+	mutex_lock(&srv->paths_mutex);
+	sess = __find_sess(srv, &msg->sess_uuid);
+	if (sess) {
+		struct rtrs_sess *s = &sess->s;
+
+		/* Session already holds a reference */
+		put_srv(srv);
+
+		if (unlikely(sess->state != RTRS_SRV_CONNECTING)) {
+			rtrs_err(s, "Session in wrong state: %s\n",
+				  rtrs_srv_state_str(sess->state));
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		/*
+		 * Sanity checks
+		 */
+		if (unlikely(con_num != sess->s.con_num ||
+			     cid >= sess->s.con_num)) {
+			rtrs_err(s, "Incorrect request: %d, %d\n",
+				  cid, con_num);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+		if (unlikely(sess->s.con[cid])) {
+			rtrs_err(s, "Connection already exists: %d\n",
+				  cid);
+			mutex_unlock(&srv->paths_mutex);
+			goto reject_w_econnreset;
+		}
+	} else {
+		sess = __alloc_sess(srv, cm_id, con_num, recon_cnt,
+				    &msg->sess_uuid);
+		if (IS_ERR(sess)) {
+			mutex_unlock(&srv->paths_mutex);
+			put_srv(srv);
+			err = PTR_ERR(sess);
+			goto reject_w_err;
+		}
+	}
+	err = create_con(sess, cm_id, cid);
+	if (unlikely(err)) {
+		(void)rtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since session has other connections we follow normal way
+		 * through workqueue, but still return an error to tell cma.c
+		 * to call rdma_destroy_id() for current connection.
+		 */
+		goto close_and_return_err;
+	}
+	err = rtrs_rdma_do_accept(sess, cm_id);
+	if (unlikely(err)) {
+		(void)rtrs_rdma_do_reject(cm_id, err);
+		/*
+		 * Since current connection was successfully added to the
+		 * session we follow normal way through workqueue to close the
+		 * session, thus return 0 to tell cma.c we call
+		 * rdma_destroy_id() ourselves.
+		 */
+		err = 0;
+		goto close_and_return_err;
+	}
+	mutex_unlock(&srv->paths_mutex);
+
+	return 0;
+
+reject_w_err:
+	return rtrs_rdma_do_reject(cm_id, err);
+
+reject_w_econnreset:
+	return rtrs_rdma_do_reject(cm_id, -ECONNRESET);
+
+close_and_return_err:
+	close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+
+static int rtrs_srv_rdma_cm_handler(struct rdma_cm_id *cm_id,
+				     struct rdma_cm_event *ev)
+{
+	struct rtrs_srv_sess *sess = NULL;
+	struct rtrs_sess *s = NULL;
+
+	if (ev->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+		struct rtrs_con *c = cm_id->context;
+
+		s = c->sess;
+		sess = to_srv_sess(s);
+	}
+
+	switch (ev->event) {
+	case RDMA_CM_EVENT_CONNECT_REQUEST:
+		/*
+		 * In case of error cma.c will destroy cm_id,
+		 * see cma_process_remove()
+		 */
+		return rtrs_rdma_connect(cm_id, ev->param.conn.private_data,
+					  ev->param.conn.private_data_len);
+	case RDMA_CM_EVENT_ESTABLISHED:
+		/* Nothing here */
+		break;
+	case RDMA_CM_EVENT_REJECTED:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+		rtrs_err(s, "CM error (CM event: %s, err: %d)\n",
+			  rdma_event_msg(ev->event), ev->status);
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DISCONNECTED:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+	case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+		close_sess(sess);
+		break;
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+		close_sess(sess);
+		break;
+	default:
+		pr_err("Ignoring unexpected CM event %s, err %d\n",
+		       rdma_event_msg(ev->event), ev->status);
+		break;
+	}
+
+	return 0;
+}
+
+static struct rdma_cm_id *rtrs_srv_cm_init(struct rtrs_srv_ctx *ctx,
+					    struct sockaddr *addr,
+					    enum rdma_ucm_port_space ps)
+{
+	struct rdma_cm_id *cm_id;
+	int ret;
+
+	cm_id = rdma_create_id(&init_net, rtrs_srv_rdma_cm_handler,
+			       ctx, ps, IB_QPT_RC);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		pr_err("Creating id for RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_out;
+	}
+	ret = rdma_bind_addr(cm_id, addr);
+	if (ret) {
+		pr_err("Binding RDMA address failed, err: %d\n", ret);
+		goto err_cm;
+	}
+	ret = rdma_listen(cm_id, 64);
+	if (ret) {
+		pr_err("Listening on RDMA connection failed, err: %d\n",
+		       ret);
+		goto err_cm;
+	}
+
+	return cm_id;
+
+err_cm:
+	rdma_destroy_id(cm_id);
+err_out:
+
+	return ERR_PTR(ret);
+}
+
+static int rtrs_srv_rdma_init(struct rtrs_srv_ctx *ctx, unsigned int port)
+{
+	struct sockaddr_in6 sin = {
+		.sin6_family	= AF_INET6,
+		.sin6_addr	= IN6ADDR_ANY_INIT,
+		.sin6_port	= htons(port),
+	};
+	struct sockaddr_ib sib = {
+		.sib_family			= AF_IB,
+		.sib_addr.sib_subnet_prefix	= 0ULL,
+		.sib_addr.sib_interface_id	= 0ULL,
+		.sib_sid	= cpu_to_be64(RDMA_IB_IP_PS_IB | port),
+		.sib_sid_mask	= cpu_to_be64(0xffffffffffffffffULL),
+		.sib_pkey	= cpu_to_be16(0xffff),
+	};
+	struct rdma_cm_id *cm_ip, *cm_ib;
+	int ret;
+
+	/*
+	 * We accept both IPoIB and IB connections, so we need to keep
+	 * two cm id's, one for each socket type and port space.
+	 * If the cm initialization of one of the id's fails, we abort
+	 * everything.
+	 */
+	cm_ip = rtrs_srv_cm_init(ctx, (struct sockaddr *)&sin, RDMA_PS_TCP);
+	if (IS_ERR(cm_ip))
+		return PTR_ERR(cm_ip);
+
+	cm_ib = rtrs_srv_cm_init(ctx, (struct sockaddr *)&sib, RDMA_PS_IB);
+	if (IS_ERR(cm_ib)) {
+		ret = PTR_ERR(cm_ib);
+		goto free_cm_ip;
+	}
+
+	ctx->cm_id_ip = cm_ip;
+	ctx->cm_id_ib = cm_ib;
+
+	return 0;
+
+free_cm_ip:
+	rdma_destroy_id(cm_ip);
+
+	return ret;
+}
+
+static struct rtrs_srv_ctx *alloc_srv_ctx(rdma_ev_fn *rdma_ev,
+					   link_ev_fn *link_ev)
+{
+	struct rtrs_srv_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->rdma_ev = rdma_ev;
+	ctx->link_ev = link_ev;
+	mutex_init(&ctx->srv_mutex);
+	INIT_LIST_HEAD(&ctx->srv_list);
+
+	return ctx;
+}
+
+static void free_srv_ctx(struct rtrs_srv_ctx *ctx)
+{
+	WARN_ON(!list_empty(&ctx->srv_list));
+	kfree(ctx);
+}
+
+struct rtrs_srv_ctx *rtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
+				     unsigned int port)
+{
+	struct rtrs_srv_ctx *ctx;
+	int err;
+
+	ctx = alloc_srv_ctx(rdma_ev, link_ev);
+	if (unlikely(!ctx))
+		return ERR_PTR(-ENOMEM);
+
+	err = rtrs_srv_rdma_init(ctx, port);
+	if (unlikely(err)) {
+		free_srv_ctx(ctx);
+		return ERR_PTR(err);
+	}
+	/* Do not let module be unloaded if server context is alive */
+	__module_get(THIS_MODULE);
+
+	return ctx;
+}
+EXPORT_SYMBOL(rtrs_srv_open);
+
+static void close_sessions(struct rtrs_srv *srv)
+{
+	struct rtrs_srv_sess *sess;
+
+	mutex_lock(&srv->paths_mutex);
+	list_for_each_entry(sess, &srv->paths_list, s.entry)
+		close_sess(sess);
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static void close_ctx(struct rtrs_srv_ctx *ctx)
+{
+	struct rtrs_srv *srv;
+
+	mutex_lock(&ctx->srv_mutex);
+	list_for_each_entry(srv, &ctx->srv_list, ctx_list)
+		close_sessions(srv);
+	mutex_unlock(&ctx->srv_mutex);
+	flush_workqueue(rtrs_wq);
+}
+
+void rtrs_srv_close(struct rtrs_srv_ctx *ctx)
+{
+	rdma_destroy_id(ctx->cm_id_ip);
+	rdma_destroy_id(ctx->cm_id_ib);
+	close_ctx(ctx);
+	free_srv_ctx(ctx);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(rtrs_srv_close);
+
+static int check_module_params(void)
+{
+	if (sess_queue_depth < 1 || sess_queue_depth > MAX_SESS_QUEUE_DEPTH) {
+		pr_err("Invalid sess_queue_depth value %d, has to be >= %d, <= %d.\n",
+		       sess_queue_depth, 1, MAX_SESS_QUEUE_DEPTH);
+		return -EINVAL;
+	}
+	if (max_chunk_size < 4096 || !is_power_of_2(max_chunk_size)) {
+		pr_err("Invalid max_chunk_size value %d, has to be >= %d and should be power of two.\n",
+		       max_chunk_size, 4096);
+		return -EINVAL;
+	}
+
+	/*
+	 * Check if IB immediate data size is enough to hold the mem_id and the
+	 * offset inside the memory chunk
+	 */
+	if ((ilog2(sess_queue_depth - 1) + 1) +
+	    (ilog2(max_chunk_size - 1) + 1) > MAX_IMM_PAYL_BITS) {
+		pr_err("RDMA immediate size (%db) not enough to encode %d buffers of size %dB. Reduce 'sess_queue_depth' or 'max_chunk_size' parameters.\n",
+		       MAX_IMM_PAYL_BITS, sess_queue_depth, max_chunk_size);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init rtrs_server_init(void)
+{
+	int err;
+
+	init_cq_affinity();
+
+	pr_info("Loading module %s, proto %s: (cq_affinity_list: %s, max_chunk_size: %d (pure IO %ld, headers %ld) , sess_queue_depth: %d, always_invalidate: %d)\n",
+		KBUILD_MODNAME, RTRS_PROTO_VER_STRING,
+		cq_affinity_list, max_chunk_size,
+		max_chunk_size - MAX_HDR_SIZE, MAX_HDR_SIZE,
+		sess_queue_depth, always_invalidate);
+
+	rtrs_ib_dev_pool_init(0, &dev_pool);
+
+	err = check_module_params();
+	if (err) {
+		pr_err("Failed to load module, invalid module parameters, err: %d\n",
+		       err);
+		return err;
+	}
+	chunk_pool = mempool_create_page_pool(sess_queue_depth * CHUNK_POOL_SZ,
+					      get_order(max_chunk_size));
+	if (unlikely(!chunk_pool)) {
+		pr_err("Failed preallocate pool of chunks\n");
+		return -ENOMEM;
+	}
+	rtrs_dev_class = class_create(THIS_MODULE, "rtrs-server");
+	if (IS_ERR(rtrs_dev_class)) {
+		pr_err("Failed to create rtrs-server dev class\n");
+		err = PTR_ERR(rtrs_dev_class);
+		goto out_chunk_pool;
+	}
+	rtrs_wq = alloc_workqueue("rtrs_server_wq", WQ_MEM_RECLAIM, 0);
+	if (unlikely(!rtrs_wq)) {
+		pr_err("Failed to load module, alloc rtrs_server_wq failed\n");
+		goto out_dev_class;
+	}
+
+	return 0;
+
+out_dev_class:
+	class_destroy(rtrs_dev_class);
+out_chunk_pool:
+	mempool_destroy(chunk_pool);
+
+	return err;
+}
+
+static void __exit rtrs_server_exit(void)
+{
+	destroy_workqueue(rtrs_wq);
+	class_destroy(rtrs_dev_class);
+	mempool_destroy(chunk_pool);
+	rtrs_ib_dev_pool_deinit(&dev_pool);
+}
+
+module_init(rtrs_server_init);
+module_exit(rtrs_server_exit);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 11/25] rtrs: server: statistics functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (9 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 10/25] rtrs: server: main functionality Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 22:02   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 12/25] rtrs: server: sysfs interface functions Jack Wang
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This introduces set of functions used on server side to account
statistics of RDMA data sent/received.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-srv-stats.c | 91 ++++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv-stats.c

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv-stats.c b/drivers/infiniband/ulp/rtrs/rtrs-srv-stats.c
new file mode 100644
index 000000000000..515f7088db71
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv-stats.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "rtrs-srv.h"
+
+void rtrs_srv_update_rdma_stats(struct rtrs_srv_stats *s,
+				 size_t size, int d)
+{
+	atomic64_inc(&s->rdma_stats.dir[d].cnt);
+	atomic64_add(size, &s->rdma_stats.dir[d].size_total);
+}
+
+void rtrs_srv_update_wc_stats(struct rtrs_srv_stats *s)
+{
+	atomic64_inc(&s->wc_comp.calls);
+	atomic64_inc(&s->wc_comp.total_wc_cnt);
+}
+
+int rtrs_srv_reset_rdma_stats(struct rtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		struct rtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+
+		memset(r, 0, sizeof(*r));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+ssize_t rtrs_srv_stats_rdma_to_str(struct rtrs_srv_stats *stats,
+				    char *page, size_t len)
+{
+	struct rtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+	struct rtrs_srv_sess *sess;
+
+	sess = container_of(stats, typeof(*sess), stats);
+
+	return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
+			 (s64)atomic64_read(&r->dir[READ].cnt),
+			 (s64)atomic64_read(&r->dir[READ].size_total),
+			 (s64)atomic64_read(&r->dir[WRITE].cnt),
+			 (s64)atomic64_read(&r->dir[WRITE].size_total),
+			 atomic_read(&sess->ids_inflight));
+}
+
+int rtrs_srv_reset_wc_completion_stats(struct rtrs_srv_stats *stats,
+					bool enable)
+{
+	if (enable) {
+		memset(&stats->wc_comp, 0, sizeof(stats->wc_comp));
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+int rtrs_srv_stats_wc_completion_to_str(struct rtrs_srv_stats *stats,
+					 char *buf, size_t len)
+{
+	return snprintf(buf, len, "%lld %lld\n",
+			(s64)atomic64_read(&stats->wc_comp.total_wc_cnt),
+			(s64)atomic64_read(&stats->wc_comp.calls));
+}
+
+ssize_t rtrs_srv_reset_all_help(struct rtrs_srv_stats *stats,
+				 char *page, size_t len)
+{
+	return scnprintf(page, PAGE_SIZE, "echo 1 to reset all statistics\n");
+}
+
+int rtrs_srv_reset_all_stats(struct rtrs_srv_stats *stats, bool enable)
+{
+	if (enable) {
+		rtrs_srv_reset_wc_completion_stats(stats, enable);
+		rtrs_srv_reset_rdma_stats(stats, enable);
+		return 0;
+	}
+
+	return -EINVAL;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 12/25] rtrs: server: sysfs interface functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (10 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 11/25] rtrs: server: statistics functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 22:06   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 13/25] rtrs: include client and server modules into kernel compilation Jack Wang
                   ` (14 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is the sysfs interface to rtrs sessions on server side:

  /sys/devices/virtual/rtrs-server/<SESS-NAME>/
    *** rtrs session accepted from a client peer
    |
    |- paths/<SRC@DST>/
       *** established paths from a client in a session
       |
       |- disconnect
       |  *** disconnect path
       |
       |- hca_name
       |  *** HCA name
       |
       |- hca_port
       |  *** HCA port
       |
       |- stats/
          *** current path statistics
          |
	  |- rdma
	  |- reset_all
	  |- wc_completions

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/ulp/rtrs/rtrs-srv-sysfs.c | 297 +++++++++++++++++++
 1 file changed, 297 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/rtrs-srv-sysfs.c

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv-sysfs.c b/drivers/infiniband/ulp/rtrs/rtrs-srv-sysfs.c
new file mode 100644
index 000000000000..f5fb80f2a513
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv-sysfs.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "rtrs-pri.h"
+#include "rtrs-srv.h"
+#include "rtrs-log.h"
+
+static struct kobj_type ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+};
+
+static ssize_t rtrs_srv_disconnect_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t rtrs_srv_disconnect_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct rtrs_srv_sess *sess;
+	struct rtrs_sess *s;
+	char str[MAXHOSTNAMELEN];
+
+	sess = container_of(kobj, struct rtrs_srv_sess, kobj);
+	s = &sess->s;
+	if (!sysfs_streq(buf, "1")) {
+		rtrs_err(s, "%s: invalid value: '%s'\n",
+			  attr->attr.name, buf);
+		return -EINVAL;
+	}
+
+	sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+	rtrs_info(s, "disconnect for path %s requested\n", str);
+	close_sess(sess);
+
+	return count;
+}
+
+static struct kobj_attribute rtrs_srv_disconnect_attr =
+	__ATTR(disconnect, 0644,
+	       rtrs_srv_disconnect_show, rtrs_srv_disconnect_store);
+
+static ssize_t rtrs_srv_hca_port_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_srv_sess *sess;
+	struct rtrs_con *usr_con;
+
+	sess = container_of(kobj, typeof(*sess), kobj);
+	usr_con = sess->s.con[0];
+
+	return scnprintf(page, PAGE_SIZE, "%u\n",
+			 usr_con->cm_id->port_num);
+}
+
+static struct kobj_attribute rtrs_srv_hca_port_attr =
+	__ATTR(hca_port, 0444, rtrs_srv_hca_port_show, NULL);
+
+static ssize_t rtrs_srv_hca_name_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_srv_sess *sess;
+
+	sess = container_of(kobj, struct rtrs_srv_sess, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 sess->s.dev->ib_dev->name);
+}
+
+static struct kobj_attribute rtrs_srv_hca_name_attr =
+	__ATTR(hca_name, 0444, rtrs_srv_hca_name_show, NULL);
+
+static ssize_t rtrs_srv_src_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_srv_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct rtrs_srv_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute rtrs_srv_src_addr_attr =
+	__ATTR(src_addr, 0444, rtrs_srv_src_addr_show, NULL);
+
+static ssize_t rtrs_srv_dst_addr_show(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       char *page)
+{
+	struct rtrs_srv_sess *sess;
+	int cnt;
+
+	sess = container_of(kobj, struct rtrs_srv_sess, kobj);
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			      page, PAGE_SIZE);
+	return cnt + scnprintf(page + cnt, PAGE_SIZE - cnt, "\n");
+}
+
+static struct kobj_attribute rtrs_srv_dst_addr_attr =
+	__ATTR(dst_addr, 0444, rtrs_srv_dst_addr_show, NULL);
+
+static struct attribute *rtrs_srv_sess_attrs[] = {
+	&rtrs_srv_hca_name_attr.attr,
+	&rtrs_srv_hca_port_attr.attr,
+	&rtrs_srv_src_addr_attr.attr,
+	&rtrs_srv_dst_addr_attr.attr,
+	&rtrs_srv_disconnect_attr.attr,
+	NULL,
+};
+
+static struct attribute_group rtrs_srv_sess_attr_group = {
+	.attrs = rtrs_srv_sess_attrs,
+};
+
+STAT_ATTR(struct rtrs_srv_sess, rdma,
+	  rtrs_srv_stats_rdma_to_str,
+	  rtrs_srv_reset_rdma_stats);
+
+STAT_ATTR(struct rtrs_srv_sess, wc_completion,
+	  rtrs_srv_stats_wc_completion_to_str,
+	  rtrs_srv_reset_wc_completion_stats);
+
+STAT_ATTR(struct rtrs_srv_sess, reset_all,
+	  rtrs_srv_reset_all_help,
+	  rtrs_srv_reset_all_stats);
+
+static struct attribute *rtrs_srv_stats_attrs[] = {
+	&rdma_attr.attr,
+	&wc_completion_attr.attr,
+	&reset_all_attr.attr,
+	NULL,
+};
+
+static struct attribute_group rtrs_srv_stats_attr_group = {
+	.attrs = rtrs_srv_stats_attrs,
+};
+
+static void rtrs_srv_dev_release(struct device *dev)
+{
+	struct rtrs_srv *srv = container_of(dev, struct rtrs_srv, dev);
+
+	kfree(srv);
+}
+
+static int rtrs_srv_create_once_sysfs_root_folders(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	int err = 0;
+
+	mutex_lock(&srv->paths_mutex);
+	if (srv->dev_ref++) {
+		/*
+		 * Just increase device reference.  We can't use get_device()
+		 * because we need to unregister device when ref goes to 0,
+		 * not just to put it.
+		 */
+		goto unlock;
+	}
+	srv->dev.class = rtrs_dev_class;
+	srv->dev.release = rtrs_srv_dev_release;
+	dev_set_name(&srv->dev, "%s", sess->s.sessname);
+
+	err = device_register(&srv->dev);
+	if (unlikely(err)) {
+		pr_err("device_register(): %d\n", err);
+		goto unlock;
+	}
+	err = kobject_init_and_add(&srv->kobj_paths, &ktype,
+				   &srv->dev.kobj, "paths");
+	if (unlikely(err)) {
+		pr_err("kobject_init_and_add(): %d\n", err);
+		device_unregister(&srv->dev);
+		goto unlock;
+	}
+unlock:
+	mutex_unlock(&srv->paths_mutex);
+
+	return err;
+}
+
+static void
+rtrs_srv_destroy_once_sysfs_root_folders(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+
+	mutex_lock(&srv->paths_mutex);
+	if (!--srv->dev_ref) {
+		kobject_del(&srv->kobj_paths);
+		kobject_put(&srv->kobj_paths);
+		device_unregister(&srv->dev);
+	}
+	mutex_unlock(&srv->paths_mutex);
+}
+
+static int rtrs_srv_create_stats_files(struct rtrs_srv_sess *sess)
+{
+	int err;
+	struct rtrs_sess *s = &sess->s;
+
+	err = kobject_init_and_add(&sess->kobj_stats, &ktype,
+				   &sess->kobj, "stats");
+	if (unlikely(err)) {
+		rtrs_err(s, "kobject_init_and_add(): %d\n", err);
+		return err;
+	}
+	err = sysfs_create_group(&sess->kobj_stats,
+				 &rtrs_srv_stats_attr_group);
+	if (unlikely(err)) {
+		rtrs_err(s, "sysfs_create_group(): %d\n", err);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	kobject_del(&sess->kobj_stats);
+	kobject_put(&sess->kobj_stats);
+
+	return err;
+}
+
+int rtrs_srv_create_sess_files(struct rtrs_srv_sess *sess)
+{
+	struct rtrs_srv *srv = sess->srv;
+	struct rtrs_sess *s = &sess->s;
+	char str[NAME_MAX];
+	int err, cnt;
+
+	cnt = sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr,
+			      str, sizeof(str));
+	cnt += scnprintf(str + cnt, sizeof(str) - cnt, "@");
+	sockaddr_to_str((struct sockaddr *)&sess->s.src_addr,
+			str + cnt, sizeof(str) - cnt);
+
+	err = rtrs_srv_create_once_sysfs_root_folders(sess);
+	if (unlikely(err))
+		return err;
+
+	err = kobject_init_and_add(&sess->kobj, &ktype, &srv->kobj_paths,
+				   "%s", str);
+	if (unlikely(err)) {
+		rtrs_err(s, "kobject_init_and_add(): %d\n", err);
+		goto destroy_root;
+	}
+	err = sysfs_create_group(&sess->kobj, &rtrs_srv_sess_attr_group);
+	if (unlikely(err)) {
+		rtrs_err(s, "sysfs_create_group(): %d\n", err);
+		goto put_kobj;
+	}
+	err = rtrs_srv_create_stats_files(sess);
+	if (unlikely(err))
+		goto remove_group;
+
+	return 0;
+
+remove_group:
+	sysfs_remove_group(&sess->kobj, &rtrs_srv_sess_attr_group);
+put_kobj:
+	kobject_del(&sess->kobj);
+	kobject_put(&sess->kobj);
+destroy_root:
+	rtrs_srv_destroy_once_sysfs_root_folders(sess);
+
+	return err;
+}
+
+void rtrs_srv_destroy_sess_files(struct rtrs_srv_sess *sess)
+{
+	if (sess->kobj.state_in_sysfs) {
+		kobject_del(&sess->kobj_stats);
+		kobject_put(&sess->kobj_stats);
+		kobject_del(&sess->kobj);
+		kobject_put(&sess->kobj);
+
+		rtrs_srv_destroy_once_sysfs_root_folders(sess);
+	}
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 13/25] rtrs: include client and server modules into kernel compilation
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (11 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 12/25] rtrs: server: sysfs interface functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 22:11   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 14/25] rtrs: a bit of documentation Jack Wang
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

Add rtrs Makefile, Kconfig and also corresponding lines into upper
layer infiniband/ulp files.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/infiniband/Kconfig           |  1 +
 drivers/infiniband/ulp/Makefile      |  1 +
 drivers/infiniband/ulp/rtrs/Kconfig  | 27 +++++++++++++++++++++++++++
 drivers/infiniband/ulp/rtrs/Makefile | 17 +++++++++++++++++
 4 files changed, 46 insertions(+)
 create mode 100644 drivers/infiniband/ulp/rtrs/Kconfig
 create mode 100644 drivers/infiniband/ulp/rtrs/Makefile

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ade86388434f..477418b37786 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -107,6 +107,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
+source "drivers/infiniband/ulp/rtrs/Kconfig"
 
 source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 
diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
index 437813c7b481..4d0004b58377 100644
--- a/drivers/infiniband/ulp/Makefile
+++ b/drivers/infiniband/ulp/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)		+= srpt/
 obj-$(CONFIG_INFINIBAND_ISER)		+= iser/
 obj-$(CONFIG_INFINIBAND_ISERT)		+= isert/
 obj-$(CONFIG_INFINIBAND_OPA_VNIC)	+= opa_vnic/
+obj-$(CONFIG_INFINIBAND_RTRS)		+= rtrs/
diff --git a/drivers/infiniband/ulp/rtrs/Kconfig b/drivers/infiniband/ulp/rtrs/Kconfig
new file mode 100644
index 000000000000..1d6c670a4504
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/Kconfig
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+config INFINIBAND_RTRS
+	tristate
+	depends on INFINIBAND_ADDR_TRANS
+
+config INFINIBAND_RTRS_CLIENT
+	tristate "RTRS client module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_RTRS
+	help
+	  RDMA transport client module.
+
+	  RTRS client allows for simplified data transfer and connection
+	  establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
+	  READ/WRITE semantics and provides multipath capabilities.
+
+config INFINIBAND_RTRS_SERVER
+	tristate "RTRS server module"
+	depends on INFINIBAND_ADDR_TRANS
+	select INFINIBAND_RTRS
+	help
+	  RDMA transport server module.
+
+	  RTRS server module processing connection and IO requests received
+	  from the RTRS client module, it will pass the IO requests to its
+	  user eg. RNBD_server.
diff --git a/drivers/infiniband/ulp/rtrs/Makefile b/drivers/infiniband/ulp/rtrs/Makefile
new file mode 100644
index 000000000000..89332be15c9e
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/Makefile
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+rtrs-client-y := rtrs-clt.o \
+		  rtrs-clt-stats.o \
+		  rtrs-clt-sysfs.o
+
+rtrs-server-y := rtrs-srv.o \
+		  rtrs-srv-stats.o \
+		  rtrs-srv-sysfs.o
+
+rtrs-core-y := rtrs.o
+
+obj-$(CONFIG_INFINIBAND_RTRS)        += rtrs-core.o
+obj-$(CONFIG_INFINIBAND_RTRS_CLIENT) += rtrs-client.o
+obj-$(CONFIG_INFINIBAND_RTRS_SERVER) += rtrs-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 14/25] rtrs: a bit of documentation
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (12 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 13/25] rtrs: include client and server modules into kernel compilation Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 23:19   ` Bart Van Assche
  2020-01-02 22:21   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 15/25] rnbd: private headers with rnbd protocol structs and helpers Jack Wang
                   ` (12 subsequent siblings)
  26 siblings, 2 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev, linux-kernel

From: Jack Wang <jinpu.wang@cloud.ionos.com>

README with description of major sysfs entries, sysfs documentation
has been moved to ABI dir as suggested by Bart.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: linux-kernel@vger.kernel.org
---
 .../ABI/testing/sysfs-class-rtrs-client       | 190 ++++++++++++++++++
 .../ABI/testing/sysfs-class-rtrs-server       |  81 ++++++++
 drivers/infiniband/ulp/rtrs/README            | 149 ++++++++++++++
 3 files changed, 420 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-rtrs-client
 create mode 100644 Documentation/ABI/testing/sysfs-class-rtrs-server
 create mode 100644 drivers/infiniband/ulp/rtrs/README

diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-client b/Documentation/ABI/testing/sysfs-class-rtrs-client
new file mode 100644
index 000000000000..8b219cf6c5c4
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-rtrs-client
@@ -0,0 +1,190 @@
+What:		/sys/class/rtrs-client
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+When a user of RTRS API creates a new session, a directory entry with
+the name of that session is created under /sys/class/rtrs-client/<session-name>/
+
+What:		/sys/class/rtrs-client/<session-name>/add_path
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RW, adds a new path (connection) to an existing session. Expected format is the
+following:
+
+  <[source addr,]destination addr>
+
+  *addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]
+
+What:		/sys/class/rtrs-client/<session-name>/max_reconnect_attempts
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Maximum number reconnect attempts the client should make before giving up
+after connection breaks unexpectedly.
+
+What:		/sys/class/rtrs-client/<session-name>/mp_policy
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Multipath policy specifies which path should be selected on each IO:
+
+   round-robin (0):
+       select path in per CPU round-robin manner.
+
+   min-inflight (1):
+       select path with minimum inflights.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Each path belonging to a given session is listed here by its source and
+destination address. When a new path is added to a session by writing to
+the "add_path" entry, a directory <src@dst> is created.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/state
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains "connected" if the session is connected to the peer and fully
+functional.  Otherwise the file contains "disconnected"
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/reconnect
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Write "1" to the file in order to reconnect the path.
+Operation is blocking and returns 0 if reconnect was successful.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/disconnect
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Write "1" to the file in order to disconnect the path.
+Operation blocks until RTRS path is disconnected.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/remove_path
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Write "1" to the file in order to disconnected and remove the path
+from the session.  Operation blocks until the path is disconnected
+and removed from the session.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/hca_name
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the the name of HCA the connection established on.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/hca_port
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the port number of active port traffic is going through.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/src_addr
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the source address of the path
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/dst_addr
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the destination address of the path
+
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/reset_all
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RW, Read will return usage help, write 0 will clear all the statistics.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/sg_entries
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Data to be transferred via RDMA is passed to RTRS as scatter-gather
+list. A scatter-gather list can contain multiple entries.
+Scatter-gather list with less entries require less processing power
+and can therefore transferred faster. The file sg_entries outputs a
+per-CPU distribution table for the number of entries in the
+scatter-gather lists, that were passed to the RTRS API function
+rtrs_clt_request (READ or WRITE).
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/cpu_migration
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RTRS expects that each HCA IRQ is pinned to a separate CPU. If it's
+not the case, the processing of an I/O response could be processed on a
+different CPU than where it was originally submitted.  This file shows
+how many interrupts where generated on a non expected CPU.
+"from:" is the CPU on which the IRQ was expected, but not generated.
+"to:" is the CPU on which the IRQ was generated, but not expected.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/reconnects
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains 2 unsigned int values, the first one records number of successful
+reconnects in the path lifetime, the second one records number of failed
+reconnects in the path lifetime.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/rdma_lat
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Latency distribution of RTRS requests.
+The format is:
+   1 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   2 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   4 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   8 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  16 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  ...
+  65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  >= 65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  maximum ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/wc_completion
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains 2 unsigned int values, the first one records max number of work
+requests processed in work_completion in session lifetime, the second
+one records average number of work requests processed in work_completion
+in session lifetime.
+
+What:		/sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/rdma
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 6 values:
+
+<read-count> <read-total-size> <write-count> <write-total-size> \
+<inflights> <failovered>
diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-server b/Documentation/ABI/testing/sysfs-class-rtrs-server
new file mode 100644
index 000000000000..cac2a093d56f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-rtrs-server
@@ -0,0 +1,81 @@
+What:		/sys/class/rtrs-server
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+When a user of RTRS API creates a new session on a client side, a
+directory entry with the name of that session is created in here.
+
+What:		/sys/class/rtrs-server/<session-name>/paths/
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+When new path is created by writing to "add_path" entry on client side,
+a directory entry named as <source address>@<destination address> is created
+on server.
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/disconnect
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+When "1" is written to the file, the RTRS session is being disconnected.
+Operations is non-blocking and returns control immediately to the caller.
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/hca_name
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the the name of HCA the connection established on.
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/hca_port
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the port number of active port traffic is going through.
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/src_addr
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the source address of the path
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/dst_addr
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RO, Contains the destination address of the path
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/stats/reset_all
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RW, Read will return usage help, write 0 will clear all the statistics.
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/stats/rdma
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 5 values:
+<read-count> <read-total-size> <write-count> <write-total-size> <inflights>
+
+What:		/sys/class/rtrs-server/<session-name>/paths/<src@dst>/stats/wc_completion
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains 3 values, the first one is int, records max number of work
+requests processed in work_completion in session lifetime, the second
+one long int records total number of work requests processed in
+work_completion in session lifetime and the 3rd one long int records
+total number of calls to the cq completion handler. Division of 2nd
+number through 3rd gives the average number of completions processed
+in completion handler.
diff --git a/drivers/infiniband/ulp/rtrs/README b/drivers/infiniband/ulp/rtrs/README
new file mode 100644
index 000000000000..59ad60318a18
--- /dev/null
+++ b/drivers/infiniband/ulp/rtrs/README
@@ -0,0 +1,149 @@
+****************************
+InfiniBand Transport (RTRS)
+****************************
+
+RTRS (InfiniBand Transport) is a reliable high speed transport library
+which provides support to establish optimal number of connections
+between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
+transport. It is optimized to transfer (read/write) IO blocks.
+
+In its core interface it follows the BIO semantics of providing the
+possibility to either write data from an sg list to the remote side
+or to request ("read") data transfer from the remote side into a given
+sg list.
+
+RTRS provides I/O fail-over and load-balancing capabilities by using
+multipath I/O (see "add_path" and "mp_policy" configuration entries).
+
+RTRS is used by the RNBD (Infiniband Network Block Device) modules.
+
+==================
+Transport protocol
+==================
+
+Overview
+--------
+An established connection between a client and a server is called rtrs
+session. A session is associated with a set of memory chunks reserved on the
+server side for a given client for rdma transfer. A session
+consists of multiple paths, each representing a separate physical link
+between client and server. Those are used for load balancing and failover.
+Each path consists of as many connections (QPs) as there are cpus on
+the client.
+
+When processing an incoming rdma write or read request rtrs client uses memory
+chunks reserved for him on the server side. Their number, size and addresses
+need to be exchanged between client and server during the connection
+establishment phase. Apart from the memory related information client needs to
+inform the server about the session name and identify each path and connection
+individually.
+
+On an established session client sends to server write or read messages.
+Server uses immediate field to tell the client which request is being
+acknowledged and for errno. Client uses immediate field to tell the server
+which of the memory chunks has been accessed and at which offset the message
+can be found.
+
+Connection establishment
+------------------------
+
+1. Client starts establishing connections belonging to a path of a session one
+by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
+Those include uuid of the session and uuid of the path to be
+established. They are used by the server to find a persisting session/path or
+to create a new one when necessary. The message also contains the protocol
+version and magic for compatibility, total number of connections per session
+(as many as cpus on the client), the id of the current connection and
+the reconnect counter, which is used to resolve the situations where
+client is trying to reconnect a path, while server is still destroying the old
+one.
+
+2. Server accepts the connection requests one by one and attaches
+RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
+protocol version, the messages include error code, queue depth supported by
+the server (number of memory chunks which are going to be allocated for that
+session) and the maximum size of one io.
+
+3. After all connections of a path are established client sends to server the
+RTRS_MSG_INFO_REQ message, containing the name of the session. This message
+requests the address information from the server.
+
+4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
+which contains the addresses and keys of the RDMA buffers allocated for that
+session.
+
+5. Session becomes connected after all paths to be established are connected
+(i.e. steps 1-4 finished for all paths requested for a session)
+
+6. Server and client exchange periodically heartbeat messages (empty rdma
+messages with an immediate field) which are used to detect a crash on remote
+side or network outage in an absence of IO.
+
+7. On any RDMA related error or in the case of a heartbeat timeout, the
+corresponding path is disconnected, all the inflight IO are failed over to a
+healthy path, if any, and the reconnect mechanism is triggered.
+
+CLT                                     SRV
+*for each connection belonging to a path and for each path:
+RTRS_MSG_CON_REQ  ------------------->
+                   <------------------- RTRS_MSG_CON_RSP
+...
+*after all connections are established:
+RTRS_MSG_INFO_REQ ------------------->
+                   <------------------- RTRS_MSG_INFO_RSP
+*heartbeat is started from both sides:
+                   -------------------> [RTRS_HB_MSG_IMM]
+[RTRS_HB_MSG_ACK] <-------------------
+[RTRS_HB_MSG_IMM] <-------------------
+                   -------------------> [RTRS_HB_MSG_ACK]
+
+IO path
+-------
+
+* Write *
+
+1. When processing a write request client selects one of the memory chunks
+on the server side and rdma writes there the user data, user header and the
+RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
+contains size of the user header. The client tells the server which chunk has
+been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
+using the IMM field.
+
+2. When confirming a write request server sends an "empty" rdma message with
+an immediate field. The 32 bit field is used to specify the outstanding
+inflight IO and for the error code.
+
+CLT                                                          SRV
+usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
+[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
+
+* Read *
+
+1. When processing a read request client selects one of the memory chunks
+on the server side and rdma writes there the user header and the
+RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
+the user header, flags (specifying if memory invalidation is necessary) and the
+list of addresses along with keys for the data to be read into.
+
+2. When confirming a read request server transfers the requested data first,
+attaches an invalidation message if requested and finally an "empty" rdma
+message with an immediate field. The 32 bit field is used to specify the
+outstanding inflight IO and the error code.
+
+CLT                                           SRV
+usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
+[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
+or in case client requested invalidation:
+[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
+
+=========================================
+Contributors List(in alphabetical order)
+=========================================
+Danil Kipnis <danil.kipnis@profitbricks.com>
+Fabian Holler <mail@fholler.de>
+Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
+Jack Wang <jinpu.wang@profitbricks.com>
+Kleber Souza <kleber.souza@profitbricks.com>
+Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
+Milind Dumbare <Milind.dumbare@gmail.com>
+Roman Penyaev <roman.penyaev@profitbricks.com>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 15/25] rnbd: private headers with rnbd protocol structs and helpers
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (13 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 14/25] rtrs: a bit of documentation Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 22:34   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 16/25] rnbd: client: private header with client structs and functions Jack Wang
                   ` (11 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

These are common private headers with rnbd protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-common.c |  25 +++
 drivers/block/rnbd/rnbd-log.h    |  43 +++++
 drivers/block/rnbd/rnbd-proto.h  | 307 +++++++++++++++++++++++++++++++
 3 files changed, 375 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-common.c
 create mode 100644 drivers/block/rnbd/rnbd-log.h
 create mode 100644 drivers/block/rnbd/rnbd-proto.h

diff --git a/drivers/block/rnbd/rnbd-common.c b/drivers/block/rnbd/rnbd-common.c
new file mode 100644
index 000000000000..4de9df8cedb3
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-common.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#include "rnbd-proto.h"
+
+const char *rnbd_access_mode_str(enum rnbd_access_mode mode)
+{
+	switch (mode) {
+	case RNBD_ACCESS_RO:
+		return "ro";
+	case RNBD_ACCESS_RW:
+		return "rw";
+	case RNBD_ACCESS_MIGRATION:
+		return "migration";
+	default:
+		return "unknown";
+	}
+}
diff --git a/drivers/block/rnbd/rnbd-log.h b/drivers/block/rnbd/rnbd-log.h
new file mode 100644
index 000000000000..14e22bff1821
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-log.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#ifndef RNBD_LOG_H
+#define RNBD_LOG_H
+
+#include "rnbd-clt.h"
+#include "rnbd-srv.h"
+
+#define rnbd_clt_log(fn, dev, fmt, ...) (				\
+		fn("<%s@%s> " fmt, (dev)->pathname,			\
+		(dev)->sess->sessname,					\
+		   ##__VA_ARGS__))
+#define rnbd_srv_log(fn, dev, fmt, ...) (				\
+			fn("<%s@%s>: " fmt, (dev)->pathname,		\
+			   (dev)->sess->sessname, ##__VA_ARGS__))
+
+#define rnbd_clt_err(dev, fmt, ...)	\
+	rnbd_clt_log(pr_err, dev, fmt, ##__VA_ARGS__)
+#define rnbd_clt_err_rl(dev, fmt, ...)	\
+	rnbd_clt_log(pr_err_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define rnbd_clt_info(dev, fmt, ...) \
+	rnbd_clt_log(pr_info, dev, fmt, ##__VA_ARGS__)
+#define rnbd_clt_info_rl(dev, fmt, ...) \
+	rnbd_clt_log(pr_info_ratelimited, dev, fmt, ##__VA_ARGS__)
+
+#define rnbd_srv_err(dev, fmt, ...)	\
+	rnbd_srv_log(pr_err, dev, fmt, ##__VA_ARGS__)
+#define rnbd_srv_err_rl(dev, fmt, ...)	\
+	rnbd_srv_log(pr_err_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define rnbd_srv_info(dev, fmt, ...) \
+	rnbd_srv_log(pr_info, dev, fmt, ##__VA_ARGS__)
+#define rnbd_srv_info_rl(dev, fmt, ...) \
+	rnbd_srv_log(pr_info_ratelimited, dev, fmt, ##__VA_ARGS__)
+
+#endif /* RNBD_LOG_H */
diff --git a/drivers/block/rnbd/rnbd-proto.h b/drivers/block/rnbd/rnbd-proto.h
new file mode 100644
index 000000000000..069df2d1ae5e
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-proto.h
@@ -0,0 +1,307 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#ifndef RNBD_PROTO_H
+#define RNBD_PROTO_H
+
+#include <linux/types.h>
+#include <linux/blkdev.h>
+#include <linux/limits.h>
+#include <linux/inet.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <rdma/ib.h>
+
+#define RNBD_PROTO_VER_MAJOR 2
+#define RNBD_PROTO_VER_MINOR 0
+
+#define RNBD_PROTO_VER_STRING __stringify(RNBD_PROTO_VER_MAJOR) "." \
+			       __stringify(RNBD_PROTO_VER_MINOR)
+
+#define RTRS_PORT 1234
+
+/**
+ * enum rnbd_msg_types - RNBD message types
+ * @RNBD_MSG_SESS_INFO:	initial session info from client to server
+ * @RNBD_MSG_SESS_INFO_RSP:	initial session info from server to client
+ * @RNBD_MSG_OPEN:		open (map) device request
+ * @RNBD_MSG_OPEN_RSP:		response to an @RNBD_MSG_OPEN
+ * @RNBD_MSG_IO:		block IO request operation
+ * @RNBD_MSG_CLOSE:		close (unmap) device request
+ */
+enum rnbd_msg_type {
+	RNBD_MSG_SESS_INFO,
+	RNBD_MSG_SESS_INFO_RSP,
+	RNBD_MSG_OPEN,
+	RNBD_MSG_OPEN_RSP,
+	RNBD_MSG_IO,
+	RNBD_MSG_CLOSE,
+};
+
+/**
+ * struct rnbd_msg_hdr - header of RNBD messages
+ * @type:	Message type, valid values see: enum rnbd_msg_types
+ */
+struct rnbd_msg_hdr {
+	__le16		type;
+	__le16		__padding;
+};
+
+/**
+ * We allow to map RO many times and RW only once. We allow to map yet another
+ * time RW, if MIGRATION is provided (second RW export can be required for
+ * example for VM migration)
+ */
+enum rnbd_access_mode {
+	RNBD_ACCESS_RO,
+	RNBD_ACCESS_RW,
+	RNBD_ACCESS_MIGRATION,
+};
+
+/**
+ * struct rnbd_msg_sess_info - initial session info from client to server
+ * @hdr:		message header
+ * @ver:		RNBD protocol version
+ */
+struct rnbd_msg_sess_info {
+	struct rnbd_msg_hdr hdr;
+	u8		ver;
+	u8		reserved[31];
+};
+
+/**
+ * struct rnbd_msg_sess_info_rsp - initial session info from server to client
+ * @hdr:		message header
+ * @ver:		RNBD protocol version
+ */
+struct rnbd_msg_sess_info_rsp {
+	struct rnbd_msg_hdr hdr;
+	u8		ver;
+	u8		reserved[31];
+};
+
+/**
+ * struct rnbd_msg_open - request to open a remote device.
+ * @hdr:		message header
+ * @access_mode:	the mode to open remote device, valid values see:
+ *			enum rnbd_access_mode
+ * @device_name:	device path on remote side
+ */
+struct rnbd_msg_open {
+	struct rnbd_msg_hdr hdr;
+	u8		access_mode;
+	u8		resv1;
+	s8		dev_name[NAME_MAX];
+	u8		reserved[3];
+};
+
+/**
+ * struct rnbd_msg_close - request to close a remote device.
+ * @hdr:	message header
+ * @device_id:	device_id on server side to identify the device
+ */
+struct rnbd_msg_close {
+	struct rnbd_msg_hdr hdr;
+	__le32		device_id;
+};
+
+/**
+ * struct rnbd_msg_open_rsp - response message to RNBD_MSG_OPEN
+ * @hdr:		message header
+ * @device_id:		device_id on server side to identify the device
+ * @nsectors:		number of sectors in the usual 512b unit
+ * @max_hw_sectors:	max hardware sectors in the usual 512b unit
+ * @max_write_same_sectors: max sectors for WRITE SAME in the 512b unit
+ * @max_discard_sectors: max. sectors that can be discarded at once in 512b
+ * unit.
+ * @discard_granularity: size of the internal discard allocation unit in bytes
+ * @discard_alignment: offset from internal allocation assignment in bytes
+ * @physical_block_size: physical block size device supports in bytes
+ * @logical_block_size: logical block size device supports in bytes
+ * @max_segments:	max segments hardware support in one transfer
+ * @secure_discard:	supports secure discard
+ * @rotation:		is a rotational disc?
+ */
+struct rnbd_msg_open_rsp {
+	struct rnbd_msg_hdr	hdr;
+	__le32			device_id;
+	__le64			nsectors;
+	__le32			max_hw_sectors;
+	__le32			max_write_same_sectors;
+	__le32			max_discard_sectors;
+	__le32			discard_granularity;
+	__le32			discard_alignment;
+	__le16			physical_block_size;
+	__le16			logical_block_size;
+	__le16			max_segments;
+	__le16			secure_discard;
+	u8			rotational;
+	u8			reserved[11];
+};
+
+/**
+ * struct rnbd_msg_io - message for I/O read/write
+ * @hdr:	message header
+ * @device_id:	device_id on server side to find the right device
+ * @sector:	bi_sector attribute from struct bio
+ * @rw:		valid values are defined in enum rnbd_io_flags
+ * @bi_size:    number of bytes for I/O read/write
+ * @prio:       priority
+ */
+struct rnbd_msg_io {
+	struct rnbd_msg_hdr hdr;
+	__le32		device_id;
+	__le64		sector;
+	__le32		rw;
+	__le32		bi_size;
+	__le16		prio;
+};
+
+#define RNBD_OP_BITS  8
+#define RNBD_OP_MASK  ((1 << RNBD_OP_BITS) - 1)
+
+/**
+ * enum rnbd_io_flags - RNBD request types from rq_flag_bits
+ * @RNBD_OP_READ:	     read sectors from the device
+ * @RNBD_OP_WRITE:	     write sectors to the device
+ * @RNBD_OP_FLUSH:	     flush the volatile write cache
+ * @RNBD_OP_DISCARD:        discard sectors
+ * @RNBD_OP_SECURE_ERASE:   securely erase sectors
+ * @RNBD_OP_WRITE_SAME:     write the same sectors many times
+
+ * @RNBD_F_SYNC:	     request is sync (sync write or read)
+ * @RNBD_F_FUA:             forced unit access
+ */
+enum rnbd_io_flags {
+
+	/* Operations */
+
+	RNBD_OP_READ		= 0,
+	RNBD_OP_WRITE		= 1,
+	RNBD_OP_FLUSH		= 2,
+	RNBD_OP_DISCARD	= 3,
+	RNBD_OP_SECURE_ERASE	= 4,
+	RNBD_OP_WRITE_SAME	= 5,
+
+	RNBD_OP_LAST,
+
+	/* Flags */
+
+	RNBD_F_SYNC  = 1<<(RNBD_OP_BITS + 0),
+	RNBD_F_FUA   = 1<<(RNBD_OP_BITS + 1),
+
+	RNBD_F_ALL   = (RNBD_F_SYNC | RNBD_F_FUA)
+
+};
+
+static inline u32 rnbd_op(u32 flags)
+{
+	return (flags & RNBD_OP_MASK);
+}
+
+static inline u32 rnbd_flags(u32 flags)
+{
+	return (flags & ~RNBD_OP_MASK);
+}
+
+static inline bool rnbd_flags_supported(u32 flags)
+{
+	u32 op;
+
+	op = rnbd_op(flags);
+	flags = rnbd_flags(flags);
+
+	if (op >= RNBD_OP_LAST)
+		return false;
+	if (flags & ~RNBD_F_ALL)
+		return false;
+
+	return true;
+}
+
+static inline u32 rnbd_to_bio_flags(u32 rnbd_opf)
+{
+	u32 bio_opf;
+
+	switch (rnbd_op(rnbd_opf)) {
+	case RNBD_OP_READ:
+		bio_opf = REQ_OP_READ;
+		break;
+	case RNBD_OP_WRITE:
+		bio_opf = REQ_OP_WRITE;
+		break;
+	case RNBD_OP_FLUSH:
+		bio_opf = REQ_OP_FLUSH | REQ_PREFLUSH;
+		break;
+	case RNBD_OP_DISCARD:
+		bio_opf = REQ_OP_DISCARD;
+		break;
+	case RNBD_OP_SECURE_ERASE:
+		bio_opf = REQ_OP_SECURE_ERASE;
+		break;
+	case RNBD_OP_WRITE_SAME:
+		bio_opf = REQ_OP_WRITE_SAME;
+		break;
+	default:
+		WARN(1, "Unknown RNBD type: %d (flags %d)\n",
+		     rnbd_op(rnbd_opf), rnbd_opf);
+		bio_opf = 0;
+	}
+
+	if (rnbd_opf & RNBD_F_SYNC)
+		bio_opf |= REQ_SYNC;
+
+	if (rnbd_opf & RNBD_F_FUA)
+		bio_opf |= REQ_FUA;
+
+	return bio_opf;
+}
+
+static inline u32 rq_to_rnbd_flags(struct request *rq)
+{
+	u32 rnbd_opf;
+
+	switch (req_op(rq)) {
+	case REQ_OP_READ:
+		rnbd_opf = RNBD_OP_READ;
+		break;
+	case REQ_OP_WRITE:
+		rnbd_opf = RNBD_OP_WRITE;
+		break;
+	case REQ_OP_DISCARD:
+		rnbd_opf = RNBD_OP_DISCARD;
+		break;
+	case REQ_OP_SECURE_ERASE:
+		rnbd_opf = RNBD_OP_SECURE_ERASE;
+		break;
+	case REQ_OP_WRITE_SAME:
+		rnbd_opf = RNBD_OP_WRITE_SAME;
+		break;
+	case REQ_OP_FLUSH:
+		rnbd_opf = RNBD_OP_FLUSH;
+		break;
+	default:
+		WARN(1, "Unknown request type %d (flags %llu)\n",
+		     req_op(rq), (unsigned long long)rq->cmd_flags);
+		rnbd_opf = 0;
+	}
+
+	if (op_is_sync(rq->cmd_flags))
+		rnbd_opf |= RNBD_F_SYNC;
+
+	if (op_is_flush(rq->cmd_flags))
+		rnbd_opf |= RNBD_F_FUA;
+
+	return rnbd_opf;
+}
+
+const char *rnbd_access_mode_str(enum rnbd_access_mode mode);
+
+#endif /* RNBD_PROTO_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 16/25] rnbd: client: private header with client structs and functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (14 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 15/25] rnbd: private headers with rnbd protocol structs and helpers Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 22:37   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 17/25] rnbd: client: main functionality Jack Wang
                   ` (10 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This header describes main structs and functions used by rnbd-client
module, mainly for managing RNBD sessions and mapped block devices,
creating and destroying sysfs entries.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-clt.h | 151 ++++++++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-clt.h

diff --git a/drivers/block/rnbd/rnbd-clt.h b/drivers/block/rnbd/rnbd-clt.h
new file mode 100644
index 000000000000..a9ff25e36fdf
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-clt.h
@@ -0,0 +1,151 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#ifndef RNBD_CLT_H
+#define RNBD_CLT_H
+
+#include <linux/wait.h>
+#include <linux/in.h>
+#include <linux/inet.h>
+#include <linux/blk-mq.h>
+#include <linux/refcount.h>
+
+#include "rtrs.h"
+#include "rnbd-proto.h"
+#include "rnbd-log.h"
+
+#define BMAX_SEGMENTS 29
+#define RECONNECT_DELAY 30
+#define MAX_RECONNECTS -1
+
+enum rnbd_clt_dev_state {
+	DEV_STATE_INIT,
+	DEV_STATE_MAPPED,
+	DEV_STATE_MAPPED_DISCONNECTED,
+	DEV_STATE_UNMAPPED,
+};
+
+struct rnbd_iu_comp {
+	wait_queue_head_t wait;
+	int errno;
+};
+
+struct rnbd_iu {
+	union {
+		struct request *rq; /* for block io */
+		void *buf; /* for user messages */
+	};
+	struct rtrs_permit	*permit;
+	union {
+		/* use to send msg associated with a dev */
+		struct rnbd_clt_dev *dev;
+		/* use to send msg associated with a sess */
+		struct rnbd_clt_session *sess;
+	};
+	blk_status_t		status;
+	struct scatterlist	sglist[BMAX_SEGMENTS];
+	struct work_struct	work;
+	int			errno;
+	struct rnbd_iu_comp	comp;
+	atomic_t		refcount;
+};
+
+struct rnbd_cpu_qlist {
+	struct list_head	requeue_list;
+	spinlock_t		requeue_lock;
+	unsigned int		cpu;
+};
+
+struct rnbd_clt_session {
+	struct list_head        list;
+	struct rtrs_clt        *rtrs;
+	wait_queue_head_t       rtrs_waitq;
+	bool                    rtrs_ready;
+	struct rnbd_cpu_qlist	__percpu
+				*cpu_queues;
+	DECLARE_BITMAP(cpu_queues_bm, NR_CPUS);
+	int	__percpu	*cpu_rr; /* per-cpu var for CPU round-robin */
+	atomic_t		busy;
+	int			queue_depth;
+	u32			max_io_size;
+	struct blk_mq_tag_set	tag_set;
+	struct mutex		lock; /* protects state and devs_list */
+	struct list_head        devs_list; /* list of struct rnbd_clt_dev */
+	refcount_t		refcount;
+	char			sessname[NAME_MAX];
+	u8			ver; /* protocol version */
+};
+
+/**
+ * Submission queues.
+ */
+struct rnbd_queue {
+	struct list_head	requeue_list;
+	unsigned long		in_list;
+	struct rnbd_clt_dev	*dev;
+	struct blk_mq_hw_ctx	*hctx;
+};
+
+struct rnbd_clt_dev {
+	struct rnbd_clt_session	*sess;
+	struct request_queue	*queue;
+	struct rnbd_queue	*hw_queues;
+	u32			device_id;
+	/* local Idr index - used to track minor number allocations. */
+	u32			clt_device_id;
+	struct mutex		lock;
+	enum rnbd_clt_dev_state	dev_state;
+	char			pathname[NAME_MAX];
+	enum rnbd_access_mode	access_mode;
+	bool			read_only;
+	bool			rotational;
+	u32			max_hw_sectors;
+	u32			max_write_same_sectors;
+	u32			max_discard_sectors;
+	u32			discard_granularity;
+	u32			discard_alignment;
+	u16			secure_discard;
+	u16			physical_block_size;
+	u16			logical_block_size;
+	u16			max_segments;
+	size_t			nsectors;
+	u64			size;		/* device size in bytes */
+	struct list_head        list;
+	struct gendisk		*gd;
+	struct kobject		kobj;
+	char			blk_symlink_name[NAME_MAX];
+	refcount_t		refcount;
+	struct work_struct	unmap_on_rmmod_work;
+};
+
+/* rnbd-clt.c */
+
+struct rnbd_clt_dev *rnbd_clt_map_device(const char *sessname,
+					   struct rtrs_addr *paths,
+					   size_t path_cnt,
+					   const char *pathname,
+					   enum rnbd_access_mode access_mode);
+int rnbd_clt_unmap_device(struct rnbd_clt_dev *dev, bool force,
+			   const struct attribute *sysfs_self);
+
+int rnbd_clt_remap_device(struct rnbd_clt_dev *dev);
+int rnbd_clt_resize_disk(struct rnbd_clt_dev *dev, size_t newsize);
+
+/* rnbd-clt-sysfs.c */
+
+int rnbd_clt_create_sysfs_files(void);
+
+void rnbd_clt_destroy_sysfs_files(void);
+void rnbd_clt_destroy_default_group(void);
+
+void rnbd_clt_remove_dev_symlink(struct rnbd_clt_dev *dev);
+
+#endif /* RNBD_CLT_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 17/25] rnbd: client: main functionality
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (15 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 16/25] rnbd: client: private header with client structs and functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-02 23:55   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 18/25] rnbd: client: sysfs interface functions Jack Wang
                   ` (9 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is main functionality of rnbd-client module, which provides
interface to map remote device as local block device /dev/rnbd<N>
and feeds RTRS with IO requests.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-clt.c | 1743 +++++++++++++++++++++++++++++++++
 1 file changed, 1743 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-clt.c

diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c
new file mode 100644
index 000000000000..4d2c8475a6e5
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-clt.c
@@ -0,0 +1,1743 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/scatterlist.h>
+#include <linux/idr.h>
+
+#include "rnbd-clt.h"
+
+MODULE_DESCRIPTION("InfiniBand Network Block Device Client");
+MODULE_LICENSE("GPL");
+
+static int rnbd_client_major;
+static DEFINE_IDA(index_ida);
+static DEFINE_MUTEX(ida_lock);
+static DEFINE_MUTEX(sess_lock);
+static LIST_HEAD(sess_list);
+
+/*
+ * Maximum number of partitions an instance can have.
+ * 6 bits = 64 minors = 63 partitions (one minor is used for the device itself)
+ */
+#define RNBD_PART_BITS		6
+
+static inline bool rnbd_clt_get_sess(struct rnbd_clt_session *sess)
+{
+	return refcount_inc_not_zero(&sess->refcount);
+}
+
+static void free_sess(struct rnbd_clt_session *sess);
+
+static void rnbd_clt_put_sess(struct rnbd_clt_session *sess)
+{
+	might_sleep();
+
+	if (refcount_dec_and_test(&sess->refcount))
+		free_sess(sess);
+}
+
+static inline bool rnbd_clt_dev_is_mapped(struct rnbd_clt_dev *dev)
+{
+	return dev->dev_state == DEV_STATE_MAPPED;
+}
+
+static void rnbd_clt_put_dev(struct rnbd_clt_dev *dev)
+{
+	might_sleep();
+
+	if (refcount_dec_and_test(&dev->refcount)) {
+		mutex_lock(&ida_lock);
+		ida_simple_remove(&index_ida, dev->clt_device_id);
+		mutex_unlock(&ida_lock);
+		kfree(dev->hw_queues);
+		rnbd_clt_put_sess(dev->sess);
+		kfree(dev);
+	}
+}
+
+static inline bool rnbd_clt_get_dev(struct rnbd_clt_dev *dev)
+{
+	return refcount_inc_not_zero(&dev->refcount);
+}
+
+static int rnbd_clt_set_dev_attr(struct rnbd_clt_dev *dev,
+				  const struct rnbd_msg_open_rsp *rsp)
+{
+	struct rnbd_clt_session *sess = dev->sess;
+
+	if (unlikely(!rsp->logical_block_size))
+		return -EINVAL;
+
+	dev->device_id		    = le32_to_cpu(rsp->device_id);
+	dev->nsectors		    = le64_to_cpu(rsp->nsectors);
+	dev->logical_block_size	    = le16_to_cpu(rsp->logical_block_size);
+	dev->physical_block_size    = le16_to_cpu(rsp->physical_block_size);
+	dev->max_write_same_sectors = le32_to_cpu(rsp->max_write_same_sectors);
+	dev->max_discard_sectors    = le32_to_cpu(rsp->max_discard_sectors);
+	dev->discard_granularity    = le32_to_cpu(rsp->discard_granularity);
+	dev->discard_alignment	    = le32_to_cpu(rsp->discard_alignment);
+	dev->secure_discard	    = le16_to_cpu(rsp->secure_discard);
+	dev->rotational		    = rsp->rotational;
+
+	dev->max_hw_sectors = sess->max_io_size / dev->logical_block_size;
+	dev->max_segments = BMAX_SEGMENTS;
+
+	dev->max_hw_sectors = min_t(u32, dev->max_hw_sectors,
+				    le32_to_cpu(rsp->max_hw_sectors));
+	dev->max_segments = min_t(u16, dev->max_segments,
+				  le16_to_cpu(rsp->max_segments));
+
+	return 0;
+}
+
+static int rnbd_clt_change_capacity(struct rnbd_clt_dev *dev,
+				     size_t new_nsectors)
+{
+	int err = 0;
+
+	rnbd_clt_info(dev, "Device size changed from %zu to %zu sectors\n",
+		       dev->nsectors, new_nsectors);
+	dev->nsectors = new_nsectors;
+	set_capacity(dev->gd,
+		     dev->nsectors * (dev->logical_block_size /
+				      SECTOR_SIZE));
+	err = revalidate_disk(dev->gd);
+	if (err)
+		rnbd_clt_err(dev,
+			      "Failed to change device size from %zu to %zu, err: %d\n",
+			      dev->nsectors, new_nsectors, err);
+	return err;
+}
+
+static int process_msg_open_rsp(struct rnbd_clt_dev *dev,
+				struct rnbd_msg_open_rsp *rsp)
+{
+	int err = 0;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state == DEV_STATE_UNMAPPED) {
+		rnbd_clt_info(dev,
+			       "Ignoring Open-Response message from server for  unmapped device\n");
+		err = -ENOENT;
+		goto out;
+	}
+	if (dev->dev_state == DEV_STATE_MAPPED_DISCONNECTED) {
+		u64 nsectors = le64_to_cpu(rsp->nsectors);
+
+		/*
+		 * If the device was remapped and the size changed in the
+		 * meantime we need to revalidate it
+		 */
+		if (dev->nsectors != nsectors)
+			rnbd_clt_change_capacity(dev, nsectors);
+		rnbd_clt_info(dev, "Device online, device remapped successfully\n");
+	}
+	err = rnbd_clt_set_dev_attr(dev, rsp);
+	if (unlikely(err))
+		goto out;
+	dev->dev_state = DEV_STATE_MAPPED;
+
+out:
+	mutex_unlock(&dev->lock);
+
+	return err;
+}
+
+int rnbd_clt_resize_disk(struct rnbd_clt_dev *dev, size_t newsize)
+{
+	int ret = 0;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state != DEV_STATE_MAPPED) {
+		pr_err("Failed to set new size of the device, device is not opened\n");
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = rnbd_clt_change_capacity(dev, newsize);
+
+out:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+static inline void rnbd_clt_dev_requeue(struct rnbd_queue *q)
+{
+	if (WARN_ON(!q->hctx))
+		return;
+
+	/* We can come here from interrupt, thus async=true */
+	blk_mq_run_hw_queue(q->hctx, true);
+}
+
+enum {
+	RNBD_DELAY_10ms   = 10,
+	RNBD_DELAY_IFBUSY = -1,
+};
+
+/**
+ * rnbd_get_cpu_qlist() - finds a list with HW queues to be rerun
+ * @sess:	Session to find a queue for
+ * @cpu:	Cpu to start the search from
+ *
+ * Description:
+ *     Each CPU has a list of HW queues, which needs to be rerun.  If a list
+ *     is not empty - it is marked with a bit.  This function finds first
+ *     set bit in a bitmap and returns corresponding CPU list.
+ */
+static struct rnbd_cpu_qlist *
+rnbd_get_cpu_qlist(struct rnbd_clt_session *sess, int cpu)
+{
+	int bit;
+
+	/* First half */
+	bit = find_next_bit(sess->cpu_queues_bm, nr_cpu_ids, cpu);
+	if (bit < nr_cpu_ids) {
+		return per_cpu_ptr(sess->cpu_queues, bit);
+	} else if (cpu != 0) {
+		/* Second half */
+		bit = find_next_bit(sess->cpu_queues_bm, cpu, 0);
+		if (bit < cpu)
+			return per_cpu_ptr(sess->cpu_queues, bit);
+	}
+
+	return NULL;
+}
+
+static inline int nxt_cpu(int cpu)
+{
+	return (cpu + 1) % nr_cpu_ids;
+}
+
+/**
+ * rnbd_rerun_if_needed() - rerun next queue marked as stopped
+ * @sess:	Session to rerun a queue on
+ *
+ * Description:
+ *     Each CPU has it's own list of HW queues, which should be rerun.
+ *     Function finds such list with HW queues, takes a list lock, picks up
+ *     the first HW queue out of the list and requeues it.
+ *
+ * Return:
+ *     True if the queue was requeued, false otherwise.
+ *
+ * Context:
+ *     Does not matter.
+ */
+static inline bool rnbd_rerun_if_needed(struct rnbd_clt_session *sess)
+{
+	struct rnbd_queue *q = NULL;
+	struct rnbd_cpu_qlist *cpu_q;
+	unsigned long flags;
+	int *cpup;
+
+	/*
+	 * To keep fairness and not to let other queues starve we always
+	 * try to wake up someone else in round-robin manner.  That of course
+	 * increases latency but queues always have a chance to be executed.
+	 */
+	cpup = get_cpu_ptr(sess->cpu_rr);
+	for (cpu_q = rnbd_get_cpu_qlist(sess, nxt_cpu(*cpup)); cpu_q;
+	     cpu_q = rnbd_get_cpu_qlist(sess, nxt_cpu(cpu_q->cpu))) {
+		if (!spin_trylock_irqsave(&cpu_q->requeue_lock, flags))
+			continue;
+		if (likely(test_bit(cpu_q->cpu, sess->cpu_queues_bm))) {
+			q = list_first_entry_or_null(&cpu_q->requeue_list,
+						     typeof(*q), requeue_list);
+			if (WARN_ON(!q))
+				goto clear_bit;
+			list_del_init(&q->requeue_list);
+			clear_bit_unlock(0, &q->in_list);
+
+			if (list_empty(&cpu_q->requeue_list)) {
+				/* Clear bit if nothing is left */
+clear_bit:
+				clear_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			}
+		}
+		spin_unlock_irqrestore(&cpu_q->requeue_lock, flags);
+
+		if (q)
+			break;
+	}
+
+	/**
+	 * Saves the CPU that is going to be requeued on the per-cpu var. Just
+	 * incrementing it doesn't work because rnbd_get_cpu_qlist() will
+	 * always return the first CPU with something on the queue list when the
+	 * value stored on the var is greater than the last CPU with something
+	 * on the list.
+	 */
+	if (cpu_q)
+		*cpup = cpu_q->cpu;
+	put_cpu_var(sess->cpu_rr);
+
+	if (q)
+		rnbd_clt_dev_requeue(q);
+
+	return !!q;
+}
+
+/**
+ * rnbd_rerun_all_if_idle() - rerun all queues left in the list if
+ *				 session is idling (there are no requests
+ *				 in-flight).
+ * @sess:	Session to rerun the queues on
+ *
+ * Description:
+ *     This function tries to rerun all stopped queues if there are no
+ *     requests in-flight anymore.  This function tries to solve an obvious
+ *     problem, when number of tags < than number of queues (hctx), which
+ *     are stopped and put to sleep.  If last permit, which has been just put,
+ *     does not wake up all left queues (hctxs), IO requests hang forever.
+ *
+ *     That can happen when all number of permits, say N, have been exhausted
+ *     from one CPU, and we have many block devices per session, say M.
+ *     Each block device has it's own queue (hctx) for each CPU, so eventually
+ *     we can put that number of queues (hctxs) to sleep: M x nr_cpu_ids.
+ *     If number of permits N < M x nr_cpu_ids finally we will get an IO hang.
+ *
+ *     To avoid this hang last caller of rnbd_put_permit() (last caller is the
+ *     one who observes sess->busy == 0) must wake up all remaining queues.
+ *
+ * Context:
+ *     Does not matter.
+ */
+static inline void rnbd_rerun_all_if_idle(struct rnbd_clt_session *sess)
+{
+	bool requeued;
+
+	do {
+		requeued = rnbd_rerun_if_needed(sess);
+	} while (atomic_read(&sess->busy) == 0 && requeued);
+}
+
+static struct rtrs_permit *rnbd_get_permit(struct rnbd_clt_session *sess,
+					     enum rtrs_clt_con_type con_type,
+					     int wait)
+{
+	struct rtrs_permit *permit;
+
+	permit = rtrs_clt_get_permit(sess->rtrs, con_type,
+				      wait ? RTRS_PERMIT_WAIT :
+				      RTRS_PERMIT_NOWAIT);
+	if (likely(permit))
+		/* We have a subtle rare case here, when all permits can be
+		 * consumed before busy counter increased.  This is safe,
+		 * because loser will get NULL as a permit, observe 0 busy
+		 * counter and immediately restart the queue himself.
+		 */
+		atomic_inc(&sess->busy);
+
+	return permit;
+}
+
+static void rnbd_put_permit(struct rnbd_clt_session *sess,
+			     struct rtrs_permit *permit)
+{
+	rtrs_clt_put_permit(sess->rtrs, permit);
+	atomic_dec(&sess->busy);
+	/* Paired with rnbd_clt_dev_add_to_requeue().  Decrement first
+	 * and then check queue bits.
+	 */
+	smp_mb__after_atomic();
+	rnbd_rerun_all_if_idle(sess);
+}
+
+static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
+				     enum rtrs_clt_con_type con_type,
+				     int wait)
+{
+	struct rnbd_iu *iu;
+	struct rtrs_permit *permit;
+
+	permit = rnbd_get_permit(sess, con_type,
+				  wait ? RTRS_PERMIT_WAIT :
+				  RTRS_PERMIT_NOWAIT);
+	if (unlikely(!permit))
+		return NULL;
+	iu = rtrs_permit_to_pdu(permit);
+	iu->permit = permit;
+	/* yes, rtrs_permit_from_pdu() can be nice here,
+	 * but also we have to think about MQ mode
+	 */
+	/*
+	 * 1st reference is dropped after finishing sending a "user" message,
+	 * 2nd reference is dropped after confirmation with the response is
+	 * returned.
+	 * 1st and 2nd can happen in any order, so the rnbd_iu should be
+	 * released (rtrs_permit returned to ibbtrs) only leased after both
+	 * are finished.
+	 */
+	atomic_set(&iu->refcount, 2);
+	init_waitqueue_head(&iu->comp.wait);
+	iu->comp.errno = INT_MAX;
+
+	return iu;
+}
+
+static void rnbd_put_iu(struct rnbd_clt_session *sess, struct rnbd_iu *iu)
+{
+	if (atomic_dec_and_test(&iu->refcount))
+		rnbd_put_permit(sess, iu->permit);
+}
+
+static void rnbd_softirq_done_fn(struct request *rq)
+{
+	struct rnbd_clt_dev *dev	= rq->rq_disk->private_data;
+	struct rnbd_clt_session *sess	= dev->sess;
+	struct rnbd_iu *iu;
+
+	iu = blk_mq_rq_to_pdu(rq);
+	rnbd_put_permit(sess, iu->permit);
+	blk_mq_end_request(rq, iu->status);
+}
+
+static void msg_io_conf(void *priv, int errno)
+{
+	struct rnbd_iu *iu = priv;
+	struct rnbd_clt_dev *dev = iu->dev;
+	struct request *rq = iu->rq;
+
+	iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
+
+	blk_mq_complete_request(rq);
+
+	if (errno)
+		rnbd_clt_info_rl(dev, "%s I/O failed with err: %d\n",
+				  rq_data_dir(rq) == READ ? "read" : "write",
+				  errno);
+}
+
+static void wake_up_iu_comp(struct rnbd_iu *iu, int errno)
+{
+	iu->comp.errno = errno;
+	wake_up(&iu->comp.wait);
+}
+
+static void msg_conf(void *priv, int errno)
+{
+	struct rnbd_iu *iu = priv;
+
+	iu->errno = errno;
+	schedule_work(&iu->work);
+}
+
+enum {
+	NO_WAIT = 0,
+	WAIT    = 1
+};
+
+static int send_usr_msg(struct rtrs_clt *rtrs, int dir,
+			struct rnbd_iu *iu, struct kvec *vec, size_t nr,
+			size_t len, struct scatterlist *sg, unsigned int sg_len,
+			void (*conf)(struct work_struct *work),
+			int *errno, bool wait)
+{
+	int err;
+
+	INIT_WORK(&iu->work, conf);
+	err = rtrs_clt_request(dir, msg_conf, rtrs, iu->permit,
+				iu, vec, nr, len, sg, sg_len);
+	if (!err && wait) {
+		wait_event(iu->comp.wait, iu->comp.errno != INT_MAX);
+		*errno = iu->comp.errno;
+	} else {
+		*errno = 0;
+	}
+
+	return err;
+}
+
+static void msg_close_conf(struct work_struct *work)
+{
+	struct rnbd_iu *iu = container_of(work, struct rnbd_iu, work);
+	struct rnbd_clt_dev *dev = iu->dev;
+
+	wake_up_iu_comp(iu, iu->errno);
+	rnbd_put_iu(dev->sess, iu);
+	rnbd_clt_put_dev(dev);
+}
+
+static int send_msg_close(struct rnbd_clt_dev *dev, u32 device_id, bool wait)
+{
+	struct rnbd_clt_session *sess = dev->sess;
+	struct rnbd_msg_close msg;
+	struct rnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	iu = rnbd_get_iu(sess, RTRS_USR_CON, RTRS_PERMIT_WAIT);
+	if (unlikely(!iu))
+		return -ENOMEM;
+
+	iu->buf = NULL;
+	iu->dev = dev;
+
+	sg_mark_end(&iu->sglist[0]);
+
+	msg.hdr.type	= cpu_to_le16(RNBD_MSG_CLOSE);
+	msg.device_id	= cpu_to_le32(device_id);
+
+	WARN_ON(!rnbd_clt_get_dev(dev));
+	err = send_usr_msg(sess->rtrs, WRITE, iu, &vec, 1, 0, NULL, 0,
+			   msg_close_conf, &errno, wait);
+	if (unlikely(err)) {
+		rnbd_clt_put_dev(dev);
+		rnbd_put_iu(sess, iu);
+	} else {
+		err = errno;
+	}
+
+	rnbd_put_iu(sess, iu);
+	return err;
+}
+
+static void msg_open_conf(struct work_struct *work)
+{
+	struct rnbd_iu *iu = container_of(work, struct rnbd_iu, work);
+	struct rnbd_msg_open_rsp *rsp = iu->buf;
+	struct rnbd_clt_dev *dev = iu->dev;
+	int errno = iu->errno;
+
+	if (errno) {
+		rnbd_clt_err(dev,
+			      "Opening failed, server responded: %d\n",
+			      errno);
+	} else {
+		errno = process_msg_open_rsp(dev, rsp);
+		if (unlikely(errno)) {
+			u32 device_id = le32_to_cpu(rsp->device_id);
+			/*
+			 * If server thinks its fine, but we fail to process
+			 * then be nice and send a close to server.
+			 */
+			(void)send_msg_close(dev, device_id, NO_WAIT);
+		}
+	}
+	kfree(rsp);
+	wake_up_iu_comp(iu, errno);
+	rnbd_put_iu(dev->sess, iu);
+	rnbd_clt_put_dev(dev);
+}
+
+static void msg_sess_info_conf(struct work_struct *work)
+{
+	struct rnbd_iu *iu = container_of(work, struct rnbd_iu, work);
+	struct rnbd_msg_sess_info_rsp *rsp = iu->buf;
+	struct rnbd_clt_session *sess = iu->sess;
+
+	if (likely(!iu->errno))
+		sess->ver = min_t(u8, rsp->ver, RNBD_PROTO_VER_MAJOR);
+
+	kfree(rsp);
+	wake_up_iu_comp(iu, iu->errno);
+	rnbd_put_iu(sess, iu);
+	rnbd_clt_put_sess(sess);
+}
+
+static int send_msg_open(struct rnbd_clt_dev *dev, bool wait)
+{
+	struct rnbd_clt_session *sess = dev->sess;
+	struct rnbd_msg_open_rsp *rsp;
+	struct rnbd_msg_open msg;
+	struct rnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+	if (unlikely(!rsp))
+		return -ENOMEM;
+
+	iu = rnbd_get_iu(sess, RTRS_USR_CON, RTRS_PERMIT_WAIT);
+	if (unlikely(!iu)) {
+		kfree(rsp);
+		return -ENOMEM;
+	}
+
+	iu->buf = rsp;
+	iu->dev = dev;
+
+	sg_init_one(iu->sglist, rsp, sizeof(*rsp));
+
+	msg.hdr.type	= cpu_to_le16(RNBD_MSG_OPEN);
+	msg.access_mode	= dev->access_mode;
+	strlcpy(msg.dev_name, dev->pathname, sizeof(msg.dev_name));
+
+	WARN_ON(!rnbd_clt_get_dev(dev));
+	err = send_usr_msg(sess->rtrs, READ, iu,
+			   &vec, 1, sizeof(*rsp), iu->sglist, 1,
+			   msg_open_conf, &errno, wait);
+	if (unlikely(err)) {
+		rnbd_clt_put_dev(dev);
+		rnbd_put_iu(sess, iu);
+		kfree(rsp);
+	} else {
+		err = errno;
+	}
+
+	rnbd_put_iu(sess, iu);
+	return err;
+}
+
+static int send_msg_sess_info(struct rnbd_clt_session *sess, bool wait)
+{
+	struct rnbd_msg_sess_info_rsp *rsp;
+	struct rnbd_msg_sess_info msg;
+	struct rnbd_iu *iu;
+	struct kvec vec = {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	int err, errno;
+
+	rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+	if (unlikely(!rsp))
+		return -ENOMEM;
+
+	iu = rnbd_get_iu(sess, RTRS_USR_CON, RTRS_PERMIT_WAIT);
+	if (unlikely(!iu)) {
+		kfree(rsp);
+		return -ENOMEM;
+	}
+
+	iu->buf = rsp;
+	iu->sess = sess;
+
+	sg_init_one(iu->sglist, rsp, sizeof(*rsp));
+
+	msg.hdr.type = cpu_to_le16(RNBD_MSG_SESS_INFO);
+	msg.ver      = RNBD_PROTO_VER_MAJOR;
+
+	if (unlikely(!rnbd_clt_get_sess(sess))) {
+		/*
+		 * That can happen only in one case, when RTRS has restablished
+		 * the connection and link_ev() is called, but session is almost
+		 * dead, last reference on session is put and caller is waiting
+		 * for RTRS to close everything.
+		 */
+		err = -ENODEV;
+		goto put_iu;
+	}
+	err = send_usr_msg(sess->rtrs, READ, iu,
+			   &vec, 1, sizeof(*rsp), iu->sglist, 1,
+			   msg_sess_info_conf, &errno, wait);
+	if (unlikely(err)) {
+		rnbd_clt_put_sess(sess);
+put_iu:
+		rnbd_put_iu(sess, iu);
+		kfree(rsp);
+	} else {
+		err = errno;
+	}
+
+	rnbd_put_iu(sess, iu);
+	return err;
+}
+
+static void set_dev_states_to_disconnected(struct rnbd_clt_session *sess)
+{
+	struct rnbd_clt_dev *dev;
+
+	mutex_lock(&sess->lock);
+	list_for_each_entry(dev, &sess->devs_list, list) {
+		rnbd_clt_err(dev, "Device disconnected.\n");
+
+		mutex_lock(&dev->lock);
+		if (dev->dev_state == DEV_STATE_MAPPED)
+			dev->dev_state = DEV_STATE_MAPPED_DISCONNECTED;
+		mutex_unlock(&dev->lock);
+	}
+	mutex_unlock(&sess->lock);
+}
+
+static void remap_devs(struct rnbd_clt_session *sess)
+{
+	struct rnbd_clt_dev *dev;
+	struct rtrs_attrs attrs;
+	int err;
+
+	/*
+	 * Careful here: we are called from RTRS link event directly,
+	 * thus we can't send any RTRS request and wait for response
+	 * or RTRS will not be able to complete request with failure
+	 * if something goes wrong (failing of outstanding requests
+	 * happens exactly from the context where we are blocking now).
+	 *
+	 * So to avoid deadlocks each usr message sent from here must
+	 * be asynchronous.
+	 */
+
+	err = send_msg_sess_info(sess, NO_WAIT);
+	if (unlikely(err)) {
+		pr_err("send_msg_sess_info(\"%s\"): %d\n", sess->sessname, err);
+		return;
+	}
+
+	rtrs_clt_query(sess->rtrs, &attrs);
+	mutex_lock(&sess->lock);
+	sess->max_io_size = attrs.max_io_size;
+
+	list_for_each_entry(dev, &sess->devs_list, list) {
+		bool skip;
+
+		mutex_lock(&dev->lock);
+		skip = (dev->dev_state == DEV_STATE_INIT);
+		mutex_unlock(&dev->lock);
+		if (skip)
+			/*
+			 * When device is establishing connection for the first
+			 * time - do not remap, it will be closed soon.
+			 */
+			continue;
+
+		rnbd_clt_info(dev, "session reconnected, remapping device\n");
+		err = send_msg_open(dev, NO_WAIT);
+		if (unlikely(err)) {
+			rnbd_clt_err(dev, "send_msg_open(): %d\n", err);
+			break;
+		}
+	}
+	mutex_unlock(&sess->lock);
+}
+
+static void rnbd_clt_link_ev(void *priv, enum rtrs_clt_link_ev ev)
+{
+	struct rnbd_clt_session *sess = priv;
+
+	switch (ev) {
+	case RTRS_CLT_LINK_EV_DISCONNECTED:
+		set_dev_states_to_disconnected(sess);
+		break;
+	case RTRS_CLT_LINK_EV_RECONNECTED:
+		remap_devs(sess);
+		break;
+	default:
+		pr_err("Unknown session event received (%d), session: %s\n",
+		       ev, sess->sessname);
+	}
+}
+
+static void rnbd_init_cpu_qlists(struct rnbd_cpu_qlist __percpu *cpu_queues)
+{
+	unsigned int cpu;
+	struct rnbd_cpu_qlist *cpu_q;
+
+	for_each_possible_cpu(cpu) {
+		cpu_q = per_cpu_ptr(cpu_queues, cpu);
+
+		cpu_q->cpu = cpu;
+		INIT_LIST_HEAD(&cpu_q->requeue_list);
+		spin_lock_init(&cpu_q->requeue_lock);
+	}
+}
+
+static void destroy_mq_tags(struct rnbd_clt_session *sess)
+{
+	if (sess->tag_set.tags)
+		blk_mq_free_tag_set(&sess->tag_set);
+}
+
+static inline void wake_up_rtrs_waiters(struct rnbd_clt_session *sess)
+{
+	sess->rtrs_ready = true;
+	wake_up_all(&sess->rtrs_waitq);
+}
+
+static void close_rtrs(struct rnbd_clt_session *sess)
+{
+	might_sleep();
+
+	if (!IS_ERR_OR_NULL(sess->rtrs)) {
+		rtrs_clt_close(sess->rtrs);
+		sess->rtrs = NULL;
+		wake_up_rtrs_waiters(sess);
+	}
+}
+
+static void free_sess(struct rnbd_clt_session *sess)
+{
+	WARN_ON(!list_empty(&sess->devs_list));
+
+	might_sleep();
+
+	close_rtrs(sess);
+	destroy_mq_tags(sess);
+	if (!list_empty(&sess->list)) {
+		mutex_lock(&sess_lock);
+		list_del(&sess->list);
+		mutex_unlock(&sess_lock);
+	}
+	free_percpu(sess->cpu_queues);
+	free_percpu(sess->cpu_rr);
+	kfree(sess);
+}
+
+static struct rnbd_clt_session *alloc_sess(const char *sessname)
+{
+	struct rnbd_clt_session *sess;
+	int err, cpu;
+
+	sess = kzalloc_node(sizeof(*sess), GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!sess)) {
+		pr_err("Failed to create session %s, allocating session struct failed\n",
+		       sessname);
+		return ERR_PTR(-ENOMEM);
+	}
+	strlcpy(sess->sessname, sessname, sizeof(sess->sessname));
+	atomic_set(&sess->busy, 0);
+	mutex_init(&sess->lock);
+	INIT_LIST_HEAD(&sess->devs_list);
+	INIT_LIST_HEAD(&sess->list);
+	bitmap_zero(sess->cpu_queues_bm, NR_CPUS);
+	init_waitqueue_head(&sess->rtrs_waitq);
+	refcount_set(&sess->refcount, 1);
+
+	sess->cpu_queues = alloc_percpu(struct rnbd_cpu_qlist);
+	if (unlikely(!sess->cpu_queues)) {
+		pr_err("Failed to create session to %s, alloc of percpu var (cpu_queues) failed\n",
+		       sessname);
+		err = -ENOMEM;
+		goto err;
+	}
+	rnbd_init_cpu_qlists(sess->cpu_queues);
+
+	/**
+	 * That is simple percpu variable which stores cpu indeces, which are
+	 * incremented on each access.  We need that for the sake of fairness
+	 * to wake up queues in a round-robin manner.
+	 */
+	sess->cpu_rr = alloc_percpu(int);
+	if (unlikely(!sess->cpu_rr)) {
+		pr_err("Failed to create session %s, alloc of percpu var (cpu_rr) failed\n",
+		       sessname);
+		err = -ENOMEM;
+		goto err;
+	}
+	for_each_possible_cpu(cpu)
+		* per_cpu_ptr(sess->cpu_rr, cpu) = cpu;
+
+	return sess;
+
+err:
+	free_sess(sess);
+
+	return ERR_PTR(err);
+}
+
+static int wait_for_rtrs_connection(struct rnbd_clt_session *sess)
+{
+	wait_event(sess->rtrs_waitq, sess->rtrs_ready);
+	if (IS_ERR_OR_NULL(sess->rtrs))
+		return -ECONNRESET;
+
+	return 0;
+}
+
+static void wait_for_rtrs_disconnection(struct rnbd_clt_session *sess)
+__releases(&sess_lock)
+__acquires(&sess_lock)
+{
+	DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
+
+	prepare_to_wait(&sess->rtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
+	if (IS_ERR_OR_NULL(sess->rtrs)) {
+		finish_wait(&sess->rtrs_waitq, &wait);
+		return;
+	}
+	mutex_unlock(&sess_lock);
+	/* After unlock session can be freed, so careful */
+	schedule();
+	mutex_lock(&sess_lock);
+}
+
+static struct rnbd_clt_session *__find_and_get_sess(const char *sessname)
+__releases(&sess_lock)
+__acquires(&sess_lock)
+{
+	struct rnbd_clt_session *sess;
+	int err;
+
+again:
+	list_for_each_entry(sess, &sess_list, list) {
+		if (strcmp(sessname, sess->sessname))
+			continue;
+
+		if (unlikely(sess->rtrs_ready && IS_ERR_OR_NULL(sess->rtrs)))
+			/*
+			 * No RTRS connection, session is dying.
+			 */
+			continue;
+
+		if (likely(rnbd_clt_get_sess(sess))) {
+			/*
+			 * Alive session is found, wait for RTRS connection.
+			 */
+			mutex_unlock(&sess_lock);
+			err = wait_for_rtrs_connection(sess);
+			if (unlikely(err))
+				rnbd_clt_put_sess(sess);
+			mutex_lock(&sess_lock);
+
+			if (unlikely(err))
+				/* Session is dying, repeat the loop */
+				goto again;
+
+			return sess;
+		}
+		/*
+		 * Ref is 0, session is dying, wait for RTRS disconnect
+		 * in order to avoid session names clashes.
+		 */
+		wait_for_rtrs_disconnection(sess);
+		/*
+		 * RTRS is disconnected and soon session will be freed,
+		 * so repeat a loop.
+		 */
+		goto again;
+	}
+
+	return NULL;
+}
+
+static struct rnbd_clt_session *find_and_get_sess(const char *sessname)
+{
+	struct rnbd_clt_session *sess;
+
+	mutex_lock(&sess_lock);
+	sess = __find_and_get_sess(sessname);
+	mutex_unlock(&sess_lock);
+
+	return sess;
+}
+
+static struct rnbd_clt_session *
+find_and_get_or_insert_sess(struct rnbd_clt_session *sess)
+{
+	struct rnbd_clt_session *found;
+
+	mutex_lock(&sess_lock);
+	found = __find_and_get_sess(sess->sessname);
+	if (!found)
+		list_add(&sess->list, &sess_list);
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static int rnbd_client_open(struct block_device *block_device, fmode_t mode)
+{
+	struct rnbd_clt_dev *dev = block_device->bd_disk->private_data;
+
+	if (dev->read_only && (mode & FMODE_WRITE))
+		return -EPERM;
+
+	if (dev->dev_state == DEV_STATE_UNMAPPED ||
+	    !rnbd_clt_get_dev(dev))
+		return -EIO;
+
+	return 0;
+}
+
+static void rnbd_client_release(struct gendisk *gen, fmode_t mode)
+{
+	struct rnbd_clt_dev *dev = gen->private_data;
+
+	rnbd_clt_put_dev(dev);
+}
+
+static int rnbd_client_getgeo(struct block_device *block_device,
+			       struct hd_geometry *geo)
+{
+	u64 size;
+	struct rnbd_clt_dev *dev;
+
+	dev = block_device->bd_disk->private_data;
+	size = dev->size * (dev->logical_block_size / SECTOR_SIZE);
+	geo->cylinders	= (size & ~0x3f) >> 6;	/* size/64 */
+	geo->heads	= 4;
+	geo->sectors	= 16;
+	geo->start	= 0;
+
+	return 0;
+}
+
+static const struct block_device_operations rnbd_client_ops = {
+	.owner		= THIS_MODULE,
+	.open		= rnbd_client_open,
+	.release	= rnbd_client_release,
+	.getgeo		= rnbd_client_getgeo
+};
+
+static size_t rnbd_clt_get_sg_size(struct scatterlist *sglist, u32 len)
+{
+	struct scatterlist *sg;
+	size_t tsize = 0;
+	int i;
+
+	for_each_sg(sglist, sg, len, i)
+		tsize += sg->length;
+	return tsize;
+}
+
+static int rnbd_client_xfer_request(struct rnbd_clt_dev *dev,
+				     struct request *rq,
+				     struct rnbd_iu *iu)
+{
+	struct rtrs_clt *rtrs = dev->sess->rtrs;
+	struct rtrs_permit *permit = iu->permit;
+	struct rnbd_msg_io msg;
+	unsigned int sg_cnt = 0;
+	struct kvec vec;
+	size_t size;
+	int err;
+
+	iu->rq		= rq;
+	iu->dev		= dev;
+	msg.sector	= cpu_to_le64(blk_rq_pos(rq));
+	msg.bi_size	= cpu_to_le32(blk_rq_bytes(rq));
+	msg.rw		= cpu_to_le32(rq_to_rnbd_flags(rq));
+	msg.prio	= cpu_to_le16(req_get_ioprio(rq));
+
+	/*
+	 * We only support discards with single segment for now.
+	 * See queue limits.
+	 */
+	if (req_op(rq) != REQ_OP_DISCARD)
+		sg_cnt = blk_rq_map_sg(dev->queue, rq, iu->sglist);
+
+	if (sg_cnt == 0)
+		/* Do not forget to mark the end */
+		sg_mark_end(&iu->sglist[0]);
+
+	msg.hdr.type	= cpu_to_le16(RNBD_MSG_IO);
+	msg.device_id	= cpu_to_le32(dev->device_id);
+
+	vec = (struct kvec) {
+		.iov_base = &msg,
+		.iov_len  = sizeof(msg)
+	};
+	size = rnbd_clt_get_sg_size(iu->sglist, sg_cnt);
+	err = rtrs_clt_request(rq_data_dir(rq), msg_io_conf, rtrs, permit,
+				iu, &vec, 1, size, iu->sglist, sg_cnt);
+	if (unlikely(err)) {
+		rnbd_clt_err_rl(dev, "RTRS failed to transfer IO, err: %d\n",
+				 err);
+		return err;
+	}
+
+	return 0;
+}
+
+/**
+ * rnbd_clt_dev_add_to_requeue() - add device to requeue if session is busy
+ * @dev:	Device to be checked
+ * @q:		Queue to be added to the requeue list if required
+ *
+ * Description:
+ *     If session is busy, that means someone will requeue us when resources
+ *     are freed.  If session is not doing anything - device is not added to
+ *     the list and @false is returned.
+ */
+static inline bool rnbd_clt_dev_add_to_requeue(struct rnbd_clt_dev *dev,
+						struct rnbd_queue *q)
+{
+	struct rnbd_clt_session *sess = dev->sess;
+	struct rnbd_cpu_qlist *cpu_q;
+	unsigned long flags;
+	bool added = true;
+	bool need_set;
+
+	cpu_q = get_cpu_ptr(sess->cpu_queues);
+	spin_lock_irqsave(&cpu_q->requeue_lock, flags);
+
+	if (likely(!test_and_set_bit_lock(0, &q->in_list))) {
+		if (WARN_ON(!list_empty(&q->requeue_list)))
+			goto unlock;
+
+		need_set = !test_bit(cpu_q->cpu, sess->cpu_queues_bm);
+		if (need_set) {
+			set_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			/* Paired with rnbd_put_permit().	 Set a bit first
+			 * and then observe the busy counter.
+			 */
+			smp_mb__before_atomic();
+		}
+		if (likely(atomic_read(&sess->busy))) {
+			list_add_tail(&q->requeue_list, &cpu_q->requeue_list);
+		} else {
+			/* Very unlikely, but possible: busy counter was
+			 * observed as zero.  Drop all bits and return
+			 * false to restart the queue by ourselves.
+			 */
+			if (need_set)
+				clear_bit(cpu_q->cpu, sess->cpu_queues_bm);
+			clear_bit_unlock(0, &q->in_list);
+			added = false;
+		}
+	}
+unlock:
+	spin_unlock_irqrestore(&cpu_q->requeue_lock, flags);
+	put_cpu_ptr(sess->cpu_queues);
+
+	return added;
+}
+
+static void rnbd_clt_dev_kick_mq_queue(struct rnbd_clt_dev *dev,
+					struct blk_mq_hw_ctx *hctx,
+					int delay)
+{
+	struct rnbd_queue *q = hctx->driver_data;
+
+	if (delay != RNBD_DELAY_IFBUSY)
+		blk_mq_delay_run_hw_queue(hctx, delay);
+	else if (unlikely(!rnbd_clt_dev_add_to_requeue(dev, q)))
+		/*
+		 * If session is not busy we have to restart
+		 * the queue ourselves.
+		 */
+		blk_mq_delay_run_hw_queue(hctx, RNBD_DELAY_10ms);
+}
+
+static blk_status_t rnbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+				   const struct blk_mq_queue_data *bd)
+{
+	struct request *rq = bd->rq;
+	struct rnbd_clt_dev *dev = rq->rq_disk->private_data;
+	struct rnbd_iu *iu = blk_mq_rq_to_pdu(rq);
+	int err;
+
+	if (unlikely(!rnbd_clt_dev_is_mapped(dev)))
+		return BLK_STS_IOERR;
+
+	iu->permit = rnbd_get_permit(dev->sess, RTRS_IO_CON,
+				      RTRS_PERMIT_NOWAIT);
+	if (unlikely(!iu->permit)) {
+		rnbd_clt_dev_kick_mq_queue(dev, hctx, RNBD_DELAY_IFBUSY);
+		return BLK_STS_RESOURCE;
+	}
+
+	blk_mq_start_request(rq);
+	err = rnbd_client_xfer_request(dev, rq, iu);
+	if (likely(err == 0))
+		return BLK_STS_OK;
+	if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
+		rnbd_clt_dev_kick_mq_queue(dev, hctx, RNBD_DELAY_10ms);
+		rnbd_put_permit(dev->sess, iu->permit);
+		return BLK_STS_RESOURCE;
+	}
+
+	rnbd_put_permit(dev->sess, iu->permit);
+	return BLK_STS_IOERR;
+}
+
+static int rnbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
+			      unsigned int hctx_idx, unsigned int numa_node)
+{
+	struct rnbd_iu *iu = blk_mq_rq_to_pdu(rq);
+
+	sg_init_table(iu->sglist, BMAX_SEGMENTS);
+	return 0;
+}
+
+static struct blk_mq_ops rnbd_mq_ops = {
+	.queue_rq	= rnbd_queue_rq,
+	.init_request	= rnbd_init_request,
+	.complete	= rnbd_softirq_done_fn,
+};
+
+static int setup_mq_tags(struct rnbd_clt_session *sess)
+{
+	struct blk_mq_tag_set *tags = &sess->tag_set;
+
+	memset(tags, 0, sizeof(*tags));
+	tags->ops		= &rnbd_mq_ops;
+	tags->queue_depth	= sess->queue_depth;
+	tags->numa_node		= NUMA_NO_NODE;
+	tags->flags		= BLK_MQ_F_SHOULD_MERGE |
+				  BLK_MQ_F_TAG_SHARED;
+	tags->cmd_size		= sizeof(struct rnbd_iu);
+	tags->nr_hw_queues	= num_online_cpus();
+
+	return blk_mq_alloc_tag_set(tags);
+}
+
+static struct rnbd_clt_session *
+find_and_get_or_create_sess(const char *sessname,
+			    const struct rtrs_addr *paths,
+			    size_t path_cnt)
+{
+	struct rnbd_clt_session *sess, *found;
+	struct rtrs_attrs attrs;
+	int err;
+
+	sess = find_and_get_sess(sessname);
+	if (sess)
+		return sess;
+
+	sess = alloc_sess(sessname);
+	if (IS_ERR(sess))
+		return sess;
+
+	found = find_and_get_or_insert_sess(sess);
+	if (unlikely(found)) {
+		free_sess(sess);
+
+		return found;
+	}
+	/*
+	 * Nothing was found, establish rtrs connection and proceed further.
+	 */
+	sess->rtrs = rtrs_clt_open(sess, rnbd_clt_link_ev, sessname,
+				     paths, path_cnt, RTRS_PORT,
+				     sizeof(struct rnbd_iu),
+				     RECONNECT_DELAY, BMAX_SEGMENTS,
+				     MAX_RECONNECTS);
+	if (IS_ERR(sess->rtrs)) {
+		err = PTR_ERR(sess->rtrs);
+		goto wake_up_and_put;
+	}
+	rtrs_clt_query(sess->rtrs, &attrs);
+	sess->max_io_size = attrs.max_io_size;
+	sess->queue_depth = attrs.queue_depth;
+
+	err = setup_mq_tags(sess);
+	if (unlikely(err))
+		goto close_rtrs;
+
+	err = send_msg_sess_info(sess, WAIT);
+	if (unlikely(err))
+		goto close_rtrs;
+
+	wake_up_rtrs_waiters(sess);
+
+	return sess;
+
+close_rtrs:
+	close_rtrs(sess);
+put_sess:
+	rnbd_clt_put_sess(sess);
+
+	return ERR_PTR(err);
+
+wake_up_and_put:
+	wake_up_rtrs_waiters(sess);
+	goto put_sess;
+}
+
+static inline void rnbd_init_hw_queue(struct rnbd_clt_dev *dev,
+				       struct rnbd_queue *q,
+				       struct blk_mq_hw_ctx *hctx)
+{
+	INIT_LIST_HEAD(&q->requeue_list);
+	q->dev  = dev;
+	q->hctx = hctx;
+}
+
+static void rnbd_init_mq_hw_queues(struct rnbd_clt_dev *dev)
+{
+	int i;
+	struct blk_mq_hw_ctx *hctx;
+	struct rnbd_queue *q;
+
+	queue_for_each_hw_ctx(dev->queue, hctx, i) {
+		q = &dev->hw_queues[i];
+		rnbd_init_hw_queue(dev, q, hctx);
+		hctx->driver_data = q;
+	}
+}
+
+static int index_to_minor(int index)
+{
+	return index << RNBD_PART_BITS;
+}
+
+static int minor_to_index(int minor)
+{
+	return minor >> RNBD_PART_BITS;
+}
+
+static int setup_mq_dev(struct rnbd_clt_dev *dev)
+{
+	dev->queue = blk_mq_init_queue(&dev->sess->tag_set);
+	if (IS_ERR(dev->queue)) {
+		rnbd_clt_err(dev, "Initializing multiqueue queue failed, err: %ld\n",
+			      PTR_ERR(dev->queue));
+		return PTR_ERR(dev->queue);
+	}
+	rnbd_init_mq_hw_queues(dev);
+	return 0;
+}
+
+static void setup_request_queue(struct rnbd_clt_dev *dev)
+{
+	blk_queue_logical_block_size(dev->queue, dev->logical_block_size);
+	blk_queue_physical_block_size(dev->queue, dev->physical_block_size);
+	blk_queue_max_hw_sectors(dev->queue, dev->max_hw_sectors);
+	blk_queue_max_write_same_sectors(dev->queue,
+					 dev->max_write_same_sectors);
+
+	/*
+	 * we don't support discards to "discontiguous" segments
+	 * in on request
+	 */
+	blk_queue_max_discard_segments(dev->queue, 1);
+
+	blk_queue_max_discard_sectors(dev->queue, dev->max_discard_sectors);
+	dev->queue->limits.discard_granularity	= dev->discard_granularity;
+	dev->queue->limits.discard_alignment	= dev->discard_alignment;
+	if (dev->max_discard_sectors)
+		blk_queue_flag_set(QUEUE_FLAG_DISCARD, dev->queue);
+	if (dev->secure_discard)
+		blk_queue_flag_set(QUEUE_FLAG_SECERASE, dev->queue);
+
+	blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, dev->queue);
+	blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, dev->queue);
+	blk_queue_max_segments(dev->queue, dev->max_segments);
+	blk_queue_io_opt(dev->queue, dev->sess->max_io_size);
+	blk_queue_virt_boundary(dev->queue, 4095);
+	blk_queue_write_cache(dev->queue, true, true);
+	dev->queue->queuedata = dev;
+}
+
+static void rnbd_clt_setup_gen_disk(struct rnbd_clt_dev *dev, int idx)
+{
+	dev->gd->major		= rnbd_client_major;
+	dev->gd->first_minor	= index_to_minor(idx);
+	dev->gd->fops		= &rnbd_client_ops;
+	dev->gd->queue		= dev->queue;
+	dev->gd->private_data	= dev;
+	snprintf(dev->gd->disk_name, sizeof(dev->gd->disk_name), "rnbd%d",
+		 idx);
+	pr_debug("disk_name=%s, capacity=%zu\n",
+		 dev->gd->disk_name,
+		 dev->nsectors * (dev->logical_block_size / SECTOR_SIZE)
+		 );
+
+	set_capacity(dev->gd, dev->nsectors * (dev->logical_block_size /
+					       SECTOR_SIZE));
+
+	if (dev->access_mode == RNBD_ACCESS_RO) {
+		dev->read_only = true;
+		set_disk_ro(dev->gd, true);
+	} else {
+		dev->read_only = false;
+	}
+
+	if (!dev->rotational)
+		blk_queue_flag_set(QUEUE_FLAG_NONROT, dev->queue);
+}
+
+static void rnbd_clt_add_gen_disk(struct rnbd_clt_dev *dev)
+{
+	add_disk(dev->gd);
+}
+
+static int rnbd_client_setup_device(struct rnbd_clt_session *sess,
+				     struct rnbd_clt_dev *dev, int idx)
+{
+	int err;
+
+	dev->size = dev->nsectors * dev->logical_block_size;
+
+	err = setup_mq_dev(dev);
+	if (err)
+		return err;
+
+	setup_request_queue(dev);
+
+	dev->gd = alloc_disk_node(1 << RNBD_PART_BITS,	NUMA_NO_NODE);
+	if (!dev->gd) {
+		rnbd_clt_err(dev, "Failed to allocate disk node\n");
+		blk_cleanup_queue(dev->queue);
+		return -ENOMEM;
+	}
+
+	rnbd_clt_setup_gen_disk(dev, idx);
+
+	return 0;
+}
+
+static struct rnbd_clt_dev *init_dev(struct rnbd_clt_session *sess,
+				      enum rnbd_access_mode access_mode,
+				      const char *pathname)
+{
+	struct rnbd_clt_dev *dev;
+	int ret;
+
+	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, NUMA_NO_NODE);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	dev->hw_queues = kcalloc(nr_cpu_ids, sizeof(*dev->hw_queues),
+				 GFP_KERNEL);
+	if (unlikely(!dev->hw_queues)) {
+		pr_err("Failed to initialize device '%s' from session %s, allocating hw_queues failed.",
+		       pathname, sess->sessname);
+		ret = -ENOMEM;
+		goto out_alloc;
+	}
+
+	mutex_lock(&ida_lock);
+	ret = ida_simple_get(&index_ida, 0, minor_to_index(1 << MINORBITS),
+			     GFP_KERNEL);
+	mutex_unlock(&ida_lock);
+	if (ret < 0) {
+		pr_err("Failed to initialize device '%s' from session %s, allocating idr failed, err: %d\n",
+		       pathname, sess->sessname, ret);
+		goto out_queues;
+	}
+	dev->clt_device_id	= ret;
+	dev->sess		= sess;
+	dev->access_mode	= access_mode;
+	strlcpy(dev->pathname, pathname, sizeof(dev->pathname));
+	mutex_init(&dev->lock);
+	refcount_set(&dev->refcount, 1);
+	dev->dev_state = DEV_STATE_INIT;
+
+	/*
+	 * Here we called from sysfs entry, thus clt-sysfs is
+	 * responsible that session will not disappear.
+	 */
+	WARN_ON(!rnbd_clt_get_sess(sess));
+
+	return dev;
+
+out_queues:
+	kfree(dev->hw_queues);
+out_alloc:
+	kfree(dev);
+	return ERR_PTR(ret);
+}
+
+static bool __exists_dev(const char *pathname)
+{
+	struct rnbd_clt_session *sess;
+	struct rnbd_clt_dev *dev;
+	bool found = false;
+
+	list_for_each_entry(sess, &sess_list, list) {
+		mutex_lock(&sess->lock);
+		list_for_each_entry(dev, &sess->devs_list, list) {
+			if (!strncmp(dev->pathname, pathname,
+				     sizeof(dev->pathname))) {
+				found = true;
+				break;
+			}
+		}
+		mutex_unlock(&sess->lock);
+		if (found)
+			break;
+	}
+
+	return found;
+}
+
+static bool exists_devpath(const char *pathname)
+{
+	bool found;
+
+	mutex_lock(&sess_lock);
+	found = __exists_dev(pathname);
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static bool insert_dev_if_not_exists_devpath(const char *pathname,
+					     struct rnbd_clt_session *sess,
+					     struct rnbd_clt_dev *dev)
+{
+	bool found;
+
+	mutex_lock(&sess_lock);
+	found = __exists_dev(pathname);
+	if (!found) {
+		mutex_lock(&sess->lock);
+		list_add_tail(&dev->list, &sess->devs_list);
+		mutex_unlock(&sess->lock);
+	}
+	mutex_unlock(&sess_lock);
+
+	return found;
+}
+
+static void delete_dev(struct rnbd_clt_dev *dev)
+{
+	struct rnbd_clt_session *sess = dev->sess;
+
+	mutex_lock(&sess->lock);
+	list_del(&dev->list);
+	mutex_unlock(&sess->lock);
+}
+
+struct rnbd_clt_dev *rnbd_clt_map_device(const char *sessname,
+					   struct rtrs_addr *paths,
+					   size_t path_cnt,
+					   const char *pathname,
+					   enum rnbd_access_mode access_mode)
+{
+	struct rnbd_clt_session *sess;
+	struct rnbd_clt_dev *dev;
+	int ret;
+
+	if (unlikely(exists_devpath(pathname)))
+		return ERR_PTR(-EEXIST);
+
+	sess = find_and_get_or_create_sess(sessname, paths, path_cnt);
+	if (IS_ERR(sess))
+		return ERR_CAST(sess);
+
+	dev = init_dev(sess, access_mode, pathname);
+	if (IS_ERR(dev)) {
+		pr_err("map_device: failed to map device '%s' from session %s, can't initialize device, err: %ld\n",
+		       pathname, sess->sessname, PTR_ERR(dev));
+		ret = PTR_ERR(dev);
+		goto put_sess;
+	}
+	if (unlikely(insert_dev_if_not_exists_devpath(pathname, sess, dev))) {
+		ret = -EEXIST;
+		goto put_dev;
+	}
+	ret = send_msg_open(dev, WAIT);
+	if (unlikely(ret)) {
+		rnbd_clt_err(dev,
+			      "map_device: failed, can't open remote device, err: %d\n",
+			      ret);
+		goto del_dev;
+	}
+	mutex_lock(&dev->lock);
+	pr_debug("Opened remote device: session=%s, path='%s'\n",
+		 sess->sessname, pathname);
+	ret = rnbd_client_setup_device(sess, dev, dev->clt_device_id);
+	if (ret) {
+		rnbd_clt_err(dev,
+			      "map_device: Failed to configure device, err: %d\n",
+			      ret);
+		mutex_unlock(&dev->lock);
+		goto del_dev;
+	}
+
+	rnbd_clt_info(dev,
+		       "map_device: Device mapped as %s (nsectors: %zu, logical_block_size: %d, physical_block_size: %d, max_write_same_sectors: %d, max_discard_sectors: %d, discard_granularity: %d, discard_alignment: %d, secure_discard: %d, max_segments: %d, max_hw_sectors: %d, rotational: %d)\n",
+		       dev->gd->disk_name, dev->nsectors,
+		       dev->logical_block_size, dev->physical_block_size,
+		       dev->max_write_same_sectors, dev->max_discard_sectors,
+		       dev->discard_granularity, dev->discard_alignment,
+		       dev->secure_discard, dev->max_segments,
+		       dev->max_hw_sectors, dev->rotational);
+
+	mutex_unlock(&dev->lock);
+
+	rnbd_clt_add_gen_disk(dev);
+	rnbd_clt_put_sess(sess);
+
+	return dev;
+
+del_dev:
+	delete_dev(dev);
+put_dev:
+	rnbd_clt_put_dev(dev);
+put_sess:
+	rnbd_clt_put_sess(sess);
+
+	return ERR_PTR(ret);
+}
+
+static void destroy_gen_disk(struct rnbd_clt_dev *dev)
+{
+	del_gendisk(dev->gd);
+	blk_cleanup_queue(dev->queue);
+	put_disk(dev->gd);
+}
+
+static void destroy_sysfs(struct rnbd_clt_dev *dev,
+			  const struct attribute *sysfs_self)
+{
+	rnbd_clt_remove_dev_symlink(dev);
+	if (dev->kobj.state_initialized) {
+		if (sysfs_self)
+			/* To avoid deadlock firstly remove itself */
+			sysfs_remove_file_self(&dev->kobj, sysfs_self);
+		kobject_del(&dev->kobj);
+		kobject_put(&dev->kobj);
+	}
+}
+
+int rnbd_clt_unmap_device(struct rnbd_clt_dev *dev, bool force,
+			   const struct attribute *sysfs_self)
+{
+	struct rnbd_clt_session *sess = dev->sess;
+	int refcount, ret = 0;
+	bool was_mapped;
+
+	mutex_lock(&dev->lock);
+	if (dev->dev_state == DEV_STATE_UNMAPPED) {
+		rnbd_clt_info(dev, "Device is already being unmapped\n");
+		ret = -EALREADY;
+		goto err;
+	}
+	refcount = refcount_read(&dev->refcount);
+	if (!force && refcount > 1) {
+		rnbd_clt_err(dev,
+			      "Closing device failed, device is in use, (%d device users)\n",
+			      refcount - 1);
+		ret = -EBUSY;
+		goto err;
+	}
+	was_mapped = (dev->dev_state == DEV_STATE_MAPPED);
+	dev->dev_state = DEV_STATE_UNMAPPED;
+	mutex_unlock(&dev->lock);
+
+	delete_dev(dev);
+	destroy_sysfs(dev, sysfs_self);
+	destroy_gen_disk(dev);
+	if (was_mapped && sess->rtrs)
+		send_msg_close(dev, dev->device_id, WAIT);
+
+	rnbd_clt_info(dev, "Device is unmapped\n");
+
+	/* Likely last reference put */
+	rnbd_clt_put_dev(dev);
+
+	/*
+	 * Here device and session can be vanished!
+	 */
+
+	return 0;
+err:
+	mutex_unlock(&dev->lock);
+
+	return ret;
+}
+
+int rnbd_clt_remap_device(struct rnbd_clt_dev *dev)
+{
+	int err;
+
+	mutex_lock(&dev->lock);
+	if (likely(dev->dev_state == DEV_STATE_MAPPED_DISCONNECTED))
+		err = 0;
+	else if (dev->dev_state == DEV_STATE_UNMAPPED)
+		err = -ENODEV;
+	else if (dev->dev_state == DEV_STATE_MAPPED)
+		err = -EALREADY;
+	else
+		err = -EBUSY;
+	mutex_unlock(&dev->lock);
+	if (likely(!err)) {
+		rnbd_clt_info(dev, "Remapping device.\n");
+		err = send_msg_open(dev, WAIT);
+		if (unlikely(err))
+			rnbd_clt_err(dev, "remap_device: %d\n", err);
+	}
+
+	return err;
+}
+
+static void unmap_device_work(struct work_struct *work)
+{
+	struct rnbd_clt_dev *dev;
+
+	dev = container_of(work, typeof(*dev), unmap_on_rmmod_work);
+	rnbd_clt_unmap_device(dev, true, NULL);
+}
+
+static void rnbd_destroy_sessions(void)
+{
+	struct rnbd_clt_session *sess, *sn;
+	struct rnbd_clt_dev *dev, *tn;
+
+	/* Firstly forbid access through sysfs interface */
+	rnbd_clt_destroy_default_group();
+	rnbd_clt_destroy_sysfs_files();
+
+	/*
+	 * Here at this point there is no any concurrent access to sessions
+	 * list and devices list:
+	 *   1. New session or device can'be be created - session sysfs files
+	 *      are removed.
+	 *   2. Device or session can't be removed - module reference is taken
+	 *      into account in unmap device sysfs callback.
+	 *   3. No IO requests inflight - each file open of block_dev increases
+	 *      module reference in get_disk().
+	 *
+	 * But still there can be user requests inflights, which are sent by
+	 * asynchronous send_msg_*() functions, thus before unmapping devices
+	 * RTRS session must be explicitly closed.
+	 */
+
+	list_for_each_entry_safe(sess, sn, &sess_list, list) {
+		WARN_ON(!rnbd_clt_get_sess(sess));
+		close_rtrs(sess);
+		list_for_each_entry_safe(dev, tn, &sess->devs_list, list) {
+			/*
+			 * Here unmap happens in parallel for only one reason:
+			 * blk_cleanup_queue() takes around half a second, so
+			 * on huge amount of devices the whole module unload
+			 * procedure takes minutes.
+			 */
+			INIT_WORK(&dev->unmap_on_rmmod_work, unmap_device_work);
+			queue_work(system_long_wq, &dev->unmap_on_rmmod_work);
+		}
+		rnbd_clt_put_sess(sess);
+	}
+	/* Wait for all scheduled unmap works */
+	flush_workqueue(system_long_wq);
+	WARN_ON(!list_empty(&sess_list));
+}
+
+static int __init rnbd_client_init(void)
+{
+	int err = 0;
+
+	pr_info("Loading module %s, proto %s:\n",
+		KBUILD_MODNAME, RNBD_PROTO_VER_STRING);
+
+	rnbd_client_major = register_blkdev(rnbd_client_major, "rnbd");
+	if (rnbd_client_major <= 0) {
+		pr_err("Failed to load module, block device registration failed\n");
+		return -EBUSY;
+	}
+
+	err = rnbd_clt_create_sysfs_files();
+	if (err) {
+		pr_err("Failed to load module, creating sysfs device files failed, err: %d\n",
+		       err);
+		unregister_blkdev(rnbd_client_major, "rnbd");
+	}
+
+	return err;
+}
+
+static void __exit rnbd_client_exit(void)
+{
+	pr_info("Unloading module\n");
+	rnbd_destroy_sessions();
+	unregister_blkdev(rnbd_client_major, "rnbd");
+	ida_destroy(&index_ida);
+	pr_info("Module unloaded\n");
+}
+
+module_init(rnbd_client_init);
+module_exit(rnbd_client_exit);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 18/25] rnbd: client: sysfs interface functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (16 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 17/25] rnbd: client: main functionality Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2020-01-03  0:03   ` Bart Van Assche
  2019-12-30 10:29 ` [PATCH v6 19/25] rnbd: server: private header with server structs and functions Jack Wang
                   ` (8 subsequent siblings)
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is the sysfs interface to rnbd block devices on client side:

  /sys/devices/virtual/rnbd-client/ctl/
    |- map_device
    |  *** maps remote device
    |
    |- devices/
       *** all mapped devices

  /sys/block/rnbd<N>/rnbd/
    |- unmap_device
    |  *** unmaps device
    |
    |- state
    |  *** device state
    |
    |- session
    |  *** session name
    |
    |- mapping_path
       *** path of the dev that was mapped on server

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-clt-sysfs.c | 641 ++++++++++++++++++++++++++++
 1 file changed, 641 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-clt-sysfs.c

diff --git a/drivers/block/rnbd/rnbd-clt-sysfs.c b/drivers/block/rnbd/rnbd-clt-sysfs.c
new file mode 100644
index 000000000000..0b889cc8a9f9
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-clt-sysfs.c
@@ -0,0 +1,641 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/parser.h>
+#include <linux/module.h>
+#include <linux/in6.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/device.h>
+#include <rdma/ib.h>
+#include <rdma/rdma_cm.h>
+
+#include "rnbd-clt.h"
+
+static struct device *rnbd_dev;
+static struct class *rnbd_dev_class;
+static struct kobject *rnbd_devs_kobj;
+
+enum {
+	RNBD_OPT_ERR		= 0,
+	RNBD_OPT_PATH		= 1 << 0,
+	RNBD_OPT_DEV_PATH	= 1 << 1,
+	RNBD_OPT_ACCESS_MODE	= 1 << 3,
+	RNBD_OPT_SESSNAME	= 1 << 6,
+};
+
+static const unsigned int rnbd_opt_mandatory[] = {
+	RNBD_OPT_PATH,
+	RNBD_OPT_DEV_PATH,
+	RNBD_OPT_SESSNAME,
+};
+
+static const match_table_t rnbd_opt_tokens = {
+	{	RNBD_OPT_PATH,		"path=%s"		},
+	{	RNBD_OPT_DEV_PATH,	"device_path=%s"	},
+	{	RNBD_OPT_ACCESS_MODE,	"access_mode=%s"	},
+	{	RNBD_OPT_SESSNAME,	"sessname=%s"		},
+	{	RNBD_OPT_ERR,		NULL			},
+};
+
+/* remove new line from string */
+static void strip(char *s)
+{
+	char *p = s;
+
+	while (*s != '\0') {
+		if (*s != '\n')
+			*p++ = *s++;
+		else
+			++s;
+	}
+	*p = '\0';
+}
+
+struct rnbd_map_options {
+	char *sessname;
+	struct rtrs_addr *paths;
+	size_t *path_cnt;
+	char *pathname;
+	enum rnbd_access_mode *access_mode;
+};
+
+static int rnbd_clt_parse_map_options(const char *buf, size_t max_path_cnt,
+				       struct rnbd_map_options *opt)
+{
+	char *options, *sep_opt;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int opt_mask = 0;
+	int token;
+	int ret = -EINVAL;
+	int i;
+	int p_cnt = 0;
+
+	options = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	sep_opt = strstrip(options);
+	strip(sep_opt);
+	while ((p = strsep(&sep_opt, " ")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, rnbd_opt_tokens, args);
+		opt_mask |= token;
+
+		switch (token) {
+		case RNBD_OPT_SESSNAME:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (strlen(p) > NAME_MAX) {
+				pr_err("map_device: sessname too long\n");
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			strlcpy(opt->sessname, p, NAME_MAX);
+			kfree(p);
+			break;
+
+		case RNBD_OPT_PATH:
+			if (p_cnt >= max_path_cnt) {
+				pr_err("map_device: too many (> %zu) paths provided\n",
+				       max_path_cnt);
+				ret = -ENOMEM;
+				goto out;
+			}
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			ret = rtrs_addr_to_sockaddr(p, strlen(p), RTRS_PORT,
+						     &opt->paths[p_cnt]);
+			if (ret) {
+				pr_err("Can't parse path %s: %d\n", p, ret);
+				kfree(p);
+				goto out;
+			}
+
+			p_cnt++;
+
+			kfree(p);
+			break;
+
+		case RNBD_OPT_DEV_PATH:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (strlen(p) > NAME_MAX) {
+				pr_err("map_device: Device path too long\n");
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+			strlcpy(opt->pathname, p, NAME_MAX);
+			kfree(p);
+			break;
+
+		case RNBD_OPT_ACCESS_MODE:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			if (!strcmp(p, "ro")) {
+				*opt->access_mode = RNBD_ACCESS_RO;
+			} else if (!strcmp(p, "rw")) {
+				*opt->access_mode = RNBD_ACCESS_RW;
+			} else if (!strcmp(p, "migration")) {
+				*opt->access_mode = RNBD_ACCESS_MIGRATION;
+			} else {
+				pr_err("map_device: Invalid access_mode: '%s'\n",
+				       p);
+				ret = -EINVAL;
+				kfree(p);
+				goto out;
+			}
+
+			kfree(p);
+			break;
+
+		default:
+			pr_err("map_device: Unknown parameter or missing value '%s'\n",
+			       p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(rnbd_opt_mandatory); i++) {
+		if ((opt_mask & rnbd_opt_mandatory[i])) {
+			ret = 0;
+		} else {
+			pr_err("map_device: Parameters missing\n");
+			ret = -EINVAL;
+			break;
+		}
+	}
+
+out:
+	*opt->path_cnt = p_cnt;
+	kfree(options);
+	return ret;
+}
+
+static ssize_t state_show(struct kobject *kobj,
+			  struct kobj_attribute *attr, char *page)
+{
+	struct rnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct rnbd_clt_dev, kobj);
+
+	switch (dev->dev_state) {
+	case DEV_STATE_INIT:
+		return snprintf(page, PAGE_SIZE, "init\n");
+	case DEV_STATE_MAPPED:
+		/* TODO fix cli tool before changing to proper state */
+		return snprintf(page, PAGE_SIZE, "open\n");
+	case DEV_STATE_MAPPED_DISCONNECTED:
+		/* TODO fix cli tool before changing to proper state */
+		return snprintf(page, PAGE_SIZE, "closed\n");
+	case DEV_STATE_UNMAPPED:
+		return snprintf(page, PAGE_SIZE, "unmapped\n");
+	default:
+		return snprintf(page, PAGE_SIZE, "unknown\n");
+	}
+}
+
+static struct kobj_attribute rnbd_clt_state_attr = __ATTR_RO(state);
+
+static ssize_t mapping_path_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *page)
+{
+	struct rnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct rnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", dev->pathname);
+}
+
+static struct kobj_attribute rnbd_clt_mapping_path_attr =
+	__ATTR_RO(mapping_path);
+
+static ssize_t access_mode_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *page)
+{
+	struct rnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct rnbd_clt_dev, kobj);
+
+	return snprintf(page, PAGE_SIZE, "%s\n",
+			rnbd_access_mode_str(dev->access_mode));
+}
+
+static struct kobj_attribute rnbd_clt_access_mode =
+	__ATTR_RO(access_mode);
+
+static ssize_t rnbd_clt_unmap_dev_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo <normal|force> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t rnbd_clt_unmap_dev_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct rnbd_clt_dev *dev;
+	char *opt, *options;
+	bool force;
+	int err;
+
+	opt = kstrdup(buf, GFP_KERNEL);
+	if (!opt)
+		return -ENOMEM;
+
+	options = strstrip(opt);
+	strip(options);
+
+	dev = container_of(kobj, struct rnbd_clt_dev, kobj);
+
+	if (sysfs_streq(options, "normal")) {
+		force = false;
+	} else if (sysfs_streq(options, "force")) {
+		force = true;
+	} else {
+		rnbd_clt_err(dev,
+			      "unmap_device: Invalid value: %s\n",
+			      options);
+		err = -EINVAL;
+		goto out;
+	}
+
+	rnbd_clt_info(dev, "Unmapping device, option: %s.\n",
+		       force ? "force" : "normal");
+
+	/*
+	 * We take explicit module reference only for one reason: do not
+	 * race with lockless rnbd_destroy_sessions().
+	 */
+	if (!try_module_get(THIS_MODULE)) {
+		err = -ENODEV;
+		goto out;
+	}
+	err = rnbd_clt_unmap_device(dev, force, &attr->attr);
+	if (unlikely(err)) {
+		if (unlikely(err != -EALREADY))
+			rnbd_clt_err(dev, "unmap_device: %d\n",  err);
+		goto module_put;
+	}
+
+	/*
+	 * Here device can be vanished!
+	 */
+
+	err = count;
+
+module_put:
+	module_put(THIS_MODULE);
+out:
+	kfree(opt);
+
+	return err;
+}
+
+static struct kobj_attribute rnbd_clt_unmap_device_attr =
+	__ATTR(unmap_device, 0644, rnbd_clt_unmap_dev_show,
+	       rnbd_clt_unmap_dev_store);
+
+static ssize_t rnbd_clt_resize_dev_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE,
+			 "Usage: echo <new size in sectors> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t rnbd_clt_resize_dev_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int ret;
+	unsigned long sectors;
+	struct rnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct rnbd_clt_dev, kobj);
+
+	ret = kstrtoul(buf, 0, &sectors);
+	if (ret)
+		return ret;
+
+	ret = rnbd_clt_resize_disk(dev, (size_t)sectors);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute rnbd_clt_resize_dev_attr =
+	__ATTR(resize, 0644, rnbd_clt_resize_dev_show,
+	       rnbd_clt_resize_dev_store);
+
+static ssize_t rnbd_clt_remap_dev_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *page)
+{
+	return scnprintf(page, PAGE_SIZE, "Usage: echo <1> > %s\n",
+			 attr->attr.name);
+}
+
+static ssize_t rnbd_clt_remap_dev_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	struct rnbd_clt_dev *dev;
+	char *opt, *options;
+	int err;
+
+	opt = kstrdup(buf, GFP_KERNEL);
+	if (!opt)
+		return -ENOMEM;
+
+	options = strstrip(opt);
+	strip(options);
+
+	dev = container_of(kobj, struct rnbd_clt_dev, kobj);
+	if (!sysfs_streq(options, "1")) {
+		rnbd_clt_err(dev,
+			      "remap_device: Invalid value: %s\n",
+			      options);
+		err = -EINVAL;
+		goto out;
+	}
+	err = rnbd_clt_remap_device(dev);
+	if (likely(!err))
+		err = count;
+
+out:
+	kfree(opt);
+
+	return err;
+}
+
+static struct kobj_attribute rnbd_clt_remap_device_attr =
+	__ATTR(remap_device, 0644, rnbd_clt_remap_dev_show,
+	       rnbd_clt_remap_dev_store);
+
+static ssize_t session_show(struct kobject *kobj, struct kobj_attribute *attr,
+			    char *page)
+{
+	struct rnbd_clt_dev *dev;
+
+	dev = container_of(kobj, struct rnbd_clt_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", dev->sess->sessname);
+}
+
+static struct kobj_attribute rnbd_clt_session_attr =
+	__ATTR_RO(session);
+
+static struct attribute *rnbd_dev_attrs[] = {
+	&rnbd_clt_unmap_device_attr.attr,
+	&rnbd_clt_resize_dev_attr.attr,
+	&rnbd_clt_remap_device_attr.attr,
+	&rnbd_clt_mapping_path_attr.attr,
+	&rnbd_clt_state_attr.attr,
+	&rnbd_clt_session_attr.attr,
+	&rnbd_clt_access_mode.attr,
+	NULL,
+};
+
+void rnbd_clt_remove_dev_symlink(struct rnbd_clt_dev *dev)
+{
+	/*
+	 * The module_is_live() check is crucial and helps to avoid annoying
+	 * sysfs warning raised in sysfs_remove_link(), when the whole sysfs
+	 * path was just removed, see rnbd_close_sessions().
+	 */
+	if (strlen(dev->blk_symlink_name) && module_is_live(THIS_MODULE))
+		sysfs_remove_link(rnbd_devs_kobj, dev->blk_symlink_name);
+}
+
+static struct kobj_type rnbd_dev_ktype = {
+	.sysfs_ops      = &kobj_sysfs_ops,
+	.default_attrs  = rnbd_dev_attrs,
+};
+
+static int rnbd_clt_add_dev_kobj(struct rnbd_clt_dev *dev)
+{
+	int ret;
+	struct kobject *gd_kobj = &disk_to_dev(dev->gd)->kobj;
+
+	ret = kobject_init_and_add(&dev->kobj, &rnbd_dev_ktype, gd_kobj, "%s",
+				   "rnbd");
+	if (ret)
+		rnbd_clt_err(dev, "Failed to create device sysfs dir, err: %d\n",
+			      ret);
+
+	return ret;
+}
+
+static ssize_t rnbd_clt_map_device_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *page)
+{
+	return scnprintf(page, PAGE_SIZE,
+			 "Usage: echo \"sessname=<name of the rtrs session> path=<[srcaddr@]dstaddr> [path=<[srcaddr@]dstaddr>] device_path=<full path on remote side> [access_mode=<ro|rw|migration>]\" > %s\n\naddr ::= [ ip:<ipv4> | ip:<ipv6> | gid:<gid> ]\n",
+			 attr->attr.name);
+}
+
+static int rnbd_clt_get_path_name(struct rnbd_clt_dev *dev, char *buf,
+				   size_t len)
+{
+	int ret;
+	char pathname[NAME_MAX], *s;
+
+	strlcpy(pathname, dev->pathname, sizeof(pathname));
+	while ((s = strchr(pathname, '/')))
+		s[0] = '!';
+
+	ret = snprintf(buf, len, "%s", pathname);
+	if (ret >= len)
+		return -ENAMETOOLONG;
+
+	return 0;
+}
+
+static int rnbd_clt_add_dev_symlink(struct rnbd_clt_dev *dev)
+{
+	struct kobject *gd_kobj = &disk_to_dev(dev->gd)->kobj;
+	int ret;
+
+	ret = rnbd_clt_get_path_name(dev, dev->blk_symlink_name,
+				      sizeof(dev->blk_symlink_name));
+	if (ret) {
+		rnbd_clt_err(dev, "Failed to get /sys/block symlink path, err: %d\n",
+			      ret);
+		goto out_err;
+	}
+
+	ret = sysfs_create_link(rnbd_devs_kobj, gd_kobj,
+				dev->blk_symlink_name);
+	if (ret) {
+		rnbd_clt_err(dev, "Creating /sys/block symlink failed, err: %d\n",
+			      ret);
+		goto out_err;
+	}
+
+	return 0;
+
+out_err:
+	dev->blk_symlink_name[0] = '\0';
+	return ret;
+}
+
+static ssize_t rnbd_clt_map_device_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct rnbd_clt_dev *dev;
+	struct rnbd_map_options opt;
+	int ret;
+	char pathname[NAME_MAX];
+	char sessname[NAME_MAX];
+	enum rnbd_access_mode access_mode = RNBD_ACCESS_RW;
+
+	struct sockaddr_storage *addrs;
+	struct rtrs_addr paths[6];
+	size_t path_cnt;
+
+	opt.sessname = sessname;
+	opt.paths = paths;
+	opt.path_cnt = &path_cnt;
+	opt.pathname = pathname;
+	opt.access_mode = &access_mode;
+	addrs = kcalloc(ARRAY_SIZE(paths) * 2, sizeof(*addrs), GFP_KERNEL);
+	if (!addrs)
+		return -ENOMEM;
+
+	for (path_cnt = 0; path_cnt < ARRAY_SIZE(paths); path_cnt++) {
+		paths[path_cnt].src = &addrs[path_cnt * 2];
+		paths[path_cnt].dst = &addrs[path_cnt * 2 + 1];
+	}
+
+	ret = rnbd_clt_parse_map_options(buf, ARRAY_SIZE(paths), &opt);
+	if (ret)
+		goto out;
+
+	pr_info("Mapping device %s on session %s, (access_mode: %s)\n",
+		pathname, sessname,
+		rnbd_access_mode_str(access_mode));
+
+	dev = rnbd_clt_map_device(sessname, paths, path_cnt, pathname,
+				   access_mode);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+
+	ret = rnbd_clt_add_dev_kobj(dev);
+	if (unlikely(ret))
+		goto unmap_dev;
+
+	ret = rnbd_clt_add_dev_symlink(dev);
+	if (ret)
+		goto unmap_dev;
+
+	kfree(addrs);
+	return count;
+
+unmap_dev:
+	rnbd_clt_unmap_device(dev, true, NULL);
+out:
+	kfree(addrs);
+	return ret;
+}
+
+static struct kobj_attribute rnbd_clt_map_device_attr =
+	__ATTR(map_device, 0644,
+	       rnbd_clt_map_device_show, rnbd_clt_map_device_store);
+
+static struct attribute *default_attrs[] = {
+	&rnbd_clt_map_device_attr.attr,
+	NULL,
+};
+
+static struct attribute_group default_attr_group = {
+	.attrs = default_attrs,
+};
+
+static const struct attribute_group *default_attr_groups[] = {
+	&default_attr_group,
+	NULL,
+};
+
+int rnbd_clt_create_sysfs_files(void)
+{
+	int err;
+
+	rnbd_dev_class = class_create(THIS_MODULE, "rnbd-client");
+	if (IS_ERR(rnbd_dev_class))
+		return PTR_ERR(rnbd_dev_class);
+
+	rnbd_dev = device_create_with_groups(rnbd_dev_class, NULL,
+					      MKDEV(0, 0), NULL,
+					      default_attr_groups, "ctl");
+	if (IS_ERR(rnbd_dev)) {
+		err = PTR_ERR(rnbd_dev);
+		goto cls_destroy;
+	}
+	rnbd_devs_kobj = kobject_create_and_add("devices", &rnbd_dev->kobj);
+	if (unlikely(!rnbd_devs_kobj)) {
+		err = -ENOMEM;
+		goto dev_destroy;
+	}
+
+	return 0;
+
+dev_destroy:
+	device_destroy(rnbd_dev_class, MKDEV(0, 0));
+cls_destroy:
+	class_destroy(rnbd_dev_class);
+
+	return err;
+}
+
+void rnbd_clt_destroy_default_group(void)
+{
+	sysfs_remove_group(&rnbd_dev->kobj, &default_attr_group);
+}
+
+void rnbd_clt_destroy_sysfs_files(void)
+{
+	kobject_del(rnbd_devs_kobj);
+	kobject_put(rnbd_devs_kobj);
+	device_destroy(rnbd_dev_class, MKDEV(0, 0));
+	class_destroy(rnbd_dev_class);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 19/25] rnbd: server: private header with server structs and functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (17 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 18/25] rnbd: client: sysfs interface functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 10:29 ` [PATCH v6 20/25] rnbd: server: main functionality Jack Wang
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This header describes main structs and functions used by rnbd-server
module, namely structs for managing sessions from different clients
and mapped (opened) devices.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-srv.h | 81 +++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-srv.h

diff --git a/drivers/block/rnbd/rnbd-srv.h b/drivers/block/rnbd/rnbd-srv.h
new file mode 100644
index 000000000000..2c1f4c302ab1
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-srv.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#ifndef RNBD_SRV_H
+#define RNBD_SRV_H
+
+#include <linux/types.h>
+#include <linux/idr.h>
+#include <linux/kref.h>
+
+#include "rtrs.h"
+#include "rnbd-proto.h"
+#include "rnbd-log.h"
+
+struct rnbd_srv_session {
+	/* Entry inside global sess_list */
+	struct list_head        list;
+	struct rtrs_srv	*rtrs;
+	char			sessname[NAME_MAX];
+	int			queue_depth;
+	struct bio_set		sess_bio_set;
+
+	rwlock_t                index_lock ____cacheline_aligned;
+	struct idr              index_idr;
+	/* List of struct rnbd_srv_sess_dev */
+	struct list_head        sess_dev_list;
+	struct mutex		lock;
+	u8			ver;
+};
+
+struct rnbd_srv_dev {
+	/* Entry inside global dev_list */
+	struct list_head                list;
+	struct kobject                  dev_kobj;
+	struct kobject                  dev_sessions_kobj;
+	struct kref                     kref;
+	char				id[NAME_MAX];
+	/* List of rnbd_srv_sess_dev structs */
+	struct list_head		sess_dev_list;
+	struct mutex			lock;
+	int				open_write_cnt;
+};
+
+/* Structure which binds N devices and N sessions */
+struct rnbd_srv_sess_dev {
+	/* Entry inside rnbd_srv_dev struct */
+	struct list_head		dev_list;
+	/* Entry inside rnbd_srv_session struct */
+	struct list_head		sess_list;
+	struct rnbd_dev		*rnbd_dev;
+	struct rnbd_srv_session        *sess;
+	struct rnbd_srv_dev		*dev;
+	struct kobject                  kobj;
+	struct completion		*sysfs_release_compl;
+	u32                             device_id;
+	fmode_t                         open_flags;
+	struct kref			kref;
+	struct completion               *destroy_comp;
+	char				pathname[NAME_MAX];
+	enum rnbd_access_mode		access_mode;
+};
+
+/* rnbd-srv-sysfs.c */
+
+int rnbd_srv_create_dev_sysfs(struct rnbd_srv_dev *dev,
+			       struct block_device *bdev,
+			       const char *dir_name);
+void rnbd_srv_destroy_dev_sysfs(struct rnbd_srv_dev *dev);
+int rnbd_srv_create_dev_session_sysfs(struct rnbd_srv_sess_dev *sess_dev);
+void rnbd_srv_destroy_dev_session_sysfs(struct rnbd_srv_sess_dev *sess_dev);
+int rnbd_srv_create_sysfs_files(void);
+void rnbd_srv_destroy_sysfs_files(void);
+
+#endif /* RNBD_SRV_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 20/25] rnbd: server: main functionality
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (18 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 19/25] rnbd: server: private header with server structs and functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 10:29 ` [PATCH v6 21/25] rnbd: server: functionality for IO submission to file or block dev Jack Wang
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is main functionality of rnbd-server module, which handles IBTRS
events and rnbd protocol requests, like map (open) or unmap (close)
device.  Also server side is responsible for processing incoming IBTRS
IO requests and forward them to local mapped devices.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-srv.c | 864 ++++++++++++++++++++++++++++++++++
 1 file changed, 864 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-srv.c

diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c
new file mode 100644
index 000000000000..ac2d1d558fbc
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-srv.c
@@ -0,0 +1,864 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <linux/module.h>
+#include <linux/blkdev.h>
+
+#include "rnbd-srv.h"
+#include "rnbd-srv-dev.h"
+
+MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
+MODULE_LICENSE("GPL");
+
+static int __read_mostly port_nr = RTRS_PORT;
+
+module_param_named(port_nr, port_nr, int, 0444);
+MODULE_PARM_DESC(port_nr,
+		 "The port number server is listening on (default: "
+		 __stringify(RTRS_PORT)")");
+
+#define DEFAULT_DEV_SEARCH_PATH "/"
+
+static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
+
+static int dev_search_path_set(const char *val, const struct kernel_param *kp)
+{
+	char *dup;
+
+	if (strlen(val) >= sizeof(dev_search_path))
+		return -EINVAL;
+
+	dup = kstrdup(val, GFP_KERNEL);
+
+	if (dup[strlen(dup) - 1] == '\n')
+		dup[strlen(dup) - 1] = '\0';
+
+	strlcpy(dev_search_path, dup, sizeof(dev_search_path));
+
+	kfree(dup);
+	pr_info("dev_search_path changed to '%s'\n", dev_search_path);
+
+	return 0;
+}
+
+static struct kparam_string dev_search_path_kparam_str = {
+	.maxlen	= sizeof(dev_search_path),
+	.string	= dev_search_path
+};
+
+static const struct kernel_param_ops dev_search_path_ops = {
+	.set	= dev_search_path_set,
+	.get	= param_get_string,
+};
+
+module_param_cb(dev_search_path, &dev_search_path_ops,
+		&dev_search_path_kparam_str, 0444);
+MODULE_PARM_DESC(dev_search_path,
+		 "Sets the dev_search_path. When a device is mapped this path is prepended to the device path from the map device operation.  If %SESSNAME% is specified in a path, then device will be searched in a session namespace. (default: "
+		 DEFAULT_DEV_SEARCH_PATH ")");
+
+static DEFINE_MUTEX(sess_lock);
+static DEFINE_SPINLOCK(dev_lock);
+
+static LIST_HEAD(sess_list);
+static LIST_HEAD(dev_list);
+
+struct rnbd_io_private {
+	struct rtrs_srv_op		*id;
+	struct rnbd_srv_sess_dev	*sess_dev;
+};
+
+static void rnbd_sess_dev_release(struct kref *kref)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kref, struct rnbd_srv_sess_dev, kref);
+	complete(sess_dev->destroy_comp);
+}
+
+static inline void rnbd_put_sess_dev(struct rnbd_srv_sess_dev *sess_dev)
+{
+	kref_put(&sess_dev->kref, rnbd_sess_dev_release);
+}
+
+static void rnbd_endio(void *priv, int error)
+{
+	struct rnbd_io_private *rnbd_priv = priv;
+	struct rnbd_srv_sess_dev *sess_dev = rnbd_priv->sess_dev;
+
+	rnbd_put_sess_dev(sess_dev);
+
+	rtrs_srv_resp_rdma(rnbd_priv->id, error);
+
+	kfree(priv);
+}
+
+static struct rnbd_srv_sess_dev *
+rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session *srv_sess)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+	int ret = 0;
+
+	read_lock(&srv_sess->index_lock);
+	sess_dev = idr_find(&srv_sess->index_idr, dev_id);
+	if (likely(sess_dev))
+		ret = kref_get_unless_zero(&sess_dev->kref);
+	read_unlock(&srv_sess->index_lock);
+
+	if (unlikely(!sess_dev || !ret))
+		return ERR_PTR(-ENXIO);
+
+	return sess_dev;
+}
+
+static int process_rdma(struct rtrs_srv *sess,
+			struct rnbd_srv_session *srv_sess,
+			struct rtrs_srv_op *id, void *data, u32 datalen,
+			const void *usr, size_t usrlen)
+{
+	const struct rnbd_msg_io *msg = usr;
+	struct rnbd_io_private *priv;
+	struct rnbd_srv_sess_dev *sess_dev;
+	u32 dev_id;
+	int err;
+
+	priv = kmalloc(sizeof(*priv), GFP_KERNEL);
+	if (unlikely(!priv))
+		return -ENOMEM;
+
+	dev_id = le32_to_cpu(msg->device_id);
+
+	sess_dev = rnbd_get_sess_dev(dev_id, srv_sess);
+	if (IS_ERR(sess_dev)) {
+		pr_err_ratelimited("Got I/O request on session %s for unknown device id %d\n",
+				   srv_sess->sessname, dev_id);
+		err = -ENOTCONN;
+		goto err;
+	}
+
+	priv->sess_dev = sess_dev;
+	priv->id = id;
+
+	err = rnbd_dev_submit_io(sess_dev->rnbd_dev, le64_to_cpu(msg->sector),
+				  data, datalen, le32_to_cpu(msg->bi_size),
+				  le32_to_cpu(msg->rw),
+				  srv_sess->ver < RNBD_PROTO_VER_MAJOR ||
+				  usrlen < sizeof(*msg) ?
+				  0 : le16_to_cpu(msg->prio), priv);
+	if (unlikely(err)) {
+		rnbd_srv_err(sess_dev, "Submitting I/O to device failed, err: %d\n",
+			      err);
+		goto sess_dev_put;
+	}
+
+	return 0;
+
+sess_dev_put:
+	rnbd_put_sess_dev(sess_dev);
+err:
+	kfree(priv);
+	return err;
+}
+
+static void destroy_device(struct rnbd_srv_dev *dev)
+{
+	WARN(!list_empty(&dev->sess_dev_list),
+	     "Device %s is being destroyed but still in use!\n",
+	     dev->id);
+
+	spin_lock(&dev_lock);
+	list_del(&dev->list);
+	spin_unlock(&dev_lock);
+
+	if (dev->dev_kobj.state_in_sysfs)
+		/*
+		 * Destroy kobj only if it was really created.
+		 * The following call should be sync, because
+		 *  we free the memory afterwards.
+		 */
+		rnbd_srv_destroy_dev_sysfs(dev);
+
+	kfree(dev);
+}
+
+static void destroy_device_cb(struct kref *kref)
+{
+	struct rnbd_srv_dev *dev;
+
+	dev = container_of(kref, struct rnbd_srv_dev, kref);
+
+	destroy_device(dev);
+}
+
+static void rnbd_put_srv_dev(struct rnbd_srv_dev *dev)
+{
+	kref_put(&dev->kref, destroy_device_cb);
+}
+
+static void rnbd_destroy_sess_dev(struct rnbd_srv_sess_dev *sess_dev)
+{
+	DECLARE_COMPLETION_ONSTACK(dc);
+
+	write_lock(&sess_dev->sess->index_lock);
+	idr_remove(&sess_dev->sess->index_idr, sess_dev->device_id);
+	write_unlock(&sess_dev->sess->index_lock);
+
+	sess_dev->destroy_comp = &dc;
+	rnbd_put_sess_dev(sess_dev);
+	wait_for_completion(&dc);
+
+	rnbd_dev_close(sess_dev->rnbd_dev);
+	list_del(&sess_dev->sess_list);
+	mutex_lock(&sess_dev->dev->lock);
+	list_del(&sess_dev->dev_list);
+	if (sess_dev->open_flags & FMODE_WRITE)
+		sess_dev->dev->open_write_cnt--;
+	mutex_unlock(&sess_dev->dev->lock);
+
+	rnbd_put_srv_dev(sess_dev->dev);
+
+	rnbd_srv_info(sess_dev, "Device closed\n");
+	kfree(sess_dev);
+}
+
+static void destroy_sess(struct rnbd_srv_session *srv_sess)
+{
+	struct rnbd_srv_sess_dev *sess_dev, *tmp;
+
+	if (list_empty(&srv_sess->sess_dev_list))
+		goto out;
+
+	mutex_lock(&srv_sess->lock);
+	list_for_each_entry_safe(sess_dev, tmp, &srv_sess->sess_dev_list,
+				 sess_list) {
+		rnbd_srv_destroy_dev_session_sysfs(sess_dev);
+		rnbd_destroy_sess_dev(sess_dev);
+	}
+	mutex_unlock(&srv_sess->lock);
+
+out:
+	idr_destroy(&srv_sess->index_idr);
+	bioset_exit(&srv_sess->sess_bio_set);
+
+	pr_info("RTRS Session %s disconnected\n", srv_sess->sessname);
+
+	mutex_lock(&sess_lock);
+	list_del(&srv_sess->list);
+	mutex_unlock(&sess_lock);
+
+	kfree(srv_sess);
+}
+
+static int create_sess(struct rtrs_srv *rtrs)
+{
+	struct rnbd_srv_session *srv_sess;
+	char sessname[NAME_MAX];
+	int err;
+
+	err = rtrs_srv_get_sess_name(rtrs, sessname, sizeof(sessname));
+	if (unlikely(err)) {
+		pr_err("rtrs_srv_get_sess_name(%s): %d\n", sessname, err);
+
+		return err;
+	}
+	srv_sess = kzalloc(sizeof(*srv_sess), GFP_KERNEL);
+	if (!srv_sess)
+		return -ENOMEM;
+	srv_sess->queue_depth = rtrs_srv_get_queue_depth(rtrs);
+
+	err = bioset_init(&srv_sess->sess_bio_set, srv_sess->queue_depth,
+			  offsetof(struct rnbd_dev_blk_io, bio),
+			  BIOSET_NEED_BVECS);
+	if (err) {
+		pr_err("Allocating srv_session for session %s failed\n",
+		       sessname);
+		kfree(srv_sess);
+		return err;
+	}
+
+	idr_init(&srv_sess->index_idr);
+	rwlock_init(&srv_sess->index_lock);
+	INIT_LIST_HEAD(&srv_sess->sess_dev_list);
+	mutex_init(&srv_sess->lock);
+	mutex_lock(&sess_lock);
+	list_add(&srv_sess->list, &sess_list);
+	mutex_unlock(&sess_lock);
+
+	srv_sess->rtrs = rtrs;
+	strlcpy(srv_sess->sessname, sessname, sizeof(srv_sess->sessname));
+
+	rtrs_srv_set_sess_priv(rtrs, srv_sess);
+
+	return 0;
+}
+
+static int rnbd_srv_link_ev(struct rtrs_srv *rtrs,
+			     enum rtrs_srv_link_ev ev, void *priv)
+{
+	struct rnbd_srv_session *srv_sess = priv;
+
+	switch (ev) {
+	case RTRS_SRV_LINK_EV_CONNECTED:
+		return create_sess(rtrs);
+
+	case RTRS_SRV_LINK_EV_DISCONNECTED:
+		if (WARN_ON(!srv_sess))
+			return -EINVAL;
+
+		destroy_sess(srv_sess);
+		return 0;
+
+	default:
+		pr_warn("Received unknown RTRS session event %d from session %s\n",
+			ev, srv_sess->sessname);
+		return -EINVAL;
+	}
+}
+
+static int process_msg_close(struct rtrs_srv *rtrs,
+			     struct rnbd_srv_session *srv_sess,
+			     void *data, size_t datalen, const void *usr,
+			     size_t usrlen)
+{
+	const struct rnbd_msg_close *close_msg = usr;
+	struct rnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = rnbd_get_sess_dev(le32_to_cpu(close_msg->device_id),
+				      srv_sess);
+	if (IS_ERR(sess_dev))
+		return 0;
+
+	rnbd_srv_destroy_dev_session_sysfs(sess_dev);
+	rnbd_put_sess_dev(sess_dev);
+	mutex_lock(&srv_sess->lock);
+	rnbd_destroy_sess_dev(sess_dev);
+	mutex_unlock(&srv_sess->lock);
+	return 0;
+}
+
+static int process_msg_open(struct rtrs_srv *rtrs,
+			    struct rnbd_srv_session *srv_sess,
+			    const void *msg, size_t len,
+			    void *data, size_t datalen);
+
+static int process_msg_sess_info(struct rtrs_srv *rtrs,
+				 struct rnbd_srv_session *srv_sess,
+				 const void *msg, size_t len,
+				 void *data, size_t datalen);
+
+static int rnbd_srv_rdma_ev(struct rtrs_srv *rtrs, void *priv,
+			     struct rtrs_srv_op *id, int dir,
+			     void *data, size_t datalen, const void *usr,
+			     size_t usrlen)
+{
+	struct rnbd_srv_session *srv_sess = priv;
+	const struct rnbd_msg_hdr *hdr = usr;
+	int ret = 0;
+	u16 type;
+
+	if (WARN_ON(!srv_sess))
+		return -ENODEV;
+
+	type = le16_to_cpu(hdr->type);
+
+	switch (type) {
+	case RNBD_MSG_IO:
+		return process_rdma(rtrs, srv_sess, id, data, datalen, usr,
+				    usrlen);
+	case RNBD_MSG_CLOSE:
+		ret = process_msg_close(rtrs, srv_sess, data, datalen,
+					usr, usrlen);
+		break;
+	case RNBD_MSG_OPEN:
+		ret = process_msg_open(rtrs, srv_sess, usr, usrlen,
+				       data, datalen);
+		break;
+	case RNBD_MSG_SESS_INFO:
+		ret = process_msg_sess_info(rtrs, srv_sess, usr, usrlen,
+					    data, datalen);
+		break;
+	default:
+		pr_warn("Received unexpected message type %d with dir %d from session %s\n",
+			type, dir, srv_sess->sessname);
+		return -EINVAL;
+	}
+
+	rtrs_srv_resp_rdma(id, ret);
+	return 0;
+}
+
+static struct rnbd_srv_sess_dev
+*rnbd_sess_dev_alloc(struct rnbd_srv_session *srv_sess)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+	int error;
+
+	sess_dev = kzalloc(sizeof(*sess_dev), GFP_KERNEL);
+	if (!sess_dev)
+		return ERR_PTR(-ENOMEM);
+
+	idr_preload(GFP_KERNEL);
+	write_lock(&srv_sess->index_lock);
+
+	error = idr_alloc(&srv_sess->index_idr, sess_dev, 0, -1, GFP_NOWAIT);
+	if (error < 0) {
+		pr_warn("Allocating idr failed, err: %d\n", error);
+		goto out_unlock;
+	}
+
+	sess_dev->device_id = error;
+	error = 0;
+
+out_unlock:
+	write_unlock(&srv_sess->index_lock);
+	idr_preload_end();
+	if (error) {
+		kfree(sess_dev);
+		return ERR_PTR(error);
+	}
+
+	return sess_dev;
+}
+
+static struct rnbd_srv_dev *rnbd_srv_init_srv_dev(const char *id)
+{
+	struct rnbd_srv_dev *dev;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	strlcpy(dev->id, id, sizeof(dev->id));
+	kref_init(&dev->kref);
+	INIT_LIST_HEAD(&dev->sess_dev_list);
+	mutex_init(&dev->lock);
+
+	return dev;
+}
+
+static struct rnbd_srv_dev *
+rnbd_srv_find_or_add_srv_dev(struct rnbd_srv_dev *new_dev)
+{
+	struct rnbd_srv_dev *dev;
+
+	spin_lock(&dev_lock);
+	list_for_each_entry(dev, &dev_list, list) {
+		if (!strncmp(dev->id, new_dev->id, sizeof(dev->id))) {
+			if (!kref_get_unless_zero(&dev->kref))
+				/*
+				 * We lost the race, device is almost dead.
+				 *  Continue traversing to find a valid one.
+				 */
+				continue;
+			spin_unlock(&dev_lock);
+			return dev;
+		}
+	}
+	list_add(&new_dev->list, &dev_list);
+	spin_unlock(&dev_lock);
+
+	return new_dev;
+}
+
+static int rnbd_srv_check_update_open_perm(struct rnbd_srv_dev *srv_dev,
+					    struct rnbd_srv_session *srv_sess,
+					    enum rnbd_access_mode access_mode)
+{
+	int ret = -EPERM;
+
+	mutex_lock(&srv_dev->lock);
+
+	switch (access_mode) {
+	case RNBD_ACCESS_RO:
+		ret = 0;
+		break;
+	case RNBD_ACCESS_RW:
+		if (srv_dev->open_write_cnt == 0)  {
+			srv_dev->open_write_cnt++;
+			ret = 0;
+		} else {
+			pr_err("Mapping device '%s' for session %s with RW permissions failed. Device already opened as 'RW' by %d client(s), access mode %s.\n",
+			       srv_dev->id, srv_sess->sessname,
+			       srv_dev->open_write_cnt,
+			       rnbd_access_mode_str(access_mode));
+		}
+		break;
+	case RNBD_ACCESS_MIGRATION:
+		if (srv_dev->open_write_cnt < 2) {
+			srv_dev->open_write_cnt++;
+			ret = 0;
+		} else {
+			pr_err("Mapping device '%s' for session %s with migration permissions failed. Device already opened as 'RW' by %d client(s), access mode %s.\n",
+			       srv_dev->id, srv_sess->sessname,
+			       srv_dev->open_write_cnt,
+			       rnbd_access_mode_str(access_mode));
+		}
+		break;
+	default:
+		pr_err("Received mapping request for device '%s' on session %s with invalid access mode: %d\n",
+		       srv_dev->id, srv_sess->sessname, access_mode);
+		ret = -EINVAL;
+	}
+
+	mutex_unlock(&srv_dev->lock);
+
+	return ret;
+}
+
+static struct rnbd_srv_dev *
+rnbd_srv_get_or_create_srv_dev(struct rnbd_dev *rnbd_dev,
+				struct rnbd_srv_session *srv_sess,
+				enum rnbd_access_mode access_mode)
+{
+	int ret;
+	struct rnbd_srv_dev *new_dev, *dev;
+
+	new_dev = rnbd_srv_init_srv_dev(rnbd_dev->name);
+	if (IS_ERR(new_dev))
+		return new_dev;
+
+	dev = rnbd_srv_find_or_add_srv_dev(new_dev);
+	if (dev != new_dev)
+		kfree(new_dev);
+
+	ret = rnbd_srv_check_update_open_perm(dev, srv_sess, access_mode);
+	if (ret) {
+		rnbd_put_srv_dev(dev);
+		return ERR_PTR(ret);
+	}
+
+	return dev;
+}
+
+static void rnbd_srv_fill_msg_open_rsp(struct rnbd_msg_open_rsp *rsp,
+					struct rnbd_srv_sess_dev *sess_dev)
+{
+	struct rnbd_dev *rnbd_dev = sess_dev->rnbd_dev;
+
+	rsp->hdr.type = cpu_to_le16(RNBD_MSG_OPEN_RSP);
+	rsp->device_id =
+		cpu_to_le32(sess_dev->device_id);
+	rsp->nsectors =
+		cpu_to_le64(get_capacity(rnbd_dev->bdev->bd_disk));
+	rsp->logical_block_size	=
+		cpu_to_le16(rnbd_dev_get_logical_bsize(rnbd_dev));
+	rsp->physical_block_size =
+		cpu_to_le16(rnbd_dev_get_phys_bsize(rnbd_dev));
+	rsp->max_segments =
+		cpu_to_le16(rnbd_dev_get_max_segs(rnbd_dev));
+	rsp->max_hw_sectors =
+		cpu_to_le32(rnbd_dev_get_max_hw_sects(rnbd_dev));
+	rsp->max_write_same_sectors =
+		cpu_to_le32(rnbd_dev_get_max_write_same_sects(rnbd_dev));
+	rsp->max_discard_sectors =
+		cpu_to_le32(rnbd_dev_get_max_discard_sects(rnbd_dev));
+	rsp->discard_granularity =
+		cpu_to_le32(rnbd_dev_get_discard_granularity(rnbd_dev));
+	rsp->discard_alignment =
+		cpu_to_le32(rnbd_dev_get_discard_alignment(rnbd_dev));
+	rsp->secure_discard =
+		cpu_to_le16(rnbd_dev_get_secure_discard(rnbd_dev));
+	rsp->rotational =
+		!blk_queue_nonrot(bdev_get_queue(rnbd_dev->bdev));
+}
+
+static struct rnbd_srv_sess_dev *
+rnbd_srv_create_set_sess_dev(struct rnbd_srv_session *srv_sess,
+			      const struct rnbd_msg_open *open_msg,
+			      struct rnbd_dev *rnbd_dev, fmode_t open_flags,
+			      struct rnbd_srv_dev *srv_dev)
+{
+	struct rnbd_srv_sess_dev *sdev = rnbd_sess_dev_alloc(srv_sess);
+
+	if (IS_ERR(sdev))
+		return sdev;
+
+	kref_init(&sdev->kref);
+
+	strlcpy(sdev->pathname, open_msg->dev_name, sizeof(sdev->pathname));
+
+	sdev->rnbd_dev		= rnbd_dev;
+	sdev->sess		= srv_sess;
+	sdev->dev		= srv_dev;
+	sdev->open_flags	= open_flags;
+	sdev->access_mode	= open_msg->access_mode;
+
+	return sdev;
+}
+
+static char *rnbd_srv_get_full_path(struct rnbd_srv_session *srv_sess,
+				     const char *dev_name)
+{
+	char *full_path;
+	char *a, *b;
+
+	full_path = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!full_path)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Replace %SESSNAME% with a real session name in order to
+	 * create device namespace.
+	 */
+	a = strnstr(dev_search_path, "%SESSNAME%", sizeof(dev_search_path));
+	if (a) {
+		int len = a - dev_search_path;
+
+		len = snprintf(full_path, PATH_MAX, "%.*s/%s/%s", len,
+			       dev_search_path, srv_sess->sessname, dev_name);
+		if (len >= PATH_MAX) {
+			pr_err("Tooooo looong path: %s, %s, %s\n",
+			       dev_search_path, srv_sess->sessname, dev_name);
+			kfree(full_path);
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		snprintf(full_path, PATH_MAX, "%s/%s",
+			 dev_search_path, dev_name);
+	}
+
+	/* eliminitate duplicated slashes */
+	a = strchr(full_path, '/');
+	b = a;
+	while (*b != '\0') {
+		if (*b == '/' && *a == '/') {
+			b++;
+		} else {
+			a++;
+			*a = *b;
+			b++;
+		}
+	}
+	a++;
+	*a = '\0';
+
+	return full_path;
+}
+
+static int process_msg_sess_info(struct rtrs_srv *rtrs,
+				 struct rnbd_srv_session *srv_sess,
+				 const void *msg, size_t len,
+				 void *data, size_t datalen)
+{
+	const struct rnbd_msg_sess_info *sess_info_msg = msg;
+	struct rnbd_msg_sess_info_rsp *rsp = data;
+
+	srv_sess->ver = min_t(u8, sess_info_msg->ver, RNBD_PROTO_VER_MAJOR);
+	pr_debug("Session %s using protocol version %d (client version: %d, server version: %d)\n",
+		 srv_sess->sessname, srv_sess->ver,
+		 sess_info_msg->ver, RNBD_PROTO_VER_MAJOR);
+
+	rsp->hdr.type = cpu_to_le16(RNBD_MSG_SESS_INFO_RSP);
+	rsp->ver = srv_sess->ver;
+
+	return 0;
+}
+
+/**
+ * find_srv_sess_dev() - a dev is already opened by this name
+ * @srv_sess	the session to search.
+ * @dev_name	string containing the name of the device.
+ *
+ * Return struct rnbd_srv_sess_dev if srv_sess already opened the dev_name
+ * NULL if the session didn't open the device yet.
+ */
+static struct rnbd_srv_sess_dev *
+find_srv_sess_dev(struct rnbd_srv_session *srv_sess, const char *dev_name)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+
+	if (list_empty(&srv_sess->sess_dev_list))
+		return NULL;
+
+	list_for_each_entry(sess_dev, &srv_sess->sess_dev_list, sess_list)
+		if (!strcmp(sess_dev->pathname, dev_name))
+			return sess_dev;
+
+	return NULL;
+}
+
+static int process_msg_open(struct rtrs_srv *rtrs,
+			    struct rnbd_srv_session *srv_sess,
+			    const void *msg, size_t len,
+			    void *data, size_t datalen)
+{
+	int ret;
+	struct rnbd_srv_dev *srv_dev;
+	struct rnbd_srv_sess_dev *srv_sess_dev;
+	const struct rnbd_msg_open *open_msg = msg;
+	fmode_t open_flags;
+	char *full_path;
+	struct rnbd_dev *rnbd_dev;
+	struct rnbd_msg_open_rsp *rsp = data;
+
+	pr_debug("Open message received: session='%s' path='%s' access_mode=%d\n",
+		 srv_sess->sessname, open_msg->dev_name,
+		 open_msg->access_mode);
+	open_flags = FMODE_READ;
+	if (open_msg->access_mode != RNBD_ACCESS_RO)
+		open_flags |= FMODE_WRITE;
+
+	mutex_lock(&srv_sess->lock);
+
+	srv_sess_dev = find_srv_sess_dev(srv_sess, open_msg->dev_name);
+	if (srv_sess_dev)
+		goto fill_response;
+
+	if ((strlen(dev_search_path) + strlen(open_msg->dev_name))
+	    >= PATH_MAX) {
+		pr_err("Opening device for session %s failed, device path too long. '%s/%s' is longer than PATH_MAX (%d)\n",
+		       srv_sess->sessname, dev_search_path, open_msg->dev_name,
+		       PATH_MAX);
+		ret = -EINVAL;
+		goto reject;
+	}
+	if (strstr(open_msg->dev_name, "..")) {
+		pr_err("Opening device for session %s failed, device path %s contains relative path ..\n",
+		       srv_sess->sessname, open_msg->dev_name);
+		ret = -EINVAL;
+		goto reject;
+	}
+	full_path = rnbd_srv_get_full_path(srv_sess, open_msg->dev_name);
+	if (IS_ERR(full_path)) {
+		ret = PTR_ERR(full_path);
+		pr_err("Opening device '%s' for client %s failed, failed to get device full path, err: %d\n",
+		       open_msg->dev_name, srv_sess->sessname, ret);
+		goto reject;
+	}
+
+	rnbd_dev = rnbd_dev_open(full_path, open_flags,
+				   &srv_sess->sess_bio_set, rnbd_endio);
+	if (IS_ERR(rnbd_dev)) {
+		pr_err("Opening device '%s' on session %s failed, failed to open the block device, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(rnbd_dev));
+		ret = PTR_ERR(rnbd_dev);
+		goto free_path;
+	}
+
+	srv_dev = rnbd_srv_get_or_create_srv_dev(rnbd_dev, srv_sess,
+						  open_msg->access_mode);
+	if (IS_ERR(srv_dev)) {
+		pr_err("Opening device '%s' on session %s failed, creating srv_dev failed, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(srv_dev));
+		ret = PTR_ERR(srv_dev);
+		goto rnbd_dev_close;
+	}
+
+	srv_sess_dev = rnbd_srv_create_set_sess_dev(srv_sess, open_msg,
+						     rnbd_dev, open_flags,
+						     srv_dev);
+	if (IS_ERR(srv_sess_dev)) {
+		pr_err("Opening device '%s' on session %s failed, creating sess_dev failed, err: %ld\n",
+		       full_path, srv_sess->sessname, PTR_ERR(srv_sess_dev));
+		ret = PTR_ERR(srv_sess_dev);
+		goto srv_dev_put;
+	}
+
+	/* Create the srv_dev sysfs files if they haven't been created yet. The
+	 * reason to delay the creation is not to create the sysfs files before
+	 * we are sure the device can be opened.
+	 */
+	mutex_lock(&srv_dev->lock);
+	if (!srv_dev->dev_kobj.state_in_sysfs) {
+		ret = rnbd_srv_create_dev_sysfs(srv_dev, rnbd_dev->bdev,
+						 rnbd_dev->name);
+		if (ret) {
+			mutex_unlock(&srv_dev->lock);
+			rnbd_srv_err(srv_sess_dev,
+				      "Opening device failed, failed to create device sysfs files, err: %d\n",
+				      ret);
+			goto free_srv_sess_dev;
+		}
+	}
+
+	ret = rnbd_srv_create_dev_session_sysfs(srv_sess_dev);
+	if (ret) {
+		mutex_unlock(&srv_dev->lock);
+		rnbd_srv_err(srv_sess_dev,
+			      "Opening device failed, failed to create dev client sysfs files, err: %d\n",
+			      ret);
+		goto free_srv_sess_dev;
+	}
+
+	list_add(&srv_sess_dev->dev_list, &srv_dev->sess_dev_list);
+	mutex_unlock(&srv_dev->lock);
+
+	list_add(&srv_sess_dev->sess_list, &srv_sess->sess_dev_list);
+
+	rnbd_srv_info(srv_sess_dev, "Opened device '%s'\n", srv_dev->id);
+
+	kfree(full_path);
+
+fill_response:
+	rnbd_srv_fill_msg_open_rsp(rsp, srv_sess_dev);
+	mutex_unlock(&srv_sess->lock);
+	return 0;
+
+free_srv_sess_dev:
+	write_lock(&srv_sess->index_lock);
+	idr_remove(&srv_sess->index_idr, srv_sess_dev->device_id);
+	write_unlock(&srv_sess->index_lock);
+	kfree(srv_sess_dev);
+srv_dev_put:
+	if (open_msg->access_mode != RNBD_ACCESS_RO) {
+		mutex_lock(&srv_dev->lock);
+		srv_dev->open_write_cnt--;
+		mutex_unlock(&srv_dev->lock);
+	}
+	rnbd_put_srv_dev(srv_dev);
+rnbd_dev_close:
+	rnbd_dev_close(rnbd_dev);
+free_path:
+	kfree(full_path);
+reject:
+	mutex_unlock(&srv_sess->lock);
+	return ret;
+}
+
+static struct rtrs_srv_ctx *rtrs_ctx;
+
+static int __init rnbd_srv_init_module(void)
+{
+	int err;
+
+	pr_info("Loading module %s, proto %s\n",
+		KBUILD_MODNAME, RNBD_PROTO_VER_STRING);
+
+	rtrs_ctx = rtrs_srv_open(rnbd_srv_rdma_ev, rnbd_srv_link_ev,
+				   port_nr);
+	if (IS_ERR(rtrs_ctx)) {
+		err = PTR_ERR(rtrs_ctx);
+		pr_err("rtrs_srv_open(), err: %d\n", err);
+		return err;
+	}
+
+	err = rnbd_srv_create_sysfs_files();
+	if (err) {
+		pr_err("rnbd_srv_create_sysfs_files(), err: %d\n", err);
+		rtrs_srv_close(rtrs_ctx);
+		return err;
+	}
+
+	return 0;
+}
+
+static void __exit rnbd_srv_cleanup_module(void)
+{
+	rtrs_srv_close(rtrs_ctx);
+	WARN_ON(!list_empty(&sess_list));
+	rnbd_srv_destroy_sysfs_files();
+	pr_info("Module unloaded\n");
+}
+
+module_init(rnbd_srv_init_module);
+module_exit(rnbd_srv_cleanup_module);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 21/25] rnbd: server: functionality for IO submission to file or block dev
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (19 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 20/25] rnbd: server: main functionality Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 10:29 ` [PATCH v6 22/25] rnbd: server: sysfs interface functions Jack Wang
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This provides helper functions for IO submission to file or block dev.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-srv-dev.c | 144 ++++++++++++++++++++++++++++++
 drivers/block/rnbd/rnbd-srv-dev.h | 112 +++++++++++++++++++++++
 2 files changed, 256 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-srv-dev.c
 create mode 100644 drivers/block/rnbd/rnbd-srv-dev.h

diff --git a/drivers/block/rnbd/rnbd-srv-dev.c b/drivers/block/rnbd/rnbd-srv-dev.c
new file mode 100644
index 000000000000..49305e0b2d40
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-srv-dev.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "rnbd-srv-dev.h"
+#include "rnbd-log.h"
+
+struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags,
+				 struct bio_set *bs, rnbd_dev_io_fn io_cb)
+{
+	struct rnbd_dev *dev;
+	int ret;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	dev->blk_open_flags = flags;
+	dev->bdev = blkdev_get_by_path(path, flags, THIS_MODULE);
+	ret = PTR_ERR_OR_ZERO(dev->bdev);
+	if (ret)
+		goto err;
+
+	dev->blk_open_flags	= flags;
+	dev->io_cb		= io_cb;
+	bdevname(dev->bdev, dev->name);
+	dev->ibd_bio_set	= bs;
+
+	return dev;
+
+err:
+	kfree(dev);
+	return ERR_PTR(ret);
+}
+
+void rnbd_dev_close(struct rnbd_dev *dev)
+{
+	blkdev_put(dev->bdev, dev->blk_open_flags);
+	kfree(dev);
+}
+
+static void rnbd_dev_bi_end_io(struct bio *bio)
+{
+	struct rnbd_dev_blk_io *io = bio->bi_private;
+
+	io->dev->io_cb(io->priv, blk_status_to_errno(bio->bi_status));
+	bio_put(bio);
+}
+
+/**
+ *	rnbd_bio_map_kern	-	map kernel address into bio
+ *	@q: the struct request_queue for the bio
+ *	@data: pointer to buffer to map
+ *	@bs: bio_set to use.
+ *	@len: length in bytes
+ *	@gfp_mask: allocation flags for bio allocation
+ *
+ *	Map the kernel address into a bio suitable for io to a block
+ *	device. Returns an error pointer in case of error.
+ */
+static struct bio *rnbd_bio_map_kern(struct request_queue *q, void *data,
+				      struct bio_set *bs,
+				      unsigned int len, gfp_t gfp_mask)
+{
+	unsigned long kaddr = (unsigned long)data;
+	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	unsigned long start = kaddr >> PAGE_SHIFT;
+	const int nr_pages = end - start;
+	int offset, i;
+	struct bio *bio;
+
+	bio = bio_alloc_bioset(gfp_mask, nr_pages, bs);
+	if (!bio)
+		return ERR_PTR(-ENOMEM);
+
+	offset = offset_in_page(kaddr);
+	for (i = 0; i < nr_pages; i++) {
+		unsigned int bytes = PAGE_SIZE - offset;
+
+		if (len <= 0)
+			break;
+
+		if (bytes > len)
+			bytes = len;
+
+		if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
+				    offset) < bytes) {
+			/* we don't support partial mappings */
+			bio_put(bio);
+			return ERR_PTR(-EINVAL);
+		}
+
+		data += bytes;
+		len -= bytes;
+		offset = 0;
+	}
+
+	bio->bi_end_io = bio_put;
+	return bio;
+}
+
+int rnbd_dev_submit_io(struct rnbd_dev *dev, sector_t sector, void *data,
+			size_t len, u32 bi_size, enum rnbd_io_flags flags,
+			short prio, void *priv)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+	struct rnbd_dev_blk_io *io;
+	struct bio *bio;
+
+	/* check if the buffer is suitable for bdev */
+	if (WARN_ON(!blk_rq_aligned(q, (unsigned long)data, len)))
+		return -EINVAL;
+
+	/* Generate bio with pages pointing to the rdma buffer */
+	bio = rnbd_bio_map_kern(q, data, dev->ibd_bio_set, len, GFP_KERNEL);
+	if (IS_ERR(bio))
+		return PTR_ERR(bio);
+
+	io = container_of(bio, struct rnbd_dev_blk_io, bio);
+
+	io->dev		= dev;
+	io->priv	= priv;
+
+	bio->bi_end_io		= rnbd_dev_bi_end_io;
+	bio->bi_private		= io;
+	bio->bi_opf		= rnbd_to_bio_flags(flags);
+	bio->bi_iter.bi_sector	= sector;
+	bio->bi_iter.bi_size	= bi_size;
+	bio_set_prio(bio, prio);
+	bio_set_dev(bio, dev->bdev);
+
+	submit_bio(bio);
+
+	return 0;
+}
diff --git a/drivers/block/rnbd/rnbd-srv-dev.h b/drivers/block/rnbd/rnbd-srv-dev.h
new file mode 100644
index 000000000000..f2b5f8d56383
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-srv-dev.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#ifndef RNBD_SRV_DEV_H
+#define RNBD_SRV_DEV_H
+
+#include <linux/fs.h>
+#include "rnbd-proto.h"
+
+typedef void rnbd_dev_io_fn(void *priv, int error);
+
+struct rnbd_dev {
+	struct block_device	*bdev;
+	struct bio_set		*ibd_bio_set;
+	fmode_t			blk_open_flags;
+	char			name[BDEVNAME_SIZE];
+	rnbd_dev_io_fn		*io_cb;
+};
+
+struct rnbd_dev_blk_io {
+	struct rnbd_dev *dev;
+	void		 *priv;
+	/* have to be last member for front_pad usage of bioset_init */
+	struct bio	bio;
+};
+
+/**
+ * rnbd_dev_open() - Open a device
+ * @flags:	open flags
+ * @bs:		bio_set to use during block io,
+ * @io_cb:	is called when I/O finished
+ */
+struct rnbd_dev *rnbd_dev_open(const char *path, fmode_t flags,
+				 struct bio_set *bs, rnbd_dev_io_fn io_cb);
+
+/**
+ * rnbd_dev_close() - Close a device
+ */
+void rnbd_dev_close(struct rnbd_dev *dev);
+
+static inline int rnbd_dev_get_logical_bsize(const struct rnbd_dev *dev)
+{
+	return bdev_logical_block_size(dev->bdev);
+}
+
+static inline int rnbd_dev_get_phys_bsize(const struct rnbd_dev *dev)
+{
+	return bdev_physical_block_size(dev->bdev);
+}
+
+static inline int rnbd_dev_get_max_segs(const struct rnbd_dev *dev)
+{
+	return queue_max_segments(bdev_get_queue(dev->bdev));
+}
+
+static inline int rnbd_dev_get_max_hw_sects(const struct rnbd_dev *dev)
+{
+	return queue_max_hw_sectors(bdev_get_queue(dev->bdev));
+}
+
+static inline int
+rnbd_dev_get_max_write_same_sects(const struct rnbd_dev *dev)
+{
+	return bdev_write_same(dev->bdev);
+}
+
+static inline int rnbd_dev_get_secure_discard(const struct rnbd_dev *dev)
+{
+	return blk_queue_secure_erase(bdev_get_queue(dev->bdev));
+}
+
+static inline int rnbd_dev_get_max_discard_sects(const struct rnbd_dev *dev)
+{
+	if (!blk_queue_discard(bdev_get_queue(dev->bdev)))
+		return 0;
+
+	return blk_queue_get_max_sectors(bdev_get_queue(dev->bdev),
+					 REQ_OP_DISCARD);
+}
+
+static inline int rnbd_dev_get_discard_granularity(const struct rnbd_dev *dev)
+{
+	return bdev_get_queue(dev->bdev)->limits.discard_granularity;
+}
+
+static inline int rnbd_dev_get_discard_alignment(const struct rnbd_dev *dev)
+{
+	return bdev_get_queue(dev->bdev)->limits.discard_alignment;
+}
+
+/**
+ * rnbd_dev_submit_io() - Submit an I/O to the disk
+ * @dev:	device to that the I/O is submitted
+ * @sector:	address to read/write data to
+ * @data:	I/O data to write or buffer to read I/O date into
+ * @len:	length of @data
+ * @bi_size:	Amount of data that will be read/written
+ * @prio:       IO priority
+ * @priv:	private data passed to @io_fn
+ */
+int rnbd_dev_submit_io(struct rnbd_dev *dev, sector_t sector, void *data,
+			size_t len, u32 bi_size, enum rnbd_io_flags flags,
+			short prio, void *priv);
+
+#endif /* RNBD_SRV_DEV_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 22/25] rnbd: server: sysfs interface functions
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (20 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 21/25] rnbd: server: functionality for IO submission to file or block dev Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 10:29 ` [PATCH v6 23/25] rnbd: include client and server modules into kernel compilation Jack Wang
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

This is the sysfs interface to rnbd mapped devices on server side:

  /sys/devices/virtual/rnbd-server/ctl/devices/<device_name>/
    |- block_dev
    |  *** link pointing to the corresponding block device sysfs entry
    |
    |- sessions/<session-name>/
    |  *** sessions directory
       |
       |- read_only
       |  *** is devices mapped as read only
       |
       |- mapping_path
          *** relative device path provided by the client during mapping

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/rnbd/rnbd-srv-sysfs.c | 213 ++++++++++++++++++++++++++++
 1 file changed, 213 insertions(+)
 create mode 100644 drivers/block/rnbd/rnbd-srv-sysfs.c

diff --git a/drivers/block/rnbd/rnbd-srv-sysfs.c b/drivers/block/rnbd/rnbd-srv-sysfs.c
new file mode 100644
index 000000000000..31d0ff520b3b
--- /dev/null
+++ b/drivers/block/rnbd/rnbd-srv-sysfs.c
@@ -0,0 +1,213 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
+ *
+ * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
+ *
+ * Copyright (c) 2019 1&1 IONOS SE. All rights reserved.
+ */
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include <uapi/linux/limits.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/stat.h>
+#include <linux/genhd.h>
+#include <linux/list.h>
+#include <linux/moduleparam.h>
+#include <linux/device.h>
+
+#include "rnbd-srv.h"
+
+static struct device *rnbd_dev;
+static struct class *rnbd_dev_class;
+static struct kobject *rnbd_devs_kobj;
+
+static struct kobj_type ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+};
+
+int rnbd_srv_create_dev_sysfs(struct rnbd_srv_dev *dev,
+			       struct block_device *bdev,
+			       const char *dir_name)
+{
+	struct kobject *bdev_kobj;
+	int ret;
+
+	ret = kobject_init_and_add(&dev->dev_kobj, &ktype,
+				   rnbd_devs_kobj, dir_name);
+	if (ret)
+		return ret;
+
+	ret = kobject_init_and_add(&dev->dev_sessions_kobj,
+				   &ktype,
+				   &dev->dev_kobj, "sessions");
+	if (ret)
+		goto err;
+
+	bdev_kobj = &disk_to_dev(bdev->bd_disk)->kobj;
+	ret = sysfs_create_link(&dev->dev_kobj, bdev_kobj, "block_dev");
+	if (ret)
+		goto err2;
+
+	return 0;
+
+err2:
+	kobject_put(&dev->dev_sessions_kobj);
+err:
+	kobject_put(&dev->dev_kobj);
+	return ret;
+}
+
+void rnbd_srv_destroy_dev_sysfs(struct rnbd_srv_dev *dev)
+{
+	sysfs_remove_link(&dev->dev_kobj, "block_dev");
+	kobject_del(&dev->dev_sessions_kobj);
+	kobject_put(&dev->dev_sessions_kobj);
+	kobject_del(&dev->dev_kobj);
+	kobject_put(&dev->dev_kobj);
+}
+
+static ssize_t read_only_show(struct kobject *kobj, struct kobj_attribute *attr,
+			      char *page)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct rnbd_srv_sess_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 (sess_dev->open_flags & FMODE_WRITE) ? "0" : "1");
+}
+
+static struct kobj_attribute rnbd_srv_dev_session_ro_attr =
+	__ATTR_RO(read_only);
+
+static ssize_t access_mode_show(struct kobject *kobj,
+				struct kobj_attribute *attr,
+				char *page)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct rnbd_srv_sess_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n",
+			 rnbd_access_mode_str(sess_dev->access_mode));
+}
+
+static struct kobj_attribute rnbd_srv_dev_session_access_mode_attr =
+	__ATTR_RO(access_mode);
+
+static ssize_t mapping_path_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *page)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct rnbd_srv_sess_dev, kobj);
+
+	return scnprintf(page, PAGE_SIZE, "%s\n", sess_dev->pathname);
+}
+
+static struct kobj_attribute rnbd_srv_dev_session_mapping_path_attr =
+	__ATTR_RO(mapping_path);
+
+static struct attribute *rnbd_srv_default_dev_sessions_attrs[] = {
+	&rnbd_srv_dev_session_access_mode_attr.attr,
+	&rnbd_srv_dev_session_ro_attr.attr,
+	&rnbd_srv_dev_session_mapping_path_attr.attr,
+	NULL,
+};
+
+static struct attribute_group rnbd_srv_default_dev_session_attr_group = {
+	.attrs = rnbd_srv_default_dev_sessions_attrs,
+};
+
+void rnbd_srv_destroy_dev_session_sysfs(struct rnbd_srv_sess_dev *sess_dev)
+{
+	DECLARE_COMPLETION_ONSTACK(sysfs_compl);
+
+	sysfs_remove_group(&sess_dev->kobj,
+			   &rnbd_srv_default_dev_session_attr_group);
+
+	sess_dev->sysfs_release_compl = &sysfs_compl;
+	kobject_del(&sess_dev->kobj);
+	kobject_put(&sess_dev->kobj);
+	wait_for_completion(&sysfs_compl);
+}
+
+static void rnbd_srv_sess_dev_release(struct kobject *kobj)
+{
+	struct rnbd_srv_sess_dev *sess_dev;
+
+	sess_dev = container_of(kobj, struct rnbd_srv_sess_dev, kobj);
+	if (sess_dev->sysfs_release_compl)
+		complete_all(sess_dev->sysfs_release_compl);
+}
+
+static struct kobj_type rnbd_srv_sess_dev_ktype = {
+	.sysfs_ops	= &kobj_sysfs_ops,
+	.release	= rnbd_srv_sess_dev_release,
+};
+
+int rnbd_srv_create_dev_session_sysfs(struct rnbd_srv_sess_dev *sess_dev)
+{
+	int ret;
+
+	ret = kobject_init_and_add(&sess_dev->kobj, &rnbd_srv_sess_dev_ktype,
+				   &sess_dev->dev->dev_sessions_kobj, "%s",
+				   sess_dev->sess->sessname);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_group(&sess_dev->kobj,
+				 &rnbd_srv_default_dev_session_attr_group);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	kobject_put(&sess_dev->kobj);
+
+	return ret;
+}
+
+int rnbd_srv_create_sysfs_files(void)
+{
+	int err;
+
+	rnbd_dev_class = class_create(THIS_MODULE, "rnbd-server");
+	if (IS_ERR(rnbd_dev_class))
+		return PTR_ERR(rnbd_dev_class);
+
+	rnbd_dev = device_create(rnbd_dev_class, NULL,
+				  MKDEV(0, 0), NULL, "ctl");
+	if (IS_ERR(rnbd_dev)) {
+		err = PTR_ERR(rnbd_dev);
+		goto cls_destroy;
+	}
+	rnbd_devs_kobj = kobject_create_and_add("devices", &rnbd_dev->kobj);
+	if (unlikely(!rnbd_devs_kobj)) {
+		err = -ENOMEM;
+		goto dev_destroy;
+	}
+
+	return 0;
+
+dev_destroy:
+	device_destroy(rnbd_dev_class, MKDEV(0, 0));
+cls_destroy:
+	class_destroy(rnbd_dev_class);
+
+	return err;
+}
+
+void rnbd_srv_destroy_sysfs_files(void)
+{
+	kobject_del(rnbd_devs_kobj);
+	kobject_put(rnbd_devs_kobj);
+	device_destroy(rnbd_dev_class, MKDEV(0, 0));
+	class_destroy(rnbd_dev_class);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 23/25] rnbd: include client and server modules into kernel compilation
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (21 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 22/25] rnbd: server: sysfs interface functions Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 10:29 ` [PATCH v6 24/25] rnbd: a bit of documentation Jack Wang
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

Add rnbd Makefile, Kconfig and also corresponding lines into upper
block layer files.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 drivers/block/Kconfig       |  2 ++
 drivers/block/Makefile      |  1 +
 drivers/block/rnbd/Kconfig  | 28 ++++++++++++++++++++++++++++
 drivers/block/rnbd/Makefile | 17 +++++++++++++++++
 4 files changed, 48 insertions(+)
 create mode 100644 drivers/block/rnbd/Kconfig
 create mode 100644 drivers/block/rnbd/Makefile

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1bb8ec575352..1636a7d9e91e 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -468,4 +468,6 @@ config BLK_DEV_RSXX
 	  To compile this driver as a module, choose M here: the
 	  module will be called rsxx.
 
+source "drivers/block/rnbd/Kconfig"
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index a53cc1e3a2d3..914f9d07835c 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
 
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_BLK_DEV_RNBD)	+= rnbd/
 
 obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk.o
 null_blk-objs	:= null_blk_main.o
diff --git a/drivers/block/rnbd/Kconfig b/drivers/block/rnbd/Kconfig
new file mode 100644
index 000000000000..67882d2ff58f
--- /dev/null
+++ b/drivers/block/rnbd/Kconfig
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+config BLK_DEV_RNBD
+	bool
+
+config BLK_DEV_RNBD_CLIENT
+	tristate "Network block device driver on top of RTRS transport"
+	depends on INFINIBAND_RTRS_CLIENT
+	select BLK_DEV_RNBD
+	help
+	  RNBD client is a network block device driver using rdma transport.
+
+	  RNBD client allows for mapping of a remote block devices over
+	  RTRS protocol from a target system where RNBD server is running.
+
+	  If unsure, say N.
+
+config BLK_DEV_RNBD_SERVER
+	tristate "Network block device over RDMA Infiniband server support"
+	depends on INFINIBAND_RTRS_SERVER
+	select BLK_DEV_RNBD
+	help
+	  RNBD server is the server side of RNBD using rdma transport.
+
+	  RNBD server allows for exporting local block devices to a remote client
+	  over RTRS protocol.
+
+	  If unsure, say N.
diff --git a/drivers/block/rnbd/Makefile b/drivers/block/rnbd/Makefile
new file mode 100644
index 000000000000..125c3576f221
--- /dev/null
+++ b/drivers/block/rnbd/Makefile
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+ccflags-y := -Idrivers/infiniband/ulp/rtrs
+
+rnbd-client-y := rnbd-clt.o \
+		  rnbd-common.o \
+		  rnbd-clt-sysfs.o
+
+rnbd-server-y := rnbd-srv.o \
+		  rnbd-common.o \
+		  rnbd-srv-dev.o \
+		  rnbd-srv-sysfs.o
+
+obj-$(CONFIG_BLK_DEV_RNBD_CLIENT) += rnbd-client.o
+obj-$(CONFIG_BLK_DEV_RNBD_SERVER) += rnbd-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 24/25] rnbd: a bit of documentation
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (22 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 23/25] rnbd: include client and server modules into kernel compilation Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 10:29 ` [PATCH v6 25/25] MAINTAINERS: Add maintainers for RNBD/RTRS modules Jack Wang
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev, linux-kernel

From: Jack Wang <jinpu.wang@cloud.ionos.com>

README with description of major sysfs entries, sysfs documentation
are moved to ABI dir as Bart suggested.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: linux-kernel@vger.kernel.org
---
 Documentation/ABI/testing/sysfs-block-rnbd    |  51 ++++++++
 .../ABI/testing/sysfs-class-rnbd-client       | 117 ++++++++++++++++++
 .../ABI/testing/sysfs-class-rnbd-server       |  57 +++++++++
 drivers/block/rnbd/README                     |  92 ++++++++++++++
 4 files changed, 317 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-block-rnbd
 create mode 100644 Documentation/ABI/testing/sysfs-class-rnbd-client
 create mode 100644 Documentation/ABI/testing/sysfs-class-rnbd-server
 create mode 100644 drivers/block/rnbd/README

diff --git a/Documentation/ABI/testing/sysfs-block-rnbd b/Documentation/ABI/testing/sysfs-block-rnbd
new file mode 100644
index 000000000000..dea84280291a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-block-rnbd
@@ -0,0 +1,51 @@
+What:		/sys/block/rnbd<N>/rnbd/unmap_device
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+To unmap a volume, "normal" or "force" has to be written to:
+  /sys/block/rnbd<N>/rnbd/unmap_device
+
+When "normal" is used, the operation will fail with EBUSY if any process
+is using the device.  When "force" is used, the device is also unmapped
+when device is in use.  All I/Os that are in progress will fail.
+
+Example:
+
+   # echo "normal" > /sys/block/rnbd0/rnbd/unmap_device
+
+What:		/sys/block/rnbd<N>/rnbd/state
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+The file contains the current state of the block device. The state file
+returns "open" when the device is successfully mapped from the server
+and accepting I/O requests. When the connection to the server gets
+disconnected in case of an error (e.g. link failure), the state file
+returns "closed" and all I/O requests submitted to it will fail with -EIO.
+
+What:		/sys/block/rnbd<N>/rnbd/session
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+RNBD uses RTRS session to transport the data between client and
+server.  The entry "session" contains the name of the session, that
+was used to establish the RTRS session.  It's the same name that
+was passed as server parameter to the map_device entry.
+
+What:		/sys/block/rnbd<N>/rnbd/mapping_path
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains the path that was passed as "device_path" to the map_device
+operation.
+
+What:		/sys/block/rnbd<N>/rnbd/access_mode
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains the device access mode: ro, rw or migration.
diff --git a/Documentation/ABI/testing/sysfs-class-rnbd-client b/Documentation/ABI/testing/sysfs-class-rnbd-client
new file mode 100644
index 000000000000..7466a6aa2641
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-rnbd-client
@@ -0,0 +1,117 @@
+What:		/sys/class/rnbd-client
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Provide information about RNBD-client.
+All sysfs files that are not read-only provide the usage information on read:
+
+Example:
+  # cat /sys/class/rnbd-client/ctl/map_device
+
+  > Usage: echo "sessname=<name of the rtrs session> path=<[srcaddr,]dstaddr>
+  > [path=<[srcaddr,]dstaddr>] device_path=<full path on remote side>
+  > [access_mode=<ro|rw|migration>] > map_device
+  >
+  > addr ::= [ ip:<ipv4> | ip:<ipv6> | gid:<gid> ]
+
+What:		/sys/class/rnbd-client/ctl/map_device
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Expected format is the following:
+
+    sessname=<name of the rtrs session>
+    path=<[srcaddr,]dstaddr> [path=<[srcaddr,]dstaddr> ...]
+    device_path=<full path on remote side>
+    [access_mode=<ro|rw|migration>]
+
+Where:
+
+sessname: accepts a string not bigger than 256 chars, which identifies
+          a given session on the client and on the server.
+          I.e. "clt_hostname-srv_hostname" could be a natural choice.
+
+path:     describes a connection between the client and the server by
+      specifying destination and, when required, the source address.
+      The addresses are to be provided in the following format:
+
+            ip:<IPv6>
+            ip:<IPv4>
+            gid:<GID>
+
+          for example:
+
+          path=ip:10.0.0.66
+                         The single addr is treated as the destination.
+                         The connection will be established to this
+                         server from any client IP address.
+
+          path=ip:10.0.0.66,ip:10.0.1.66
+                         First addr is the source address and the second
+                         is the destination.
+
+          If multiple "path=" options are specified multiple connection
+          will be established and data will be sent according to
+          the selected multipath policy (see RTRS mp_policy sysfs entry
+          description).
+
+device_path: Path to the block device on the server side. Path is specified
+         relative to the directory on server side configured in the
+         'dev_search_path' module parameter of the rnbd_server.
+         The rnbd_server prepends the <device_path> received from client
+         with <dev_search_path> and tries to open the
+         <dev_search_path>/<device_path> block device.  On success,
+         a /dev/rnbd<N> device file, a /sys/block/rnbd_client/rnbd<N>/
+         directory and an entry in /sys/class/rnbd-client/ctl/devices
+         will be created.
+
+         If 'dev_search_path' contains '%SESSNAME%', then each session can
+         have different devices namespace, e.g. server was configured with
+         the following parameter "dev_search_path=/run/rnbd-devs/%SESSNAME%",
+         client has this string "sessname=blya device_path=sda", then server
+         will try to open: /run/rnbd-devs/blya/sda.
+
+access_mode: the access_mode parameter specifies if the device is to be
+             mapped as "ro" read-only or "rw" read-write. The server allows
+             a device to be exported in rw mode only once. The "migration"
+             access mode has to be specified if a second mapping in read-write
+             mode is desired.
+
+             By default "rw" is used.
+
+Exit Codes:
+
+If the device is already mapped it will fail with EEXIST. If the input
+has an invalid format it will return EINVAL. If the device path cannot
+be found on the server, it will fail with ENOENT.
+
+Finding device file after mapping
+---------------------------------
+
+After mapping, the device file can be found by:
+ o  The symlink /sys/class/rnbd-client/ctl/devices/<device_id>
+    points to /sys/block/<dev-name>. The last part of the symlink destination
+    is the same as the device name.  By extracting the last part of the
+    path the path to the device /dev/<dev-name> can be build.
+
+ o /dev/block/$(cat /sys/class/rnbd-client/ctl/devices/<device_id>/dev)
+
+How to find the <device_id> of the device is described on the next
+section.
+
+What:		/sys/class/rnbd-client/ctl/devices/
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+For each device mapped on the client a new symbolic link is created as
+/sys/class/rnbd-client/ctl/devices/<device_id>, which points
+to the block device created by rnbd (/sys/block/rnbd<N>/).
+The <device_id> of each device is created as follows:
+
+- If the 'device_path' provided during mapping contains slashes ("/"),
+  they are replaced by exclamation mark ("!") and used as as the
+  <device_id>. Otherwise, the <device_id> will be the same as the
+  "device_path" provided.
diff --git a/Documentation/ABI/testing/sysfs-class-rnbd-server b/Documentation/ABI/testing/sysfs-class-rnbd-server
new file mode 100644
index 000000000000..1faa9faa8ca3
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-rnbd-server
@@ -0,0 +1,57 @@
+What:		/sys/class/rnbd-server
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:   provide information about RNBD-server.
+
+What:		/sys/class/rnbd-server/ctl/
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+When a client maps a device, a directory entry with the name of the
+block device is created under /sys/class/rnbd-server/ctl/devices/.
+
+What:		/sys/class/rnbd-server/ctl/devices/<device_name>/block_dev
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Is a symlink to the sysfs entry of the exported device.
+
+Example:
+
+  block_dev -> ../../../../class/block/ram0
+
+What:		/sys/class/rnbd-server/ctl/devices/<device_name>/sessions/
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+For each client a particular device is exported to, following directory will be
+created:
+
+/sys/class/rnbd-server/ctl/devices/<device_name>/sessions/<session-name>/
+
+When the device is unmapped by that client, the directory will be removed.
+
+What: /sys/class/rnbd-server/ctl/devices/<device_name>/sessions/<session-name>/read_only
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains '1' if device is mapped read-only, otherwise '0'.
+
+What: /sys/class/rnbd-server/ctl/devices/<device_name>/sessions/<session-name>/mapping_path
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains the relative device path provided by the user during mapping.
+
+What: /sys/class/rnbd-server/ctl/devices/<device_name>/sessions/<session-name>/access_mode
+Date:		Jan 2020
+KernelVersion:	5.6
+Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
+Description:
+Contains the device access mode: ro, rw or migration.
diff --git a/drivers/block/rnbd/README b/drivers/block/rnbd/README
new file mode 100644
index 000000000000..3ee009d72fbb
--- /dev/null
+++ b/drivers/block/rnbd/README
@@ -0,0 +1,92 @@
+***************************************
+Infiniband Network Block Device (RNBD)
+***************************************
+
+Introduction
+------------
+
+RNBD (InfiniBand Network Block Device) is a pair of kernel modules
+(client and server) that allow for remote access of a block device on
+the server over RTRS protocol using the RDMA (InfiniBand, RoCE, iWarp)
+transport. After being mapped, the remote block devices can be accessed
+on the client side as local block devices.
+
+I/O is transferred between client and server by the RTRS transport
+modules. The administration of RNBD and RTRS modules is done via
+sysfs entries.
+
+Requirements
+------------
+
+  RTRS kernel modules
+
+Quick Start
+-----------
+
+Server side:
+  # modprobe rnbd_server
+
+Client side:
+  # modprobe rnbd_client
+  # echo "sessname=blya path=ip:10.50.100.66 device_path=/dev/ram0" > \
+            /sys/devices/virtual/rnbd-client/ctl/map_device
+
+  Where "sessname=" is a session name, a string to identify the session
+  on client and on server sides; "path=" is a destination IP address or
+  a pair of a source and a destination IPs, separated by comma.  Multiple
+  "path=" options can be specified in order to use multipath  (see RTRS
+  description for details); "device_path=" is the block device to be
+  mapped from the server side. After the session to the server machine is
+  established, the mapped device will appear on the client side under
+  /dev/rnbd<N>.
+
+
+RNBD-Server Module Parameters
+==============================
+
+dev_search_path
+---------------
+
+When a device is mapped from the client, the server generates the path
+to the block device on the server side by concatenating dev_search_path
+and the "device_path" that was specified in the map_device operation.
+
+The default dev_search_path is: "/".
+
+dev_search_path option can also contain %SESSNAME% in order to provide
+different deviec namespaces for different sessions.  See "device_path"
+option for details.
+
+==============================
+Protocol (rnbd/rnbd-proto.h)
+==============================
+
+1. Before mapping first device from a given server, client sends an
+RNBD_MSG_SESS_INFO to the server. Server responds with
+RNBD_MSG_SESS_INFO_RSP. Currently the messages only contain the protocol
+version for backward compatibility.
+
+2. Client requests to open a device by sending RNBD_MSG_OPEN message. This
+contains the path to the device and access mode (read-only or writable).
+Server responds to the message with RNBD_MSG_OPEN_RSP. This contains
+a 32 bit device id to be used for  IOs and device "geometry" related
+information: side, max_hw_sectors, etc.
+
+3. Client attaches RNBD_MSG_IO to each IO message send to a device. This
+message contains device id, provided by server in his rnbd_msg_open_rsp,
+sector to be accessed, read-write flags and bi_size.
+
+4. Client closes a device by sending RNBD_MSG_CLOSE which contains only the
+device id provided by the server.
+
+=========================================
+Contributors List(in alphabetical order)
+=========================================
+Danil Kipnis <danil.kipnis@profitbricks.com>
+Fabian Holler <mail@fholler.de>
+Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
+Jack Wang <jinpu.wang@profitbricks.com>
+Kleber Souza <kleber.souza@profitbricks.com>
+Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
+Milind Dumbare <Milind.dumbare@gmail.com>
+Roman Penyaev <roman.penyaev@profitbricks.com>
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 25/25] MAINTAINERS: Add maintainers for RNBD/RTRS modules
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (23 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 24/25] rnbd: a bit of documentation Jack Wang
@ 2019-12-30 10:29 ` Jack Wang
  2019-12-30 12:22   ` Gal Pressman
  2019-12-31  0:11 ` [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Bart Van Assche
  2019-12-31  2:39 ` Bart Van Assche
  26 siblings, 1 reply; 89+ messages in thread
From: Jack Wang @ 2019-12-30 10:29 UTC (permalink / raw)
  To: linux-block, linux-rdma
  Cc: axboe, hch, sagi, bvanassche, leon, dledford, danil.kipnis,
	jinpu.wang, rpenyaev

From: Jack Wang <jinpu.wang@cloud.ionos.com>

Danil and me will maintain RNBD/RTRS modules.

Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
---
 MAINTAINERS | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e09bd92a1e44..2ba370d8145d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14125,6 +14125,13 @@ F:	arch/riscv/
 K:	riscv
 N:	riscv
 
+RNBD BLOCK DRIVERS
+M:	Danil Kipnis <danil.kipnis@cloud.ionos.com>
+M:	Jack Wang <jinpu.wang@cloud.ionos.com>
+L:	linux-block@vger.kernel.org
+S:	Maintained
+F:	drivers/block/rnbd/
+
 ROCCAT DRIVERS
 M:	Stefan Achatz <erazor_de@users.sourceforge.net>
 W:	http://sourceforge.net/projects/roccat/
@@ -14192,6 +14199,13 @@ F:	include/net/rose.h
 F:	include/uapi/linux/rose.h
 F:	net/rose/
 
+RTRS TRANSPORT DRIVERS
+M:	Danil Kipnis <danil.kipnis@cloud.ionos.com>
+M:	Jack Wang <jinpu.wang@cloud.ionos.com>
+L:	linux-rdma@vger.kernel.org
+S:	Maintained
+F:	drivers/infiniband/ulp/rtrs/
+
 RTL2830 MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
 L:	linux-media@vger.kernel.org
-- 
2.17.1


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 25/25] MAINTAINERS: Add maintainers for RNBD/RTRS modules
  2019-12-30 10:29 ` [PATCH v6 25/25] MAINTAINERS: Add maintainers for RNBD/RTRS modules Jack Wang
@ 2019-12-30 12:22   ` Gal Pressman
  2020-01-02  8:41     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Gal Pressman @ 2019-12-30 12:22 UTC (permalink / raw)
  To: Jack Wang
  Cc: linux-block, linux-rdma, axboe, hch, sagi, bvanassche, leon,
	dledford, danil.kipnis, jinpu.wang, rpenyaev

On 30/12/2019 12:29, Jack Wang wrote:
> From: Jack Wang <jinpu.wang@cloud.ionos.com>
> 
> Danil and me will maintain RNBD/RTRS modules.
> 
> Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
> ---
>  MAINTAINERS | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e09bd92a1e44..2ba370d8145d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14125,6 +14125,13 @@ F:	arch/riscv/
>  K:	riscv
>  N:	riscv
>  
> +RNBD BLOCK DRIVERS
> +M:	Danil Kipnis <danil.kipnis@cloud.ionos.com>
> +M:	Jack Wang <jinpu.wang@cloud.ionos.com>
> +L:	linux-block@vger.kernel.org
> +S:	Maintained
> +F:	drivers/block/rnbd/
> +
>  ROCCAT DRIVERS
>  M:	Stefan Achatz <erazor_de@users.sourceforge.net>
>  W:	http://sourceforge.net/projects/roccat/
> @@ -14192,6 +14199,13 @@ F:	include/net/rose.h
>  F:	include/uapi/linux/rose.h
>  F:	net/rose/
>  
> +RTRS TRANSPORT DRIVERS
> +M:	Danil Kipnis <danil.kipnis@cloud.ionos.com>
> +M:	Jack Wang <jinpu.wang@cloud.ionos.com>
> +L:	linux-rdma@vger.kernel.org
> +S:	Maintained
> +F:	drivers/infiniband/ulp/rtrs/
> +
>  RTL2830 MEDIA DRIVER
>  M:	Antti Palosaari <crope@iki.fi>
>  L:	linux-media@vger.kernel.org

RTRS should be after RTL, right :)?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections
  2019-12-30 10:29 ` [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections Jack Wang
@ 2019-12-30 19:25   ` Bart Van Assche
  2020-01-02 13:35     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-30 19:25 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> +/*
> + * Here goes RTRS client API
> + */

A comment that explains what the abbreviation "RTRS" stands for would be
welcome here. Additionally, I think that "Here goes" can be left out.

> +/**
> + * rtrs_clt_open() - Open a session to an RTRS server
> + * @priv: User supplied private data.
> + * @link_ev: Event notification for connection state changes

Please mention that @link_ev is a callback function.

> + *	@priv: User supplied data that was passed to rtrs_clt_open()
> + *	@ev: Occurred event

Is this patch series W=1 clean? @link_ev arguments should be documented
above the link_clt_ev_fn typedef.

> + * @path_cnt: Number of elemnts in the @paths array

elemnts -> elements?

> + * Starts session establishment with the rtrs_server. The function can block
> + * up to ~2000ms until it returns.

until -> before?

> +struct rtrs_clt *rtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
> +				 const char *sessname,
> +				 const struct rtrs_addr *paths,
> +				 size_t path_cnt, short port,
> +				 size_t pdu_sz, u8 reconnect_delay_sec,
> +				 u16 max_segments,
> +				 s16 max_reconnect_attempts);

Since the range for port numbers is 1..65535, please change "short port"
into "u16 port".

> +/**
> + * enum rtrs_clt_con_type() type of ib connection to use with a given permit

What is a "permit"?

> + * @vec:	Message that is send to server together with the request.

send -> sent?

> + *		Sum of len of all @vec elements limited to <= IO_MSG_SIZE.
> + *		Since the msg is copied internally it can be allocated on stack.
> + * @nr:		Number of elements in @vec.
> + * @len:	length of data send to/from server

send -> sent?

> +/**
> + * link_ev_fn():	Events about connective state changes

connective -> connection?

> +/**
> + * rtrs_srv_open() - open RTRS server context
> + * @ops:		callback functions
> + *
> + * Creates server context with specified callbacks.
> + *
> + * Return a valid pointer on success otherwise PTR_ERR.
> + */
> +struct rtrs_srv_ctx *rtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
> +				     unsigned int port);

Is this patch series W=1 clean? The documented argument does not match
the actual argument list.

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2019-12-30 10:29 ` [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers Jack Wang
@ 2019-12-30 19:48   ` Bart Van Assche
  2020-01-02 15:27     ` Jinpu Wang
  2019-12-31  0:07   ` Bart Van Assche
  1 sibling, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-30 19:48 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> + * InfiniBand Transport Layer

Is RTRS an InfiniBand or an RDMA transport layer?

> +#define rtrs_prefix(obj) (obj->sessname)

Is it really worth it to introduce a macro for accessing a single member
of a single pointer?

> + * InfiniBand Transport Layer

Same question here: is RTRS an InfiniBand or an RDMA transport layer?

> +enum {
> +	SERVICE_CON_QUEUE_DEPTH = 512,

What is a service connection?

> +	/*
> +	 * With the current size of the tag allocated on the client, 4K
> +	 * is the maximum number of tags we can allocate.  This number is
> +	 * also used on the client to allocate the IU for the user connection
> +	 * to receive the RDMA addresses from the server.
> +	 */

What does the word 'tag' mean in the context of the RTRS protocol?

> +struct rtrs_ib_dev;

What does the "rtrs_ib_dev" data structure represent? Additionally, I
think it's confusing that a single name has an "r" that refers to "RDMA"
and "ib" that refers to InfiniBand.

> +struct rtrs_ib_dev_pool {
> +	struct mutex		mutex;
> +	struct list_head	list;
> +	enum ib_pd_flags	pd_flags;
> +	const struct rtrs_ib_dev_pool_ops *ops;
> +};

What is the purpose of an rtrs_ib_dev_pool and what does it contain?

> +struct rtrs_iu {

A comment that explains what the "iu" abbreviation stands for would be
welcome.

> +/**
> + * enum rtrs_msg_types - RTRS message types.
> + * @RTRS_MSG_INFO_REQ:		Client additional info request to the server
> + * @RTRS_MSG_INFO_RSP:		Server additional info response to the client
> + * @RTRS_MSG_WRITE:		Client writes data per RDMA to server
> + * @RTRS_MSG_READ:		Client requests data transfer from server
> + * @RTRS_MSG_RKEY_RSP:		Server refreshed rkey for rbuf
> + */

What is "additional info" in this context?

> +/**
> + * struct rtrs_msg_conn_req - Client connection request to the server
> + * @magic:	   RTRS magic
> + * @version:	   RTRS protocol version
> + * @cid:	   Current connection id
> + * @cid_num:	   Number of connections per session
> + * @recon_cnt:	   Reconnections counter
> + * @sess_uuid:	   UUID of a session (path)
> + * @paths_uuid:	   UUID of a group of sessions (paths)
> + *
> + * NOTE: max size 56 bytes, see man rdma_connect().
> + */
> +struct rtrs_msg_conn_req {
> +	u8		__cma_version; /* Is set to 0 by cma.c in case of
> +					* AF_IB, do not touch that.
> +					*/
> +	u8		__ip_version;  /* On sender side that should be
> +					* set to 0, or cma_save_ip_info()
> +					* extract garbage and will fail.
> +					*/

The above two fields and the comments next to it look suspicious to me.
Does RTRS perhaps try to generate CMA-formatted messages without using
the CMA to format these messages?

> +	u8		reserved[12];

Please leave out the reserved data. If future versions of the protocol
would need any of these bytes it is easy to add more data to this structure.

> +/**
> + * struct rtrs_msg_conn_rsp - Server connection response to the client
> + * @magic:	   RTRS magic
> + * @version:	   RTRS protocol version
> + * @errno:	   If rdma_accept() then 0, if rdma_reject() indicates error
> + * @queue_depth:   max inflight messages (queue-depth) in this session
> + * @max_io_size:   max io size server supports
> + * @max_hdr_size:  max msg header size server supports
> + *
> + * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
> + */
> +struct rtrs_msg_conn_rsp {
> +	__le16		magic;
> +	__le16		version;
> +	__le16		errno;
> +	__le16		queue_depth;
> +	__le32		max_io_size;
> +	__le32		max_hdr_size;
> +	__le32		flags;
> +	u8		reserved[36];
> +};

Same comment here: please leave out the "reserved[]" array. Sending a
bunch of zero-bytes at the end of a message over the wire is not useful.

> +static inline void rtrs_from_imm(u32 imm, u32 *type, u32 *payload)
> +{
> +	*payload = (imm & MAX_IMM_PAYL_MASK);
> +	*type = (imm >> MAX_IMM_PAYL_BITS);
> +}

Please do not use parentheses when not necessary. Such superfluous
parentheses namely hurt readability of the code.

> +	type = (w_inval ? RTRS_IO_RSP_W_INV_IMM : RTRS_IO_RSP_IMM);

Same comment here: I think the parentheses can be left out from the
above statement.

> +static inline void rtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
> +{
> +	/* 9 bits for errno, 19 bits for msg_id */
> +	*msg_id = (payload & 0x7ffff);

Are the parentheses in the above expression necessary?

> +	*errno = -(int)((payload >> 19) & 0x1ff);

Is the '(int)' cast useful in the above expression? Can it be left out?

> +#define STAT_ATTR(type, stat, print, reset)				\
> +STAT_STORE_FUNC(type, stat, reset)					\
> +STAT_SHOW_FUNC(type, stat, print)					\
> +static struct kobj_attribute stat##_attr =				\
> +		__ATTR(stat, 0644,					\
> +		       stat##_show,					\
> +		       stat##_store)

Is the above use of __ATTR() perhaps an open-coded version of __ATTR_RW()?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 04/25] rtrs: core: lib functions shared between client and server modules
  2019-12-30 10:29 ` [PATCH v6 04/25] rtrs: core: lib functions shared between client and server modules Jack Wang
@ 2019-12-30 22:25   ` Bart Van Assche
  2020-01-07 12:22     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-30 22:25 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> + * InfiniBand Transport Layer

Is RTRS an InfiniBand or an RDMA transport layer?

> +MODULE_DESCRIPTION("RTRS Core");

Please write out RTRS in full and consider changing the word "Core" into
"client and server".

> +	WARN_ON(!queue_size);
> +	ius = kcalloc(queue_size, sizeof(*ius), gfp_mask);
> +
> +	if (unlikely(!ius))
> +		return NULL;

No blank line between the 'ius' assignment and the 'ius' check please.

> +int rtrs_iu_post_recv(struct rtrs_con *con, struct rtrs_iu *iu)
> +{
> +	struct rtrs_sess *sess = con->sess;
> +	struct ib_recv_wr wr;
> +	const struct ib_recv_wr *bad_wr;
> +	struct ib_sge list;
> +
> +	list.addr   = iu->dma_addr;
> +	list.length = iu->size;
> +	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
> +
> +	if (WARN_ON(list.length == 0)) {
> +		rtrs_wrn(con->sess,
> +			  "Posting receive work request failed, sg list is empty\n");
> +		return -EINVAL;
> +	}
> +
> +	wr.next    = NULL;
> +	wr.wr_cqe  = &iu->cqe;
> +	wr.sg_list = &list;
> +	wr.num_sge = 1;
> +
> +	return ib_post_recv(con->qp, &wr, &bad_wr);
> +}
> +EXPORT_SYMBOL_GPL(rtrs_iu_post_recv);

The above code is fragile: although this is unlikely, if a member would
be added in struct ib_sge or in struct ib_recv_wr then the above code
will leave some member variables uninitialized. Has it been considered
to initialize these structures using a single assignment statement, e.g.
as follows:

	wr = (struct ib_recv_wr) {
		.wr_cqe = ...,
		.sg_list = ...,
		.num_sge = 1,
	};

> +int rtrs_post_recv_empty(struct rtrs_con *con, struct ib_cqe *cqe)
> +{
> +	struct ib_recv_wr wr;
> +	const struct ib_recv_wr *bad_wr;
> +
> +	wr.next    = NULL;
> +	wr.wr_cqe  = cqe;
> +	wr.sg_list = NULL;
> +	wr.num_sge = 0;
> +
> +	return ib_post_recv(con->qp, &wr, &bad_wr);
> +}
> +EXPORT_SYMBOL_GPL(rtrs_post_recv_empty);

Same comment for this function.

> +int rtrs_post_recv_empty_x2(struct rtrs_con *con, struct ib_cqe *cqe)
> +{
> +	struct ib_recv_wr wr_arr[2], *wr;
> +	const struct ib_recv_wr *bad_wr;
> +	int i;
> +
> +	memset(wr_arr, 0, sizeof(wr_arr));
> +	for (i = 0; i < ARRAY_SIZE(wr_arr); i++) {
> +		wr = &wr_arr[i];
> +		wr->wr_cqe  = cqe;
> +		if (i)
> +			/* Chain backwards */
> +			wr->next = &wr_arr[i - 1];
> +	}
> +
> +	return ib_post_recv(con->qp, wr, &bad_wr);
> +}
> +EXPORT_SYMBOL_GPL(rtrs_post_recv_empty_x2);

I have not yet seen any other RDMA code that is similar to the above
function. A comment above this function that explains its purpose would
be more than welcome.

> +int rtrs_iu_post_send(struct rtrs_con *con, struct rtrs_iu *iu, size_t size,
> +		       struct ib_send_wr *head)
> +{
> +	struct rtrs_sess *sess = con->sess;
> +	struct ib_send_wr wr;
> +	const struct ib_send_wr *bad_wr;
> +	struct ib_sge list;
> +
> +	if ((WARN_ON(size == 0)))
> +		return -EINVAL;

No superfluous parentheses please.

> +	list.addr   = iu->dma_addr;
> +	list.length = size;
> +	list.lkey   = sess->dev->ib_pd->local_dma_lkey;
> +
> +	memset(&wr, 0, sizeof(wr));
> +	wr.next       = NULL;
> +	wr.wr_cqe     = &iu->cqe;
> +	wr.sg_list    = &list;
> +	wr.num_sge    = 1;
> +	wr.opcode     = IB_WR_SEND;
> +	wr.send_flags = IB_SEND_SIGNALED;

Has it been considered to use designated initializers instead of a
memset() followed by multiple assignments? Same question for
rtrs_iu_post_rdma_write_imm() and rtrs_post_rdma_write_imm_empty().

> +static int create_qp(struct rtrs_con *con, struct ib_pd *pd,
> +		     u16 wr_queue_size, u32 max_sge)
> +{
> +	struct ib_qp_init_attr init_attr = {NULL};
> +	struct rdma_cm_id *cm_id = con->cm_id;
> +	int ret;
> +
> +	init_attr.cap.max_send_wr = wr_queue_size;
> +	init_attr.cap.max_recv_wr = wr_queue_size;

What code is responsible for ensuring that neither max_send_wr nor
max_recv_wr exceeds the device limits? Please document this in a comment
above this function.

> +	init_attr.cap.max_recv_sge = 1;
> +	init_attr.event_handler = qp_event_handler;
> +	init_attr.qp_context = con;
> +#undef max_send_sge
> +	init_attr.cap.max_send_sge = max_sge;

Is the "undef max_send_sge" really necessary? If so, please add a
comment that explains why it is necessary.

> +static int rtrs_str_gid_to_sockaddr(const char *addr, size_t len,
> +				     short port, struct sockaddr_storage *dst)
> +{
> +	struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
> +	int ret;
> +
> +	/*
> +	 * We can use some of the I6 functions since GID is a valid
> +	 * IPv6 address format
> +	 */
> +	ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
> +	if (ret == 0)
> +		return -EINVAL;

What is "I6"?

Is the fourth argument to this function correct? From the comment above
in6_pton(): "@delim: the delimiter of the IPv6 address in @src, -1 means
no delimiter".

> +int sockaddr_to_str(const struct sockaddr *addr, char *buf, size_t len)
> +{
> +	int cnt;
> +
> +	switch (addr->sa_family) {
> +	case AF_IB:
> +		cnt = scnprintf(buf, len, "gid:%pI6",
> +			&((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
> +		return cnt;
> +	case AF_INET:
> +		cnt = scnprintf(buf, len, "ip:%pI4",
> +			&((struct sockaddr_in *)addr)->sin_addr);
> +		return cnt;
> +	case AF_INET6:
> +		cnt = scnprintf(buf, len, "ip:%pI6c",
> +			  &((struct sockaddr_in6 *)addr)->sin6_addr);
> +		return cnt;
> +	}
> +	cnt = scnprintf(buf, len, "<invalid address family>");
> +	pr_err("Invalid address family\n");
> +	return cnt;
> +}
> +EXPORT_SYMBOL(sockaddr_to_str);

Is the pr_err() statement in the above function useful? Will anyone be
able to figure out what is going on if the "Invalid address family"
string appears in the system log? Please consider changing that pr_err()
statement into a WARN_ON_ONCE() statement.

> +	ret = rtrs_str_to_sockaddr(str, len, port, addr->dst);
> +
> +	return ret;

Please change this into a single return statement.

> +EXPORT_SYMBOL(rtrs_addr_to_sockaddr);
> +
> +void rtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
> +			    struct rtrs_ib_dev_pool *pool)
> +{
> +	WARN_ON(pool->ops && (!pool->ops->alloc ^ !pool->ops->free));
> +	INIT_LIST_HEAD(&pool->list);
> +	mutex_init(&pool->mutex);
> +	pool->pd_flags = pd_flags;
> +}
> +EXPORT_SYMBOL(rtrs_ib_dev_pool_init);
> +
> +void rtrs_ib_dev_pool_deinit(struct rtrs_ib_dev_pool *pool)
> +{
> +	WARN_ON(!list_empty(&pool->list));
> +}
> +EXPORT_SYMBOL(rtrs_ib_dev_pool_deinit);

Since rtrs_ib_dev_pool_init() calls mutex_init(), should
rtrs_ib_dev_pool_deinit() call mutex_destroy()?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 05/25] rtrs: client: private header with client structs and functions
  2019-12-30 10:29 ` [PATCH v6 05/25] rtrs: client: private header with client structs and functions Jack Wang
@ 2019-12-30 22:51   ` Bart Van Assche
  2020-01-07 12:39     ` Jinpu Wang
  2019-12-30 23:03   ` Bart Van Assche
  1 sibling, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-30 22:51 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> + * InfiniBand Transport Layer

InfiniBand or RDMA?

> +static inline const char *rtrs_clt_state_str(enum rtrs_clt_state state)
> +{
> +	switch (state) {
> +	case RTRS_CLT_CONNECTING:
> +		return "RTRS_CLT_CONNECTING";
> +	case RTRS_CLT_CONNECTING_ERR:
> +		return "RTRS_CLT_CONNECTING_ERR";
> +	case RTRS_CLT_RECONNECTING:
> +		return "RTRS_CLT_RECONNECTING";
> +	case RTRS_CLT_CONNECTED:
> +		return "RTRS_CLT_CONNECTED";
> +	case RTRS_CLT_CLOSING:
> +		return "RTRS_CLT_CLOSING";
> +	case RTRS_CLT_CLOSED:
> +		return "RTRS_CLT_CLOSED";
> +	case RTRS_CLT_DEAD:
> +		return "RTRS_CLT_DEAD";
> +	default:
> +		return "UNKNOWN";
> +	}
> +}

This function is not in the hot path so it shouldn't be inline.

> +#define MIN_LOG_SG 2
> +#define MAX_LOG_SG 5
> +#define MAX_LIN_SG BIT(MIN_LOG_SG)
> +#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)

I think these constants deserve a comment that explains what their
meaning is.

> +/**
> + * rtrs_permit - permits the memory allocation for future RDMA operation
> + */
> +struct rtrs_permit {
> +	enum rtrs_clt_con_type con_type;
> +	unsigned int cpu_id;
> +	unsigned int mem_id;
> +	unsigned int mem_off;
> +};

The comment above this structure is confusing. Please make it more clear.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 05/25] rtrs: client: private header with client structs and functions
  2019-12-30 10:29 ` [PATCH v6 05/25] rtrs: client: private header with client structs and functions Jack Wang
  2019-12-30 22:51   ` Bart Van Assche
@ 2019-12-30 23:03   ` Bart Van Assche
  2020-01-07 12:39     ` Jinpu Wang
  1 sibling, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-30 23:03 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> +#define GET_PERMIT(clt, idx) ((clt)->permits + PERMIT_SIZE(clt) * idx)

Please surround 'idx' with parentheses.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 14/25] rtrs: a bit of documentation
  2019-12-30 10:29 ` [PATCH v6 14/25] rtrs: a bit of documentation Jack Wang
@ 2019-12-30 23:19   ` Bart Van Assche
  2020-01-07 14:48     ` Jinpu Wang
  2020-01-02 22:21   ` Bart Van Assche
  1 sibling, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-30 23:19 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang,
	rpenyaev, linux-kernel

On 2019-12-30 02:29, Jack Wang wrote:
> diff --git a/drivers/infiniband/ulp/rtrs/README b/drivers/infiniband/ulp/rtrs/README

Other kernel driver documentation exists under the Documentation/
directory. Should this README file perhaps be moved to a subdirectory of
the Documentation/ directory?

> +****************************
> +InfiniBand Transport (RTRS)
> +****************************

The abbreviation does not match the full title. Do you agree that this
is confusing?

> +RTRS is used by the RNBD (Infiniband Network Block Device) modules.

Is RNBD an RDMA or an InfiniBand network block device?

> +
> +==================
> +Transport protocol
> +==================
> +
> +Overview
> +--------
> +An established connection between a client and a server is called rtrs
> +session. A session is associated with a set of memory chunks reserved on the
> +server side for a given client for rdma transfer. A session
> +consists of multiple paths, each representing a separate physical link
> +between client and server. Those are used for load balancing and failover.
> +Each path consists of as many connections (QPs) as there are cpus on
> +the client.
> +
> +When processing an incoming rdma write or read request rtrs client uses memory

A quote from
https://linuxplumbersconf.org/event/4/contributions/367/attachments/331/555/LPC_2019_RMDA_MC_IBNBD_IBTRS_Upstreaming.pdf:
"Only RDMA writes with immediate". Has the wire protocol perhaps been
changed such that both RDMA reads and writes are used? I haven't found
any references to RDMA reads in the "IO path" section in this file. Did
I perhaps overlook something?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 06/25] rtrs: client: main functionality
  2019-12-30 10:29 ` [PATCH v6 06/25] rtrs: client: main functionality Jack Wang
@ 2019-12-30 23:53   ` Bart Van Assche
  2020-01-02 18:23     ` Jason Gunthorpe
  2020-01-03 14:30     ` Jinpu Wang
  0 siblings, 2 replies; 89+ messages in thread
From: Bart Van Assche @ 2019-12-30 23:53 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> + * InfiniBand Transport Layer

InfiniBand or RDMA?

> +MODULE_DESCRIPTION("RTRS Client");

Please spell out RTRS in full.

> +static const struct rtrs_ib_dev_pool_ops dev_pool_ops;

Can this forward declaration be avoided?

> +static struct rtrs_ib_dev_pool dev_pool = {
> +	.ops = &dev_pool_ops
> +};

Can this structure be declared 'const'?

> +static inline struct rtrs_permit *
> +__rtrs_get_permit(struct rtrs_clt *clt, enum rtrs_clt_con_type con_type)
> +{
> +	size_t max_depth = clt->queue_depth;
> +	struct rtrs_permit *permit;
> +	int cpu, bit;
> +
> +	cpu = get_cpu();
> +	do {
> +		bit = find_first_zero_bit(clt->permits_map, max_depth);
> +		if (unlikely(bit >= max_depth)) {
> +			put_cpu();
> +			return NULL;
> +		}
> +
> +	} while (unlikely(test_and_set_bit_lock(bit, clt->permits_map)));
> +	put_cpu();

Are the get_cpu() and put_cpu() calls around this loop useful? If not,
please remove these calls. Otherwise please add a comment that explains
the purpose of these calls.

An additional question: is it possible to replace the above loop with an
sbitmap_get() call?

> +static void complete_rdma_req(struct rtrs_clt_io_req *req, int errno,
> +			      bool notify, bool can_wait)
> +{
> +	struct rtrs_clt_con *con = req->con;
> +	struct rtrs_clt_sess *sess;
> +	int err;
> +
> +	if (WARN_ON(!req->in_use))
> +		return;
> +	if (WARN_ON(!req->con))
> +		return;
> +	sess = to_clt_sess(con->c.sess);
> +
> +	if (req->sg_cnt) {
> +		if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) {
> +			/*
> +			 * We are here to invalidate RDMA read requests
> +			 * ourselves.  In normal scenario server should
> +			 * send INV for all requested RDMA reads, but
> +			 * we are here, thus two things could happen:
> +			 *
> +			 *    1.  this is failover, when errno != 0
> +			 *        and can_wait == 1,
> +			 *
> +			 *    2.  something totally bad happened and
> +			 *        server forgot to send INV, so we
> +			 *        should do that ourselves.
> +			 */

Please document in the protocol documentation when RDMA reads are used.

What does "server forgot to send INV" mean?

Additionally, if I remember correctly Jason considers it very important
that invalidation happens from the submitting context because otherwise
the RDMA retry mechanism can't work.

> +static void process_io_rsp(struct rtrs_clt_sess *sess, u32 msg_id,
> +			   s16 errno, bool w_inval)
> +{
> +	struct rtrs_clt_io_req *req;
> +
> +	if (WARN_ON(msg_id >= sess->queue_depth))
> +		return;
> +
> +	req = &sess->reqs[msg_id];
> +	/* Drop need_inv if server responsed with invalidation */
> +	req->need_inv &= !w_inval;
> +	complete_rdma_req(req, errno, true, false);
> +}

Please document the meaning of the "w_inval" argument. Please also fix
the spelling of "responsed".

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2019-12-30 10:29 ` [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers Jack Wang
  2019-12-30 19:48   ` Bart Van Assche
@ 2019-12-31  0:07   ` Bart Van Assche
  2020-01-03 13:48     ` Jinpu Wang
  1 sibling, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-31  0:07 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> +static inline u32 rtrs_to_io_rsp_imm(u32 msg_id, int errno, bool w_inval)
> +{
> +	enum rtrs_imm_type type;
> +	u32 payload;
> +
> +	/* 9 bits for errno, 19 bits for msg_id */
> +	payload = (abs(errno) & 0x1ff) << 19 | (msg_id & 0x7ffff);
> +	type = (w_inval ? RTRS_IO_RSP_W_INV_IMM : RTRS_IO_RSP_IMM);
> +
> +	return rtrs_to_imm(type, payload);
> +}
> +
> +static inline void rtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
> +{
> +	/* 9 bits for errno, 19 bits for msg_id */
> +	*msg_id = (payload & 0x7ffff);
> +	*errno = -(int)((payload >> 19) & 0x1ff);
> +}

The above comments mention that 19 bits are used for msg_id. The 0x7ffff
mask however has 23 bits set. Did I see that correctly? If so, does that
mean that the errno and msg_id bitfields overlap partially?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (24 preceding siblings ...)
  2019-12-30 10:29 ` [PATCH v6 25/25] MAINTAINERS: Add maintainers for RNBD/RTRS modules Jack Wang
@ 2019-12-31  0:11 ` Bart Van Assche
  2020-01-02  8:48   ` Jinpu Wang
  2019-12-31  2:39 ` Bart Van Assche
  26 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2019-12-31  0:11 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> here is V6 of the RTRS (former IBTRS) rdma transport library and the
> corresponding RNBD (former IBNBD) rdma network block device.

Please provide more information about the RTRS_IO_RSP_IMM and
RTRS_IO_RSP_W_INV_IMM server to client message types. Does one of these
message types perhaps mean that the receiver of the message is
responsible for invalidating the rkey associated with the RDMA transfer?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device
  2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
                   ` (25 preceding siblings ...)
  2019-12-31  0:11 ` [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Bart Van Assche
@ 2019-12-31  2:39 ` Bart Van Assche
  2020-01-02  9:20   ` Jinpu Wang
  2020-01-02 18:28   ` Jason Gunthorpe
  26 siblings, 2 replies; 89+ messages in thread
From: Bart Van Assche @ 2019-12-31  2:39 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 2019-12-30 02:29, Jack Wang wrote:
> here is V6 of the RTRS (former IBTRS) rdma transport library and the
> corresponding RNBD (former IBNBD) rdma network block device.
> 
> Changelog since v5:
> 1 rebased to linux-5.5-rc4
> 2 fix typo in my email address in first patch
> 3 cleanup copyright as suggested by Leon Romanovsky
> 4 remove 2 redudant kobject_del in error path as suggested by Leon Romanovsky
> 5 add MAINTAINERS entries in alphabetical order as Gal Pressman suggested

Please always include the full changelog when posting a new version.
Every other Linux kernel patch series I have seen includes a full
changelog in version two and later versions of its cover letter.

Information about how this patch series has been tested would be
welcome. How big were the changes between v4 and v5 and how much testing
have these changes received? Was this patch series tested in the Ionos
data center or is it the out-of-tree version of these drivers that runs
in the Ionos data center?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 25/25] MAINTAINERS: Add maintainers for RNBD/RTRS modules
  2019-12-30 12:22   ` Gal Pressman
@ 2020-01-02  8:41     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-02  8:41 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Bart Van Assche,
	Leon Romanovsky, Doug Ledford, Danil Kipnis, rpenyaev

On Mon, Dec 30, 2019 at 1:22 PM Gal Pressman <galpress@amazon.com> wrote:
>
> On 30/12/2019 12:29, Jack Wang wrote:
> > From: Jack Wang <jinpu.wang@cloud.ionos.com>
> >
> > Danil and me will maintain RNBD/RTRS modules.
> >
> > Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
> > Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
> > ---
> >  MAINTAINERS | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index e09bd92a1e44..2ba370d8145d 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -14125,6 +14125,13 @@ F:   arch/riscv/
> >  K:   riscv
> >  N:   riscv
> >
> > +RNBD BLOCK DRIVERS
> > +M:   Danil Kipnis <danil.kipnis@cloud.ionos.com>
> > +M:   Jack Wang <jinpu.wang@cloud.ionos.com>
> > +L:   linux-block@vger.kernel.org
> > +S:   Maintained
> > +F:   drivers/block/rnbd/
> > +
> >  ROCCAT DRIVERS
> >  M:   Stefan Achatz <erazor_de@users.sourceforge.net>
> >  W:   http://sourceforge.net/projects/roccat/
> > @@ -14192,6 +14199,13 @@ F:   include/net/rose.h
> >  F:   include/uapi/linux/rose.h
> >  F:   net/rose/
> >
> > +RTRS TRANSPORT DRIVERS
> > +M:   Danil Kipnis <danil.kipnis@cloud.ionos.com>
> > +M:   Jack Wang <jinpu.wang@cloud.ionos.com>
> > +L:   linux-rdma@vger.kernel.org
> > +S:   Maintained
> > +F:   drivers/infiniband/ulp/rtrs/
> > +
> >  RTL2830 MEDIA DRIVER
> >  M:   Antti Palosaari <crope@iki.fi>
> >  L:   linux-media@vger.kernel.org
>
> RTRS should be after RTL, right :)?
Yes, that's right. -.-
Thanks for double-checking, will be fixed.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device
  2019-12-31  0:11 ` [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Bart Van Assche
@ 2020-01-02  8:48   ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-02  8:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Dec 31, 2019 at 1:11 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > here is V6 of the RTRS (former IBTRS) rdma transport library and the
> > corresponding RNBD (former IBNBD) rdma network block device.
>
> Please provide more information about the RTRS_IO_RSP_IMM and
> RTRS_IO_RSP_W_INV_IMM server to client message types. Does one of these
> message types perhaps mean that the receiver of the message is
> responsible for invalidating the rkey associated with the RDMA transfer?
>
> Thanks,
>
> Bart.
Hi Bart,

You're right, RTRS_IO_RSP_W_INV_IMM means the client upon receiving
the message should invalidate
the rkey associated with the RDMA transfer.

We will document it in README PROTOCOL part.

Thanks,
Jack

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device
  2019-12-31  2:39 ` Bart Van Assche
@ 2020-01-02  9:20   ` Jinpu Wang
  2020-01-02 18:28   ` Jason Gunthorpe
  1 sibling, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-02  9:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Dec 31, 2019 at 3:39 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > here is V6 of the RTRS (former IBTRS) rdma transport library and the
> > corresponding RNBD (former IBNBD) rdma network block device.
> >
> > Changelog since v5:
> > 1 rebased to linux-5.5-rc4
> > 2 fix typo in my email address in first patch
> > 3 cleanup copyright as suggested by Leon Romanovsky
> > 4 remove 2 redudant kobject_del in error path as suggested by Leon Romanovsky
> > 5 add MAINTAINERS entries in alphabetical order as Gal Pressman suggested
>
> Please always include the full changelog when posting a new version.
> Every other Linux kernel patch series I have seen includes a full
> changelog in version two and later versions of its cover letter.
Sorry, it was my mistake, will include the full changelog next time.
>
> Information about how this patch series has been tested would be
> welcome. How big were the changes between v4 and v5 and how much testing
> have these changes received? Was this patch series tested in the Ionos
> data center or is it the out-of-tree version of these drivers that runs
> in the Ionos data center?
As mentioned in the v5 cover letter, the changes between v4 and v5
"'
 Main changes are the following:
1. Fix the security problem pointed out by Jason
2. Implement code-style/readability/API/etc suggestions by Bart van Assche
3. Rename IBTRS and IBNBD to RTRS and RNBD accordingly
4. Fileio mode support in rnbd-srv has been removed.

The main functional change is a fix for the security problem pointed out by
Jason and discussed both on the mailing list and during the last LPC
RDMA MC 2019.
On the server side we now invalidate in RTRS each rdma buffer before we hand it
over to RNBD server and in turn to the block layer. A new rkey is generated and
registered for the buffer after it returns back from the block layer and RNBD
server. The new rkey is sent back to the client along with the IO result.
The procedure is the default behaviour of the driver. This invalidation and
registration on each IO causes performance drop of up to 20%. A user of the
driver may choose to load the modules with this mechanism switched off
(always_invalidate=N), if he understands and can take the risk of a malicious
client being able to corrupt memory of a server it is connected to. This might
be a reasonable option in a scenario where all the clients and all the servers
are located within a secure datacenter.

Huge thanks to Bart van Assche for the very detailed review of both RNBD and
RTRS. These included suggestions for style fixes, better readability and
documentation, code simplifications, eliminating usage of deprecated APIs,
too many to name.

The transport library and the network block device using it have been renamed to
RTRS and RNBD accordingly in order to reflect the fact that they are based on
the rdma subsystem and not bound to InfiniBand only.

Fileio mode support in rnbd-server is not so efficent as pointed out by Bart,
and we can use loop device in between if there is need, hence we just
removed the fileio mode support.
"'
Regarding testing, all the changes have been tested with our
regression tests in our staging environment in IONOS data center.
it's around 200 test cases, for both always_invalidate=N and
always_invalidate=Y configurations.

I will mention it in the cover letter next time.

Thanks for your comments, Bart.
>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections
  2019-12-30 19:25   ` Bart Van Assche
@ 2020-01-02 13:35     ` Jinpu Wang
  2020-01-02 16:36       ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Jinpu Wang @ 2020-01-02 13:35 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Mon, Dec 30, 2019 at 8:25 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > +/*
> > + * Here goes RTRS client API
> > + */
>
> A comment that explains what the abbreviation "RTRS" stands for would be
> welcome here. Additionally, I think that "Here goes" can be left out.
will do.
>
> > +/**
> > + * rtrs_clt_open() - Open a session to an RTRS server
> > + * @priv: User supplied private data.
> > + * @link_ev: Event notification for connection state changes
>
> Please mention that @link_ev is a callback function.
Ok.
>
> > + *   @priv: User supplied data that was passed to rtrs_clt_open()
> > + *   @ev: Occurred event
>
> Is this patch series W=1 clean? @link_ev arguments should be documented
> above the link_clt_ev_fn typedef.
We will make sure it's W=1 clean in next round.
>
> > + * @path_cnt: Number of elemnts in the @paths array
>
> elemnts -> elements?
will fix.
>
> > + * Starts session establishment with the rtrs_server. The function can block
> > + * up to ~2000ms until it returns.
>
> until -> before?
will fix
>
> > +struct rtrs_clt *rtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
> > +                              const char *sessname,
> > +                              const struct rtrs_addr *paths,
> > +                              size_t path_cnt, short port,
> > +                              size_t pdu_sz, u8 reconnect_delay_sec,
> > +                              u16 max_segments,
> > +                              s16 max_reconnect_attempts);
>
> Since the range for port numbers is 1..65535, please change "short port"
> into "u16 port".
ok.
>
> > +/**
> > + * enum rtrs_clt_con_type() type of ib connection to use with a given permit
>
> What is a "permit"?
Does use rtrs_permit sound better?
>
> > + * @vec:     Message that is send to server together with the request.
>
> send -> sent?
right.
>
> > + *           Sum of len of all @vec elements limited to <= IO_MSG_SIZE.
> > + *           Since the msg is copied internally it can be allocated on stack.
> > + * @nr:              Number of elements in @vec.
> > + * @len:     length of data send to/from server
>
> send -> sent?
right.
>
> > +/**
> > + * link_ev_fn():     Events about connective state changes
>
> connective -> connection?
connectivity I think, will fix.
>
> > +/**
> > + * rtrs_srv_open() - open RTRS server context
> > + * @ops:             callback functions
> > + *
> > + * Creates server context with specified callbacks.
> > + *
> > + * Return a valid pointer on success otherwise PTR_ERR.
> > + */
> > +struct rtrs_srv_ctx *rtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
> > +                                  unsigned int port);
>
> Is this patch series W=1 clean? The documented argument does not match
> the actual argument list.
As replied above we will make sure it's W=1 clean when sending next round.
>
> Bart.
Thanks, Bart

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2019-12-30 19:48   ` Bart Van Assche
@ 2020-01-02 15:27     ` Jinpu Wang
  2020-01-02 17:00       ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Jinpu Wang @ 2020-01-02 15:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Mon, Dec 30, 2019 at 8:48 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > + * InfiniBand Transport Layer
>
> Is RTRS an InfiniBand or an RDMA transport layer?
The later,  will fix.
>
> > +#define rtrs_prefix(obj) (obj->sessname)
>
> Is it really worth it to introduce a macro for accessing a single member
> of a single pointer?
maybe not, will remove.
>
> > + * InfiniBand Transport Layer
>
> Same question here: is RTRS an InfiniBand or an RDMA transport layer?
will fix.

>
> > +enum {
> > +     SERVICE_CON_QUEUE_DEPTH = 512,
>
> What is a service connection?
s/SERVICE_CON_QUEUE_DEPTH/CON_QUEUE_DEPTH/g, do you think
CON_QUEUE_DEPTH is better or just QUEUE_DEPTH?
>
> > +     /*
> > +      * With the current size of the tag allocated on the client, 4K
> > +      * is the maximum number of tags we can allocate.  This number is
> > +      * also used on the client to allocate the IU for the user connection
> > +      * to receive the RDMA addresses from the server.
> > +      */
>
> What does the word 'tag' mean in the context of the RTRS protocol?
should be struct rtrs_permit, will fix.
>
> > +struct rtrs_ib_dev;
>
> What does the "rtrs_ib_dev" data structure represent? Additionally, I
> think it's confusing that a single name has an "r" that refers to "RDMA"
> and "ib" that refers to InfiniBand.
Naming is hard, it's structure mainly to cache struct ib_device
pointer and ib_pd pointer.
more info can be found below, Roman did try to push it to upstream,
see comment below.
>
> > +struct rtrs_ib_dev_pool {
> > +     struct mutex            mutex;
> > +     struct list_head        list;
> > +     enum ib_pd_flags        pd_flags;
> > +     const struct rtrs_ib_dev_pool_ops *ops;
> > +};
>
> What is the purpose of an rtrs_ib_dev_pool and what does it contain?
The idea was documented in the patchset here:
https://www.spinics.net/lists/linux-rdma/msg64025.html
"'
This is an attempt to make a device pool API out of a common code,
which caches pair of ib_device and ib_pd pointers. I found 4 places,
where this common functionality can be replaced by some lib calls:
nvme, nvmet, iser and isert. Total deduplication gain in loc is not
quite significant, but eventually new ULP IB code can also require
the same device/pd pair cache, e.g. in our IBTRS module the same
code has to be repeated again, which was observed by Sagi and he
suggested to make a common helper function instead of producing
another copy.
'''

>
> > +struct rtrs_iu {
>
> A comment that explains what the "iu" abbreviation stands for would be
> welcome.
ok.
>
> > +/**
> > + * enum rtrs_msg_types - RTRS message types.
> > + * @RTRS_MSG_INFO_REQ:               Client additional info request to the server
> > + * @RTRS_MSG_INFO_RSP:               Server additional info response to the client
> > + * @RTRS_MSG_WRITE:          Client writes data per RDMA to server
> > + * @RTRS_MSG_READ:           Client requests data transfer from server
> > + * @RTRS_MSG_RKEY_RSP:               Server refreshed rkey for rbuf
> > + */
>
> What is "additional info" in this context?
We have a bit more documentation in rtrs/README in patch 14,
"'
3. After all connections of a path are established client sends to server the
RTRS_MSG_INFO_REQ message, containing the name of the session. This message
requests the address information from the server.

4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
which contains the addresses and keys of the RDMA buffers allocated for that
session.
"'
>
> > +/**
> > + * struct rtrs_msg_conn_req - Client connection request to the server
> > + * @magic:      RTRS magic
> > + * @version:    RTRS protocol version
> > + * @cid:        Current connection id
> > + * @cid_num:    Number of connections per session
> > + * @recon_cnt:          Reconnections counter
> > + * @sess_uuid:          UUID of a session (path)
> > + * @paths_uuid:         UUID of a group of sessions (paths)
> > + *
> > + * NOTE: max size 56 bytes, see man rdma_connect().
> > + */
> > +struct rtrs_msg_conn_req {
> > +     u8              __cma_version; /* Is set to 0 by cma.c in case of
> > +                                     * AF_IB, do not touch that.
> > +                                     */
> > +     u8              __ip_version;  /* On sender side that should be
> > +                                     * set to 0, or cma_save_ip_info()
> > +                                     * extract garbage and will fail.
> > +                                     */
>
> The above two fields and the comments next to it look suspicious to me.
> Does RTRS perhaps try to generate CMA-formatted messages without using
> the CMA to format these messages?
The problem is in cma_format_hdr over-writes the first byte for AF_IB
https://www.spinics.net/lists/linux-rdma/msg22397.html

No one fixes the problem since then.

>
> > +     u8              reserved[12];
>
> Please leave out the reserved data. If future versions of the protocol
> would need any of these bytes it is easy to add more data to this structure.
Sorry, we can't do that, as I explained in the past, we have code
running in production and
there are checks expecting the size the of message are the same, it
will make the transition
to upstream version in the future a lot harder if we change the size
of the controll message.
>
> > +/**
> > + * struct rtrs_msg_conn_rsp - Server connection response to the client
> > + * @magic:      RTRS magic
> > + * @version:    RTRS protocol version
> > + * @errno:      If rdma_accept() then 0, if rdma_reject() indicates error
> > + * @queue_depth:   max inflight messages (queue-depth) in this session
> > + * @max_io_size:   max io size server supports
> > + * @max_hdr_size:  max msg header size server supports
> > + *
> > + * NOTE: size is 56 bytes, max possible is 136 bytes, see man rdma_accept().
> > + */
> > +struct rtrs_msg_conn_rsp {
> > +     __le16          magic;
> > +     __le16          version;
> > +     __le16          errno;
> > +     __le16          queue_depth;
> > +     __le32          max_io_size;
> > +     __le32          max_hdr_size;
> > +     __le32          flags;
> > +     u8              reserved[36];
> > +};
>
> Same comment here: please leave out the "reserved[]" array. Sending a
> bunch of zero-bytes at the end of a message over the wire is not useful.
same here.
>
> > +static inline void rtrs_from_imm(u32 imm, u32 *type, u32 *payload)
> > +{
> > +     *payload = (imm & MAX_IMM_PAYL_MASK);
> > +     *type = (imm >> MAX_IMM_PAYL_BITS);
> > +}
>
> Please do not use parentheses when not necessary. Such superfluous
> parentheses namely hurt readability of the code.
ok, will remove.
>
> > +     type = (w_inval ? RTRS_IO_RSP_W_INV_IMM : RTRS_IO_RSP_IMM);
>
> Same comment here: I think the parentheses can be left out from the
> above statement.
ok, will remove
>
> > +static inline void rtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
> > +{
> > +     /* 9 bits for errno, 19 bits for msg_id */
> > +     *msg_id = (payload & 0x7ffff);
>
> Are the parentheses in the above expression necessary?
will remove.
>
> > +     *errno = -(int)((payload >> 19) & 0x1ff);
>
> Is the '(int)' cast useful in the above expression? Can it be left out?
I think it's necessary, and make it more clear errno is a negative int
value, isn't it?

>
> > +#define STAT_ATTR(type, stat, print, reset)                          \
> > +STAT_STORE_FUNC(type, stat, reset)                                   \
> > +STAT_SHOW_FUNC(type, stat, print)                                    \
> > +static struct kobj_attribute stat##_attr =                           \
> > +             __ATTR(stat, 0644,                                      \
> > +                    stat##_show,                                     \
> > +                    stat##_store)
>
> Is the above use of __ATTR() perhaps an open-coded version of __ATTR_RW()?
right, will use __ATTR_RW() instead.
>
> Thanks,
>
> Bart.
Thanks Bart!

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections
  2020-01-02 13:35     ` Jinpu Wang
@ 2020-01-02 16:36       ` Bart Van Assche
  2020-01-02 16:47         ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 16:36 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On 1/2/20 5:35 AM, Jinpu Wang wrote:
> On Mon, Dec 30, 2019 at 8:25 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>> +/**
>>> + * enum rtrs_clt_con_type() type of ib connection to use with a given permit
>>
>> What is a "permit"?
> Does use rtrs_permit sound better?

I think keeping the word "permit" is fine. How about adding a comment 
above rtrs_permit that explains more clearly what the role of that data 
structure is? This is what I found in rtrs-clt.h:

/**
  * rtrs_permit - permits the memory allocation for future RDMA operation
  */
struct rtrs_permit {
         enum rtrs_clt_con_type con_type;
         unsigned int cpu_id;
         unsigned int mem_id;
         unsigned int mem_off;
};

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections
  2020-01-02 16:36       ` Bart Van Assche
@ 2020-01-02 16:47         ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-02 16:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 5:36 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 1/2/20 5:35 AM, Jinpu Wang wrote:
> > On Mon, Dec 30, 2019 at 8:25 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>> +/**
> >>> + * enum rtrs_clt_con_type() type of ib connection to use with a given permit
> >>
> >> What is a "permit"?
> > Does use rtrs_permit sound better?
>
> I think keeping the word "permit" is fine. How about adding a comment
> above rtrs_permit that explains more clearly what the role of that data
> structure is? This is what I found in rtrs-clt.h:
>
> /**
>   * rtrs_permit - permits the memory allocation for future RDMA operation
>   */
> struct rtrs_permit {
>          enum rtrs_clt_con_type con_type;
>          unsigned int cpu_id;
>          unsigned int mem_id;
>          unsigned int mem_off;
> };
>
> Thanks,
>
> Bart.
Ok, will do.
Thanks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2020-01-02 15:27     ` Jinpu Wang
@ 2020-01-02 17:00       ` Bart Van Assche
  2020-01-02 18:26         ` Jason Gunthorpe
  2020-01-03 12:27         ` Jinpu Wang
  0 siblings, 2 replies; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 17:00 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On 1/2/20 7:27 AM, Jinpu Wang wrote:
> On Mon, Dec 30, 2019 at 8:48 PM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 2019-12-30 02:29, Jack Wang wrote:
>>> +enum {
>>> +     SERVICE_CON_QUEUE_DEPTH = 512,
>>
>> What is a service connection?
> s/SERVICE_CON_QUEUE_DEPTH/CON_QUEUE_DEPTH/g, do you think
> CON_QUEUE_DEPTH is better or just QUEUE_DEPTH?

The name of the constant is fine, but what I meant is the following: has 
it been documented anywhere what the role of a "service connection" is?

>>> +struct rtrs_ib_dev_pool {
>>> +     struct mutex            mutex;
>>> +     struct list_head        list;
>>> +     enum ib_pd_flags        pd_flags;
>>> +     const struct rtrs_ib_dev_pool_ops *ops;
>>> +};
>>
>> What is the purpose of an rtrs_ib_dev_pool and what does it contain?
> The idea was documented in the patchset here:
> https://www.spinics.net/lists/linux-rdma/msg64025.html
> "'
> This is an attempt to make a device pool API out of a common code,
> which caches pair of ib_device and ib_pd pointers. I found 4 places,
> where this common functionality can be replaced by some lib calls:
> nvme, nvmet, iser and isert. Total deduplication gain in loc is not
> quite significant, but eventually new ULP IB code can also require
> the same device/pd pair cache, e.g. in our IBTRS module the same
> code has to be repeated again, which was observed by Sagi and he
> suggested to make a common helper function instead of producing
> another copy.
> '''

The word "pool" suggest ownership. Since struct rtrs_ib_dev_pool owns 
protection domains instead of RDMA devices, how about renaming that data 
structure into rtrs_pd_per_rdma_dev, rtrs_rdma_dev_pd or something 
similar? How about adding a comment like the following above that data 
structure?

/*
  * Data structure used to associate one protection domain (PD) with each
  * RDMA device.
  */

>>> +/**
>>> + * struct rtrs_msg_conn_req - Client connection request to the server
>>> + * @magic:      RTRS magic
>>> + * @version:    RTRS protocol version
>>> + * @cid:        Current connection id
>>> + * @cid_num:    Number of connections per session
>>> + * @recon_cnt:          Reconnections counter
>>> + * @sess_uuid:          UUID of a session (path)
>>> + * @paths_uuid:         UUID of a group of sessions (paths)
>>> + *
>>> + * NOTE: max size 56 bytes, see man rdma_connect().
>>> + */
>>> +struct rtrs_msg_conn_req {
>>> +     u8              __cma_version; /* Is set to 0 by cma.c in case of
>>> +                                     * AF_IB, do not touch that.
>>> +                                     */
>>> +     u8              __ip_version;  /* On sender side that should be
>>> +                                     * set to 0, or cma_save_ip_info()
>>> +                                     * extract garbage and will fail.
>>> +                                     */
>>
>> The above two fields and the comments next to it look suspicious to me.
>> Does RTRS perhaps try to generate CMA-formatted messages without using
>> the CMA to format these messages?
> The problem is in cma_format_hdr over-writes the first byte for AF_IB
> https://www.spinics.net/lists/linux-rdma/msg22397.html
> 
> No one fixes the problem since then.

How about adding that URL to the comment block above struct 
rtrs_msg_conn_req?

>>
>>> +     *errno = -(int)((payload >> 19) & 0x1ff);
>>
>> Is the '(int)' cast useful in the above expression? Can it be left out?
> I think it's necessary, and make it more clear errno is a negative int
> value, isn't it?

According to the C standard operations on unsigned integers "wrap 
around" so removing the cast should be safe. Anyway, this is not 
something I consider important.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 06/25] rtrs: client: main functionality
  2019-12-30 23:53   ` Bart Van Assche
@ 2020-01-02 18:23     ` Jason Gunthorpe
  2020-01-03 14:30     ` Jinpu Wang
  1 sibling, 0 replies; 89+ messages in thread
From: Jason Gunthorpe @ 2020-01-02 18:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, axboe, hch, sagi, leon,
	dledford, danil.kipnis, jinpu.wang, rpenyaev

On Mon, Dec 30, 2019 at 03:53:10PM -0800, Bart Van Assche wrote:
> > +	if (req->sg_cnt) {
> > +		if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) {
> > +			/*
> > +			 * We are here to invalidate RDMA read requests
> > +			 * ourselves.  In normal scenario server should
> > +			 * send INV for all requested RDMA reads, but
> > +			 * we are here, thus two things could happen:
> > +			 *
> > +			 *    1.  this is failover, when errno != 0
> > +			 *        and can_wait == 1,
> > +			 *
> > +			 *    2.  something totally bad happened and
> > +			 *        server forgot to send INV, so we
> > +			 *        should do that ourselves.
> > +			 */
> 
> Please document in the protocol documentation when RDMA reads are used.
> 
> What does "server forgot to send INV" mean?
> 
> Additionally, if I remember correctly Jason considers it very important
> that invalidation happens from the submitting context because otherwise
> the RDMA retry mechanism can't work.

I think my point has usually been you can't use completions on the RQ
to deduce the state of the SQ

But if you are doing inv by posting it on the same SQ then things will
get ordered OK as the HW shouldn't progress the INV until any work
touching that rkey is also concluded.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2020-01-02 17:00       ` Bart Van Assche
@ 2020-01-02 18:26         ` Jason Gunthorpe
  2020-01-03 12:31           ` Jinpu Wang
  2020-01-03 12:27         ` Jinpu Wang
  1 sibling, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-01-02 18:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jinpu Wang, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 02, 2020 at 09:00:53AM -0800, Bart Van Assche wrote:

> > > > +/**
> > > > + * struct rtrs_msg_conn_req - Client connection request to the server
> > > > + * @magic:      RTRS magic
> > > > + * @version:    RTRS protocol version
> > > > + * @cid:        Current connection id
> > > > + * @cid_num:    Number of connections per session
> > > > + * @recon_cnt:          Reconnections counter
> > > > + * @sess_uuid:          UUID of a session (path)
> > > > + * @paths_uuid:         UUID of a group of sessions (paths)
> > > > + *
> > > > + * NOTE: max size 56 bytes, see man rdma_connect().
> > > > + */
> > > > +struct rtrs_msg_conn_req {
> > > > +     u8              __cma_version; /* Is set to 0 by cma.c in case of
> > > > +                                     * AF_IB, do not touch that.
> > > > +                                     */
> > > > +     u8              __ip_version;  /* On sender side that should be
> > > > +                                     * set to 0, or cma_save_ip_info()
> > > > +                                     * extract garbage and will fail.
> > > > +                                     */
> > > 
> > > The above two fields and the comments next to it look suspicious to me.
> > > Does RTRS perhaps try to generate CMA-formatted messages without using
> > > the CMA to format these messages?
> > The problem is in cma_format_hdr over-writes the first byte for AF_IB
> > https://www.spinics.net/lists/linux-rdma/msg22397.html
> > 
> > No one fixes the problem since then.
> 
> How about adding that URL to the comment block above struct
> rtrs_msg_conn_req?

Or just fixing whatever the problem is..

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device
  2019-12-31  2:39 ` Bart Van Assche
  2020-01-02  9:20   ` Jinpu Wang
@ 2020-01-02 18:28   ` Jason Gunthorpe
  2020-01-03 12:34     ` Jinpu Wang
  1 sibling, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-01-02 18:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, axboe, hch, sagi, leon,
	dledford, danil.kipnis, jinpu.wang, rpenyaev

On Mon, Dec 30, 2019 at 06:39:00PM -0800, Bart Van Assche wrote:
> On 2019-12-30 02:29, Jack Wang wrote:
> > here is V6 of the RTRS (former IBTRS) rdma transport library and the
> > corresponding RNBD (former IBNBD) rdma network block device.
> > 
> > Changelog since v5:
> > 1 rebased to linux-5.5-rc4
> > 2 fix typo in my email address in first patch
> > 3 cleanup copyright as suggested by Leon Romanovsky
> > 4 remove 2 redudant kobject_del in error path as suggested by Leon Romanovsky
> > 5 add MAINTAINERS entries in alphabetical order as Gal Pressman suggested
> 
> Please always include the full changelog when posting a new version.
> Every other Linux kernel patch series I have seen includes a full
> changelog in version two and later versions of its cover letter.

We now also like it if you include URLs to lore.kernel.org for the
prior submissions.

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 07/25] rtrs: client: statistics functions
  2019-12-30 10:29 ` [PATCH v6 07/25] rtrs: client: statistics functions Jack Wang
@ 2020-01-02 21:07   ` Bart Van Assche
  2020-01-03 14:39     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 21:07 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> From: Jack Wang <jinpu.wang@cloud.ionos.com>
> 
> This introduces set of functions used on client side to account
> statistics of RDMA data sent/received, amount of IOs inflight,
> latency, cpu migrations, etc.  Almost all statistics is collected
                                                        ^^
                                                        are?
> using percpu variables.
> [ ... ]
> +static inline int rtrs_clt_ms_to_id(unsigned long ms)
> +{
> +	int id = ms ? ilog2(ms) - MIN_LOG_LAT + 1 : 0;
> +
> +	return clamp(id, 0, LOG_LAT_SZ - 1);
> +}

I think it is unusual to call the returned value an "id" in this 
context. How about changing "id" into "bin" or "bucket"? See also 
https://en.wikipedia.org/wiki/Histogram.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 08/25] rtrs: client: sysfs interface functions
  2019-12-30 10:29 ` [PATCH v6 08/25] rtrs: client: sysfs interface functions Jack Wang
@ 2020-01-02 21:14   ` Bart Van Assche
  2020-01-03 14:59     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 21:14 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +static struct kobj_type ktype = {
> +	.sysfs_ops = &kobj_sysfs_ops,
> +};

Can this data structure be declared 'const'?

> +static ssize_t max_reconnect_attempts_show(struct device *dev,
> +					   struct device_attribute *attr,
> +					   char *page)
> +{
> +	struct rtrs_clt *clt;
> +
> +	clt = container_of(dev, struct rtrs_clt, dev);

If the above two statements would be combined into a single statement, 
does the result still fit in 80 columns? If so, please combine these two 
statements into a single statement.

> +static ssize_t max_reconnect_attempts_store(struct device *dev,
> +					    struct device_attribute *attr,
> +					    const char *buf,
> +					    size_t count)
> +{
> +	struct rtrs_clt *clt;
> +	int value;
> +	int ret;
> +
> +	clt = container_of(dev, struct rtrs_clt, dev);

Same comment here and also for other uses of 'clt': how about combining 
the declaration and initialization of 'clt' into a single line of code?

> +static ssize_t mpath_policy_show(struct device *dev,
> +				 struct device_attribute *attr,
> +				 char *page)
> +{
> +	struct rtrs_clt *clt;
> +
> +	clt = container_of(dev, struct rtrs_clt, dev);
> +
> +	switch (clt->mp_policy) {
> +	case MP_POLICY_RR:
> +		return sprintf(page, "round-robin (RR: %d)\n", clt->mp_policy);
> +	case MP_POLICY_MIN_INFLIGHT:
> +		return sprintf(page, "min-inflight (MI: %d)\n", clt->mp_policy);
> +	default:
> +		return sprintf(page, "Unknown (%d)\n", clt->mp_policy);
> +	}
> +}

Is the above show function compatible with the sysfs one-value-per-file 
rule?

> +static struct kobj_attribute rtrs_clt_remove_path_attr =
> +	__ATTR(remove_path, 0644, rtrs_clt_remove_path_show,
> +	       rtrs_clt_remove_path_store);

Could __ATTR_RW() have been used here?

> +static struct kobj_attribute rtrs_clt_src_addr_attr =
> +	__ATTR(src_addr, 0444, rtrs_clt_src_addr_show, NULL);

Could __ATTR_RO() have been used here?

> +static struct attribute_group rtrs_clt_sess_attr_group = {
> +	.attrs = rtrs_clt_sess_attrs,
> +};

Can this data structure be declared 'const'?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 09/25] rtrs: server: private header with server structs and functions
  2019-12-30 10:29 ` [PATCH v6 09/25] rtrs: server: private header with server structs and functions Jack Wang
@ 2020-01-02 21:24   ` Bart Van Assche
  2020-01-08 16:33     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 21:24 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +struct rtrs_stats_wc_comp {
> +	atomic64_t	calls;
> +	atomic64_t	total_wc_cnt;
> +};

Please document the meaning of the members of this data structure.

> +struct rtrs_srv_stats_rdma_stats {
> +	struct {
> +		atomic64_t	cnt;
> +		atomic64_t	size_total;
> +	} dir[2];
> +};

Please document the meaning of the members of this data structure and 
also which index (0, 1) corresponds to which direction (read, write).

> +struct rtrs_srv_op {
> +	struct rtrs_srv_con		*con;
> +	u32				msg_id;
> +	u8				dir;
> +	struct rtrs_msg_rdma_read	*rd_msg;
> +	struct ib_rdma_wr		*tx_wr;
> +	struct ib_sge			*tx_sg;
> +};

Please document the role of this data structure.

> +struct rtrs_srv_mr {
> +	struct ib_mr	*mr;
> +	struct sg_table	sgt;
> +	struct ib_cqe	inv_cqe; /* only for always_invalidate=true */
> +	u32		msg_id; /* only for always_invalidate=true */
> +	u32		msg_off; /* only for always_invalidate=true */
> +	struct rtrs_iu	*iu; /* send buffer for new rkey msg */
> +};

Please document the role of this data structure.

> +extern struct class *rtrs_dev_class;

Please make sure that the static 'rtrs_dev_class' variable in rtrs-clt.c 
and in this header file have different names.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 11/25] rtrs: server: statistics functions
  2019-12-30 10:29 ` [PATCH v6 11/25] rtrs: server: statistics functions Jack Wang
@ 2020-01-02 22:02   ` Bart Van Assche
  2020-01-08 12:55     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 22:02 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +int rtrs_srv_reset_rdma_stats(struct rtrs_srv_stats *stats, bool enable)
> +{
> +	if (enable) {
> +		struct rtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> +
> +		memset(r, 0, sizeof(*r));
> +		return 0;
> +	}
> +
> +	return -EINVAL;
> +}

I think the traditional kernel coding style is "if (!enable) return ...".

> +ssize_t rtrs_srv_stats_rdma_to_str(struct rtrs_srv_stats *stats,
> +				    char *page, size_t len)
> +{
> +	struct rtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> +	struct rtrs_srv_sess *sess;
> +
> +	sess = container_of(stats, typeof(*sess), stats);
> +
> +	return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
> +			 (s64)atomic64_read(&r->dir[READ].cnt),
> +			 (s64)atomic64_read(&r->dir[READ].size_total),
> +			 (s64)atomic64_read(&r->dir[WRITE].cnt),
> +			 (s64)atomic64_read(&r->dir[WRITE].size_total),
> +			 atomic_read(&sess->ids_inflight));
> +}

Does this follow the sysfs one-value-per-file rule?

> +int rtrs_srv_stats_wc_completion_to_str(struct rtrs_srv_stats *stats,
> +					 char *buf, size_t len)
> +{
> +	return snprintf(buf, len, "%lld %lld\n",
> +			(s64)atomic64_read(&stats->wc_comp.total_wc_cnt),
> +			(s64)atomic64_read(&stats->wc_comp.calls));
> +}

Same comment here.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 10/25] rtrs: server: main functionality
  2019-12-30 10:29 ` [PATCH v6 10/25] rtrs: server: main functionality Jack Wang
@ 2020-01-02 22:03   ` Bart Van Assche
  2020-01-07 13:19     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 22:03 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +MODULE_DESCRIPTION("RTRS Server");

Please expand the "RTRS" abbreviation in the module description.

> +static void rtrs_srv_get_ops_ids(struct rtrs_srv_sess *sess)
> +{
> +	atomic_inc(&sess->ids_inflight);
> +}
> +
> +static void rtrs_srv_put_ops_ids(struct rtrs_srv_sess *sess)
> +{
> +	if (atomic_dec_and_test(&sess->ids_inflight))
> +		wake_up(&sess->ids_waitq);
> +}
> +
> +static void rtrs_srv_wait_ops_ids(struct rtrs_srv_sess *sess)
> +{
> +	wait_event(sess->ids_waitq, !atomic_read(&sess->ids_inflight));
> +}

So rtrs_srv_wait_ops_ids() returns without grabbing any synchronization 
object? What guarantees that ids_inflight is not increased after 
wait_event() has returned and before rtrs_srv_wait_ops_ids() returns?

> +	/*
> +	 * From time to time we have to post signalled sends,
> +	 * or send queue will fill up and only QP reset can help.
> +	 */
> +	flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
> +			0 : IB_SEND_SIGNALED;

Should "signalled" perhaps be changed into "signaled"?

How can posting a signaled send prevent that the send queue overflows? 
Isn't that something that can only be guaranteed by tracking the number 
of WQE's in the send queue?

> +/**
> + * send_io_resp_imm() - response with empty IMM on failed READ/WRITE requests or
> + *                      on successful WRITE request.
> + * @con		the connection to send back result
> + * @id		the id associated to io
> + * @errno	the error number of the IO.
> + *
> + * Return 0 on success, errno otherwise.
> + */

Should "response ... on" perhaps be changed into "respond ... to"? 
Should "associated to" perhaps be changed into "associated with"?

> +static int map_cont_bufs(struct rtrs_srv_sess *sess)

A comment that explains what "cont" in this function name means would be 
welcome.

> +static inline int sockaddr_cmp(const struct sockaddr *a,
> +			       const struct sockaddr *b)
> +{
> +	switch (a->sa_family) {
> +	case AF_IB:
> +		return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
> +			      &((struct sockaddr_ib *)b)->sib_addr,
> +			      sizeof(struct ib_addr));
> +	case AF_INET:
> +		return memcmp(&((struct sockaddr_in *)a)->sin_addr,
> +			      &((struct sockaddr_in *)b)->sin_addr,
> +			      sizeof(struct in_addr));
> +	case AF_INET6:
> +		return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
> +			      &((struct sockaddr_in6 *)b)->sin6_addr,
> +			      sizeof(struct in6_addr));
> +	default:
> +		return -ENOENT;
> +	}
> +}

The memcmp() return value can be used to sort values. Since that is not 
the case for the sockaddr_cmp() return value, please document this. 
Additionally, it seems like a comparison of a->sa_family and 
b->sa_family is missing?

> +static int rtrs_rdma_do_accept(struct rtrs_srv_sess *sess,
> +				struct rdma_cm_id *cm_id)
> +{
> +	struct rtrs_srv *srv = sess->srv;
> +	struct rtrs_msg_conn_rsp msg;
> +	struct rdma_conn_param param;
> +	int err;
> +
> +	memset(&param, 0, sizeof(param));
> +	param.rnr_retry_count = 7;
> +	param.private_data = &msg;
> +	param.private_data_len = sizeof(msg);
> +
> +	memset(&msg, 0, sizeof(msg));
> +	msg.magic = cpu_to_le16(RTRS_MAGIC);
> +	msg.version = cpu_to_le16(RTRS_PROTO_VER);
> +	msg.errno = 0;
> +	msg.queue_depth = cpu_to_le16(srv->queue_depth);
> +	msg.max_io_size = cpu_to_le32(max_chunk_size - MAX_HDR_SIZE);
> +	msg.max_hdr_size = cpu_to_le32(MAX_HDR_SIZE);
> +
> +	if (always_invalidate)
> +		msg.flags = cpu_to_le32(RTRS_MSG_NEW_RKEY_F);
> +
> +	err = rdma_accept(cm_id, &param);
> +	if (err)
> +		pr_err("rdma_accept(), err: %d\n", err);
> +
> +	return err;
> +}

Please use a designated initializer list instead of memset() followed by 
initialization of multiple structure members.

> +static int rtrs_srv_rdma_init(struct rtrs_srv_ctx *ctx, unsigned int port)
> +{
> +	struct sockaddr_in6 sin = {
> +		.sin6_family	= AF_INET6,
> +		.sin6_addr	= IN6ADDR_ANY_INIT,
> +		.sin6_port	= htons(port),
> +	};
> +	struct sockaddr_ib sib = {
> +		.sib_family			= AF_IB,
> +		.sib_addr.sib_subnet_prefix	= 0ULL,
> +		.sib_addr.sib_interface_id	= 0ULL,
> +		.sib_sid	= cpu_to_be64(RDMA_IB_IP_PS_IB | port),
> +		.sib_sid_mask	= cpu_to_be64(0xffffffffffffffffULL),
> +		.sib_pkey	= cpu_to_be16(0xffff),
> +	};

A minor comment: structure members that are zero do not have to be 
initialized explicitly. The compiler does that automatically.

> +struct rtrs_srv_ctx *rtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
> +				     unsigned int port)
> +{
> +	struct rtrs_srv_ctx *ctx;
> +	int err;
> +
> +	ctx = alloc_srv_ctx(rdma_ev, link_ev);
> +	if (unlikely(!ctx))
> +		return ERR_PTR(-ENOMEM);
> +
> +	err = rtrs_srv_rdma_init(ctx, port);
> +	if (unlikely(err)) {
> +		free_srv_ctx(ctx);
> +		return ERR_PTR(err);
> +	}
> +	/* Do not let module be unloaded if server context is alive */
> +	__module_get(THIS_MODULE);
> +
> +	return ctx;
> +}
> +EXPORT_SYMBOL(rtrs_srv_open);

Isn't it inconvenient for users if module unloading is prevented while 
one or more connections are active? This requires users to figure out 
how to trigger a log out if they want to unload a kernel module. 
Additionally, how are users expected to prevent that the client relogins 
after the server has told them to log out and before the server kernel 
module is unloaded?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 12/25] rtrs: server: sysfs interface functions
  2019-12-30 10:29 ` [PATCH v6 12/25] rtrs: server: sysfs interface functions Jack Wang
@ 2020-01-02 22:06   ` Bart Van Assche
  2020-01-07 14:40     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 22:06 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +static struct kobj_attribute rtrs_srv_disconnect_attr =
> +	__ATTR(disconnect, 0644,
> +	       rtrs_srv_disconnect_show, rtrs_srv_disconnect_store);

Could __ATTR_RW() have been used here?

> +static struct kobj_attribute rtrs_srv_hca_port_attr =
> +	__ATTR(hca_port, 0444, rtrs_srv_hca_port_show, NULL);

Could __ATTR_RO() have been used here?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 13/25] rtrs: include client and server modules into kernel compilation
  2019-12-30 10:29 ` [PATCH v6 13/25] rtrs: include client and server modules into kernel compilation Jack Wang
@ 2020-01-02 22:11   ` Bart Van Assche
  2020-01-03 16:19     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 22:11 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +config INFINIBAND_RTRS
> +	tristate
> +	depends on INFINIBAND_ADDR_TRANS
> +
> +config INFINIBAND_RTRS_CLIENT
> +	tristate "RTRS client module"
> +	depends on INFINIBAND_ADDR_TRANS
> +	select INFINIBAND_RTRS
> +	help
> +	  RDMA transport client module.
> +
> +	  RTRS client allows for simplified data transfer and connection
> +	  establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
> +	  READ/WRITE semantics and provides multipath capabilities.

What does "simplified" mean in this context? I'm concerned that 
including that word will cause confusion. How about writing that RTRS 
implements a reliable transport layer and also multipathing 
functionality and that it is intended to be the base layer for a block 
storage initiator over RDMA?

> +config INFINIBAND_RTRS_SERVER
> +	tristate "RTRS server module"
> +	depends on INFINIBAND_ADDR_TRANS
> +	select INFINIBAND_RTRS
> +	help
> +	  RDMA transport server module.
> +
> +	  RTRS server module processing connection and IO requests received
> +	  from the RTRS client module, it will pass the IO requests to its
> +	  user eg. RNBD_server.

Users who see these help texts will be left wondering what RTRS stands 
for. Please add some text that explains what the RTRS abbreviation 
stands for.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 14/25] rtrs: a bit of documentation
  2019-12-30 10:29 ` [PATCH v6 14/25] rtrs: a bit of documentation Jack Wang
  2019-12-30 23:19   ` Bart Van Assche
@ 2020-01-02 22:21   ` Bart Van Assche
  2020-01-07 15:49     ` Jinpu Wang
  1 sibling, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 22:21 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang,
	rpenyaev, linux-kernel

On 12/30/19 2:29 AM, Jack Wang wrote:
> diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-client b/Documentation/ABI/testing/sysfs-class-rtrs-client
> new file mode 100644
> index 000000000000..8b219cf6c5c4
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-class-rtrs-client
> @@ -0,0 +1,190 @@
> +What:		/sys/class/rtrs-client
> +Date:		Jan 2020
> +KernelVersion:	5.6
> +Contact:	Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
> +Description:
> +When a user of RTRS API creates a new session, a directory entry with
> +the name of that session is created under /sys/class/rtrs-client/<session-name>/

Thank you for having included this ABI description. This is very 
helpful. Please follow the format documented in Documentation/ABI/README 
and make sure that all text, including the description, start in column 
17 and please use tabs for indentation.

> diff --git a/drivers/infiniband/ulp/rtrs/README b/drivers/infiniband/ulp/rtrs/README
> new file mode 100644
> index 000000000000..59ad60318a18
> --- /dev/null
> +++ b/drivers/infiniband/ulp/rtrs/README
> @@ -0,0 +1,149 @@
> +****************************
> +InfiniBand Transport (RTRS)
> +****************************
> +
> +RTRS (InfiniBand Transport) is a reliable high speed transport library
> +which provides support to establish optimal number of connections
> +between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
> +transport. It is optimized to transfer (read/write) IO blocks.

Is it explained somewhere how the optimal number of connections is 
determined and also according to which metric the number of connections 
is optimized? Is the number of connections chosen to minimize latency, 
maximize IOPS or perhaps to optimize yet another metric?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 15/25] rnbd: private headers with rnbd protocol structs and helpers
  2019-12-30 10:29 ` [PATCH v6 15/25] rnbd: private headers with rnbd protocol structs and helpers Jack Wang
@ 2020-01-02 22:34   ` Bart Van Assche
  2020-01-07 16:53     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 22:34 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> + * @device_id:		device_id on server side to identify the device

Is this a number that only has a meaning inside the RTRS software? Is 
the role of this number perhaps similar to an NVMe namespace or SCSI 
LUN? If so, please mention this. Additionally, does this number start 
from zero or from one?

> + * @max_segments:	max segments hardware support in one transfer

Which "hardware" does this comment refer to? The RDMA adapter or the 
block device in the server? In the latter case, what if the block device 
has been implemented in software?

What kind of transfer does this comment refer to? A DMA transfer? If so, 
please mention this.

> +struct rnbd_msg_open_rsp {
> +	struct rnbd_msg_hdr	hdr;
> +	__le32			device_id;
> +	__le64			nsectors;
> +	__le32			max_hw_sectors;
> +	__le32			max_write_same_sectors;
> +	__le32			max_discard_sectors;
> +	__le32			discard_granularity;
> +	__le32			discard_alignment;
> +	__le16			physical_block_size;
> +	__le16			logical_block_size;
> +	__le16			max_segments;
> +	__le16			secure_discard;
> +	u8			rotational;
> +	u8			reserved[11];
> +};

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 16/25] rnbd: client: private header with client structs and functions
  2019-12-30 10:29 ` [PATCH v6 16/25] rnbd: client: private header with client structs and functions Jack Wang
@ 2020-01-02 22:37   ` Bart Van Assche
  2020-01-07 17:09     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 22:37 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +struct rnbd_iu {
> +	union {
> +		struct request *rq; /* for block io */
> +		void *buf; /* for user messages */
> +	};
> +	struct rtrs_permit	*permit;
> +	union {
> +		/* use to send msg associated with a dev */
> +		struct rnbd_clt_dev *dev;
> +		/* use to send msg associated with a sess */
> +		struct rnbd_clt_session *sess;
> +	};
> +	blk_status_t		status;
> +	struct scatterlist	sglist[BMAX_SEGMENTS];
> +	struct work_struct	work;
> +	int			errno;
> +	struct rnbd_iu_comp	comp;
> +	atomic_t		refcount;
> +};

This data structure includes both a blk_status_t and an errno value. Can 
these two members be combined into a single member?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 17/25] rnbd: client: main functionality
  2019-12-30 10:29 ` [PATCH v6 17/25] rnbd: client: main functionality Jack Wang
@ 2020-01-02 23:55   ` Bart Van Assche
  2020-01-08 14:22     ` Jinpu Wang
  2020-01-10 14:45     ` Jinpu Wang
  0 siblings, 2 replies; 89+ messages in thread
From: Bart Van Assche @ 2020-01-02 23:55 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +MODULE_DESCRIPTION("InfiniBand Network Block Device Client");

InfiniBand or RDMA?

> +static int rnbd_clt_set_dev_attr(struct rnbd_clt_dev *dev,
> +				  const struct rnbd_msg_open_rsp *rsp)
> +{
> +	struct rnbd_clt_session *sess = dev->sess;
> +
> +	if (unlikely(!rsp->logical_block_size))
> +		return -EINVAL;
> +
> +	dev->device_id		    = le32_to_cpu(rsp->device_id);
> +	dev->nsectors		    = le64_to_cpu(rsp->nsectors);
> +	dev->logical_block_size	    = le16_to_cpu(rsp->logical_block_size);
> +	dev->physical_block_size    = le16_to_cpu(rsp->physical_block_size);
> +	dev->max_write_same_sectors = le32_to_cpu(rsp->max_write_same_sectors);
> +	dev->max_discard_sectors    = le32_to_cpu(rsp->max_discard_sectors);
> +	dev->discard_granularity    = le32_to_cpu(rsp->discard_granularity);
> +	dev->discard_alignment	    = le32_to_cpu(rsp->discard_alignment);
> +	dev->secure_discard	    = le16_to_cpu(rsp->secure_discard);
> +	dev->rotational		    = rsp->rotational;
> +
> +	dev->max_hw_sectors = sess->max_io_size / dev->logical_block_size;

The above statement looks suspicious to me. The unit of the second 
argument of blk_queue_max_hw_sectors() is 512 bytes. Since 
dev->max_hw_sectors is passed as the second argument to 
blk_queue_max_hw_sectors() I think it should also have 512 bytes as unit 
instead of the logical block size.

> +static int rnbd_clt_change_capacity(struct rnbd_clt_dev *dev,
> +				     size_t new_nsectors)
> +{
> +	int err = 0;
> +
> +	rnbd_clt_info(dev, "Device size changed from %zu to %zu sectors\n",
> +		       dev->nsectors, new_nsectors);
> +	dev->nsectors = new_nsectors;
> +	set_capacity(dev->gd,
> +		     dev->nsectors * (dev->logical_block_size /
> +				      SECTOR_SIZE));
> +	err = revalidate_disk(dev->gd);
> +	if (err)
> +		rnbd_clt_err(dev,
> +			      "Failed to change device size from %zu to %zu, err: %d\n",
> +			      dev->nsectors, new_nsectors, err);
> +	return err;
> +}

Please document the unit of nsectors in struct rnbd_clt_dev. Please also 
document the unit of the 'new_nsectors' argument.

The set_capacity() call can only be correct if the unit of dev->nsectors 
is one logical block. Is that really the case?

> +static void msg_io_conf(void *priv, int errno)
> +{
> +	struct rnbd_iu *iu = priv;
> +	struct rnbd_clt_dev *dev = iu->dev;
> +	struct request *rq = iu->rq;
> +
> +	iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
> +
> +	blk_mq_complete_request(rq);
> +
> +	if (errno)
> +		rnbd_clt_info_rl(dev, "%s I/O failed with err: %d\n",
> +				  rq_data_dir(rq) == READ ? "read" : "write",
> +				  errno);
> +}

Accessing 'rq' after having called blk_mq_complete_request() may trigger 
a use-after-free. Please don't do that.

> +static void wait_for_rtrs_disconnection(struct rnbd_clt_session *sess)
> +__releases(&sess_lock)
> +__acquires(&sess_lock)

Please indent __releases() and __acquires() annotations.

> +{
> +	DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
> +
> +	prepare_to_wait(&sess->rtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
> +	if (IS_ERR_OR_NULL(sess->rtrs)) {
> +		finish_wait(&sess->rtrs_waitq, &wait);
> +		return;
> +	}
> +	mutex_unlock(&sess_lock);
> +	/* After unlock session can be freed, so careful */
> +	schedule();
> +	mutex_lock(&sess_lock);
> +}

How can a function that calls schedule() and that is not surrounded by a 
loop be correct? What if e.g. schedule() finishes due to a spurious wakeup?

> +static struct rnbd_clt_session *__find_and_get_sess(const char *sessname)
> +__releases(&sess_lock)
> +__acquires(&sess_lock)
> +{
> +	struct rnbd_clt_session *sess;
> +	int err;
> +
> +again:
> +	list_for_each_entry(sess, &sess_list, list) {
> +		if (strcmp(sessname, sess->sessname))
> +			continue;
> +
> +		if (unlikely(sess->rtrs_ready && IS_ERR_OR_NULL(sess->rtrs)))
> +			/*
> +			 * No RTRS connection, session is dying.
> +			 */
> +			continue;
> +
> +		if (likely(rnbd_clt_get_sess(sess))) {
> +			/*
> +			 * Alive session is found, wait for RTRS connection.
> +			 */
> +			mutex_unlock(&sess_lock);
> +			err = wait_for_rtrs_connection(sess);
> +			if (unlikely(err))
> +				rnbd_clt_put_sess(sess);
> +			mutex_lock(&sess_lock);
> +
> +			if (unlikely(err))
> +				/* Session is dying, repeat the loop */
> +				goto again;
> +
> +			return sess;
> +		}
> +		/*
> +		 * Ref is 0, session is dying, wait for RTRS disconnect
> +		 * in order to avoid session names clashes.
> +		 */
> +		wait_for_rtrs_disconnection(sess);
> +		/*
> +		 * RTRS is disconnected and soon session will be freed,
> +		 * so repeat a loop.
> +		 */
> +		goto again;
> +	}
> +
> +	return NULL;
> +}

Since wait_for_rtrs_disconnection() unlocks sess_lock, can the 
list_for_each_entry() above trigger a use-after-free of sess->next?

> +static size_t rnbd_clt_get_sg_size(struct scatterlist *sglist, u32 len)
> +{
> +	struct scatterlist *sg;
> +	size_t tsize = 0;
> +	int i;
> +
> +	for_each_sg(sglist, sg, len, i)
> +		tsize += sg->length;
> +	return tsize;
> +}

Please follow the example of other block drivers and use blk_rq_bytes() 
instead of iterating over the sg-list.

> +static int setup_mq_tags(struct rnbd_clt_session *sess)
> +{
> +	struct blk_mq_tag_set *tags = &sess->tag_set;
> +
> +	memset(tags, 0, sizeof(*tags));
> +	tags->ops		= &rnbd_mq_ops;
> +	tags->queue_depth	= sess->queue_depth;
> +	tags->numa_node		= NUMA_NO_NODE;
> +	tags->flags		= BLK_MQ_F_SHOULD_MERGE |
> +				  BLK_MQ_F_TAG_SHARED;
> +	tags->cmd_size		= sizeof(struct rnbd_iu);
> +	tags->nr_hw_queues	= num_online_cpus();
> +
> +	return blk_mq_alloc_tag_set(tags);
> +}

Please change the name of the "tags" pointer into "tag_set".

> +static int index_to_minor(int index)
> +{
> +	return index << RNBD_PART_BITS;
> +}
> +
> +static int minor_to_index(int minor)
> +{
> +	return minor >> RNBD_PART_BITS;
> +}

Is it useful to introduce functions that encapsulate a single shift 
operation?

> +	blk_queue_virt_boundary(dev->queue, 4095);

The virt_boundary parameter must match the RDMA memory registration page 
size. Please introduce a symbolic constant for the RDMA memory 
registration page size such that these two parameters stay in sync in 
case anyone would want to change the memory registration page size.

> +static void rnbd_clt_setup_gen_disk(struct rnbd_clt_dev *dev, int idx)
> +{
> +	dev->gd->major		= rnbd_client_major;
> +	dev->gd->first_minor	= index_to_minor(idx);
> +	dev->gd->fops		= &rnbd_client_ops;
> +	dev->gd->queue		= dev->queue;
> +	dev->gd->private_data	= dev;
> +	snprintf(dev->gd->disk_name, sizeof(dev->gd->disk_name), "rnbd%d",
> +		 idx);
> +	pr_debug("disk_name=%s, capacity=%zu\n",
> +		 dev->gd->disk_name,
> +		 dev->nsectors * (dev->logical_block_size / SECTOR_SIZE)
> +		 );
> +
> +	set_capacity(dev->gd, dev->nsectors * (dev->logical_block_size /
> +					       SECTOR_SIZE));

Again, what is the unit of dev->nsectors?

> +static void rnbd_clt_add_gen_disk(struct rnbd_clt_dev *dev)
> +{
> +	add_disk(dev->gd);
> +}

Is it useful to introduce this wrapper around add_disk()?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 18/25] rnbd: client: sysfs interface functions
  2019-12-30 10:29 ` [PATCH v6 18/25] rnbd: client: sysfs interface functions Jack Wang
@ 2020-01-03  0:03   ` Bart Van Assche
  2020-01-08 13:06     ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-03  0:03 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: axboe, hch, sagi, leon, dledford, danil.kipnis, jinpu.wang, rpenyaev

On 12/30/19 2:29 AM, Jack Wang wrote:
> +static const match_table_t rnbd_opt_tokens = {
> +	{	RNBD_OPT_PATH,		"path=%s"		},
> +	{	RNBD_OPT_DEV_PATH,	"device_path=%s"	},
> +	{	RNBD_OPT_ACCESS_MODE,	"access_mode=%s"	},
> +	{	RNBD_OPT_SESSNAME,	"sessname=%s"		},
> +	{	RNBD_OPT_ERR,		NULL			},
> +};


Please follow the example of other kernel code and change 
"{<tab>...<tab>}" into "{ ... }".

> +/* remove new line from string */
> +static void strip(char *s)
> +{
> +	char *p = s;
> +
> +	while (*s != '\0') {
> +		if (*s != '\n')
> +			*p++ = *s++;
> +		else
> +			++s;
> +	}
> +	*p = '\0';
> +}

Does this function change a multiline string into a single line? I'm not 
sure that is how sysfs input should be processed ... Is this perhaps 
what you want?

static inline void kill_final_newline(char *str)
{
	char *newline = strrchr(str, '\n');

	if (newline && !newline[1])
		*newline = 0;
}

> +static struct kobj_attribute rnbd_clt_map_device_attr =
> +	__ATTR(map_device, 0644,
> +	       rnbd_clt_map_device_show, rnbd_clt_map_device_store);

Could __ATTR_RW() have been used here?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2020-01-02 17:00       ` Bart Van Assche
  2020-01-02 18:26         ` Jason Gunthorpe
@ 2020-01-03 12:27         ` Jinpu Wang
  1 sibling, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 12:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 6:00 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 1/2/20 7:27 AM, Jinpu Wang wrote:
> > On Mon, Dec 30, 2019 at 8:48 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >> On 2019-12-30 02:29, Jack Wang wrote:
> >>> +enum {
> >>> +     SERVICE_CON_QUEUE_DEPTH = 512,
> >>
> >> What is a service connection?
> > s/SERVICE_CON_QUEUE_DEPTH/CON_QUEUE_DEPTH/g, do you think
> > CON_QUEUE_DEPTH is better or just QUEUE_DEPTH?
>
> The name of the constant is fine, but what I meant is the following: has
> it been documented anywhere what the role of a "service connection" is?
ah, get your point now, will add a comment before the constant.
>
> >>> +struct rtrs_ib_dev_pool {
> >>> +     struct mutex            mutex;
> >>> +     struct list_head        list;
> >>> +     enum ib_pd_flags        pd_flags;
> >>> +     const struct rtrs_ib_dev_pool_ops *ops;
> >>> +};
> >>
> >> What is the purpose of an rtrs_ib_dev_pool and what does it contain?
> > The idea was documented in the patchset here:
> > https://www.spinics.net/lists/linux-rdma/msg64025.html
> > "'
> > This is an attempt to make a device pool API out of a common code,
> > which caches pair of ib_device and ib_pd pointers. I found 4 places,
> > where this common functionality can be replaced by some lib calls:
> > nvme, nvmet, iser and isert. Total deduplication gain in loc is not
> > quite significant, but eventually new ULP IB code can also require
> > the same device/pd pair cache, e.g. in our IBTRS module the same
> > code has to be repeated again, which was observed by Sagi and he
> > suggested to make a common helper function instead of producing
> > another copy.
> > '''
>
> The word "pool" suggest ownership. Since struct rtrs_ib_dev_pool owns
> protection domains instead of RDMA devices, how about renaming that data
> structure into rtrs_pd_per_rdma_dev, rtrs_rdma_dev_pd or something
> similar? How about adding a comment like the following above that data
> structure?
rtrs_rdma_dev_pd sounds better to me, will also add the comments.
>
> /*
>   * Data structure used to associate one protection domain (PD) with each
>   * RDMA device.
>   */
>
> >>> +/**
> >>> + * struct rtrs_msg_conn_req - Client connection request to the server
> >>> + * @magic:      RTRS magic
> >>> + * @version:    RTRS protocol version
> >>> + * @cid:        Current connection id
> >>> + * @cid_num:    Number of connections per session
> >>> + * @recon_cnt:          Reconnections counter
> >>> + * @sess_uuid:          UUID of a session (path)
> >>> + * @paths_uuid:         UUID of a group of sessions (paths)
> >>> + *
> >>> + * NOTE: max size 56 bytes, see man rdma_connect().
> >>> + */
> >>> +struct rtrs_msg_conn_req {
> >>> +     u8              __cma_version; /* Is set to 0 by cma.c in case of
> >>> +                                     * AF_IB, do not touch that.
> >>> +                                     */
> >>> +     u8              __ip_version;  /* On sender side that should be
> >>> +                                     * set to 0, or cma_save_ip_info()
> >>> +                                     * extract garbage and will fail.
> >>> +                                     */
> >>
> >> The above two fields and the comments next to it look suspicious to me.
> >> Does RTRS perhaps try to generate CMA-formatted messages without using
> >> the CMA to format these messages?
> > The problem is in cma_format_hdr over-writes the first byte for AF_IB
> > https://www.spinics.net/lists/linux-rdma/msg22397.html
> >
> > No one fixes the problem since then.
>
> How about adding that URL to the comment block above struct
> rtrs_msg_conn_req?
Ok

Thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2020-01-02 18:26         ` Jason Gunthorpe
@ 2020-01-03 12:31           ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 12:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 7:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Jan 02, 2020 at 09:00:53AM -0800, Bart Van Assche wrote:
>
> > > > > +/**
> > > > > + * struct rtrs_msg_conn_req - Client connection request to the server
> > > > > + * @magic:      RTRS magic
> > > > > + * @version:    RTRS protocol version
> > > > > + * @cid:        Current connection id
> > > > > + * @cid_num:    Number of connections per session
> > > > > + * @recon_cnt:          Reconnections counter
> > > > > + * @sess_uuid:          UUID of a session (path)
> > > > > + * @paths_uuid:         UUID of a group of sessions (paths)
> > > > > + *
> > > > > + * NOTE: max size 56 bytes, see man rdma_connect().
> > > > > + */
> > > > > +struct rtrs_msg_conn_req {
> > > > > +     u8              __cma_version; /* Is set to 0 by cma.c in case of
> > > > > +                                     * AF_IB, do not touch that.
> > > > > +                                     */
> > > > > +     u8              __ip_version;  /* On sender side that should be
> > > > > +                                     * set to 0, or cma_save_ip_info()
> > > > > +                                     * extract garbage and will fail.
> > > > > +                                     */
> > > >
> > > > The above two fields and the comments next to it look suspicious to me.
> > > > Does RTRS perhaps try to generate CMA-formatted messages without using
> > > > the CMA to format these messages?
> > > The problem is in cma_format_hdr over-writes the first byte for AF_IB
> > > https://www.spinics.net/lists/linux-rdma/msg22397.html
> > >
> > > No one fixes the problem since then.
> >
> > How about adding that URL to the comment block above struct
> > rtrs_msg_conn_req?
>
> Or just fixing whatever the problem is..
>
> Jason
I can do another try if no one else fix the problem, but I have plenty of to-do.

Thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device
  2020-01-02 18:28   ` Jason Gunthorpe
@ 2020-01-03 12:34     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 12:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 7:28 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Dec 30, 2019 at 06:39:00PM -0800, Bart Van Assche wrote:
> > On 2019-12-30 02:29, Jack Wang wrote:
> > > here is V6 of the RTRS (former IBTRS) rdma transport library and the
> > > corresponding RNBD (former IBNBD) rdma network block device.
> > >
> > > Changelog since v5:
> > > 1 rebased to linux-5.5-rc4
> > > 2 fix typo in my email address in first patch
> > > 3 cleanup copyright as suggested by Leon Romanovsky
> > > 4 remove 2 redudant kobject_del in error path as suggested by Leon Romanovsky
> > > 5 add MAINTAINERS entries in alphabetical order as Gal Pressman suggested
> >
> > Please always include the full changelog when posting a new version.
> > Every other Linux kernel patch series I have seen includes a full
> > changelog in version two and later versions of its cover letter.
>
> We now also like it if you include URLs to lore.kernel.org for the
> prior submissions.
>
> Jason
Will do.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers
  2019-12-31  0:07   ` Bart Van Assche
@ 2020-01-03 13:48     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 13:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Dec 31, 2019 at 1:07 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > +static inline u32 rtrs_to_io_rsp_imm(u32 msg_id, int errno, bool w_inval)
> > +{
> > +     enum rtrs_imm_type type;
> > +     u32 payload;
> > +
> > +     /* 9 bits for errno, 19 bits for msg_id */
> > +     payload = (abs(errno) & 0x1ff) << 19 | (msg_id & 0x7ffff);
> > +     type = (w_inval ? RTRS_IO_RSP_W_INV_IMM : RTRS_IO_RSP_IMM);
> > +
> > +     return rtrs_to_imm(type, payload);
> > +}
> > +
> > +static inline void rtrs_from_io_rsp_imm(u32 payload, u32 *msg_id, int *errno)
> > +{
> > +     /* 9 bits for errno, 19 bits for msg_id */
> > +     *msg_id = (payload & 0x7ffff);
> > +     *errno = -(int)((payload >> 19) & 0x1ff);
> > +}
>
> The above comments mention that 19 bits are used for msg_id. The 0x7ffff
> mask however has 23 bits set. Did I see that correctly? If so, does that
> mean that the errno and msg_id bitfields overlap partially?
Double checked with calculator 0x7ffff is 19 bits set, not 23 bits :)
>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 06/25] rtrs: client: main functionality
  2019-12-30 23:53   ` Bart Van Assche
  2020-01-02 18:23     ` Jason Gunthorpe
@ 2020-01-03 14:30     ` Jinpu Wang
  2020-01-03 16:12       ` Bart Van Assche
  1 sibling, 1 reply; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 14:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Dec 31, 2019 at 12:53 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > + * InfiniBand Transport Layer
>
> InfiniBand or RDMA?
will fix.
>
> > +MODULE_DESCRIPTION("RTRS Client");
>
> Please spell out RTRS in full.
ok.

>
> > +static const struct rtrs_ib_dev_pool_ops dev_pool_ops;
>
> Can this forward declaration be avoided?
I don't see how to do it easily.

>
> > +static struct rtrs_ib_dev_pool dev_pool = {
> > +     .ops = &dev_pool_ops
> > +};
>
> Can this structure be declared 'const'?
No, it's not const, we also initialize it in rtrs_ib_dev_pool_init
>
> > +static inline struct rtrs_permit *
> > +__rtrs_get_permit(struct rtrs_clt *clt, enum rtrs_clt_con_type con_type)
> > +{
> > +     size_t max_depth = clt->queue_depth;
> > +     struct rtrs_permit *permit;
> > +     int cpu, bit;
> > +
> > +     cpu = get_cpu();
> > +     do {
> > +             bit = find_first_zero_bit(clt->permits_map, max_depth);
> > +             if (unlikely(bit >= max_depth)) {
> > +                     put_cpu();
> > +                     return NULL;
> > +             }
> > +
> > +     } while (unlikely(test_and_set_bit_lock(bit, clt->permits_map)));
> > +     put_cpu();
>
> Are the get_cpu() and put_cpu() calls around this loop useful? If not,
> please remove these calls. Otherwise please add a comment that explains
> the purpose of these calls.
>
> An additional question: is it possible to replace the above loop with an
> sbitmap_get() call?
will check.
>
> > +static void complete_rdma_req(struct rtrs_clt_io_req *req, int errno,
> > +                           bool notify, bool can_wait)
> > +{
> > +     struct rtrs_clt_con *con = req->con;
> > +     struct rtrs_clt_sess *sess;
> > +     int err;
> > +
> > +     if (WARN_ON(!req->in_use))
> > +             return;
> > +     if (WARN_ON(!req->con))
> > +             return;
> > +     sess = to_clt_sess(con->c.sess);
> > +
> > +     if (req->sg_cnt) {
> > +             if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) {
> > +                     /*
> > +                      * We are here to invalidate RDMA read requests
> > +                      * ourselves.  In normal scenario server should
> > +                      * send INV for all requested RDMA reads, but
> > +                      * we are here, thus two things could happen:
> > +                      *
> > +                      *    1.  this is failover, when errno != 0
> > +                      *        and can_wait == 1,
> > +                      *
> > +                      *    2.  something totally bad happened and
> > +                      *        server forgot to send INV, so we
> > +                      *        should do that ourselves.
> > +                      */
>
> Please document in the protocol documentation when RDMA reads are used.
We don't use RDMA READ, it's requested RDMA read meaning, server side will do
RDMA write to the buffers.
>
> What does "server forgot to send INV" mean?
Means server side malfunctional/server panic/etc, server didnot sent
SEND_WITH_INV WR,
so client have to do local invalidate.
>
> Additionally, if I remember correctly Jason considers it very important
> that invalidation happens from the submitting context because otherwise
> the RDMA retry mechanism can't work.

>
> > +static void process_io_rsp(struct rtrs_clt_sess *sess, u32 msg_id,
> > +                        s16 errno, bool w_inval)
> > +{
> > +     struct rtrs_clt_io_req *req;
> > +
> > +     if (WARN_ON(msg_id >= sess->queue_depth))
> > +             return;
> > +
> > +     req = &sess->reqs[msg_id];
> > +     /* Drop need_inv if server responsed with invalidation */
> > +     req->need_inv &= !w_inval;
> > +     complete_rdma_req(req, errno, true, false);
> > +}
>
> Please document the meaning of the "w_inval" argument. Please also fix
> the spelling of "responsed".
>
> Thanks,
>
> Bart.
OK, thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 07/25] rtrs: client: statistics functions
  2020-01-02 21:07   ` Bart Van Assche
@ 2020-01-03 14:39     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 14:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 10:07 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > From: Jack Wang <jinpu.wang@cloud.ionos.com>
> >
> > This introduces set of functions used on client side to account
> > statistics of RDMA data sent/received, amount of IOs inflight,
> > latency, cpu migrations, etc.  Almost all statistics is collected
>                                                         ^^
>                                                         are?
will fix.
> > using percpu variables.
> > [ ... ]
> > +static inline int rtrs_clt_ms_to_id(unsigned long ms)
> > +{
> > +     int id = ms ? ilog2(ms) - MIN_LOG_LAT + 1 : 0;
> > +
> > +     return clamp(id, 0, LOG_LAT_SZ - 1);
> > +}
>
> I think it is unusual to call the returned value an "id" in this
> context. How about changing "id" into "bin" or "bucket"? See also
> https://en.wikipedia.org/wiki/Histogram.
will rename id to bin
>
> Thanks,
>
> Bart.
Thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 08/25] rtrs: client: sysfs interface functions
  2020-01-02 21:14   ` Bart Van Assche
@ 2020-01-03 14:59     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 14:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 10:14 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +static struct kobj_type ktype = {
> > +     .sysfs_ops = &kobj_sysfs_ops,
> > +};
>
> Can this data structure be declared 'const'?
No, kobject_init_and_add expect sturct kobj_type *.
>
> > +static ssize_t max_reconnect_attempts_show(struct device *dev,
> > +                                        struct device_attribute *attr,
> > +                                        char *page)
> > +{
> > +     struct rtrs_clt *clt;
> > +
> > +     clt = container_of(dev, struct rtrs_clt, dev);
>
> If the above two statements would be combined into a single statement,
> does the result still fit in 80 columns? If so, please combine these two
> statements into a single statement.
ok.
>
> > +static ssize_t max_reconnect_attempts_store(struct device *dev,
> > +                                         struct device_attribute *attr,
> > +                                         const char *buf,
> > +                                         size_t count)
> > +{
> > +     struct rtrs_clt *clt;
> > +     int value;
> > +     int ret;
> > +
> > +     clt = container_of(dev, struct rtrs_clt, dev);
>
> Same comment here and also for other uses of 'clt': how about combining
> the declaration and initialization of 'clt' into a single line of code?
ok.
>
> > +static ssize_t mpath_policy_show(struct device *dev,
> > +                              struct device_attribute *attr,
> > +                              char *page)
> > +{
> > +     struct rtrs_clt *clt;
> > +
> > +     clt = container_of(dev, struct rtrs_clt, dev);
> > +
> > +     switch (clt->mp_policy) {
> > +     case MP_POLICY_RR:
> > +             return sprintf(page, "round-robin (RR: %d)\n", clt->mp_policy);
> > +     case MP_POLICY_MIN_INFLIGHT:
> > +             return sprintf(page, "min-inflight (MI: %d)\n", clt->mp_policy);
> > +     default:
> > +             return sprintf(page, "Unknown (%d)\n", clt->mp_policy);
> > +     }
> > +}
>
> Is the above show function compatible with the sysfs one-value-per-file
> rule?
It's a single string :)
>
> > +static struct kobj_attribute rtrs_clt_remove_path_attr =
> > +     __ATTR(remove_path, 0644, rtrs_clt_remove_path_show,
> > +            rtrs_clt_remove_path_store);
>
> Could __ATTR_RW() have been used here?
can be used, but I prefer to keep the rtrs_clt_ prefix for function names.
>
> > +static struct kobj_attribute rtrs_clt_src_addr_attr =
> > +     __ATTR(src_addr, 0444, rtrs_clt_src_addr_show, NULL);
>
> Could __ATTR_RO() have been used here?
dito.
>
> > +static struct attribute_group rtrs_clt_sess_attr_group = {
> > +     .attrs = rtrs_clt_sess_attrs,
> > +};
>
> Can this data structure be declared 'const'?
Yes.
>
> Thanks,
>
> Bart.
Thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 06/25] rtrs: client: main functionality
  2020-01-03 14:30     ` Jinpu Wang
@ 2020-01-03 16:12       ` Bart Van Assche
  0 siblings, 0 replies; 89+ messages in thread
From: Bart Van Assche @ 2020-01-03 16:12 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On 1/3/20 6:30 AM, Jinpu Wang wrote:
> On Tue, Dec 31, 2019 at 12:53 AM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 2019-12-30 02:29, Jack Wang wrote:
>>> +static void complete_rdma_req(struct rtrs_clt_io_req *req, int errno,
>>> +                           bool notify, bool can_wait)
>>> +{
>>> +     struct rtrs_clt_con *con = req->con;
>>> +     struct rtrs_clt_sess *sess;
>>> +     int err;
>>> +
>>> +     if (WARN_ON(!req->in_use))
>>> +             return;
>>> +     if (WARN_ON(!req->con))
>>> +             return;
>>> +     sess = to_clt_sess(con->c.sess);
>>> +
>>> +     if (req->sg_cnt) {
>>> +             if (unlikely(req->dir == DMA_FROM_DEVICE && req->need_inv)) {
>>> +                     /*
>>> +                      * We are here to invalidate RDMA read requests
>>> +                      * ourselves.  In normal scenario server should
>>> +                      * send INV for all requested RDMA reads, but
>>> +                      * we are here, thus two things could happen:
>>> +                      *
>>> +                      *    1.  this is failover, when errno != 0
>>> +                      *        and can_wait == 1,
>>> +                      *
>>> +                      *    2.  something totally bad happened and
>>> +                      *        server forgot to send INV, so we
>>> +                      *        should do that ourselves.
>>> +                      */
>>
>> Please document in the protocol documentation when RDMA reads are used.
> We don't use RDMA READ, it's requested RDMA read meaning, server side will do
> RDMA write to the buffers.

Please make the comment more clear. The comment says "RDMA read" twice.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 13/25] rtrs: include client and server modules into kernel compilation
  2020-01-02 22:11   ` Bart Van Assche
@ 2020-01-03 16:19     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-03 16:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 11:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +config INFINIBAND_RTRS
> > +     tristate
> > +     depends on INFINIBAND_ADDR_TRANS
> > +
> > +config INFINIBAND_RTRS_CLIENT
> > +     tristate "RTRS client module"
> > +     depends on INFINIBAND_ADDR_TRANS
> > +     select INFINIBAND_RTRS
> > +     help
> > +       RDMA transport client module.
> > +
> > +       RTRS client allows for simplified data transfer and connection
> > +       establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
> > +       READ/WRITE semantics and provides multipath capabilities.
>
> What does "simplified" mean in this context? I'm concerned that
> including that word will cause confusion. How about writing that RTRS
> implements a reliable transport layer and also multipathing
> functionality and that it is intended to be the base layer for a block
> storage initiator over RDMA?
Sounds fine, will explains what the RTRS abbreviation
>
> > +config INFINIBAND_RTRS_SERVER
> > +     tristate "RTRS server module"
> > +     depends on INFINIBAND_ADDR_TRANS
> > +     select INFINIBAND_RTRS
> > +     help
> > +       RDMA transport server module.
> > +
> > +       RTRS server module processing connection and IO requests received
> > +       from the RTRS client module, it will pass the IO requests to its
> > +       user eg. RNBD_server.
>
> Users who see these help texts will be left wondering what RTRS stands
> for. Please add some text that explains what the RTRS abbreviation
> stands for.
>
> Thanks,
>
> Bart.
Thanks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 04/25] rtrs: core: lib functions shared between client and server modules
  2019-12-30 22:25   ` Bart Van Assche
@ 2020-01-07 12:22     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 12:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Mon, Dec 30, 2019 at 11:25 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > + * InfiniBand Transport Layer
>
> Is RTRS an InfiniBand or an RDMA transport layer?
will fix.
>
> > +MODULE_DESCRIPTION("RTRS Core");
>
> Please write out RTRS in full and consider changing the word "Core" into
> "client and server".
will do.
>
> > +     WARN_ON(!queue_size);
> > +     ius = kcalloc(queue_size, sizeof(*ius), gfp_mask);
> > +
> > +     if (unlikely(!ius))
> > +             return NULL;
>
> No blank line between the 'ius' assignment and the 'ius' check please.
ok.
>
> > +int rtrs_iu_post_recv(struct rtrs_con *con, struct rtrs_iu *iu)
> > +{
> > +     struct rtrs_sess *sess = con->sess;
> > +     struct ib_recv_wr wr;
> > +     const struct ib_recv_wr *bad_wr;
> > +     struct ib_sge list;
> > +
> > +     list.addr   = iu->dma_addr;
> > +     list.length = iu->size;
> > +     list.lkey   = sess->dev->ib_pd->local_dma_lkey;
> > +
> > +     if (WARN_ON(list.length == 0)) {
> > +             rtrs_wrn(con->sess,
> > +                       "Posting receive work request failed, sg list is empty\n");
> > +             return -EINVAL;
> > +     }
> > +
> > +     wr.next    = NULL;
> > +     wr.wr_cqe  = &iu->cqe;
> > +     wr.sg_list = &list;
> > +     wr.num_sge = 1;
> > +
> > +     return ib_post_recv(con->qp, &wr, &bad_wr);
> > +}
> > +EXPORT_SYMBOL_GPL(rtrs_iu_post_recv);
>
> The above code is fragile: although this is unlikely, if a member would
> be added in struct ib_sge or in struct ib_recv_wr then the above code
> will leave some member variables uninitialized. Has it been considered
> to initialize these structures using a single assignment statement, e.g.
> as follows:
>
>         wr = (struct ib_recv_wr) {
>                 .wr_cqe = ...,
>                 .sg_list = ...,
>                 .num_sge = 1,
>         };
Will do.
>
> > +int rtrs_post_recv_empty(struct rtrs_con *con, struct ib_cqe *cqe)
> > +{
> > +     struct ib_recv_wr wr;
> > +     const struct ib_recv_wr *bad_wr;
> > +
> > +     wr.next    = NULL;
> > +     wr.wr_cqe  = cqe;
> > +     wr.sg_list = NULL;
> > +     wr.num_sge = 0;
> > +
> > +     return ib_post_recv(con->qp, &wr, &bad_wr);
> > +}
> > +EXPORT_SYMBOL_GPL(rtrs_post_recv_empty);
>
> Same comment for this function.
dito.
>
> > +int rtrs_post_recv_empty_x2(struct rtrs_con *con, struct ib_cqe *cqe)
> > +{
> > +     struct ib_recv_wr wr_arr[2], *wr;
> > +     const struct ib_recv_wr *bad_wr;
> > +     int i;
> > +
> > +     memset(wr_arr, 0, sizeof(wr_arr));
> > +     for (i = 0; i < ARRAY_SIZE(wr_arr); i++) {
> > +             wr = &wr_arr[i];
> > +             wr->wr_cqe  = cqe;
> > +             if (i)
> > +                     /* Chain backwards */
> > +                     wr->next = &wr_arr[i - 1];
> > +     }
> > +
> > +     return ib_post_recv(con->qp, wr, &bad_wr);
> > +}
> > +EXPORT_SYMBOL_GPL(rtrs_post_recv_empty_x2);
>
> I have not yet seen any other RDMA code that is similar to the above
> function. A comment above this function that explains its purpose would
> be more than welcome.
Will add comment.
>
> > +int rtrs_iu_post_send(struct rtrs_con *con, struct rtrs_iu *iu, size_t size,
> > +                    struct ib_send_wr *head)
> > +{
> > +     struct rtrs_sess *sess = con->sess;
> > +     struct ib_send_wr wr;
> > +     const struct ib_send_wr *bad_wr;
> > +     struct ib_sge list;
> > +
> > +     if ((WARN_ON(size == 0)))
> > +             return -EINVAL;
>
> No superfluous parentheses please.
ok

>
> > +     list.addr   = iu->dma_addr;
> > +     list.length = size;
> > +     list.lkey   = sess->dev->ib_pd->local_dma_lkey;
> > +
> > +     memset(&wr, 0, sizeof(wr));
> > +     wr.next       = NULL;
> > +     wr.wr_cqe     = &iu->cqe;
> > +     wr.sg_list    = &list;
> > +     wr.num_sge    = 1;
> > +     wr.opcode     = IB_WR_SEND;
> > +     wr.send_flags = IB_SEND_SIGNALED;
>
> Has it been considered to use designated initializers instead of a
> memset() followed by multiple assignments? Same question for
> rtrs_iu_post_rdma_write_imm() and rtrs_post_rdma_write_imm_empty().
Sounds good, will do.

>
> > +static int create_qp(struct rtrs_con *con, struct ib_pd *pd,
> > +                  u16 wr_queue_size, u32 max_sge)
> > +{
> > +     struct ib_qp_init_attr init_attr = {NULL};
> > +     struct rdma_cm_id *cm_id = con->cm_id;
> > +     int ret;
> > +
> > +     init_attr.cap.max_send_wr = wr_queue_size;
> > +     init_attr.cap.max_recv_wr = wr_queue_size;
>
> What code is responsible for ensuring that neither max_send_wr nor
> max_recv_wr exceeds the device limits? Please document this in a comment
> above this function.
rtrs-clt/srv queries device limits for ensuring the settings will not
exceed the limits.
will add comment.

>
> > +     init_attr.cap.max_recv_sge = 1;
> > +     init_attr.event_handler = qp_event_handler;
> > +     init_attr.qp_context = con;
> > +#undef max_send_sge
> > +     init_attr.cap.max_send_sge = max_sge;
>
> Is the "undef max_send_sge" really necessary? If so, please add a
> comment that explains why it is necessary.
it's not, will remove.
>
> > +static int rtrs_str_gid_to_sockaddr(const char *addr, size_t len,
> > +                                  short port, struct sockaddr_storage *dst)
> > +{
> > +     struct sockaddr_ib *dst_ib = (struct sockaddr_ib *)dst;
> > +     int ret;
> > +
> > +     /*
> > +      * We can use some of the I6 functions since GID is a valid
> > +      * IPv6 address format
> > +      */
> > +     ret = in6_pton(addr, len, dst_ib->sib_addr.sib_raw, '\0', NULL);
> > +     if (ret == 0)
> > +             return -EINVAL;
>
> What is "I6"?
IPv6, will fix.
>
> Is the fourth argument to this function correct? From the comment above
> in6_pton(): "@delim: the delimiter of the IPv6 address in @src, -1 means
> no delimiter".
'\0' means end of the string here, seems correct to me.
>
> > +int sockaddr_to_str(const struct sockaddr *addr, char *buf, size_t len)
> > +{
> > +     int cnt;
> > +
> > +     switch (addr->sa_family) {
> > +     case AF_IB:
> > +             cnt = scnprintf(buf, len, "gid:%pI6",
> > +                     &((struct sockaddr_ib *)addr)->sib_addr.sib_raw);
> > +             return cnt;
> > +     case AF_INET:
> > +             cnt = scnprintf(buf, len, "ip:%pI4",
> > +                     &((struct sockaddr_in *)addr)->sin_addr);
> > +             return cnt;
> > +     case AF_INET6:
> > +             cnt = scnprintf(buf, len, "ip:%pI6c",
> > +                       &((struct sockaddr_in6 *)addr)->sin6_addr);
> > +             return cnt;
> > +     }
> > +     cnt = scnprintf(buf, len, "<invalid address family>");
> > +     pr_err("Invalid address family\n");
> > +     return cnt;
> > +}
> > +EXPORT_SYMBOL(sockaddr_to_str);
>
> Is the pr_err() statement in the above function useful? Will anyone be
> able to figure out what is going on if the "Invalid address family"
> string appears in the system log? Please consider changing that pr_err()
> statement into a WARN_ON_ONCE() statement.
I expect the caller should also print something in syslog, combine
them togather will help.
>
> > +     ret = rtrs_str_to_sockaddr(str, len, port, addr->dst);
> > +
> > +     return ret;
>
> Please change this into a single return statement.
ok
>
> > +EXPORT_SYMBOL(rtrs_addr_to_sockaddr);
> > +
> > +void rtrs_ib_dev_pool_init(enum ib_pd_flags pd_flags,
> > +                         struct rtrs_ib_dev_pool *pool)
> > +{
> > +     WARN_ON(pool->ops && (!pool->ops->alloc ^ !pool->ops->free));
> > +     INIT_LIST_HEAD(&pool->list);
> > +     mutex_init(&pool->mutex);
> > +     pool->pd_flags = pd_flags;
> > +}
> > +EXPORT_SYMBOL(rtrs_ib_dev_pool_init);
> > +
> > +void rtrs_ib_dev_pool_deinit(struct rtrs_ib_dev_pool *pool)
> > +{
> > +     WARN_ON(!list_empty(&pool->list));
> > +}
> > +EXPORT_SYMBOL(rtrs_ib_dev_pool_deinit);
>
> Since rtrs_ib_dev_pool_init() calls mutex_init(), should
> rtrs_ib_dev_pool_deinit() call mutex_destroy()?
You're right.

>
> Thanks,
>
> Bart.
>
Thanks Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 05/25] rtrs: client: private header with client structs and functions
  2019-12-30 22:51   ` Bart Van Assche
@ 2020-01-07 12:39     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 12:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Mon, Dec 30, 2019 at 11:51 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > + * InfiniBand Transport Layer
>
> InfiniBand or RDMA?
will fix.
>
> > +static inline const char *rtrs_clt_state_str(enum rtrs_clt_state state)
> > +{
> > +     switch (state) {
> > +     case RTRS_CLT_CONNECTING:
> > +             return "RTRS_CLT_CONNECTING";
> > +     case RTRS_CLT_CONNECTING_ERR:
> > +             return "RTRS_CLT_CONNECTING_ERR";
> > +     case RTRS_CLT_RECONNECTING:
> > +             return "RTRS_CLT_RECONNECTING";
> > +     case RTRS_CLT_CONNECTED:
> > +             return "RTRS_CLT_CONNECTED";
> > +     case RTRS_CLT_CLOSING:
> > +             return "RTRS_CLT_CLOSING";
> > +     case RTRS_CLT_CLOSED:
> > +             return "RTRS_CLT_CLOSED";
> > +     case RTRS_CLT_DEAD:
> > +             return "RTRS_CLT_DEAD";
> > +     default:
> > +             return "UNKNOWN";
> > +     }
> > +}
>
> This function is not in the hot path so it shouldn't be inline.
no longer in use, will remove.
>
> > +#define MIN_LOG_SG 2
> > +#define MAX_LOG_SG 5
> > +#define MAX_LIN_SG BIT(MIN_LOG_SG)
> > +#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)
>
> I think these constants deserve a comment that explains what their
> meaning is.
will add comment.
>
> > +/**
> > + * rtrs_permit - permits the memory allocation for future RDMA operation
> > + */
> > +struct rtrs_permit {
> > +     enum rtrs_clt_con_type con_type;
> > +     unsigned int cpu_id;
> > +     unsigned int mem_id;
> > +     unsigned int mem_off;
> > +};
>
> The comment above this structure is confusing. Please make it more clear.
will extend.
>
> Thanks,
>
> Bart.
Thanks Bart

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 05/25] rtrs: client: private header with client structs and functions
  2019-12-30 23:03   ` Bart Van Assche
@ 2020-01-07 12:39     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 12:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Dec 31, 2019 at 12:03 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > +#define GET_PERMIT(clt, idx) ((clt)->permits + PERMIT_SIZE(clt) * idx)
>
> Please surround 'idx' with parentheses.
>
> Thanks,
>
> Bart.
will do, thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 10/25] rtrs: server: main functionality
  2020-01-02 22:03   ` Bart Van Assche
@ 2020-01-07 13:19     ` Jinpu Wang
  2020-01-07 18:25       ` Jason Gunthorpe
  0 siblings, 1 reply; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 13:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 11:03 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +MODULE_DESCRIPTION("RTRS Server");
>
> Please expand the "RTRS" abbreviation in the module description.
will do.
>
> > +static void rtrs_srv_get_ops_ids(struct rtrs_srv_sess *sess)
> > +{
> > +     atomic_inc(&sess->ids_inflight);
> > +}
> > +
> > +static void rtrs_srv_put_ops_ids(struct rtrs_srv_sess *sess)
> > +{
> > +     if (atomic_dec_and_test(&sess->ids_inflight))
> > +             wake_up(&sess->ids_waitq);
> > +}
> > +
> > +static void rtrs_srv_wait_ops_ids(struct rtrs_srv_sess *sess)
> > +{
> > +     wait_event(sess->ids_waitq, !atomic_read(&sess->ids_inflight));
> > +}
>
> So rtrs_srv_wait_ops_ids() returns without grabbing any synchronization
> object? What guarantees that ids_inflight is not increased after
> wait_event() has returned and before rtrs_srv_wait_ops_ids() returns?
We do rdma_disconnect/ib_drian_qp first, so no new io from client
could reach server,
then wait for all pending IO to finish
>
> > +     /*
> > +      * From time to time we have to post signalled sends,
> > +      * or send queue will fill up and only QP reset can help.
> > +      */
> > +     flags = atomic_inc_return(&id->con->wr_cnt) % srv->queue_depth ?
> > +                     0 : IB_SEND_SIGNALED;
>
> Should "signalled" perhaps be changed into "signaled"?
will fix.
>
> How can posting a signaled send prevent that the send queue overflows?
> Isn't that something that can only be guaranteed by tracking the number
> of WQE's in the send queue?
Selective signaling works. All we need to do is signal one WR for
every SQ-depth worth of WRs posted. For example, If the SQ depth is
16, we must signal at least one out of every 16. This ensures proper
flow control for HW resources.
Courtesy: section 8.2.1 of the iWARP Verbs draft
http://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.2.1

See also: https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/

>
> > +/**
> > + * send_io_resp_imm() - response with empty IMM on failed READ/WRITE requests or
> > + *                      on successful WRITE request.
> > + * @con              the connection to send back result
> > + * @id               the id associated to io
> > + * @errno    the error number of the IO.
> > + *
> > + * Return 0 on success, errno otherwise.
> > + */
>
> Should "response ... on" perhaps be changed into "respond ... to"?
> Should "associated to" perhaps be changed into "associated with"?
Yes.
>
> > +static int map_cont_bufs(struct rtrs_srv_sess *sess)
>
> A comment that explains what "cont" in this function name means would be
> welcome.
will do .
>
> > +static inline int sockaddr_cmp(const struct sockaddr *a,
> > +                            const struct sockaddr *b)
> > +{
> > +     switch (a->sa_family) {
> > +     case AF_IB:
> > +             return memcmp(&((struct sockaddr_ib *)a)->sib_addr,
> > +                           &((struct sockaddr_ib *)b)->sib_addr,
> > +                           sizeof(struct ib_addr));
> > +     case AF_INET:
> > +             return memcmp(&((struct sockaddr_in *)a)->sin_addr,
> > +                           &((struct sockaddr_in *)b)->sin_addr,
> > +                           sizeof(struct in_addr));
> > +     case AF_INET6:
> > +             return memcmp(&((struct sockaddr_in6 *)a)->sin6_addr,
> > +                           &((struct sockaddr_in6 *)b)->sin6_addr,
> > +                           sizeof(struct in6_addr));
> > +     default:
> > +             return -ENOENT;
> > +     }
> > +}
>
> The memcmp() return value can be used to sort values. Since that is not
> the case for the sockaddr_cmp() return value, please document this.
> Additionally, it seems like a comparison of a->sa_family and
> b->sa_family is missing?
you're right, will fix.
>
> > +static int rtrs_rdma_do_accept(struct rtrs_srv_sess *sess,
> > +                             struct rdma_cm_id *cm_id)
> > +{
> > +     struct rtrs_srv *srv = sess->srv;
> > +     struct rtrs_msg_conn_rsp msg;
> > +     struct rdma_conn_param param;
> > +     int err;
> > +
> > +     memset(&param, 0, sizeof(param));
> > +     param.rnr_retry_count = 7;
> > +     param.private_data = &msg;
> > +     param.private_data_len = sizeof(msg);
> > +
> > +     memset(&msg, 0, sizeof(msg));
> > +     msg.magic = cpu_to_le16(RTRS_MAGIC);
> > +     msg.version = cpu_to_le16(RTRS_PROTO_VER);
> > +     msg.errno = 0;
> > +     msg.queue_depth = cpu_to_le16(srv->queue_depth);
> > +     msg.max_io_size = cpu_to_le32(max_chunk_size - MAX_HDR_SIZE);
> > +     msg.max_hdr_size = cpu_to_le32(MAX_HDR_SIZE);
> > +
> > +     if (always_invalidate)
> > +             msg.flags = cpu_to_le32(RTRS_MSG_NEW_RKEY_F);
> > +
> > +     err = rdma_accept(cm_id, &param);
> > +     if (err)
> > +             pr_err("rdma_accept(), err: %d\n", err);
> > +
> > +     return err;
> > +}
>
> Please use a designated initializer list instead of memset() followed by
> initialization of multiple structure members.
ok, will do.
>
> > +static int rtrs_srv_rdma_init(struct rtrs_srv_ctx *ctx, unsigned int port)
> > +{
> > +     struct sockaddr_in6 sin = {
> > +             .sin6_family    = AF_INET6,
> > +             .sin6_addr      = IN6ADDR_ANY_INIT,
> > +             .sin6_port      = htons(port),
> > +     };
> > +     struct sockaddr_ib sib = {
> > +             .sib_family                     = AF_IB,
> > +             .sib_addr.sib_subnet_prefix     = 0ULL,
> > +             .sib_addr.sib_interface_id      = 0ULL,
> > +             .sib_sid        = cpu_to_be64(RDMA_IB_IP_PS_IB | port),
> > +             .sib_sid_mask   = cpu_to_be64(0xffffffffffffffffULL),
> > +             .sib_pkey       = cpu_to_be16(0xffff),
> > +     };
>
> A minor comment: structure members that are zero do not have to be
> initialized explicitly. The compiler does that automatically.
will drop some zero initialization.
>
> > +struct rtrs_srv_ctx *rtrs_srv_open(rdma_ev_fn *rdma_ev, link_ev_fn *link_ev,
> > +                                  unsigned int port)
> > +{
> > +     struct rtrs_srv_ctx *ctx;
> > +     int err;
> > +
> > +     ctx = alloc_srv_ctx(rdma_ev, link_ev);
> > +     if (unlikely(!ctx))
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     err = rtrs_srv_rdma_init(ctx, port);
> > +     if (unlikely(err)) {
> > +             free_srv_ctx(ctx);
> > +             return ERR_PTR(err);
> > +     }
> > +     /* Do not let module be unloaded if server context is alive */
> > +     __module_get(THIS_MODULE);
> > +
> > +     return ctx;
> > +}
> > +EXPORT_SYMBOL(rtrs_srv_open);
>
> Isn't it inconvenient for users if module unloading is prevented while
> one or more connections are active? This requires users to figure out
> how to trigger a log out if they want to unload a kernel module.
The logic here is when we have rnbd_server module load, we don't allow
unload rtrs_server,
you can still unload rnbd_server first then rtrs_server.

> Additionally, how are users expectied to prevent that the client relogins
> after the server has told them to log out and before the server kernel
> module is unloaded?

We don't support such use case yet.

>
> Thanks,
>
> Bart.
Thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 12/25] rtrs: server: sysfs interface functions
  2020-01-02 22:06   ` Bart Van Assche
@ 2020-01-07 14:40     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 14:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 11:06 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +static struct kobj_attribute rtrs_srv_disconnect_attr =
> > +     __ATTR(disconnect, 0644,
> > +            rtrs_srv_disconnect_show, rtrs_srv_disconnect_store);
>
> Could __ATTR_RW() have been used here?
>
> > +static struct kobj_attribute rtrs_srv_hca_port_attr =
> > +     __ATTR(hca_port, 0444, rtrs_srv_hca_port_show, NULL);
>
> Could __ATTR_RO() have been used here?
>
> Thanks,
>
> Bart.
both case can be used, but I prefer to keep the rtrs_srv prefix.
Thanks Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 14/25] rtrs: a bit of documentation
  2019-12-30 23:19   ` Bart Van Assche
@ 2020-01-07 14:48     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 14:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev, linux-kernel

On Tue, Dec 31, 2019 at 12:19 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2019-12-30 02:29, Jack Wang wrote:
> > diff --git a/drivers/infiniband/ulp/rtrs/README b/drivers/infiniband/ulp/rtrs/README
>
> Other kernel driver documentation exists under the Documentation/
> directory. Should this README file perhaps be moved to a subdirectory of
> the Documentation/ directory?
I did check most of the drivers are in the drivers directory eg:
find ./ -name README
./fs/reiserfs/README
./fs/qnx4/README
./fs/qnx6/README
./fs/cramfs/README
./Documentation/ABI/README
./Documentation/virt/kvm/devices/README
./README
./tools/usb/usbip/README
./tools/virtio/ringtest/README
./tools/virtio/virtio-trace/README
./tools/power/pm-graph/README
./tools/power/cpupower/README
./tools/memory-model/README
./tools/memory-model/scripts/README
./tools/memory-model/litmus-tests/README
./tools/testing/vsock/README
./tools/testing/ktest/examples/README
./tools/testing/selftests/ftrace/README
./tools/testing/selftests/arm64/signal/README
./tools/testing/selftests/arm64/README
./tools/testing/selftests/android/ion/README
./tools/testing/selftests/zram/README
./tools/testing/selftests/livepatch/README
./tools/testing/selftests/net/forwarding/README
./tools/testing/selftests/futex/README
./tools/testing/selftests/tc-testing/README
./tools/thermal/tmon/README
./tools/build/tests/ex/empty2/README
./tools/perf/tests/attr/README
./tools/perf/pmu-events/README
./tools/perf/scripts/perl/Perf-Trace-Util/README
./tools/io_uring/README
./net/decnet/README
./scripts/ksymoops/README
./scripts/selinux/README
./arch/powerpc/boot/README
./arch/m68k/q40/README
./arch/m68k/ifpsp060/README
./arch/m68k/fpsp040/README
./arch/parisc/math-emu/README
./arch/x86/math-emu/README
./drivers/bcma/README
./drivers/char/mwave/README
./drivers/staging/nvec/README
./drivers/staging/wlan-ng/README
./drivers/staging/axis-fifo/README
./drivers/staging/fbtft/README
./drivers/staging/fsl-dpaa2/ethsw/README
./drivers/staging/goldfish/README
./drivers/staging/gs_fpgaboot/README
./drivers/staging/comedi/drivers/ni_routing/README
./drivers/net/wireless/marvell/mwifiex/README
./drivers/net/wireless/marvell/libertas/README

>
> > +****************************
> > +InfiniBand Transport (RTRS)
> > +****************************
>
> The abbreviation does not match the full title. Do you agree that this
> is confusing?
>
> > +RTRS is used by the RNBD (Infiniband Network Block Device) modules.
>
> Is RNBD an RDMA or an InfiniBand network block device?
will fix.
>
> > +
> > +==================
> > +Transport protocol
> > +==================
> > +
> > +Overview
> > +--------
> > +An established connection between a client and a server is called rtrs
> > +session. A session is associated with a set of memory chunks reserved on the
> > +server side for a given client for rdma transfer. A session
> > +consists of multiple paths, each representing a separate physical link
> > +between client and server. Those are used for load balancing and failover.
> > +Each path consists of as many connections (QPs) as there are cpus on
> > +the client.
> > +
> > +When processing an incoming rdma write or read request rtrs client uses memory
>
> A quote from
> https://linuxplumbersconf.org/event/4/contributions/367/attachments/331/555/LPC_2019_RMDA_MC_IBNBD_IBTRS_Upstreaming.pdf:
> "Only RDMA writes with immediate". Has the wire protocol perhaps been
> changed such that both RDMA reads and writes are used? I haven't found
> any references to RDMA reads in the "IO path" section in this file. Did
> I perhaps overlook something?
>
> Thanks,
>
> Bart.
We do not use RDMA_READ, only RDMA_WRITE/RDMA_WRITE_WITH_IMM/SEND_WITH_IMM
SEND_WITH_IMM was used only when always_invalidate=Y.
Will extend the document.

Thanks Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 14/25] rtrs: a bit of documentation
  2020-01-02 22:21   ` Bart Van Assche
@ 2020-01-07 15:49     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 15:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev, linux-kernel

On Thu, Jan 2, 2020 at 11:21 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-client b/Documentation/ABI/testing/sysfs-class-rtrs-client
> > new file mode 100644
> > index 000000000000..8b219cf6c5c4
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-class-rtrs-client
> > @@ -0,0 +1,190 @@
> > +What:                /sys/class/rtrs-client
> > +Date:                Jan 2020
> > +KernelVersion:       5.6
> > +Contact:     Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
> > +Description:
> > +When a user of RTRS API creates a new session, a directory entry with
> > +the name of that session is created under /sys/class/rtrs-client/<session-name>/
>
> Thank you for having included this ABI description. This is very
> helpful. Please follow the format documented in Documentation/ABI/README
> and make sure that all text, including the description, start in column
> 17 and please use tabs for indentation.
will fix.
>
> > diff --git a/drivers/infiniband/ulp/rtrs/README b/drivers/infiniband/ulp/rtrs/README
> > new file mode 100644
> > index 000000000000..59ad60318a18
> > --- /dev/null
> > +++ b/drivers/infiniband/ulp/rtrs/README
> > @@ -0,0 +1,149 @@
> > +****************************
> > +InfiniBand Transport (RTRS)
> > +****************************
> > +
> > +RTRS (InfiniBand Transport) is a reliable high speed transport library
> > +which provides support to establish optimal number of connections
> > +between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
> > +transport. It is optimized to transfer (read/write) IO blocks.
>
> Is it explained somewhere how the optimal number of connections is
> determined and also according to which metric the number of connections
> is optimized? Is the number of connections chosen to minimize latency,
> maximize IOPS or perhaps to optimize yet another metric?
RTRS creates one connection per CPU, optimize for minimizing latency
and maximizing IOPS, I would say.
>
> Thanks,
>
> Bart.
Thanks Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 15/25] rnbd: private headers with rnbd protocol structs and helpers
  2020-01-02 22:34   ` Bart Van Assche
@ 2020-01-07 16:53     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 16:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 11:34 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > + * @device_id:               device_id on server side to identify the device
>
> Is this a number that only has a meaning inside the RTRS software? Is
> the role of this number perhaps similar to an NVMe namespace or SCSI
> LUN? If so, please mention this. Additionally, does this number start
> from zero or from one?
device_id is only used in RNBD, RTRS doesn't care. we documented a bit
in rnbd/README,
we can extend the doc.
it is similar to SCSI LUN or NVMe namespace, it's an id to associate
the device with IO.

Thanks Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 16/25] rnbd: client: private header with client structs and functions
  2020-01-02 22:37   ` Bart Van Assche
@ 2020-01-07 17:09     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-07 17:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 11:37 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +struct rnbd_iu {
> > +     union {
> > +             struct request *rq; /* for block io */
> > +             void *buf; /* for user messages */
> > +     };
> > +     struct rtrs_permit      *permit;
> > +     union {
> > +             /* use to send msg associated with a dev */
> > +             struct rnbd_clt_dev *dev;
> > +             /* use to send msg associated with a sess */
> > +             struct rnbd_clt_session *sess;
> > +     };
> > +     blk_status_t            status;
> > +     struct scatterlist      sglist[BMAX_SEGMENTS];
> > +     struct work_struct      work;
> > +     int                     errno;
> > +     struct rnbd_iu_comp     comp;
> > +     atomic_t                refcount;
> > +};
>
> This data structure includes both a blk_status_t and an errno value. Can
> these two members be combined into a single member?
I guess you were suggesting to remove status and use
errno_to_blk_status(iu->errno) to call into blk_mq_end_request.
will do.
>
> Thanks,
>
> Bart.
Thanks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 10/25] rtrs: server: main functionality
  2020-01-07 13:19     ` Jinpu Wang
@ 2020-01-07 18:25       ` Jason Gunthorpe
  2020-01-10 17:38         ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Jason Gunthorpe @ 2020-01-07 18:25 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Tue, Jan 07, 2020 at 02:19:53PM +0100, Jinpu Wang wrote:

> > How can posting a signaled send prevent that the send queue overflows?
> > Isn't that something that can only be guaranteed by tracking the number
> > of WQE's in the send queue?
> Selective signaling works. All we need to do is signal one WR for
> every SQ-depth worth of WRs posted. For example, If the SQ depth is
> 16, we must signal at least one out of every 16. This ensures proper
> flow control for HW resources.
> Courtesy: section 8.2.1 of the iWARP Verbs draft
> http://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.2.1

Not quite. If the SQ depth is 16 and you post 16 things and then
signal the last one, you *cannot* post new work until you see the
completion.

More SQ space *ONLY* becomes available upon receipt of a
completion. This is why you can't have an unsignaled SQ

Jason

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 11/25] rtrs: server: statistics functions
  2020-01-02 22:02   ` Bart Van Assche
@ 2020-01-08 12:55     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-08 12:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 11:02 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +int rtrs_srv_reset_rdma_stats(struct rtrs_srv_stats *stats, bool enable)
> > +{
> > +     if (enable) {
> > +             struct rtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> > +
> > +             memset(r, 0, sizeof(*r));
> > +             return 0;
> > +     }
> > +
> > +     return -EINVAL;
> > +}
>
> I think the traditional kernel coding style is "if (!enable) return ...".
This can be changed.
>
> > +ssize_t rtrs_srv_stats_rdma_to_str(struct rtrs_srv_stats *stats,
> > +                                 char *page, size_t len)
> > +{
> > +     struct rtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
> > +     struct rtrs_srv_sess *sess;
> > +
> > +     sess = container_of(stats, typeof(*sess), stats);
> > +
> > +     return scnprintf(page, len, "%lld %lld %lld %lld %u\n",
> > +                      (s64)atomic64_read(&r->dir[READ].cnt),
> > +                      (s64)atomic64_read(&r->dir[READ].size_total),
> > +                      (s64)atomic64_read(&r->dir[WRITE].cnt),
> > +                      (s64)atomic64_read(&r->dir[WRITE].size_total),
> > +                      atomic_read(&sess->ids_inflight));
> > +}
>
> Does this follow the sysfs one-value-per-file rule?
We have user space tools already depend on it.
On the other side one-value-per-file rule is never really enforced,
see https://lwn.net/Articles/378884/
I think it doesn't really make sense for the use case.
>
> > +int rtrs_srv_stats_wc_completion_to_str(struct rtrs_srv_stats *stats,
> > +                                      char *buf, size_t len)
> > +{
> > +     return snprintf(buf, len, "%lld %lld\n",
> > +                     (s64)atomic64_read(&stats->wc_comp.total_wc_cnt),
> > +                     (s64)atomic64_read(&stats->wc_comp.calls));
> > +}
>
> Same comment here.
See comment above.
>
> Thanks,
>
> Bart.
Thanks Bart

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 18/25] rnbd: client: sysfs interface functions
  2020-01-03  0:03   ` Bart Van Assche
@ 2020-01-08 13:06     ` Jinpu Wang
  2020-01-08 16:39       ` Bart Van Assche
  0 siblings, 1 reply; 89+ messages in thread
From: Jinpu Wang @ 2020-01-08 13:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Fri, Jan 3, 2020 at 1:03 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +static const match_table_t rnbd_opt_tokens = {
> > +     {       RNBD_OPT_PATH,          "path=%s"               },
> > +     {       RNBD_OPT_DEV_PATH,      "device_path=%s"        },
> > +     {       RNBD_OPT_ACCESS_MODE,   "access_mode=%s"        },
> > +     {       RNBD_OPT_SESSNAME,      "sessname=%s"           },
> > +     {       RNBD_OPT_ERR,           NULL                    },
> > +};
>
>
> Please follow the example of other kernel code and change
> "{<tab>...<tab>}" into "{ ... }".
ok.
>
> > +/* remove new line from string */
> > +static void strip(char *s)
> > +{
> > +     char *p = s;
> > +
> > +     while (*s != '\0') {
> > +             if (*s != '\n')
> > +                     *p++ = *s++;
> > +             else
> > +                     ++s;
> > +     }
> > +     *p = '\0';
> > +}
>
> Does this function change a multiline string into a single line? I'm not
> sure that is how sysfs input should be processed ... Is this perhaps
> what you want?
>
> static inline void kill_final_newline(char *str)
> {
>         char *newline = strrchr(str, '\n');
>
>         if (newline && !newline[1])
>                 *newline = 0;
probably you meant "*newline = '\0'"

Your version only removes the last newline, our version removes all
the newlines in the string.

> }
>
> > +static struct kobj_attribute rnbd_clt_map_device_attr =
> > +     __ATTR(map_device, 0644,
> > +            rnbd_clt_map_device_show, rnbd_clt_map_device_store);
>
> Could __ATTR_RW() have been used here?
As replied in other emails, it can be used, but we prefer to keep the
prefix part.
>
> Thanks,
>
> Bart.
Thanks Bart

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 17/25] rnbd: client: main functionality
  2020-01-02 23:55   ` Bart Van Assche
@ 2020-01-08 14:22     ` Jinpu Wang
  2020-01-10 14:45     ` Jinpu Wang
  1 sibling, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-08 14:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Fri, Jan 3, 2020 at 12:55 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +MODULE_DESCRIPTION("InfiniBand Network Block Device Client");
>
> InfiniBand or RDMA?
will fix.
>
> > +static int rnbd_clt_set_dev_attr(struct rnbd_clt_dev *dev,
> > +                               const struct rnbd_msg_open_rsp *rsp)
> > +{
> > +     struct rnbd_clt_session *sess = dev->sess;
> > +
> > +     if (unlikely(!rsp->logical_block_size))
> > +             return -EINVAL;
> > +
> > +     dev->device_id              = le32_to_cpu(rsp->device_id);
> > +     dev->nsectors               = le64_to_cpu(rsp->nsectors);
> > +     dev->logical_block_size     = le16_to_cpu(rsp->logical_block_size);
> > +     dev->physical_block_size    = le16_to_cpu(rsp->physical_block_size);
> > +     dev->max_write_same_sectors = le32_to_cpu(rsp->max_write_same_sectors);
> > +     dev->max_discard_sectors    = le32_to_cpu(rsp->max_discard_sectors);
> > +     dev->discard_granularity    = le32_to_cpu(rsp->discard_granularity);
> > +     dev->discard_alignment      = le32_to_cpu(rsp->discard_alignment);
> > +     dev->secure_discard         = le16_to_cpu(rsp->secure_discard);
> > +     dev->rotational             = rsp->rotational;
> > +
> > +     dev->max_hw_sectors = sess->max_io_size / dev->logical_block_size;
>
> The above statement looks suspicious to me. The unit of the second
> argument of blk_queue_max_hw_sectors() is 512 bytes. Since
> dev->max_hw_sectors is passed as the second argument to
> blk_queue_max_hw_sectors() I think it should also have 512 bytes as unit
> instead of the logical block size.
You're right, will fix.
>
> > +static int rnbd_clt_change_capacity(struct rnbd_clt_dev *dev,
> > +                                  size_t new_nsectors)
> > +{
> > +     int err = 0;
> > +
> > +     rnbd_clt_info(dev, "Device size changed from %zu to %zu sectors\n",
> > +                    dev->nsectors, new_nsectors);
> > +     dev->nsectors = new_nsectors;
> > +     set_capacity(dev->gd,
> > +                  dev->nsectors * (dev->logical_block_size /
> > +                                   SECTOR_SIZE));
> > +     err = revalidate_disk(dev->gd);
> > +     if (err)
> > +             rnbd_clt_err(dev,
> > +                           "Failed to change device size from %zu to %zu, err: %d\n",
> > +                           dev->nsectors, new_nsectors, err);
> > +     return err;
> > +}
>
> Please document the unit of nsectors in struct rnbd_clt_dev. Please also
> document the unit of the 'new_nsectors' argument.
will do. The unit of nsectors is 512b.
>
> > +static void msg_io_conf(void *priv, int errno)
> > +{
> > +     struct rnbd_iu *iu = priv;
> > +     struct rnbd_clt_dev *dev = iu->dev;
> > +     struct request *rq = iu->rq;
> > +
> > +     iu->status = errno ? BLK_STS_IOERR : BLK_STS_OK;
> > +
> > +     blk_mq_complete_request(rq);
> > +
> > +     if (errno)
> > +             rnbd_clt_info_rl(dev, "%s I/O failed with err: %d\n",
> > +                               rq_data_dir(rq) == READ ? "read" : "write",
> > +                               errno);
> > +}
>
> Accessing 'rq' after having called blk_mq_complete_request() may trigger
> a use-after-free. Please don't do that.
You are right, will fix.

>
> > +static void wait_for_rtrs_disconnection(struct rnbd_clt_session *sess)
> > +__releases(&sess_lock)
> > +__acquires(&sess_lock)
>
> Please indent __releases() and __acquires() annotations.
ok.


>
> > +static int setup_mq_tags(struct rnbd_clt_session *sess)
> > +{
> > +     struct blk_mq_tag_set *tags = &sess->tag_set;
> > +
> > +     memset(tags, 0, sizeof(*tags));
> > +     tags->ops               = &rnbd_mq_ops;
> > +     tags->queue_depth       = sess->queue_depth;
> > +     tags->numa_node         = NUMA_NO_NODE;
> > +     tags->flags             = BLK_MQ_F_SHOULD_MERGE |
> > +                               BLK_MQ_F_TAG_SHARED;
> > +     tags->cmd_size          = sizeof(struct rnbd_iu);
> > +     tags->nr_hw_queues      = num_online_cpus();
> > +
> > +     return blk_mq_alloc_tag_set(tags);
> > +}
>
> Please change the name of the "tags" pointer into "tag_set".
ok.
>
> > +static int index_to_minor(int index)
> > +{
> > +     return index << RNBD_PART_BITS;
> > +}
> > +
> > +static int minor_to_index(int minor)
> > +{
> > +     return minor >> RNBD_PART_BITS;
> > +}
>
> Is it useful to introduce functions that encapsulate a single shift
> operation?
can be dropped, althrough it's common to do it this way, plenty of
examples in kernel tree.
>
> > +     blk_queue_virt_boundary(dev->queue, 4095);
>
> The virt_boundary parameter must match the RDMA memory registration page
> size. Please introduce a symbolic constant for the RDMA memory
> registration page size such that these two parameters stay in sync in
> case anyone would want to change the memory registration page size.
>
> > +static void rnbd_clt_setup_gen_disk(struct rnbd_clt_dev *dev, int idx)
> > +{
> > +     dev->gd->major          = rnbd_client_major;
> > +     dev->gd->first_minor    = index_to_minor(idx);
> > +     dev->gd->fops           = &rnbd_client_ops;
> > +     dev->gd->queue          = dev->queue;
> > +     dev->gd->private_data   = dev;
> > +     snprintf(dev->gd->disk_name, sizeof(dev->gd->disk_name), "rnbd%d",
> > +              idx);
> > +     pr_debug("disk_name=%s, capacity=%zu\n",
> > +              dev->gd->disk_name,
> > +              dev->nsectors * (dev->logical_block_size / SECTOR_SIZE)
> > +              );
> > +
> > +     set_capacity(dev->gd, dev->nsectors * (dev->logical_block_size /
> > +                                            SECTOR_SIZE));
>
> Again, what is the unit of dev->nsectors?
The unit is 512b, I will remove the multipler, in most of the case
logical_block_size is SECTOR_SIZE.
>
> > +static void rnbd_clt_add_gen_disk(struct rnbd_clt_dev *dev)
> > +{
> > +     add_disk(dev->gd);
> > +}
>
> Is it useful to introduce this wrapper around add_disk()?
will remove the wrapper.

>
> Thanks,
>
> Bart.
Thanks Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 09/25] rtrs: server: private header with server structs and functions
  2020-01-02 21:24   ` Bart Van Assche
@ 2020-01-08 16:33     ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-08 16:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Thu, Jan 2, 2020 at 10:24 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 12/30/19 2:29 AM, Jack Wang wrote:
> > +struct rtrs_stats_wc_comp {
> > +     atomic64_t      calls;
> > +     atomic64_t      total_wc_cnt;
> > +};
>
> Please document the meaning of the members of this data structure.
will do.
>
> > +struct rtrs_srv_stats_rdma_stats {
> > +     struct {
> > +             atomic64_t      cnt;
> > +             atomic64_t      size_total;
> > +     } dir[2];
> > +};
>
> Please document the meaning of the members of this data structure and
> also which index (0, 1) corresponds to which direction (read, write).
yes, will do.
>
> > +struct rtrs_srv_op {
> > +     struct rtrs_srv_con             *con;
> > +     u32                             msg_id;
> > +     u8                              dir;
> > +     struct rtrs_msg_rdma_read       *rd_msg;
> > +     struct ib_rdma_wr               *tx_wr;
> > +     struct ib_sge                   *tx_sg;
> > +};
>
> Please document the role of this data structure.
ok.
>
> > +struct rtrs_srv_mr {
> > +     struct ib_mr    *mr;
> > +     struct sg_table sgt;
> > +     struct ib_cqe   inv_cqe; /* only for always_invalidate=true */
> > +     u32             msg_id; /* only for always_invalidate=true */
> > +     u32             msg_off; /* only for always_invalidate=true */
> > +     struct rtrs_iu  *iu; /* send buffer for new rkey msg */
> > +};
>
> Please document the role of this data structure.
ok
>
> > +extern struct class *rtrs_dev_class;
>
> Please make sure that the static 'rtrs_dev_class' variable in rtrs-clt.c
> and in this header file have different names.
ok
>
> Thanks,
>
> Bart.
Thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 18/25] rnbd: client: sysfs interface functions
  2020-01-08 13:06     ` Jinpu Wang
@ 2020-01-08 16:39       ` Bart Van Assche
  2020-01-08 16:51         ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Bart Van Assche @ 2020-01-08 16:39 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On 1/8/20 5:06 AM, Jinpu Wang wrote:
> On Fri, Jan 3, 2020 at 1:03 AM Bart Van Assche <bvanassche@acm.org> wrote:
>> On 12/30/19 2:29 AM, Jack Wang wrote:
>>> +/* remove new line from string */
>>> +static void strip(char *s)
>>> +{
>>> +     char *p = s;
>>> +
>>> +     while (*s != '\0') {
>>> +             if (*s != '\n')
>>> +                     *p++ = *s++;
>>> +             else
>>> +                     ++s;
>>> +     }
>>> +     *p = '\0';
>>> +}
>>
>> Does this function change a multiline string into a single line? I'm not
>> sure that is how sysfs input should be processed ... Is this perhaps
>> what you want?
>>
>> static inline void kill_final_newline(char *str)
>> {
>>          char *newline = strrchr(str, '\n');
>>
>>          if (newline && !newline[1])
>>                  *newline = 0;
> probably you meant "*newline = '\0'"
> 
> Your version only removes the last newline, our version removes all
> the newlines in the string.

Removing all newlines seems dubious to me. I am not aware of any other 
sysfs code that does that.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 18/25] rnbd: client: sysfs interface functions
  2020-01-08 16:39       ` Bart Van Assche
@ 2020-01-08 16:51         ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-08 16:51 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, rpenyaev

On Wed, Jan 8, 2020 at 5:39 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 1/8/20 5:06 AM, Jinpu Wang wrote:
> > On Fri, Jan 3, 2020 at 1:03 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >> On 12/30/19 2:29 AM, Jack Wang wrote:
> >>> +/* remove new line from string */
> >>> +static void strip(char *s)
> >>> +{
> >>> +     char *p = s;
> >>> +
> >>> +     while (*s != '\0') {
> >>> +             if (*s != '\n')
> >>> +                     *p++ = *s++;
> >>> +             else
> >>> +                     ++s;
> >>> +     }
> >>> +     *p = '\0';
> >>> +}
> >>
> >> Does this function change a multiline string into a single line? I'm not
> >> sure that is how sysfs input should be processed ... Is this perhaps
> >> what you want?
> >>
> >> static inline void kill_final_newline(char *str)
> >> {
> >>          char *newline = strrchr(str, '\n');
> >>
> >>          if (newline && !newline[1])
> >>                  *newline = 0;
> > probably you meant "*newline = '\0'"
> >
> > Your version only removes the last newline, our version removes all
> > the newlines in the string.
>
> Removing all newlines seems dubious to me. I am not aware of any other
> sysfs code that does that.
>
> Thanks,
>
> Bart.
I agree writing a string with many newlines is not common. We can
require the user to write a single line.
I will drop it after verify with our regression tests.

Thanks Bart

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 17/25] rnbd: client: main functionality
  2020-01-02 23:55   ` Bart Van Assche
  2020-01-08 14:22     ` Jinpu Wang
@ 2020-01-10 14:45     ` Jinpu Wang
  2020-01-10 15:09       ` Roman Penyaev
  1 sibling, 1 reply; 89+ messages in thread
From: Jinpu Wang @ 2020-01-10 14:45 UTC (permalink / raw)
  To: Bart Van Assche, rpenyaev
  Cc: Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis

> > +{
> > +     DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
> > +
> > +     prepare_to_wait(&sess->rtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
> > +     if (IS_ERR_OR_NULL(sess->rtrs)) {
> > +             finish_wait(&sess->rtrs_waitq, &wait);
> > +             return;
> > +     }
> > +     mutex_unlock(&sess_lock);
> > +     /* After unlock session can be freed, so careful */
> > +     schedule();
> > +     mutex_lock(&sess_lock);
> > +}
>
> How can a function that calls schedule() and that is not surrounded by a
> loop be correct? What if e.g. schedule() finishes due to a spurious wakeup?
I checked in git history, this no clean explanation why we have to
call the mutex_unlock/schedul/mutex_lock magic
It's allowed to call schedule inside mutex, seems we can remove the
code snip, @Roman Penyaev do you remember why it was introduced?
>
> > +static struct rnbd_clt_session *__find_and_get_sess(const char *sessname)
> > +__releases(&sess_lock)
> > +__acquires(&sess_lock)
> > +{
> > +     struct rnbd_clt_session *sess;
> > +     int err;
> > +
> > +again:
> > +     list_for_each_entry(sess, &sess_list, list) {
> > +             if (strcmp(sessname, sess->sessname))
> > +                     continue;
> > +
> > +             if (unlikely(sess->rtrs_ready && IS_ERR_OR_NULL(sess->rtrs)))
> > +                     /*
> > +                      * No RTRS connection, session is dying.
> > +                      */
> > +                     continue;
> > +
> > +             if (likely(rnbd_clt_get_sess(sess))) {
> > +                     /*
> > +                      * Alive session is found, wait for RTRS connection.
> > +                      */
> > +                     mutex_unlock(&sess_lock);
> > +                     err = wait_for_rtrs_connection(sess);
> > +                     if (unlikely(err))
> > +                             rnbd_clt_put_sess(sess);
> > +                     mutex_lock(&sess_lock);
> > +
> > +                     if (unlikely(err))
> > +                             /* Session is dying, repeat the loop */
> > +                             goto again;
> > +
> > +                     return sess;
> > +             }
> > +             /*
> > +              * Ref is 0, session is dying, wait for RTRS disconnect
> > +              * in order to avoid session names clashes.
> > +              */
> > +             wait_for_rtrs_disconnection(sess);
> > +             /*
> > +              * RTRS is disconnected and soon session will be freed,
> > +              * so repeat a loop.
> > +              */
> > +             goto again;
> > +     }
> > +
> > +     return NULL;
> > +}
>
> Since wait_for_rtrs_disconnection() unlocks sess_lock, can the
> list_for_each_entry() above trigger a use-after-free of sess->next?


>
> > +static size_t rnbd_clt_get_sg_size(struct scatterlist *sglist, u32 len)
> > +{
> > +     struct scatterlist *sg;
> > +     size_t tsize = 0;
> > +     int i;
> > +
> > +     for_each_sg(sglist, sg, len, i)
> > +             tsize += sg->length;
> > +     return tsize;
> > +}
>
> Please follow the example of other block drivers and use blk_rq_bytes()
> instead of iterating over the sg-list.
    The amount of data that belongs to an I/O and the amount of data that
    should be read or written to the disk (bi_size) can differ.

    E.g. When WRITE_SAME is used, only a small amount of data is
    transfered that is then written repeatedly over a lot of sectors.

    this is why we get the size of data to be transfered via RTRS by
summing up the size
    of the scather-gather list entries.

Will add a comment.

Thanks

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 17/25] rnbd: client: main functionality
  2020-01-10 14:45     ` Jinpu Wang
@ 2020-01-10 15:09       ` Roman Penyaev
  2020-01-10 15:29         ` Jinpu Wang
  0 siblings, 1 reply; 89+ messages in thread
From: Roman Penyaev @ 2020-01-10 15:09 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis

On 2020-01-10 15:45, Jinpu Wang wrote:
>> > +{
>> > +     DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
>> > +
>> > +     prepare_to_wait(&sess->rtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
>> > +     if (IS_ERR_OR_NULL(sess->rtrs)) {
>> > +             finish_wait(&sess->rtrs_waitq, &wait);
>> > +             return;
>> > +     }
>> > +     mutex_unlock(&sess_lock);
>> > +     /* After unlock session can be freed, so careful */
>> > +     schedule();
>> > +     mutex_lock(&sess_lock);
>> > +}
>> 
>> How can a function that calls schedule() and that is not surrounded by 
>> a
>> loop be correct? What if e.g. schedule() finishes due to a spurious 
>> wakeup?
> I checked in git history, this no clean explanation why we have to
> call the mutex_unlock/schedul/mutex_lock magic
> It's allowed to call schedule inside mutex, seems we can remove the
> code snip, @Roman Penyaev do you remember why it was introduced?

The loop in question is in the caller, see __find_and_get_sess().
You can't leave mutex locked and call schedule(), you will catch a
deadlock with a caller of free_sess(), which has just put the last
reference and is about to take the sess_lock in order to delete
the session from the list.

--
Roman


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 17/25] rnbd: client: main functionality
  2020-01-10 15:09       ` Roman Penyaev
@ 2020-01-10 15:29         ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-10 15:29 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis

On Fri, Jan 10, 2020 at 4:09 PM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> On 2020-01-10 15:45, Jinpu Wang wrote:
> >> > +{
> >> > +     DEFINE_WAIT_FUNC(wait, autoremove_wake_function);
> >> > +
> >> > +     prepare_to_wait(&sess->rtrs_waitq, &wait, TASK_UNINTERRUPTIBLE);
> >> > +     if (IS_ERR_OR_NULL(sess->rtrs)) {
> >> > +             finish_wait(&sess->rtrs_waitq, &wait);
> >> > +             return;
> >> > +     }
> >> > +     mutex_unlock(&sess_lock);
> >> > +     /* After unlock session can be freed, so careful */
> >> > +     schedule();
> >> > +     mutex_lock(&sess_lock);
> >> > +}
> >>
> >> How can a function that calls schedule() and that is not surrounded by
> >> a
> >> loop be correct? What if e.g. schedule() finishes due to a spurious
> >> wakeup?
> > I checked in git history, this no clean explanation why we have to
> > call the mutex_unlock/schedul/mutex_lock magic
> > It's allowed to call schedule inside mutex, seems we can remove the
> > code snip, @Roman Penyaev do you remember why it was introduced?
>
> The loop in question is in the caller, see __find_and_get_sess().
> You can't leave mutex locked and call schedule(), you will catch a
> deadlock with a caller of free_sess(), which has just put the last
> reference and is about to take the sess_lock in order to delete
> the session from the list.
>
> --
> Roman
Thanks Roman for quick reply, will extend the comment

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 10/25] rtrs: server: main functionality
  2020-01-07 18:25       ` Jason Gunthorpe
@ 2020-01-10 17:38         ` Jinpu Wang
  0 siblings, 0 replies; 89+ messages in thread
From: Jinpu Wang @ 2020-01-10 17:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, Jack Wang, linux-block, linux-rdma, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Doug Ledford,
	Danil Kipnis, Roman Penyaev

On Tue, Jan 7, 2020 at 7:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jan 07, 2020 at 02:19:53PM +0100, Jinpu Wang wrote:
>
> > > How can posting a signaled send prevent that the send queue overflows?
> > > Isn't that something that can only be guaranteed by tracking the number
> > > of WQE's in the send queue?
> > Selective signaling works. All we need to do is signal one WR for
> > every SQ-depth worth of WRs posted. For example, If the SQ depth is
> > 16, we must signal at least one out of every 16. This ensures proper
> > flow control for HW resources.
> > Courtesy: section 8.2.1 of the iWARP Verbs draft
> > http://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.2.1
>
> Not quite. If the SQ depth is 16 and you post 16 things and then
> signal the last one, you *cannot* post new work until you see the
> completion.
>
> More SQ space *ONLY* becomes available upon receipt of a
> completion. This is why you can't have an unsignaled SQ
>
> Jason
Thanks for clarifying.
The HW seems to be very fast to complete WR, that's why never see the problem.
iser has a similar logic, see iser_signal_comp

Jack

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, back to index

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-30 10:29 [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Jack Wang
2019-12-30 10:29 ` [PATCH v6 01/25] sysfs: export sysfs_remove_file_self() Jack Wang
2019-12-30 10:29 ` [PATCH v6 02/25] rtrs: public interface header to establish RDMA connections Jack Wang
2019-12-30 19:25   ` Bart Van Assche
2020-01-02 13:35     ` Jinpu Wang
2020-01-02 16:36       ` Bart Van Assche
2020-01-02 16:47         ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 03/25] rtrs: private headers with rtrs protocol structs and helpers Jack Wang
2019-12-30 19:48   ` Bart Van Assche
2020-01-02 15:27     ` Jinpu Wang
2020-01-02 17:00       ` Bart Van Assche
2020-01-02 18:26         ` Jason Gunthorpe
2020-01-03 12:31           ` Jinpu Wang
2020-01-03 12:27         ` Jinpu Wang
2019-12-31  0:07   ` Bart Van Assche
2020-01-03 13:48     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 04/25] rtrs: core: lib functions shared between client and server modules Jack Wang
2019-12-30 22:25   ` Bart Van Assche
2020-01-07 12:22     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 05/25] rtrs: client: private header with client structs and functions Jack Wang
2019-12-30 22:51   ` Bart Van Assche
2020-01-07 12:39     ` Jinpu Wang
2019-12-30 23:03   ` Bart Van Assche
2020-01-07 12:39     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 06/25] rtrs: client: main functionality Jack Wang
2019-12-30 23:53   ` Bart Van Assche
2020-01-02 18:23     ` Jason Gunthorpe
2020-01-03 14:30     ` Jinpu Wang
2020-01-03 16:12       ` Bart Van Assche
2019-12-30 10:29 ` [PATCH v6 07/25] rtrs: client: statistics functions Jack Wang
2020-01-02 21:07   ` Bart Van Assche
2020-01-03 14:39     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 08/25] rtrs: client: sysfs interface functions Jack Wang
2020-01-02 21:14   ` Bart Van Assche
2020-01-03 14:59     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 09/25] rtrs: server: private header with server structs and functions Jack Wang
2020-01-02 21:24   ` Bart Van Assche
2020-01-08 16:33     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 10/25] rtrs: server: main functionality Jack Wang
2020-01-02 22:03   ` Bart Van Assche
2020-01-07 13:19     ` Jinpu Wang
2020-01-07 18:25       ` Jason Gunthorpe
2020-01-10 17:38         ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 11/25] rtrs: server: statistics functions Jack Wang
2020-01-02 22:02   ` Bart Van Assche
2020-01-08 12:55     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 12/25] rtrs: server: sysfs interface functions Jack Wang
2020-01-02 22:06   ` Bart Van Assche
2020-01-07 14:40     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 13/25] rtrs: include client and server modules into kernel compilation Jack Wang
2020-01-02 22:11   ` Bart Van Assche
2020-01-03 16:19     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 14/25] rtrs: a bit of documentation Jack Wang
2019-12-30 23:19   ` Bart Van Assche
2020-01-07 14:48     ` Jinpu Wang
2020-01-02 22:21   ` Bart Van Assche
2020-01-07 15:49     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 15/25] rnbd: private headers with rnbd protocol structs and helpers Jack Wang
2020-01-02 22:34   ` Bart Van Assche
2020-01-07 16:53     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 16/25] rnbd: client: private header with client structs and functions Jack Wang
2020-01-02 22:37   ` Bart Van Assche
2020-01-07 17:09     ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 17/25] rnbd: client: main functionality Jack Wang
2020-01-02 23:55   ` Bart Van Assche
2020-01-08 14:22     ` Jinpu Wang
2020-01-10 14:45     ` Jinpu Wang
2020-01-10 15:09       ` Roman Penyaev
2020-01-10 15:29         ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 18/25] rnbd: client: sysfs interface functions Jack Wang
2020-01-03  0:03   ` Bart Van Assche
2020-01-08 13:06     ` Jinpu Wang
2020-01-08 16:39       ` Bart Van Assche
2020-01-08 16:51         ` Jinpu Wang
2019-12-30 10:29 ` [PATCH v6 19/25] rnbd: server: private header with server structs and functions Jack Wang
2019-12-30 10:29 ` [PATCH v6 20/25] rnbd: server: main functionality Jack Wang
2019-12-30 10:29 ` [PATCH v6 21/25] rnbd: server: functionality for IO submission to file or block dev Jack Wang
2019-12-30 10:29 ` [PATCH v6 22/25] rnbd: server: sysfs interface functions Jack Wang
2019-12-30 10:29 ` [PATCH v6 23/25] rnbd: include client and server modules into kernel compilation Jack Wang
2019-12-30 10:29 ` [PATCH v6 24/25] rnbd: a bit of documentation Jack Wang
2019-12-30 10:29 ` [PATCH v6 25/25] MAINTAINERS: Add maintainers for RNBD/RTRS modules Jack Wang
2019-12-30 12:22   ` Gal Pressman
2020-01-02  8:41     ` Jinpu Wang
2019-12-31  0:11 ` [PATCH v6 00/25] RTRS (former IBTRS) rdma transport library and RNBD (former IBNBD) rdma network block device Bart Van Assche
2020-01-02  8:48   ` Jinpu Wang
2019-12-31  2:39 ` Bart Van Assche
2020-01-02  9:20   ` Jinpu Wang
2020-01-02 18:28   ` Jason Gunthorpe
2020-01-03 12:34     ` Jinpu Wang

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git