All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] Request for Comments on SoftiWarp
@ 2018-01-14 22:35 Bernard Metzler
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

This patch set contributes a new version of the SoftiWarp
driver, as originally introduced Oct 6th, 2017.

Thank you all for the commments on my last submissiom. It
was vey helpful and encouraging. I hereby re-send a SoftiWarp
patch after taking all comments into account, I hope. Without
introducing SoftiWarp (siw) again, let me fast forward
to the changes made to the last submission:


1. Removed all kernel module parameters, as requested.
   I am not completely happy yet with what came out of
   that, since parameters like the name of interfaces
   siw shall attach to, or if we want to use TCP_NODELAY
   for low ping-pong delay, or disable it for higher
   message rates for multiple outstanding operations,
   were quite handy. I understand I have to find a better
   solution for that. Ideally, we would be able to
   distinguish between global siw driver control (like
   interfaces it attaches to), and per-QP parameters
   (somehow similar to socket options) which allow for
   example to tune delay, setting CRC on/off, or control
   GSO usage.

   For now, the former module parameters became global
   const values, as defined in siw_main.c. So, I gave
   up on flexibility for now, but hope to re-gain it
   with a proper interface.

2. Changed all debug printouts to device debug printouts,
   or removed them.

3. Implemented sending and handling of TERMINATE 
   messages. Determining the information to be carried
   in those messages, as scattered over several RFC's,
   was unexpectedly complex.

   It turned out, that some iwarp hardware only
   rudimentarily implements that iwarp termination
   protocol, lacking sending back useful information
   in case of application failure.

   I anyway think, it is rather helpful to understand why
   a certain RDMA operation failed at peer side, since it
   carries that information back to the initiator.

4. Used Sparse to fix the various style and consistency
   issues of the last version.

5. Removed siw private implementation of DMA mapping ops
   in favour of using in-kernel available dma_virt_ops.

6. Fixed local SQ processing starvation due to high
   inbound rate of READ.requests to same QP.
   
7. Introduced some level of NUMA awareness at transmit
   side. The code tries to serve SQ processing with a
   core from the same NUMA node as the current Ethernet
   adapter.


The current code comes with the following known
limitations:

1. To get rid of module parameters, siw now attaches
   to all interfaces of type ETHER (or additionally
   LOOPBACK, if enabled).

   Please take it as an ad-hoc solution. I understand
   it will likely not be acceptable, but I wanted to
   come out with a reasonable and working update of siw.
   So far, I did not came across any good solution
   for dynamic interface management for siw. I would
   highly appreciate suggestions, and look forward
   for discussion.

2. Performance needs some additional investigation.
   The dynamics of the interaction with kernel TCP
   still needs some better understanding, especially
   regarding clever setting of message flags, socket
   parameters, etc. On a 100Gb/s link, one QP can get
   around 60Gb/s for large messages, and 6.5us WRITE
   delay for small messages, but there is still
   fluctuation, and therefor probably some headroom.

3. I kept the siw_tx_hdt() function, which
   tries to push to TCP all pieces of an iWarp message
   as one message vector. This makes that function
   probably rather complex, but performance seem
   to benefit from that approach.

4. zero copy transmission is currently switched off
   (bool zcopy_tx is false in siw_main.c). Switching
   it on lowers CPU load at sending side, but introduces
   a high rate of mid-frame fragmentation at TCP
   layer (RDMAP frame gets broken into two
   wire transmissions), which hurts receiver 
   performance. It will get re-enabled if those effects
   are better understood, and controlled.


As always, I am very thankful for reviewing the code,
much appreciating the time and effort it takes. I am
very open for discussion and suggestions.

Thanks very much,
Bernard.

Bernard Metzler (13):
  iWARP wire packet format definition
  Main SoftiWarp include file
  Attach/detach SoftiWarp to/from network and RDMA subsystem
  SoftiWarp object management
  SoftiWarp application interface
  SoftiWarp connection management
  SoftiWarp application buffer management
  SoftiWarp Queue Pair methods
  SoftiWarp transmit path
  SoftiWarp receive path
  SoftiWarp Completion Queue methods
  SoftiWarp debugging code
  Add SoftiWarp to kernel build environment

 drivers/infiniband/Kconfig            |    1 +
 drivers/infiniband/sw/Makefile        |    1 +
 drivers/infiniband/sw/siw/Kconfig     |   18 +
 drivers/infiniband/sw/siw/Makefile    |   15 +
 drivers/infiniband/sw/siw/iwarp.h     |  415 +++++++
 drivers/infiniband/sw/siw/siw.h       |  823 +++++++++++++
 drivers/infiniband/sw/siw/siw_ae.c    |  120 ++
 drivers/infiniband/sw/siw/siw_cm.c    | 2184 +++++++++++++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_cm.h    |  156 +++
 drivers/infiniband/sw/siw/siw_cq.c    |  150 +++
 drivers/infiniband/sw/siw/siw_debug.c |  463 +++++++
 drivers/infiniband/sw/siw/siw_debug.h |   87 ++
 drivers/infiniband/sw/siw/siw_main.c  |  816 ++++++++++++
 drivers/infiniband/sw/siw/siw_mem.c   |  243 ++++
 drivers/infiniband/sw/siw/siw_obj.c   |  338 +++++
 drivers/infiniband/sw/siw/siw_obj.h   |  200 +++
 drivers/infiniband/sw/siw/siw_qp.c    | 1445 ++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_qp_rx.c | 1531 +++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_qp_tx.c | 1346 ++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_verbs.c | 1876 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_verbs.h |  119 ++
 include/uapi/rdma/siw_user.h          |  216 ++++
 22 files changed, 12563 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/Kconfig
 create mode 100644 drivers/infiniband/sw/siw/Makefile
 create mode 100644 drivers/infiniband/sw/siw/iwarp.h
 create mode 100644 drivers/infiniband/sw/siw/siw.h
 create mode 100644 drivers/infiniband/sw/siw/siw_ae.c
 create mode 100644 drivers/infiniband/sw/siw/siw_cm.c
 create mode 100644 drivers/infiniband/sw/siw/siw_cm.h
 create mode 100644 drivers/infiniband/sw/siw/siw_cq.c
 create mode 100644 drivers/infiniband/sw/siw/siw_debug.c
 create mode 100644 drivers/infiniband/sw/siw/siw_debug.h
 create mode 100644 drivers/infiniband/sw/siw/siw_main.c
 create mode 100644 drivers/infiniband/sw/siw/siw_mem.c
 create mode 100644 drivers/infiniband/sw/siw/siw_obj.c
 create mode 100644 drivers/infiniband/sw/siw/siw_obj.h
 create mode 100644 drivers/infiniband/sw/siw/siw_qp.c
 create mode 100644 drivers/infiniband/sw/siw/siw_qp_rx.c
 create mode 100644 drivers/infiniband/sw/siw/siw_qp_tx.c
 create mode 100644 drivers/infiniband/sw/siw/siw_verbs.c
 create mode 100644 drivers/infiniband/sw/siw/siw_verbs.h
 create mode 100644 include/uapi/rdma/siw_user.h

-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 01/13] iWARP wire packet format definition
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
@ 2018-01-14 22:35   ` Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 02/13] Main SoftiWarp include file Bernard Metzler
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/iwarp.h | 415 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 415 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/iwarp.h

diff --git a/drivers/infiniband/sw/siw/iwarp.h b/drivers/infiniband/sw/siw/iwarp.h
new file mode 100644
index 000000000000..e44a7e1936fe
--- /dev/null
+++ b/drivers/infiniband/sw/siw/iwarp.h
@@ -0,0 +1,415 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Fredy Neeser <nfd-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _IWARP_H
+#define _IWARP_H
+
+#include <rdma/rdma_user_cm.h>	/* RDMA_MAX_PRIVATE_DATA */
+#include <linux/types.h>
+#include <asm/byteorder.h>
+
+#define RDMAP_VERSION		1
+#define DDP_VERSION		1
+#define MPA_REVISION_1		1
+#define MPA_REVISION_2		2
+#define MPA_MAX_PRIVDATA	RDMA_MAX_PRIVATE_DATA
+#define MPA_KEY_REQ		"MPA ID Req Frame"
+#define MPA_KEY_REP		"MPA ID Rep Frame"
+#define	MPA_IRD_ORD_MASK	0x3fff
+
+struct mpa_rr_params {
+	__be16	bits;
+	__be16	pd_len;
+};
+
+/*
+ * MPA request/response header bits & fields
+ */
+enum {
+	MPA_RR_FLAG_MARKERS	= cpu_to_be16(0x8000),
+	MPA_RR_FLAG_CRC		= cpu_to_be16(0x4000),
+	MPA_RR_FLAG_REJECT	= cpu_to_be16(0x2000),
+	MPA_RR_FLAG_ENHANCED	= cpu_to_be16(0x1000),
+	MPA_RR_RESERVED		= cpu_to_be16(0x0f00),
+	MPA_RR_MASK_REVISION	= cpu_to_be16(0x00ff)
+};
+
+/*
+ * MPA request/reply header
+ */
+struct mpa_rr {
+	__u8	key[16];
+	struct mpa_rr_params params;
+};
+
+static inline void __mpa_rr_set_revision(__be16 *bits, u8 rev)
+{
+	*bits = (*bits & ~MPA_RR_MASK_REVISION)
+		| (cpu_to_be16(rev) & MPA_RR_MASK_REVISION);
+}
+
+static inline u8 __mpa_rr_revision(__be16 mpa_rr_bits)
+{
+	__be16 rev = mpa_rr_bits & MPA_RR_MASK_REVISION;
+
+	return (u8)be16_to_cpu(rev);
+}
+
+enum mpa_v2_ctrl {
+	MPA_V2_PEER_TO_PEER	= cpu_to_be16(0x8000),
+	MPA_V2_ZERO_LENGTH_RTR	= cpu_to_be16(0x4000),
+	MPA_V2_RDMA_WRITE_RTR	= cpu_to_be16(0x8000),
+	MPA_V2_RDMA_READ_RTR	= cpu_to_be16(0x4000),
+	MPA_V2_RDMA_NO_RTR	= cpu_to_be16(0x0000),
+	MPA_V2_MASK_IRD_ORD	= cpu_to_be16(0x3fff)
+};
+
+struct mpa_v2_data {
+	__be16		ird;
+	__be16		ord;
+};
+
+struct mpa_marker {
+	__be16	rsvd;
+	__be16	fpdu_hmd; /* FPDU header-marker distance (= MPA's FPDUPTR) */
+};
+
+/*
+ * maximum MPA trailer
+ */
+struct mpa_trailer {
+	__u8	pad[4];
+	__be32	crc;
+};
+
+#define MPA_HDR_SIZE	2
+#define MPA_CRC_SIZE	4
+
+/*
+ * Common portion of iWARP headers (MPA, DDP, RDMAP)
+ * for any FPDU
+ */
+struct iwarp_ctrl {
+	__be16	mpa_len;
+	__be16	ddp_rdmap_ctrl;
+};
+
+/*
+ * DDP/RDMAP Hdr bits & fields
+ */
+enum {
+	DDP_FLAG_TAGGED		= cpu_to_be16(0x8000),
+	DDP_FLAG_LAST		= cpu_to_be16(0x4000),
+	DDP_MASK_RESERVED	= cpu_to_be16(0x3C00),
+	DDP_MASK_VERSION	= cpu_to_be16(0x0300),
+	RDMAP_MASK_VERSION	= cpu_to_be16(0x00C0),
+	RDMAP_MASK_RESERVED	= cpu_to_be16(0x0030),
+	RDMAP_MASK_OPCODE	= cpu_to_be16(0x000f)
+};
+
+static inline u8 __ddp_version(struct iwarp_ctrl *ctrl)
+{
+	return (u8)(be16_to_cpu(ctrl->ddp_rdmap_ctrl & DDP_MASK_VERSION) >> 8);
+};
+
+static inline void __ddp_set_version(struct iwarp_ctrl *ctrl, u8 version)
+{
+	ctrl->ddp_rdmap_ctrl = (ctrl->ddp_rdmap_ctrl & ~DDP_MASK_VERSION)
+		| (cpu_to_be16((u16)version << 8) & DDP_MASK_VERSION);
+};
+
+static inline u8 __rdmap_version(struct iwarp_ctrl *ctrl)
+{
+	__be16 ver = ctrl->ddp_rdmap_ctrl & RDMAP_MASK_VERSION;
+
+	return (u8)(be16_to_cpu(ver) >> 6);
+};
+
+static inline void __rdmap_set_version(struct iwarp_ctrl *ctrl, u8 version)
+{
+	ctrl->ddp_rdmap_ctrl = (ctrl->ddp_rdmap_ctrl & ~RDMAP_MASK_VERSION)
+			| (cpu_to_be16(version << 6) & RDMAP_MASK_VERSION);
+}
+
+static inline u8 __rdmap_opcode(struct iwarp_ctrl *ctrl)
+{
+	return (u8)be16_to_cpu(ctrl->ddp_rdmap_ctrl & RDMAP_MASK_OPCODE);
+}
+
+static inline void __rdmap_set_opcode(struct iwarp_ctrl *ctrl, u8 opcode)
+{
+	ctrl->ddp_rdmap_ctrl = (ctrl->ddp_rdmap_ctrl & ~RDMAP_MASK_OPCODE)
+				| (cpu_to_be16(opcode) & RDMAP_MASK_OPCODE);
+}
+
+struct iwarp_rdma_write {
+	struct iwarp_ctrl	ctrl;
+	__be32			sink_stag;
+	__be64			sink_to;
+};
+
+struct iwarp_rdma_rreq {
+	struct iwarp_ctrl	ctrl;
+	__be32			rsvd;
+	__be32			ddp_qn;
+	__be32			ddp_msn;
+	__be32			ddp_mo;
+	__be32			sink_stag;
+	__be64			sink_to;
+	__be32			read_size;
+	__be32			source_stag;
+	__be64			source_to;
+};
+
+struct iwarp_rdma_rresp {
+	struct iwarp_ctrl	ctrl;
+	__be32			sink_stag;
+	__be64			sink_to;
+};
+
+struct iwarp_send {
+	struct iwarp_ctrl	ctrl;
+	__be32			rsvd;
+	__be32			ddp_qn;
+	__be32			ddp_msn;
+	__be32			ddp_mo;
+};
+
+struct iwarp_send_inv {
+	struct iwarp_ctrl	ctrl;
+	__be32			inval_stag;
+	__be32			ddp_qn;
+	__be32			ddp_msn;
+	__be32			ddp_mo;
+};
+
+struct iwarp_terminate {
+	struct iwarp_ctrl ctrl;
+	__be32	rsvd;
+	__be32	ddp_qn;
+	__be32	ddp_msn;
+	__be32	ddp_mo;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+	__u16	layer:4,
+		etype:4,
+		ecode:8;
+	__u16	flag_m:1,
+		flag_d:1,
+		flag_r:1,
+		reserved:13;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+	__u16	etype:4,
+		layer:4,
+		ecode:8;
+	__u16	reserved:13,
+		flag_r:1,
+		flag_d:1,
+		flag_m:1;
+#else
+#error "undefined byte order"
+#endif
+};
+
+/*
+ * Terminate Hdr bits & fields
+ */
+enum {
+	TERM_MASK_LAYER	= cpu_to_be32(0xf0000000),
+	TERM_MASK_ETYPE	= cpu_to_be32(0x0f000000),
+	TERM_MASK_ECODE	= cpu_to_be32(0x00ff0000),
+	TERM_FLAG_M	= cpu_to_be32(0x00008000),
+	TERM_FLAG_D	= cpu_to_be32(0x00004000),
+	TERM_FLAG_R	= cpu_to_be32(0x00002000),
+	TERM_MASK_RESVD	= cpu_to_be32(0x00001fff)
+};
+
+static inline u8 __rdmap_term_layer(struct iwarp_terminate *term)
+{
+	return term->layer;
+}
+
+static inline void __rdmap_term_set_layer(struct iwarp_terminate *term,
+					  u8 layer)
+{
+	term->layer = layer & 0xf;
+}
+
+static inline u8 __rdmap_term_etype(struct iwarp_terminate *term)
+{
+	return term->etype;
+}
+
+static inline void __rdmap_term_set_etype(struct iwarp_terminate *term,
+					  u8 etype)
+{
+	term->etype = etype & 0xf;
+}
+
+static inline u8 __rdmap_term_ecode(struct iwarp_terminate *term)
+{
+	return term->ecode;
+}
+
+static inline void __rdmap_term_set_ecode(struct iwarp_terminate *term,
+					  u8 ecode)
+{
+	term->ecode = ecode;
+}
+
+/*
+ * Common portion of iWARP headers (MPA, DDP, RDMAP)
+ * for an FPDU carrying an untagged DDP segment
+ */
+struct iwarp_ctrl_untagged {
+	struct iwarp_ctrl	ctrl;
+	__be32			rsvd;
+	__be32			ddp_qn;
+	__be32			ddp_msn;
+	__be32			ddp_mo;
+};
+
+/*
+ * Common portion of iWARP headers (MPA, DDP, RDMAP)
+ * for an FPDU carrying a tagged DDP segment
+ */
+struct iwarp_ctrl_tagged {
+	struct iwarp_ctrl	ctrl;
+	__be32			ddp_stag;
+	__be64			ddp_to;
+};
+
+union iwarp_hdr {
+	struct iwarp_ctrl		ctrl;
+	struct iwarp_ctrl_untagged	c_untagged;
+	struct iwarp_ctrl_tagged	c_tagged;
+	struct iwarp_rdma_write		rwrite;
+	struct iwarp_rdma_rreq		rreq;
+	struct iwarp_rdma_rresp		rresp;
+	struct iwarp_terminate		terminate;
+	struct iwarp_send		send;
+	struct iwarp_send_inv		send_inv;
+};
+
+enum term_elayer {
+	TERM_ERROR_LAYER_RDMAP	= 0x00,
+	TERM_ERROR_LAYER_DDP	= 0x01,
+	TERM_ERROR_LAYER_LLP	= 0x02	/* eg., MPA */
+};
+
+enum ddp_etype {
+	DDP_ETYPE_CATASTROPHIC	= 0x0,
+	DDP_ETYPE_TAGGED_BUF	= 0x1,
+	DDP_ETYPE_UNTAGGED_BUF	= 0x2,
+	DDP_ETYPE_RSVD		= 0x3
+};
+
+enum ddp_ecode {
+	/* unspecified, set to zero */
+	DDP_ECODE_CATASTROPHIC		= 0x00,
+	/* Tagged Buffer Errors */
+	DDP_ECODE_T_INVALID_STAG	= 0x00,
+	DDP_ECODE_T_BASE_BOUNDS		= 0x01,
+	DDP_ECODE_T_STAG_NOT_ASSOC	= 0x02,
+	DDP_ECODE_T_TO_WRAP		= 0x03,
+	DDP_ECODE_T_VERSION		= 0x04,
+	/* Untagged Buffer Errors */
+	DDP_ECODE_UT_INVALID_QN		= 0x01,
+	DDP_ECODE_UT_INVALID_MSN_NOBUF	= 0x02,
+	DDP_ECODE_UT_INVALID_MSN_RANGE	= 0x03,
+	DDP_ECODE_UT_INVALID_MO		= 0x04,
+	DDP_ECODE_UT_MSG_TOOLONG	= 0x05,
+	DDP_ECODE_UT_VERSION		= 0x06
+};
+
+enum rdmap_untagged_qn {
+	RDMAP_UNTAGGED_QN_SEND		= 0,
+	RDMAP_UNTAGGED_QN_RDMA_READ	= 1,
+	RDMAP_UNTAGGED_QN_TERMINATE	= 2,
+	RDMAP_UNTAGGED_QN_COUNT		= 3
+};
+
+enum rdmap_etype {
+	RDMAP_ETYPE_CATASTROPHIC	= 0x0,
+	RDMAP_ETYPE_REMOTE_PROTECTION	= 0x1,
+	RDMAP_ETYPE_REMOTE_OPERATION	= 0x2
+};
+
+enum rdmap_ecode {
+	RDMAP_ECODE_INVALID_STAG	= 0x00,
+	RDMAP_ECODE_BASE_BOUNDS		= 0x01,
+	RDMAP_ECODE_ACCESS_RIGHTS	= 0x02,
+	RDMAP_ECODE_STAG_NOT_ASSOC	= 0x03,
+	RDMAP_ECODE_TO_WRAP		= 0x04,
+	RDMAP_ECODE_VERSION		= 0x05,
+	RDMAP_ECODE_OPCODE		= 0x06,
+	RDMAP_ECODE_CATASTROPHIC_STREAM	= 0x07,
+	RDMAP_ECODE_CATASTROPHIC_GLOBAL	= 0x08,
+	RDMAP_ECODE_CANNOT_INVALIDATE	= 0x09,
+	RDMAP_ECODE_UNSPECIFIED		= 0xff
+};
+
+enum llp_ecode {
+	LLP_ECODE_TCP_STREAM_LOST	= 0x01, /* How to transfer this ?? */
+	LLP_ECODE_RECEIVED_CRC		= 0x02,
+	LLP_ECODE_FPDU_START		= 0x03,
+	LLP_ECODE_INVALID_REQ_RESP	= 0x04,
+
+	/* Errors for Enhanced Connection Establishment only */
+	LLP_ECODE_LOCAL_CATASTROPHIC	= 0x05,
+	LLP_ECODE_INSUFFICIENT_IRD	= 0x06,
+	LLP_ECODE_NO_MATCHING_RTR	= 0x07
+};
+
+enum llp_etype {
+	LLP_ETYPE_MPA	= 0x00
+};
+
+enum rdma_opcode {
+	RDMAP_RDMA_WRITE	= 0x0,
+	RDMAP_RDMA_READ_REQ	= 0x1,
+	RDMAP_RDMA_READ_RESP	= 0x2,
+	RDMAP_SEND		= 0x3,
+	RDMAP_SEND_INVAL	= 0x4,
+	RDMAP_SEND_SE		= 0x5,
+	RDMAP_SEND_SE_INVAL	= 0x6,
+	RDMAP_TERMINATE		= 0x7,
+	RDMAP_NOT_SUPPORTED	= RDMAP_TERMINATE + 1
+};
+
+#endif
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 02/13] Main SoftiWarp include file
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
  2018-01-14 22:35   ` [PATCH v3 01/13] iWARP wire packet format definition Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem Bernard Metzler
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw.h | 823 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 823 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw.h

diff --git a/drivers/infiniband/sw/siw/siw.h b/drivers/infiniband/sw/siw/siw.h
new file mode 100644
index 000000000000..54bc71d1867d
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw.h
@@ -0,0 +1,823 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _SIW_H
+#define _SIW_H
+
+#include <linux/idr.h>
+#include <rdma/ib_verbs.h>
+#include <linux/socket.h>
+#include <linux/skbuff.h>
+#include <linux/in.h>
+#include <linux/fs.h>
+#include <linux/netdevice.h>
+#include <crypto/hash.h>
+#include <linux/resource.h>	/* MLOCK_LIMIT */
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/llist.h>
+#include <linux/mm.h>
+#include <linux/sched/signal.h>
+
+#include <rdma/siw_user.h>
+#include "iwarp.h"
+
+#define DEVICE_ID_SOFTIWARP	0x0815
+#define SIW_VENDOR_ID		0x626d74	/* ascii 'bmt' for now */
+#define SIW_VENDORT_PART_ID	0
+#define SIW_MAX_QP		(1024 * 100)
+#define SIW_MAX_QP_WR		(1024 * 32)
+#define SIW_MAX_ORD_QP		128
+#define SIW_MAX_IRD_QP		128
+#define SIW_MAX_SGE_PBL		256	/* max num sge's for PBL */
+#define SIW_MAX_SGE_RD		1	/* iwarp limitation. we could relax */
+#define SIW_MAX_CQ		(1024 * 100)
+#define SIW_MAX_CQE		(SIW_MAX_QP_WR * 100)
+#define SIW_MAX_MR		(SIW_MAX_QP * 10)
+#define SIW_MAX_PD		SIW_MAX_QP
+#define SIW_MAX_MW		0	/* to be set if MW's are supported */
+#define SIW_MAX_FMR		SIW_MAX_MR
+#define SIW_MAX_SRQ		SIW_MAX_QP
+#define SIW_MAX_SRQ_WR		(SIW_MAX_QP_WR * 10)
+#define SIW_MAX_CONTEXT		SIW_MAX_PD
+
+/* Min number of bytes for using zero copy transmit */
+#define SENDPAGE_THRESH		PAGE_SIZE
+
+/* Maximum number of frames which can be send in one SQ processing */
+#define SQ_USER_MAXBURST	100
+
+/* Maximum number of consecutive IRQ elements which get served
+ * if SQ has pending work
+ */
+#define SIW_IRQ_MAXBURST_SQ_ACTIVE 4
+
+struct siw_devinfo {
+	unsigned int	device;
+	unsigned int	version;
+
+	u32	vendor_id;
+	u32	vendor_part_id;
+	u32	sw_version;
+	int	max_qp;
+	int	max_qp_wr;
+	int	max_ord; /* max. outbound read queue depth */
+	int	max_ird; /* max. inbound read queue depth */
+
+	enum ib_device_cap_flags cap_flags;
+	int	max_sge;
+	int	max_sge_rd;
+	int	max_cq;
+	int	max_cqe;
+	u64	max_mr_size;
+	int	max_mr;
+	int	max_pd;
+	int	max_mw;
+	int	max_fmr;
+	int	max_srq;
+	int	max_srq_wr;
+	int	max_srq_sge;
+};
+
+
+struct siw_device {
+	struct ib_device	base_dev;
+	struct list_head	list;
+	struct net_device	*netdev;
+	struct siw_devinfo	attrs;
+	int			is_registered; /* Registered with RDMA core */
+	int			numa_node;
+
+	/* physical port state (only one port per device) */
+	enum ib_port_state	state;
+
+	/* object management */
+	struct list_head	cep_list;
+	struct list_head	qp_list;
+	struct list_head	mr_list;
+	spinlock_t		idr_lock;
+	struct idr		qp_idr;
+	struct idr		cq_idr;
+	struct idr		pd_idr;
+	struct idr		mem_idr;
+
+	/* active objects statistics */
+	atomic_t		num_qp;
+	atomic_t		num_cq;
+	atomic_t		num_pd;
+	atomic_t		num_mr;
+	atomic_t		num_srq;
+	atomic_t		num_cep;
+	atomic_t		num_ctx;
+
+	struct dentry		*debugfs;
+};
+
+struct siw_objhdr {
+	u32			id;	/* for idr based object lookup */
+	struct kref		ref;
+	struct siw_device	*sdev;
+};
+
+struct siw_uobj {
+	struct list_head	list;
+	void	*addr;
+	u32	size;
+	u32	key;
+};
+
+struct siw_ucontext {
+	struct ib_ucontext	ib_ucontext;
+	struct siw_device	*sdev;
+
+	/* List of user mappable queue objects */
+	struct list_head	uobj_list;
+	spinlock_t		uobj_lock;
+	u32			uobj_key;
+};
+
+struct siw_pd {
+	struct siw_objhdr	hdr;
+	struct ib_pd		base_pd;
+};
+
+enum siw_access_flags {
+	SIW_MEM_LREAD	= (1<<0),
+	SIW_MEM_LWRITE	= (1<<1),
+	SIW_MEM_RREAD	= (1<<2),
+	SIW_MEM_RWRITE	= (1<<3),
+
+	SIW_MEM_FLAGS_LOCAL =
+		(SIW_MEM_LREAD | SIW_MEM_LWRITE),
+	SIW_MEM_FLAGS_REMOTE =
+		(SIW_MEM_RWRITE | SIW_MEM_RREAD)
+};
+
+#define SIW_STAG_MAX	0xffffffff
+
+struct siw_mr;
+
+/*
+ * siw presentation of user memory registered as source
+ * or target of RDMA operations.
+ */
+
+struct siw_page_chunk {
+	struct page **p;
+};
+
+struct siw_umem {
+	struct siw_page_chunk	*page_chunk;
+	int			num_pages;
+	u64			fp_addr;	/* First page base address */
+	struct pid		*pid;
+	struct mm_struct	*mm_s;
+	struct work_struct	work;
+};
+
+struct siw_pble {
+	u64	addr;		/* Address of assigned user buffer */
+	u64	size;		/* Size of this entry */
+	u64	pbl_off;	/* Total offset form start of PBL */
+};
+
+struct siw_pbl {
+	unsigned int	num_buf;
+	unsigned int	max_buf;
+	struct siw_pble	pbe[1];
+};
+
+/*
+ * generic memory representation for registered siw memory.
+ * memory lookup always via higher 24 bit of stag (stag index).
+ * the stag is stored as part of the siw object header (id).
+ * object relates to memory window if embedded mr pointer is valid
+ */
+struct siw_mem {
+	struct siw_objhdr	hdr;
+
+	struct siw_mr	*mr;	/* assoc. MR if MW, NULL if MR */
+	u64	va;		/* VA of memory */
+	u64	len;		/* amount of memory bytes */
+
+	u32	stag_valid:1,		/* VALID or INVALID */
+		is_pbl:1,		/* PBL or user space mem */
+		is_zbva:1,		/* zero based virt. addr. */
+		mw_bind_enabled:1,	/* check only if MR */
+		remote_inval_enabled:1,	/* VALID or INVALID */
+		consumer_owns_key:1,	/* key/index split ? */
+		rsvd:26;
+
+	enum siw_access_flags	perms;	/* local/remote READ & WRITE */
+};
+
+#define SIW_MEM_IS_MW(m)	((m)->mr != NULL)
+
+/*
+ * MR and MW definition.
+ * Used RDMA base structs ib_mr/ib_mw holding:
+ * lkey, rkey, MW reference count on MR
+ */
+struct siw_mr {
+	struct ib_mr		base_mr;
+	struct siw_mem		mem;
+	struct list_head	devq;
+	struct rcu_head		rcu;
+	union {
+		struct siw_umem	*umem;
+		struct siw_pbl	*pbl;
+		void *mem_obj;
+	};
+	struct siw_pd	*pd;
+};
+
+struct siw_mw {
+	struct ib_mw	base_mw;
+	struct siw_mem	mem;
+	struct rcu_head rcu;
+};
+
+/*
+ * Error codes for local or remote
+ * access to registered memory
+ */
+enum siw_access_state {
+	E_ACCESS_OK = 0,
+	E_STAG_INVALID,
+	E_BASE_BOUNDS,
+	E_ACCESS_PERM,
+	E_PD_MISMATCH
+};
+
+enum siw_wr_state {
+	SIW_WR_IDLE		= 0,
+	SIW_WR_QUEUED		= 1,	/* processing has not started yet */
+	SIW_WR_INPROGRESS	= 2	/* initiated processing of the WR */
+};
+
+union siw_mem_resolved {
+	struct siw_mem	*obj;	/* reference to registered memory */
+	char		*buf;	/* linear kernel buffer */
+};
+
+/* The WQE currently being processed (RX or TX) */
+struct siw_wqe {
+	/* Copy of applications SQE or RQE */
+	union {
+		struct siw_sqe	sqe;
+		struct siw_rqe	rqe;
+	};
+	struct siw_mem		*mem[SIW_MAX_SGE]; /* per sge's resolved mem */
+	enum siw_wr_state	wr_status;
+	enum siw_wc_status	wc_status;
+	u32			bytes;		/* total bytes to process */
+	u32			processed;	/* bytes processed */
+	int			error;
+};
+
+struct siw_cq {
+	struct ib_cq		base_cq;
+	struct siw_objhdr	hdr;
+	enum siw_notify_flags	*notify;
+	spinlock_t		lock;
+	struct siw_cqe		*queue;
+	u32			cq_put;
+	u32			cq_get;
+	u32			num_cqe;
+	int			kernel_verbs;
+};
+
+enum siw_qp_state {
+	SIW_QP_STATE_IDLE	= 0,
+	SIW_QP_STATE_RTR	= 1,
+	SIW_QP_STATE_RTS	= 2,
+	SIW_QP_STATE_CLOSING	= 3,
+	SIW_QP_STATE_TERMINATE	= 4,
+	SIW_QP_STATE_ERROR	= 5,
+	SIW_QP_STATE_COUNT	= 6
+};
+
+enum siw_qp_flags {
+	SIW_RDMA_BIND_ENABLED	= (1 << 0),
+	SIW_RDMA_WRITE_ENABLED	= (1 << 1),
+	SIW_RDMA_READ_ENABLED	= (1 << 2),
+	SIW_SIGNAL_ALL_WR	= (1 << 3),
+	SIW_MPA_CRC		= (1 << 4),
+	SIW_QP_IN_DESTROY	= (1 << 5)
+};
+
+enum siw_qp_attr_mask {
+	SIW_QP_ATTR_STATE		= (1 << 0),
+	SIW_QP_ATTR_ACCESS_FLAGS	= (1 << 1),
+	SIW_QP_ATTR_LLP_HANDLE		= (1 << 2),
+	SIW_QP_ATTR_ORD			= (1 << 3),
+	SIW_QP_ATTR_IRD			= (1 << 4),
+	SIW_QP_ATTR_SQ_SIZE		= (1 << 5),
+	SIW_QP_ATTR_RQ_SIZE		= (1 << 6),
+	SIW_QP_ATTR_MPA			= (1 << 7)
+};
+
+struct siw_sk_upcalls {
+	void	(*sk_state_change)(struct sock *sk);
+	void	(*sk_data_ready)(struct sock *sk, int bytes);
+	void	(*sk_write_space)(struct sock *sk);
+	void	(*sk_error_report)(struct sock *sk);
+};
+
+struct siw_srq {
+	struct ib_srq		base_srq;
+	struct siw_pd		*pd;
+	atomic_t		rq_index;
+	spinlock_t		lock;
+	atomic_t		space;	/* current space for posting wqe's */
+	u32			max_sge;
+	u32			limit;	/* low watermark for async event */
+	struct siw_rqe		*recvq;
+	u32			rq_put;
+	u32			rq_get;
+	u32			num_rqe;/* max # of wqe's allowed */
+	char			armed;	/* inform user if limit hit */
+	char			kernel_verbs; /* '1' if kernel client */
+};
+
+struct siw_qp_attrs {
+	enum siw_qp_state	state;
+	u32			rq_hiwat;
+	u32			sq_size;
+	u32			rq_size;
+	u32			orq_size;
+	u32			irq_size;
+	u32			sq_max_sges;
+	u32			sq_max_sges_rdmaw;
+	u32			rq_max_sges;
+	enum siw_qp_flags	flags;
+
+	struct socket		*llp_stream_handle;
+};
+
+enum siw_tx_ctx {
+	SIW_SEND_HDR = 0,	/* start or continue sending HDR */
+	SIW_SEND_DATA = 1,	/* start or continue sending DDP payload */
+	SIW_SEND_TRAILER = 2,	/* start or continue sending TRAILER */
+	SIW_SEND_SHORT_FPDU = 3 /* send whole FPDU hdr|data|trailer at once */
+};
+
+enum siw_rx_state {
+	SIW_GET_HDR = 0,	/* await new hdr or within hdr */
+	SIW_GET_DATA_START = 1,	/* start of inbound DDP payload */
+	SIW_GET_DATA_MORE = 2,	/* continuation of (misaligned) DDP payload */
+	SIW_GET_TRAILER	= 3	/* await new trailer or within trailer */
+};
+
+
+struct siw_iwarp_rx {
+	struct sk_buff		*skb;
+	union iwarp_hdr		hdr;
+	struct mpa_trailer	trailer;
+	/*
+	 * local destination memory of inbound iwarp operation.
+	 * valid, according to wqe->wr_status
+	 */
+	struct siw_wqe		wqe_active;
+
+	struct shash_desc	*mpa_crc_hd;
+	/*
+	 * Next expected DDP MSN for each QN +
+	 * expected steering tag +
+	 * expected DDP tagget offset (all HBO)
+	 */
+	u32			ddp_msn[RDMAP_UNTAGGED_QN_COUNT];
+	u32			ddp_stag;
+	u64			ddp_to;
+
+	/*
+	 * For each FPDU, main RX loop runs through 3 stages:
+	 * Receiving protocol headers, placing DDP payload and receiving
+	 * trailer information (CRC + eventual padding).
+	 * Next two variables keep state on receive status of the
+	 * current FPDU part (hdr, data, trailer).
+	 */
+	int			fpdu_part_rcvd;/* bytes in pkt part copied */
+	int			fpdu_part_rem; /* bytes in pkt part not seen */
+
+	int			skb_new;      /* pending unread bytes in skb */
+	int			skb_offset;   /* offset in skb */
+	int			skb_copied;   /* processed bytes in skb */
+
+	int			pbl_idx;	/* Index into current PBL */
+
+	int			sge_idx;	/* current sge in rx */
+	unsigned int		sge_off;	/* already rcvd in curr. sge */
+
+	enum siw_rx_state	state;
+
+	u32			inval_stag;	/* Stag to be invalidated */
+
+	u8			first_ddp_seg:1,   /* this is first DDP seg */
+				more_ddp_segs:1,   /* more DDP segs expected */
+				rx_suspend:1,	   /* stop rcv DDP segs. */
+				unused:1,
+				prev_rdmap_opcode:4; /* opcode of prev msg */
+	char			pad;		/* # of pad bytes expected */
+};
+
+#define siw_rx_data(qp, rctx)	\
+	(iwarp_pktinfo[__rdmap_opcode(&rctx->hdr.ctrl)].proc_data(qp, rctx))
+
+/*
+ * Shorthands for short packets w/o payload
+ * to be transmitted more efficient.
+ */
+struct siw_send_pkt {
+	struct iwarp_send	send;
+	__be32			crc;
+};
+
+struct siw_write_pkt {
+	struct iwarp_rdma_write	write;
+	__be32			crc;
+};
+
+struct siw_rreq_pkt {
+	struct iwarp_rdma_rreq	rreq;
+	__be32			crc;
+};
+
+struct siw_rresp_pkt {
+	struct iwarp_rdma_rresp	rresp;
+	__be32			crc;
+};
+
+struct siw_iwarp_tx {
+	union {
+		union iwarp_hdr			hdr;
+
+		/* Generic part of FPDU header */
+		struct iwarp_ctrl		ctrl;
+		struct iwarp_ctrl_untagged	c_untagged;
+		struct iwarp_ctrl_tagged	c_tagged;
+
+		/* FPDU headers */
+		struct iwarp_rdma_write		rwrite;
+		struct iwarp_rdma_rreq		rreq;
+		struct iwarp_rdma_rresp		rresp;
+		struct iwarp_terminate		terminate;
+		struct iwarp_send		send;
+		struct iwarp_send_inv		send_inv;
+
+		/* complete short FPDUs */
+		struct siw_send_pkt		send_pkt;
+		struct siw_write_pkt		write_pkt;
+		struct siw_rreq_pkt		rreq_pkt;
+		struct siw_rresp_pkt		rresp_pkt;
+	} pkt;
+
+	struct mpa_trailer			trailer;
+	/* DDP MSN for untagged messages */
+	u32			ddp_msn[RDMAP_UNTAGGED_QN_COUNT];
+
+	enum siw_tx_ctx		state;
+	wait_queue_head_t	waitq;
+	u16			ctrl_len;	/* ddp+rdmap hdr */
+	u16			ctrl_sent;
+	int			burst;
+
+	int			bytes_unsent;	/* ddp payload bytes */
+
+	struct shash_desc	*mpa_crc_hd;
+
+	atomic_t		in_use;		/* tx currently under way */
+
+	u8			do_crc:1,	/* do crc for segment */
+				use_sendpage:1,	/* send w/o copy */
+				tx_suspend:1,	/* stop sending DDP segs. */
+				pad:2,		/* # pad in current fpdu */
+				orq_fence:1,	/* ORQ full or Send fenced */
+				unused:2;
+
+	u16			fpdu_len;	/* len of FPDU to tx */
+
+	unsigned int		tcp_seglen;	/* remaining tcp seg space */
+
+	struct siw_wqe		wqe_active;
+
+	int			pbl_idx;	/* Index into current PBL */
+	int			sge_idx;	/* current sge in tx */
+	u32			sge_off;	/* already sent in curr. sge */
+	int			in_syscall;	/* TX out of user context */
+};
+
+struct siw_qp {
+	struct ib_qp		base_qp;
+	struct siw_objhdr	hdr;
+	struct list_head	devq;
+	int			tx_cpu;
+	int			kernel_verbs;
+	struct siw_qp_attrs	attrs;
+
+	struct siw_cep		*cep;
+	struct rw_semaphore	state_lock;
+
+	struct siw_pd		*pd;
+	struct siw_cq		*scq;
+	struct siw_cq		*rcq;
+	struct siw_srq		*srq;
+
+	struct siw_iwarp_tx	tx_ctx;	/* Transmit context */
+	spinlock_t		sq_lock;
+	struct siw_sqe		*sendq;	/* send queue element array */
+	uint32_t		sq_get;	/* consumer index into sq array */
+	uint32_t		sq_put;	/* kernel prod. index into sq array */
+	struct llist_node	tx_list;
+
+	struct siw_sqe		*orq; /* outbound read queue element array */
+	spinlock_t		orq_lock;
+	uint32_t		orq_get;/* consumer index into orq array */
+	uint32_t		orq_put;/* shared producer index for ORQ */
+
+	struct siw_iwarp_rx	rx_ctx;	/* Receive context */
+	spinlock_t		rq_lock;
+	struct siw_rqe		*recvq;	/* recv queue element array */
+	uint32_t		rq_get;	/* consumer index into rq array */
+	uint32_t		rq_put;	/* kernel prod. index into rq array */
+
+	struct siw_sqe		*irq;	/* inbound read queue element array */
+	uint32_t		irq_get;/* consumer index into irq array */
+	uint32_t		irq_put;/* producer index into irq array */
+	int			irq_burst;
+
+	struct { /* information to be carried in TERMINATE pkt, if valid */
+		u8 valid;
+		u8 in_tx;
+		u8 layer:4,
+		   etype:4;
+		u8 ecode;
+	} term_info;
+};
+
+/* helper macros */
+#define rx_qp(rx)		container_of(rx, struct siw_qp, rx_ctx)
+#define tx_qp(tx)		container_of(tx, struct siw_qp, tx_ctx)
+#define tx_wqe(qp)		(&(qp)->tx_ctx.wqe_active)
+#define rx_wqe(qp)		(&(qp)->rx_ctx.wqe_active)
+#define rx_mem(qp)		((qp)->rx_ctx.wqe_active.mem[0])
+#define tx_type(wqe)		((wqe)->sqe.opcode)
+#define rx_type(wqe)		((wqe)->rqe.opcode)
+#define tx_flags(wqe)		((wqe)->sqe.flags)
+
+#define tx_more_wqe(qp)		(!siw_sq_empty(qp) || !siw_irq_empty(qp))
+
+#define QP_ID(qp)		((qp)->hdr.id)
+#define OBJ_ID(obj)		((obj)->hdr.id)
+#define RX_QPID(rx)		QP_ID(rx_qp(rx))
+#define TX_QPID(tx)		QP_ID(tx_qp(tx))
+
+#define ddp_data_len(op, mpa_len) \
+	(mpa_len - (iwarp_pktinfo[op].hdr_len - MPA_HDR_SIZE))
+
+struct iwarp_msg_info {
+	int			hdr_len;
+	struct iwarp_ctrl	ctrl;
+	int	(*proc_data)(struct siw_qp *qp, struct siw_iwarp_rx *rctx);
+};
+
+/* Global siw parameters. Currently set in siw_main.c */
+extern const bool zcopy_tx;
+extern const int gso_seg_limit;
+extern const bool loopback_enabled;
+extern const bool mpa_crc_required;
+extern const bool mpa_crc_strict;
+extern const bool siw_lowdelay;
+extern const bool tcp_quickack;
+extern u_char mpa_version;
+extern const bool peer_to_peer;
+
+extern struct iwarp_msg_info iwarp_pktinfo[RDMAP_TERMINATE + 1];
+
+/* QP general functions */
+int siw_qp_modify(struct siw_qp *qp, struct siw_qp_attrs *attr,
+		  enum siw_qp_attr_mask mask);
+int siw_qp_mpa_rts(struct siw_qp *qp, enum mpa_v2_ctrl ctrl);
+void siw_qp_llp_close(struct siw_qp *qp);
+void siw_qp_cm_drop(struct siw_qp *qp, int when);
+void siw_send_terminate(struct siw_qp *qp);
+
+
+struct ib_qp *siw_get_base_qp(struct ib_device *dev, int id);
+void siw_qp_get_ref(struct ib_qp *qp);
+void siw_qp_put_ref(struct ib_qp *qp);
+
+enum siw_qp_state siw_map_ibstate(enum ib_qp_state state);
+
+void siw_init_terminate(struct siw_qp *qp, enum term_elayer layer,
+			u8 etype, u8 ecode, int in_tx);
+enum ddp_ecode siw_tagged_error(enum siw_access_state state);
+enum rdmap_ecode siw_rdmap_error(enum siw_access_state state);
+
+int siw_check_mem(struct siw_pd *pd, struct siw_mem *mem, u64 addr,
+		  enum siw_access_flags perm, int len);
+int siw_check_sge(struct siw_pd *pd, struct siw_sge *sge,
+		  struct siw_mem **mem, enum siw_access_flags perm,
+		  u32 off, int len);
+
+void siw_read_to_orq(struct siw_sqe *rreq, struct siw_sqe *sqe);
+int siw_sqe_complete(struct siw_qp *qp, struct siw_sqe *sqe, u32 bytes,
+		     enum siw_wc_status status);
+int siw_rqe_complete(struct siw_qp *qp, struct siw_rqe *rqe, u32 bytes,
+		     enum siw_wc_status status);
+void siw_qp_llp_data_ready(struct sock *sock);
+void siw_qp_llp_write_space(struct sock *sock);
+
+/* SIW user memory management */
+
+#define CHUNK_SHIFT	9	/* sets number of pages per chunk */
+#define PAGES_PER_CHUNK	(_AC(1, UL) << CHUNK_SHIFT)
+#define CHUNK_MASK	(~(PAGES_PER_CHUNK - 1))
+#define PAGE_CHUNK_SIZE	(PAGES_PER_CHUNK * sizeof(struct page *))
+
+/*
+ * siw_get_upage()
+ *
+ * Get page pointer for address on given umem.
+ *
+ * @umem: two dimensional list of page pointers
+ * @addr: user virtual address
+ */
+static inline struct page *siw_get_upage(struct siw_umem *umem, u64 addr)
+{
+	unsigned int	page_idx	= (addr - umem->fp_addr) >> PAGE_SHIFT,
+			chunk_idx	= page_idx >> CHUNK_SHIFT,
+			page_in_chunk	= page_idx & ~CHUNK_MASK;
+
+	if (likely(page_idx < umem->num_pages))
+		return umem->page_chunk[chunk_idx].p[page_in_chunk];
+
+	return NULL;
+}
+
+extern struct siw_umem *siw_umem_get(u64 start, u64 len);
+extern void siw_umem_release(struct siw_umem *umem);
+extern struct siw_pbl *siw_pbl_alloc(u32 num_buf);
+extern u64 siw_pbl_get_buffer(struct siw_pbl *pbl, u64 off, int *len, int *idx);
+extern void siw_pbl_free(struct siw_pbl *pbl);
+
+
+/* QP TX path functions */
+extern int siw_run_sq(void *arg);
+extern int siw_qp_sq_process(struct siw_qp *qp);
+extern int siw_sq_start(struct siw_qp *qp);
+extern int siw_activate_tx(struct siw_qp *qp);
+extern void siw_stop_tx_thread(int nr_cpu);
+extern int siw_get_tx_cpu(struct siw_device *sdev);
+extern void siw_put_tx_cpu(int cpu);
+
+/* QP RX path functions */
+extern int siw_proc_send(struct siw_qp *qp, struct siw_iwarp_rx *rx);
+extern int siw_proc_rreq(struct siw_qp *qp, struct siw_iwarp_rx *rx);
+extern int siw_proc_rresp(struct siw_qp *qp, struct siw_iwarp_rx *rx);
+extern int siw_proc_write(struct siw_qp *qp, struct siw_iwarp_rx *rx);
+extern int siw_proc_terminate(struct siw_qp *qp, struct siw_iwarp_rx *rx);
+extern int siw_proc_unsupp(struct siw_qp *qp, struct siw_iwarp_rx *rx);
+
+extern int siw_tcp_rx_data(read_descriptor_t *rd_desc, struct sk_buff *skb,
+			   unsigned int off, size_t len);
+
+/* MPA utilities */
+static inline int siw_crc_array(struct shash_desc *desc, u8 *start,
+				size_t len)
+{
+	return crypto_shash_update(desc, start, len);
+}
+
+static inline int siw_crc_page(struct shash_desc *desc, struct page *p,
+			       int off, int len)
+{
+	return crypto_shash_update(desc, page_address(p) + off, len);
+}
+extern struct crypto_shash *siw_crypto_shash;
+
+extern struct task_struct *siw_tx_thread[];
+extern int default_tx_cpu;
+
+/* Varia */
+extern void siw_cq_flush(struct siw_cq *cq);
+extern void siw_sq_flush(struct siw_qp *qp);
+extern void siw_rq_flush(struct siw_qp *qp);
+extern int siw_reap_cqe(struct siw_cq *cq, struct ib_wc *wc);
+extern void siw_wqe_put_mem(struct siw_wqe *wqe, enum siw_opcode op);
+
+/* RDMA core event dipatching */
+extern void siw_qp_event(struct siw_qp *qp, enum ib_event_type type);
+extern void siw_cq_event(struct siw_cq *cq, enum ib_event_type type);
+extern void siw_srq_event(struct siw_srq *srq, enum ib_event_type type);
+extern void siw_port_event(struct siw_device *dev, u8 port,
+			   enum ib_event_type type);
+
+static inline struct siw_qp *siw_qp_base2siw(struct ib_qp *base_qp)
+{
+	return container_of(base_qp, struct siw_qp, base_qp);
+}
+
+static inline int siw_sq_empty(struct siw_qp *qp)
+{
+	return qp->sendq[qp->sq_get % qp->attrs.sq_size].flags == 0;
+}
+
+static inline struct siw_sqe *sq_get_next(struct siw_qp *qp)
+{
+	struct siw_sqe *sqe = &qp->sendq[qp->sq_get % qp->attrs.sq_size];
+
+	if (sqe->flags & SIW_WQE_VALID)
+		return sqe;
+
+	return NULL;
+}
+
+static inline struct siw_sqe *orq_get_current(struct siw_qp *qp)
+{
+	return &qp->orq[qp->orq_get % qp->attrs.orq_size];
+}
+
+static inline struct siw_sqe *orq_get_tail(struct siw_qp *qp)
+{
+	if (likely(qp->attrs.orq_size))
+		return &qp->orq[qp->orq_put % qp->attrs.orq_size];
+
+	pr_warn("QP[%d]: ORQ has zero length", QP_ID(qp));
+	return NULL;
+}
+
+static inline struct siw_sqe *orq_get_free(struct siw_qp *qp)
+{
+	struct siw_sqe *orq_e = orq_get_tail(qp);
+
+	if (orq_e && orq_e->flags == 0)
+		return orq_e;
+
+	return NULL;
+}
+
+static inline int siw_orq_empty(struct siw_qp *qp)
+{
+	return qp->orq[qp->orq_get % qp->attrs.orq_size].flags == 0 ? 1 : 0;
+}
+
+static inline struct siw_sqe *irq_alloc_free(struct siw_qp *qp)
+{
+	struct siw_sqe *irq_e = &qp->irq[qp->irq_put % qp->attrs.irq_size];
+
+	if (irq_e->flags == 0) {
+		qp->irq_put++;
+		return irq_e;
+	}
+	return NULL;
+}
+
+static inline int siw_irq_empty(struct siw_qp *qp)
+{
+	return qp->irq[qp->irq_get % qp->attrs.irq_size].flags == 0;
+}
+
+static inline struct siw_mr *siw_mem2mr(struct siw_mem *m)
+{
+	if (!SIW_MEM_IS_MW(m))
+		return container_of(m, struct siw_mr, mem);
+	return m->mr;
+}
+
+#endif
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
  2018-01-14 22:35   ` [PATCH v3 01/13] iWARP wire packet format definition Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 02/13] Main SoftiWarp include file Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
       [not found]     ` <20180114223603.19961-4-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
  2018-01-14 22:35   ` [PATCH v3 04/13] SoftiWarp object management Bernard Metzler
                     ` (11 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_main.c | 816 +++++++++++++++++++++++++++++++++++
 1 file changed, 816 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_main.c

diff --git a/drivers/infiniband/sw/siw/siw_main.c b/drivers/infiniband/sw/siw/siw_main.c
new file mode 100644
index 000000000000..1b7fc58d4eb9
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_main.c
@@ -0,0 +1,816 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <net/net_namespace.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_arp.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/dma-mapping.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+#include "siw_verbs.h"
+#include <linux/kthread.h>
+
+MODULE_AUTHOR("Bernard Metzler");
+MODULE_DESCRIPTION("Software iWARP Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.2");
+
+/* transmit from user buffer, if possible */
+const bool zcopy_tx;
+
+/* Restrict usage of GSO, if hardware peer iwarp is unable to process
+ * large packets. gso_seg_limit = 1 lets siw send only packets up to
+ * one real MTU in size, but severly limits maximum bandwidth.
+ * gso_seg_limit = 0 makes use of GSO (and more than doubles throughput
+ * for large transfers).
+ */
+const int gso_seg_limit;
+
+/* Attach siw also with loopback devices */
+const bool loopback_enabled = true;
+
+/* We try to negotiate CRC on, if true */
+const bool mpa_crc_required;
+
+/* MPA CRC on/off enforced */
+const bool mpa_crc_strict;
+
+/* Set TCP_NODELAY, and push messages asap */
+const bool siw_lowdelay = true;
+/* Set TCP_QUICKACK */
+const bool tcp_quickack;
+
+/* Select MPA version to be used during connection setup */
+u_char mpa_version = MPA_REVISION_2;
+
+/* Selects MPA P2P mode (additional handshake during connection
+ * setup, if true
+ */
+const bool peer_to_peer;
+
+static LIST_HEAD(siw_devlist);
+
+struct task_struct *siw_tx_thread[NR_CPUS];
+struct crypto_shash *siw_crypto_shash;
+
+static ssize_t show_sw_version(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct siw_device *sdev = container_of(dev, struct siw_device,
+					       base_dev.dev);
+
+	return sprintf(buf, "%x\n", sdev->attrs.version);
+}
+
+static DEVICE_ATTR(sw_version, 0444, show_sw_version, NULL);
+
+static struct device_attribute *siw_dev_attributes[] = {
+	&dev_attr_sw_version
+};
+
+static int siw_modify_port(struct ib_device *base_dev, u8 port, int mask,
+			   struct ib_port_modify *props)
+{
+	return -EOPNOTSUPP;
+}
+
+static int siw_device_register(struct siw_device *sdev)
+{
+	struct ib_device *base_dev = &sdev->base_dev;
+	int rv, i;
+	static int dev_id = 1;
+
+	rv = ib_register_device(base_dev, NULL);
+	if (rv) {
+		pr_warn("siw: %s: registration error %d\n",
+			base_dev->name, rv);
+		return rv;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(siw_dev_attributes); ++i) {
+		rv = device_create_file(&base_dev->dev, siw_dev_attributes[i]);
+		if (rv) {
+			pr_warn("siw: %s: create file error: rv=%d\n",
+				base_dev->name, rv);
+			ib_unregister_device(base_dev);
+			return rv;
+		}
+	}
+	siw_debugfs_add_device(sdev);
+
+	sdev->attrs.vendor_part_id = dev_id++;
+
+	siw_dbg(sdev, "HWaddr=%02x.%02x.%02x.%02x.%02x.%02x\n",
+		*(u8 *)sdev->netdev->dev_addr,
+		*((u8 *)sdev->netdev->dev_addr + 1),
+		*((u8 *)sdev->netdev->dev_addr + 2),
+		*((u8 *)sdev->netdev->dev_addr + 3),
+		*((u8 *)sdev->netdev->dev_addr + 4),
+		*((u8 *)sdev->netdev->dev_addr + 5));
+
+	sdev->is_registered = 1;
+
+	return 0;
+}
+
+static void siw_device_deregister(struct siw_device *sdev)
+{
+	int i;
+
+	siw_debugfs_del_device(sdev);
+
+	if (sdev->is_registered) {
+
+		siw_dbg(sdev, "deregister\n");
+
+		for (i = 0; i < ARRAY_SIZE(siw_dev_attributes); ++i)
+			device_remove_file(&sdev->base_dev.dev,
+					   siw_dev_attributes[i]);
+
+		ib_unregister_device(&sdev->base_dev);
+	}
+	if (atomic_read(&sdev->num_ctx) || atomic_read(&sdev->num_srq) ||
+	    atomic_read(&sdev->num_mr) || atomic_read(&sdev->num_cep) ||
+	    atomic_read(&sdev->num_qp) || atomic_read(&sdev->num_cq) ||
+	    atomic_read(&sdev->num_pd)) {
+		pr_warn("siw at %s: orphaned resources!\n",
+			sdev->netdev->name);
+		pr_warn("           CTX %d, SRQ %d, QP %d, CQ %d, MEM %d, CEP %d, PD %d\n",
+			atomic_read(&sdev->num_ctx),
+			atomic_read(&sdev->num_srq),
+			atomic_read(&sdev->num_qp),
+			atomic_read(&sdev->num_cq),
+			atomic_read(&sdev->num_mr),
+			atomic_read(&sdev->num_cep),
+			atomic_read(&sdev->num_pd));
+	}
+
+	while (!list_empty(&sdev->cep_list)) {
+		struct siw_cep *cep = list_entry(sdev->cep_list.next,
+						 struct siw_cep, devq);
+		list_del(&cep->devq);
+		pr_warn("siw: at %s: free orphaned CEP 0x%p, state %d\n",
+			sdev->base_dev.name, cep, cep->state);
+		kfree(cep);
+	}
+	sdev->is_registered = 0;
+}
+
+static void siw_device_destroy(struct siw_device *sdev)
+{
+	siw_dbg(sdev, "destroy device\n");
+	siw_idr_release(sdev);
+
+	kfree(sdev->base_dev.iwcm);
+	dev_put(sdev->netdev);
+
+	ib_dealloc_device(&sdev->base_dev);
+}
+
+static struct siw_device *siw_dev_from_netdev(struct net_device *dev)
+{
+	if (!list_empty(&siw_devlist)) {
+		struct list_head *pos;
+
+		list_for_each(pos, &siw_devlist) {
+			struct siw_device *sdev =
+				list_entry(pos, struct siw_device, list);
+			if (sdev->netdev == dev)
+				return sdev;
+		}
+	}
+	return NULL;
+}
+
+static int siw_create_tx_threads(void)
+{
+	int cpu, rv, assigned = 0;
+
+	for_each_online_cpu(cpu) {
+		/* Skip HT cores */
+		if (cpu % cpumask_weight(topology_sibling_cpumask(cpu))) {
+			siw_tx_thread[cpu] = NULL;
+			continue;
+		}
+		siw_tx_thread[cpu] = kthread_create(siw_run_sq,
+						   (unsigned long *)(long)cpu,
+						   "siw_tx/%d", cpu);
+		if (IS_ERR(siw_tx_thread[cpu])) {
+			rv = PTR_ERR(siw_tx_thread[cpu]);
+			siw_tx_thread[cpu] = NULL;
+			pr_info("Creating TX thread for CPU %d failed", cpu);
+			continue;
+		}
+		kthread_bind(siw_tx_thread[cpu], cpu);
+
+		wake_up_process(siw_tx_thread[cpu]);
+		assigned++;
+	}
+	return assigned;
+}
+
+static int siw_dev_qualified(struct net_device *netdev)
+{
+	/*
+	 * Additional hardware support can be added here
+	 * (e.g. ARPHRD_FDDI, ARPHRD_ATM, ...) - see
+	 * <linux/if_arp.h> for type identifiers.
+	 */
+	if (netdev->type == ARPHRD_ETHER ||
+	    netdev->type == ARPHRD_IEEE802 ||
+	    (netdev->type == ARPHRD_LOOPBACK && loopback_enabled))
+		return 1;
+
+	return 0;
+}
+
+static DEFINE_PER_CPU(atomic_t, use_cnt = ATOMIC_INIT(0));
+
+static struct {
+	struct cpumask **tx_valid_cpus;
+	int num_nodes;
+} siw_cpu_info;
+
+static int siw_init_cpulist(void)
+{
+	int i, num_nodes;
+
+	num_nodes = num_possible_nodes();
+	siw_cpu_info.num_nodes = num_nodes;
+
+	siw_cpu_info.tx_valid_cpus = kcalloc(num_nodes, sizeof(void *),
+					     GFP_KERNEL);
+	if (!siw_cpu_info.tx_valid_cpus) {
+		siw_cpu_info.num_nodes = 0;
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < siw_cpu_info.num_nodes; i++) {
+		siw_cpu_info.tx_valid_cpus[i] = kzalloc(sizeof(struct cpumask),
+							GFP_KERNEL);
+		if (!siw_cpu_info.tx_valid_cpus[i])
+			goto out_err;
+
+		cpumask_clear(siw_cpu_info.tx_valid_cpus[i]);
+	}
+	for_each_possible_cpu(i)
+		cpumask_set_cpu(i, siw_cpu_info.tx_valid_cpus[cpu_to_node(i)]);
+
+	return 0;
+
+out_err:
+	siw_cpu_info.num_nodes = 0;
+	while (i) {
+		kfree(siw_cpu_info.tx_valid_cpus[i]);
+		siw_cpu_info.tx_valid_cpus[i--] = NULL;
+	}
+	kfree(siw_cpu_info.tx_valid_cpus);
+	siw_cpu_info.tx_valid_cpus = NULL;
+
+	return -ENOMEM;
+}
+
+static void siw_destroy_cpulist(void)
+{
+	int i = 0;
+
+	while (i < siw_cpu_info.num_nodes)
+		kfree(siw_cpu_info.tx_valid_cpus[i++]);
+
+	kfree(siw_cpu_info.tx_valid_cpus);
+}
+
+/*
+ * Choose CPU with least number of active QP's from NUMA node of
+ * TX interface.
+ */
+int siw_get_tx_cpu(struct siw_device *sdev)
+{
+	const struct cpumask *tx_cpumask;
+	int i, num_cpus, cpu, tx_cpu = -1, min_use,
+	    node = sdev->numa_node;
+
+	if (node < 0)
+		tx_cpumask = cpu_online_mask;
+	else
+		tx_cpumask = siw_cpu_info.tx_valid_cpus[node];
+
+	num_cpus = cpumask_weight(tx_cpumask);
+	if (!num_cpus) {
+		/* no CPU on this NUMA node */
+		tx_cpumask = cpu_online_mask;
+		num_cpus = cpumask_weight(tx_cpumask);
+	}
+	if (!num_cpus) {
+		pr_warn("siw: no tx cpu found\n");
+		return tx_cpu;
+	}
+	cpu = cpumask_first(tx_cpumask);
+
+	for (i = 0, min_use = SIW_MAX_QP; i < num_cpus;
+	     i++, cpu = cpumask_next(cpu, tx_cpumask)) {
+		int usage;
+
+		/* Skip any cores which have no TX thread */
+		if (!siw_tx_thread[cpu])
+			continue;
+
+		usage = atomic_inc_return(&per_cpu(use_cnt, cpu));
+
+		if (usage < min_use) {
+			min_use = usage;
+			tx_cpu = cpu;
+		} else {
+			atomic_dec_return(&per_cpu(use_cnt, cpu));
+		}
+		if (min_use == 1)
+			break;
+	}
+	siw_dbg(sdev, "tx cpu %d, node %d, %d qp's\n",
+		cpu, node, min_use);
+
+	return tx_cpu;
+}
+
+void siw_put_tx_cpu(int cpu)
+{
+	atomic_dec(&per_cpu(use_cnt, cpu));
+}
+
+static void siw_verbs_sq_flush(struct ib_qp *base_qp)
+{
+	struct siw_qp *qp = siw_qp_base2siw(base_qp);
+
+	down_write(&qp->state_lock);
+	siw_sq_flush(qp);
+	up_write(&qp->state_lock);
+}
+
+static void siw_verbs_rq_flush(struct ib_qp *base_qp)
+{
+	struct siw_qp *qp = siw_qp_base2siw(base_qp);
+
+	down_write(&qp->state_lock);
+	siw_rq_flush(qp);
+	up_write(&qp->state_lock);
+}
+
+static struct ib_ah *siw_create_ah(struct ib_pd *pd, struct rdma_ah_attr *attr,
+				   struct ib_udata *udata)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static int siw_destroy_ah(struct ib_ah *ah)
+{
+	return -EOPNOTSUPP;
+}
+
+static struct siw_device *siw_device_create(struct net_device *netdev)
+{
+	struct siw_device *sdev;
+	struct ib_device *base_dev;
+	struct device *parent = netdev->dev.parent;
+
+	sdev = (struct siw_device *)ib_alloc_device(sizeof(*sdev));
+	if (!sdev)
+		goto out;
+
+	base_dev = &sdev->base_dev;
+
+	if (!parent) {
+		/*
+		 * The loopback device has no parent device,
+		 * so it appears as a top-level device. To support
+		 * loopback device connectivity, take this device
+		 * as the parent device. Skip all other devices
+		 * w/o parent device.
+		 */
+		if (netdev->type != ARPHRD_LOOPBACK) {
+			pr_warn("siw: device %s skipped (no parent dev)\n",
+				netdev->name);
+			ib_dealloc_device(base_dev);
+			sdev = NULL;
+			goto out;
+		}
+		parent = &netdev->dev;
+	}
+	base_dev->iwcm = kmalloc(sizeof(struct iw_cm_verbs), GFP_KERNEL);
+	if (!base_dev->iwcm) {
+		ib_dealloc_device(base_dev);
+		sdev = NULL;
+		goto out;
+	}
+
+	sdev->netdev = netdev;
+	list_add_tail(&sdev->list, &siw_devlist);
+
+	strcpy(base_dev->name, SIW_IBDEV_PREFIX);
+	strlcpy(base_dev->name + strlen(SIW_IBDEV_PREFIX), netdev->name,
+		IB_DEVICE_NAME_MAX - strlen(SIW_IBDEV_PREFIX));
+
+	memset(&base_dev->node_guid, 0, sizeof(base_dev->node_guid));
+
+	if (netdev->type != ARPHRD_LOOPBACK) {
+		memcpy(&base_dev->node_guid, netdev->dev_addr, 6);
+	} else {
+		/*
+		 * The loopback device does not have a HW address,
+		 * but connection mangagement lib expects gid != 0
+		 */
+		size_t gidlen = min_t(size_t, strlen(base_dev->name), 6);
+
+		memcpy(&base_dev->node_guid, base_dev->name, gidlen);
+	}
+	base_dev->owner = THIS_MODULE;
+
+	base_dev->uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV);
+
+	base_dev->node_type = RDMA_NODE_RNIC;
+	memcpy(base_dev->node_desc, SIW_NODE_DESC_COMMON,
+	       sizeof(SIW_NODE_DESC_COMMON));
+
+	/*
+	 * Current model (one-to-one device association):
+	 * One Softiwarp device per net_device or, equivalently,
+	 * per physical port.
+	 */
+	base_dev->phys_port_cnt = 1;
+
+	base_dev->dev.parent = parent;
+	base_dev->dev.dma_ops = &dma_virt_ops;
+
+	base_dev->num_comp_vectors = num_possible_cpus();
+	base_dev->query_device = siw_query_device;
+	base_dev->query_port = siw_query_port;
+	base_dev->get_port_immutable = siw_get_port_immutable;
+	base_dev->query_qp = siw_query_qp;
+	base_dev->modify_port = siw_modify_port;
+	base_dev->query_pkey = siw_query_pkey;
+	base_dev->query_gid = siw_query_gid;
+	base_dev->alloc_ucontext = siw_alloc_ucontext;
+	base_dev->dealloc_ucontext = siw_dealloc_ucontext;
+	base_dev->mmap = siw_mmap;
+	base_dev->alloc_pd = siw_alloc_pd;
+	base_dev->dealloc_pd = siw_dealloc_pd;
+	base_dev->create_ah = siw_create_ah;
+	base_dev->destroy_ah = siw_destroy_ah;
+	base_dev->create_qp = siw_create_qp;
+	base_dev->modify_qp = siw_verbs_modify_qp;
+	base_dev->destroy_qp = siw_destroy_qp;
+	base_dev->create_cq = siw_create_cq;
+	base_dev->destroy_cq = siw_destroy_cq;
+	base_dev->resize_cq = NULL;
+	base_dev->poll_cq = siw_poll_cq;
+	base_dev->get_dma_mr = siw_get_dma_mr;
+	base_dev->reg_user_mr = siw_reg_user_mr;
+	base_dev->dereg_mr = siw_dereg_mr;
+	base_dev->alloc_mr = siw_alloc_mr;
+	base_dev->map_mr_sg = siw_map_mr_sg;
+	base_dev->dealloc_mw = NULL;
+
+	base_dev->create_srq = siw_create_srq;
+	base_dev->modify_srq = siw_modify_srq;
+	base_dev->query_srq = siw_query_srq;
+	base_dev->destroy_srq = siw_destroy_srq;
+	base_dev->post_srq_recv = siw_post_srq_recv;
+
+	base_dev->attach_mcast = NULL;
+	base_dev->detach_mcast = NULL;
+	base_dev->process_mad = siw_no_mad;
+
+	base_dev->req_notify_cq = siw_req_notify_cq;
+	base_dev->post_send = siw_post_send;
+	base_dev->post_recv = siw_post_receive;
+
+	base_dev->drain_sq = siw_verbs_sq_flush;
+	base_dev->drain_rq = siw_verbs_rq_flush;
+
+	base_dev->iwcm->connect = siw_connect;
+	base_dev->iwcm->accept = siw_accept;
+	base_dev->iwcm->reject = siw_reject;
+	base_dev->iwcm->create_listen = siw_create_listen;
+	base_dev->iwcm->destroy_listen = siw_destroy_listen;
+	base_dev->iwcm->add_ref = siw_qp_get_ref;
+	base_dev->iwcm->rem_ref = siw_qp_put_ref;
+	base_dev->iwcm->get_qp = siw_get_base_qp;
+
+	sdev->attrs.version = VERSION_ID_SOFTIWARP;
+	sdev->attrs.vendor_id = SIW_VENDOR_ID;
+	sdev->attrs.sw_version = VERSION_ID_SOFTIWARP;
+	sdev->attrs.max_qp = SIW_MAX_QP;
+	sdev->attrs.max_qp_wr = SIW_MAX_QP_WR;
+	sdev->attrs.max_ord = SIW_MAX_ORD_QP;
+	sdev->attrs.max_ird = SIW_MAX_IRD_QP;
+	sdev->attrs.cap_flags = IB_DEVICE_MEM_MGT_EXTENSIONS;
+	sdev->attrs.max_sge = SIW_MAX_SGE;
+	sdev->attrs.max_sge_rd = SIW_MAX_SGE_RD;
+	sdev->attrs.max_cq = SIW_MAX_CQ;
+	sdev->attrs.max_cqe = SIW_MAX_CQE;
+	sdev->attrs.max_mr = SIW_MAX_MR;
+	sdev->attrs.max_mr_size = rlimit(RLIMIT_MEMLOCK);
+	sdev->attrs.max_pd = SIW_MAX_PD;
+	sdev->attrs.max_mw = SIW_MAX_MW;
+	sdev->attrs.max_fmr = SIW_MAX_FMR;
+	sdev->attrs.max_srq = SIW_MAX_SRQ;
+	sdev->attrs.max_srq_wr = SIW_MAX_SRQ_WR;
+	sdev->attrs.max_srq_sge = SIW_MAX_SGE;
+
+	siw_idr_init(sdev);
+	INIT_LIST_HEAD(&sdev->cep_list);
+	INIT_LIST_HEAD(&sdev->qp_list);
+	INIT_LIST_HEAD(&sdev->mr_list);
+
+	atomic_set(&sdev->num_ctx, 0);
+	atomic_set(&sdev->num_srq, 0);
+	atomic_set(&sdev->num_qp, 0);
+	atomic_set(&sdev->num_cq, 0);
+	atomic_set(&sdev->num_mr, 0);
+	atomic_set(&sdev->num_pd, 0);
+	atomic_set(&sdev->num_cep, 0);
+
+	sdev->numa_node = dev_to_node(parent);
+
+	sdev->is_registered = 0;
+out:
+	if (sdev)
+		dev_hold(netdev);
+
+	return sdev;
+}
+
+static int siw_netdev_event(struct notifier_block *nb, unsigned long event,
+			    void *arg)
+{
+	struct net_device	*netdev = netdev_notifier_info_to_dev(arg);
+	struct in_device	*in_dev;
+	struct siw_device	*sdev;
+
+	dev_dbg(&netdev->dev, "siw: event %lu\n", event);
+
+	if (dev_net(netdev) != &init_net)
+		goto done;
+
+	sdev = siw_dev_from_netdev(netdev);
+
+	switch (event) {
+
+	case NETDEV_UP:
+		if (!sdev)
+			break;
+
+		if (sdev->is_registered) {
+			sdev->state = IB_PORT_ACTIVE;
+			siw_port_event(sdev, 1, IB_EVENT_PORT_ACTIVE);
+			break;
+		}
+		in_dev = in_dev_get(netdev);
+		if (!in_dev) {
+			dev_dbg(&netdev->dev, "siw: no in_device\n");
+			sdev->state = IB_PORT_INIT;
+			break;
+		}
+		if (in_dev->ifa_list) {
+			sdev->state = IB_PORT_ACTIVE;
+			if (siw_device_register(sdev))
+				sdev->state = IB_PORT_INIT;
+		} else {
+			dev_dbg(&netdev->dev, "siw: no ifa_list\n");
+			sdev->state = IB_PORT_INIT;
+		}
+		in_dev_put(in_dev);
+
+		break;
+
+	case NETDEV_DOWN:
+		if (sdev && sdev->is_registered) {
+			sdev->state = IB_PORT_DOWN;
+			siw_port_event(sdev, 1, IB_EVENT_PORT_ERR);
+			break;
+		}
+		break;
+
+	case NETDEV_REGISTER:
+		if (!sdev) {
+			if (!siw_dev_qualified(netdev))
+				break;
+
+			sdev = siw_device_create(netdev);
+			if (sdev) {
+				sdev->state = IB_PORT_INIT;
+				dev_dbg(&netdev->dev, "siw: new device\n");
+			}
+		}
+		break;
+
+	case NETDEV_UNREGISTER:
+		if (sdev) {
+			if (sdev->is_registered)
+				siw_device_deregister(sdev);
+			list_del(&sdev->list);
+			siw_device_destroy(sdev);
+		}
+		break;
+
+	case NETDEV_CHANGEADDR:
+		if (sdev->is_registered)
+			siw_port_event(sdev, 1, IB_EVENT_LID_CHANGE);
+
+		break;
+	/*
+	 * Todo: Below netdev events are currently not handled.
+	 */
+	case NETDEV_CHANGEMTU:
+	case NETDEV_GOING_DOWN:
+	case NETDEV_CHANGE:
+
+		break;
+
+	default:
+		break;
+	}
+done:
+	return NOTIFY_OK;
+}
+
+static struct notifier_block siw_netdev_nb = {
+	.notifier_call = siw_netdev_event,
+};
+
+/*
+ * siw_init_module - Initialize Softiwarp module and register with netdev
+ *                   subsystem to create Softiwarp devices per net_device
+ */
+static __init int siw_init_module(void)
+{
+	int rv;
+	int nr_cpu;
+
+	if (SENDPAGE_THRESH < SIW_MAX_INLINE) {
+		pr_info("siw: sendpage threshold too small: %u\n",
+			(int)SENDPAGE_THRESH);
+		rv = EINVAL;
+		goto out_error;
+	}
+	rv = siw_init_cpulist();
+	if (rv)
+		goto out_error;
+
+	rv = siw_cm_init();
+	if (rv)
+		goto out_error;
+
+	siw_debug_init();
+
+	/*
+	 * Allocate CRC SHASH object. Fail loading siw only, if CRC is
+	 * required by kernel module
+	 */
+	siw_crypto_shash = crypto_alloc_shash("crc32c", 0, 0);
+	if (IS_ERR(siw_crypto_shash)) {
+		pr_info("siw: Loading CRC32c failed: %ld\n",
+			PTR_ERR(siw_crypto_shash));
+		siw_crypto_shash = NULL;
+		if (mpa_crc_required == true)
+			goto out_error;
+	}
+	rv = register_netdevice_notifier(&siw_netdev_nb);
+	if (rv) {
+		siw_debugfs_delete();
+		goto out_error;
+	}
+	for (nr_cpu = 0; nr_cpu < nr_cpu_ids; nr_cpu++)
+		siw_tx_thread[nr_cpu] = NULL;
+
+	if (!siw_create_tx_threads()) {
+		pr_info("siw: Could not start any TX thread\n");
+		unregister_netdevice_notifier(&siw_netdev_nb);
+		goto out_error;
+	}
+	pr_info("SoftiWARP attached\n");
+	return 0;
+
+out_error:
+	for (nr_cpu = 0; nr_cpu < nr_cpu_ids; nr_cpu++) {
+		if (siw_tx_thread[nr_cpu]) {
+			siw_stop_tx_thread(nr_cpu);
+			siw_tx_thread[nr_cpu] = NULL;
+		}
+	}
+	if (siw_crypto_shash)
+		crypto_free_shash(siw_crypto_shash);
+
+	pr_info("SoftIWARP attach failed. Error: %d\n", rv);
+
+	siw_cm_exit();
+	siw_destroy_cpulist();
+
+	return rv;
+}
+
+static void __exit siw_exit_module(void)
+{
+	int nr_cpu;
+
+	for (nr_cpu = 0; nr_cpu < nr_cpu_ids; nr_cpu++) {
+		if (siw_tx_thread[nr_cpu]) {
+			siw_stop_tx_thread(nr_cpu);
+			siw_tx_thread[nr_cpu] = NULL;
+		}
+	}
+	unregister_netdevice_notifier(&siw_netdev_nb);
+
+	siw_cm_exit();
+
+	while (!list_empty(&siw_devlist)) {
+		struct siw_device *sdev =
+			list_entry(siw_devlist.next, struct siw_device, list);
+		list_del(&sdev->list);
+		if (sdev->is_registered)
+			siw_device_deregister(sdev);
+
+		siw_device_destroy(sdev);
+	}
+	if (siw_crypto_shash)
+		crypto_free_shash(siw_crypto_shash);
+
+	siw_debugfs_delete();
+	siw_destroy_cpulist();
+
+	pr_info("SoftiWARP detached\n");
+}
+
+module_init(siw_init_module);
+module_exit(siw_exit_module);
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 04/13] SoftiWarp object management
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (2 preceding siblings ...)
  2018-01-14 22:35   ` [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 05/13] SoftiWarp application interface Bernard Metzler
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_obj.c | 338 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_obj.h | 200 +++++++++++++++++++++
 2 files changed, 538 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_obj.c
 create mode 100644 drivers/infiniband/sw/siw/siw_obj.h

diff --git a/drivers/infiniband/sw/siw/siw_obj.c b/drivers/infiniband/sw/siw/siw_obj.c
new file mode 100644
index 000000000000..9ae05345dd64
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_obj.c
@@ -0,0 +1,338 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/spinlock.h>
+#include <linux/kref.h>
+#include <linux/vmalloc.h>
+
+#include "siw.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+
+void siw_objhdr_init(struct siw_objhdr *hdr)
+{
+	kref_init(&hdr->ref);
+}
+
+void siw_idr_init(struct siw_device *sdev)
+{
+	spin_lock_init(&sdev->idr_lock);
+
+	idr_init(&sdev->qp_idr);
+	idr_init(&sdev->cq_idr);
+	idr_init(&sdev->pd_idr);
+	idr_init(&sdev->mem_idr);
+}
+
+void siw_idr_release(struct siw_device *sdev)
+{
+	idr_destroy(&sdev->qp_idr);
+	idr_destroy(&sdev->cq_idr);
+	idr_destroy(&sdev->pd_idr);
+	idr_destroy(&sdev->mem_idr);
+}
+
+static int siw_add_obj(spinlock_t *lock, struct idr *idr,
+		       struct siw_objhdr *obj)
+{
+	unsigned long flags;
+	int id, pre_id;
+
+	do {
+		get_random_bytes(&pre_id, sizeof(pre_id));
+		pre_id &= 0xffffff;
+	} while (pre_id == 0);
+again:
+	spin_lock_irqsave(lock, flags);
+	id = idr_alloc(idr, obj, pre_id, 0xffffff - 1, GFP_KERNEL);
+	spin_unlock_irqrestore(lock, flags);
+
+	if (id > 0) {
+		siw_objhdr_init(obj);
+		obj->id = id;
+	} else if (id == -ENOSPC && pre_id != 1) {
+		pre_id = 1;
+		goto again;
+	} else {
+		pr_warn("SIW: idr new object failed!\n");
+	}
+	return id > 0 ? 0 : id;
+}
+
+int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp)
+{
+	int rv = siw_add_obj(&sdev->idr_lock, &sdev->qp_idr, &qp->hdr);
+
+	if (!rv) {
+		qp->hdr.sdev = sdev;
+		siw_dbg_obj(qp, "new qp\n");
+	}
+	return rv;
+}
+
+int siw_cq_add(struct siw_device *sdev, struct siw_cq *cq)
+{
+	int rv = siw_add_obj(&sdev->idr_lock, &sdev->cq_idr, &cq->hdr);
+
+	if (!rv) {
+		cq->hdr.sdev = sdev;
+		siw_dbg_obj(cq, "new cq\n");
+	}
+	return rv;
+}
+
+int siw_pd_add(struct siw_device *sdev, struct siw_pd *pd)
+{
+	int rv = siw_add_obj(&sdev->idr_lock, &sdev->pd_idr, &pd->hdr);
+
+	if (!rv) {
+		pd->hdr.sdev = sdev;
+		siw_dbg_obj(pd, "new pd\n");
+	}
+	return rv;
+}
+
+/*
+ * Stag lookup is based on its index part only (24 bits).
+ * The code avoids special Stag of zero and tries to randomize
+ * STag values between 1 and SIW_STAG_MAX.
+ */
+int siw_mem_add(struct siw_device *sdev, struct siw_mem *m)
+{
+	unsigned long flags;
+	int id, pre_id;
+
+	do {
+		get_random_bytes(&pre_id, sizeof(pre_id));
+		pre_id &= 0xffffff;
+	} while (pre_id == 0);
+again:
+	spin_lock_irqsave(&sdev->idr_lock, flags);
+	id = idr_alloc(&sdev->mem_idr, m, pre_id, SIW_STAG_MAX, GFP_KERNEL);
+	spin_unlock_irqrestore(&sdev->idr_lock, flags);
+
+	if (id == -ENOSPC || id > SIW_STAG_MAX) {
+		if (pre_id == 1) {
+			pr_warn("SIW: idr new memory object failed!\n");
+			return -ENOSPC;
+		}
+		pre_id = 1;
+		goto again;
+	}
+	siw_objhdr_init(&m->hdr);
+	m->hdr.id = id;
+	m->hdr.sdev = sdev;
+
+	siw_dbg_obj(m, "new mem\n");
+
+	return 0;
+}
+
+void siw_remove_obj(spinlock_t *lock, struct idr *idr,
+		      struct siw_objhdr *hdr)
+{
+	unsigned long	flags;
+
+	spin_lock_irqsave(lock, flags);
+	idr_remove(idr, hdr->id);
+	spin_unlock_irqrestore(lock, flags);
+}
+
+/********** routines to put objs back and free if no ref left *****/
+
+void siw_free_cq(struct kref *ref)
+{
+	struct siw_cq *cq =
+		(container_of(container_of(ref, struct siw_objhdr, ref),
+			      struct siw_cq, hdr));
+
+	siw_dbg_obj(cq, "free cq\n");
+
+	atomic_dec(&cq->hdr.sdev->num_cq);
+	if (cq->queue)
+		vfree(cq->queue);
+	kfree(cq);
+}
+
+void siw_free_qp(struct kref *ref)
+{
+	struct siw_qp *qp =
+		container_of(container_of(ref, struct siw_objhdr, ref),
+			     struct siw_qp, hdr);
+	struct siw_device *sdev = qp->hdr.sdev;
+	unsigned long flags;
+
+	siw_dbg_obj(qp, "free qp\n");
+
+	if (qp->cep)
+		siw_cep_put(qp->cep);
+
+	siw_remove_obj(&sdev->idr_lock, &sdev->qp_idr, &qp->hdr);
+
+	spin_lock_irqsave(&sdev->idr_lock, flags);
+	list_del(&qp->devq);
+	spin_unlock_irqrestore(&sdev->idr_lock, flags);
+
+	if (qp->sendq)
+		vfree(qp->sendq);
+	if (qp->recvq)
+		vfree(qp->recvq);
+	if (qp->irq)
+		vfree(qp->irq);
+	if (qp->orq)
+		vfree(qp->orq);
+
+	siw_put_tx_cpu(qp->tx_cpu);
+
+	atomic_dec(&sdev->num_qp);
+	kfree(qp);
+}
+
+void siw_free_pd(struct kref *ref)
+{
+	struct siw_pd	*pd =
+		container_of(container_of(ref, struct siw_objhdr, ref),
+			     struct siw_pd, hdr);
+
+	siw_dbg_obj(pd, "free pd\n");
+
+	atomic_dec(&pd->hdr.sdev->num_pd);
+	kfree(pd);
+}
+
+void siw_free_mem(struct kref *ref)
+{
+	struct siw_mem *m;
+	struct siw_device *sdev;
+
+	m = container_of(container_of(ref, struct siw_objhdr, ref),
+			 struct siw_mem, hdr);
+	sdev = m->hdr.sdev;
+
+	siw_dbg_obj(m, "free mem\n");
+
+	atomic_dec(&sdev->num_mr);
+
+	if (SIW_MEM_IS_MW(m)) {
+		struct siw_mw *mw = container_of(m, struct siw_mw, mem);
+
+		kfree_rcu(mw, rcu);
+	} else {
+		struct siw_mr *mr = container_of(m, struct siw_mr, mem);
+		unsigned long flags;
+
+		siw_dbg(m->hdr.sdev, "[MEM %d]: has pbl: %s\n",
+			OBJ_ID(m), mr->mem.is_pbl ? "y" : "n");
+
+		if (mr->pd)
+			siw_pd_put(mr->pd);
+
+		if (mr->mem_obj) {
+			if (mr->mem.is_pbl == 0)
+				siw_umem_release(mr->umem);
+			else
+				siw_pbl_free(mr->pbl);
+
+		}
+		spin_lock_irqsave(&sdev->idr_lock, flags);
+		list_del(&mr->devq);
+		spin_unlock_irqrestore(&sdev->idr_lock, flags);
+
+		kfree_rcu(mr, rcu);
+	}
+}
+
+/***** routines for WQE handling ***/
+
+void siw_wqe_put_mem(struct siw_wqe *wqe, enum siw_opcode op)
+{
+	switch (op) {
+
+	case SIW_OP_SEND:
+	case SIW_OP_WRITE:
+	case SIW_OP_SEND_WITH_IMM:
+	case SIW_OP_SEND_REMOTE_INV:
+	case SIW_OP_READ:
+	case SIW_OP_READ_LOCAL_INV:
+		if (!(wqe->sqe.flags & SIW_WQE_INLINE))
+			siw_unref_mem_sgl(wqe->mem, wqe->sqe.num_sge);
+		break;
+
+	case SIW_OP_RECEIVE:
+		siw_unref_mem_sgl(wqe->mem, wqe->rqe.num_sge);
+		break;
+
+	case SIW_OP_READ_RESPONSE:
+		siw_unref_mem_sgl(wqe->mem, 1);
+		break;
+
+	default:
+		/*
+		 * SIW_OP_INVAL_STAG and SIW_OP_REG_MR
+		 * do not hold memory references
+		 */
+		break;
+	}
+}
+
+int siw_invalidate_stag(struct siw_pd *pd, u32 stag)
+{
+	u32 stag_idx = stag >> 8;
+	struct siw_mem *mem = siw_mem_id2obj(pd->hdr.sdev, stag_idx);
+	int rv = 0;
+
+	if (unlikely(!mem)) {
+		siw_dbg(pd->hdr.sdev, "stag %u unknown\n", stag_idx);
+		return -EINVAL;
+	}
+	if (unlikely(siw_mem2mr(mem)->pd != pd)) {
+		siw_dbg(pd->hdr.sdev, "pd mismatch for stag %u\n", stag_idx);
+		rv = -EACCES;
+		goto out;
+	}
+	/*
+	 * Per RDMA verbs definition, an STag may already be in invalid
+	 * state if invalidation is requested. So no state check here.
+	 */
+	mem->stag_valid = 0;
+
+	siw_dbg(pd->hdr.sdev, "stag %u now valid\n", stag_idx);
+out:
+	siw_mem_put(mem);
+	return rv;
+}
diff --git a/drivers/infiniband/sw/siw/siw_obj.h b/drivers/infiniband/sw/siw/siw_obj.h
new file mode 100644
index 000000000000..c76f0f4d4b91
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_obj.h
@@ -0,0 +1,200 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _SIW_OBJ_H
+#define _SIW_OBJ_H
+
+#include <linux/idr.h>
+#include <linux/rwsem.h>
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/semaphore.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "siw_debug.h"
+
+extern void siw_free_qp(struct kref *ref);
+extern void siw_free_pd(struct kref *ref);
+extern void siw_free_cq(struct kref *ref);
+extern void siw_free_mem(struct kref *ref);
+
+static inline struct siw_device *siw_dev_base2siw(struct ib_device *base_dev)
+{
+	return container_of(base_dev, struct siw_device, base_dev);
+}
+
+static inline struct siw_mr *siw_mr_base2siw(struct ib_mr *base_mr)
+{
+	return container_of(base_mr, struct siw_mr, base_mr);
+}
+
+static inline void siw_cq_get(struct siw_cq *cq)
+{
+	kref_get(&cq->hdr.ref);
+	siw_dbg_obj(cq, "new refcount: %d\n", refcount_read(&cq->hdr.ref));
+}
+static inline void siw_qp_get(struct siw_qp *qp)
+{
+	kref_get(&qp->hdr.ref);
+	siw_dbg_obj(qp, "new refcount: %d\n", refcount_read(&qp->hdr.ref));
+}
+
+static inline void siw_pd_get(struct siw_pd *pd)
+{
+	kref_get(&pd->hdr.ref);
+	siw_dbg_obj(pd, "new refcount: %d\n", refcount_read(&pd->hdr.ref));
+}
+
+static inline void siw_mem_get(struct siw_mem *mem)
+{
+	kref_get(&mem->hdr.ref);
+	siw_dbg_obj(mem, "new refcount: %d\n", refcount_read(&mem->hdr.ref));
+}
+
+static inline void siw_mem_put(struct siw_mem *mem)
+{
+	siw_dbg_obj(mem, "old refcount: %d\n", refcount_read(&mem->hdr.ref));
+	kref_put(&mem->hdr.ref, siw_free_mem);
+}
+
+static inline void siw_unref_mem_sgl(struct siw_mem **mem,
+				     unsigned int num_sge)
+{
+	while (num_sge) {
+		if (*mem == NULL)
+			break;
+
+		siw_mem_put(*mem);
+		*mem = NULL;
+		mem++; num_sge--;
+	}
+}
+
+static inline struct siw_objhdr *siw_get_obj(struct idr *idr, int id)
+{
+	struct siw_objhdr *obj = idr_find(idr, id);
+
+	if (obj)
+		kref_get(&obj->ref);
+
+	return obj;
+}
+
+static inline struct siw_cq *siw_cq_id2obj(struct siw_device *sdev, int id)
+{
+	struct siw_objhdr *obj = siw_get_obj(&sdev->cq_idr, id);
+
+	if (obj)
+		return container_of(obj, struct siw_cq, hdr);
+
+	return NULL;
+}
+
+static inline struct siw_qp *siw_qp_id2obj(struct siw_device *sdev, int id)
+{
+	struct siw_objhdr *obj = siw_get_obj(&sdev->qp_idr, id);
+
+	if (obj)
+		return container_of(obj, struct siw_qp, hdr);
+
+	return NULL;
+}
+
+static inline void siw_cq_put(struct siw_cq *cq)
+{
+	siw_dbg_obj(cq, "old refcount: %d\n", refcount_read(&cq->hdr.ref));
+	kref_put(&cq->hdr.ref, siw_free_cq);
+}
+
+static inline void siw_qp_put(struct siw_qp *qp)
+{
+	siw_dbg_obj(qp, "old refcount: %d\n", refcount_read(&qp->hdr.ref));
+	kref_put(&qp->hdr.ref, siw_free_qp);
+}
+
+static inline void siw_pd_put(struct siw_pd *pd)
+{
+	siw_dbg_obj(pd, "old refcount: %d\n", refcount_read(&pd->hdr.ref));
+	kref_put(&pd->hdr.ref, siw_free_pd);
+}
+
+/*
+ * siw_mem_id2obj()
+ *
+ * resolves memory from stag given by id. might be called from:
+ * o process context before sending out of sgl, or
+ * o in softirq when resolving target memory
+ */
+static inline struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int id)
+{
+	struct siw_objhdr *obj;
+	struct siw_mem *mem = NULL;
+
+	rcu_read_lock();
+	obj = siw_get_obj(&sdev->mem_idr, id);
+	rcu_read_unlock();
+
+	if (likely(obj)) {
+		mem = container_of(obj, struct siw_mem, hdr);
+		siw_dbg_obj(mem, "new refcount: %d\n",
+			    refcount_read(&obj->ref));
+	}
+	return mem;
+}
+
+extern void siw_remove_obj(spinlock_t *lock, struct idr *idr,
+				struct siw_objhdr *hdr);
+
+extern void siw_objhdr_init(struct siw_objhdr *hdr);
+extern void siw_idr_init(struct siw_device *dev);
+extern void siw_idr_release(struct siw_device *dev);
+
+extern struct siw_cq *siw_cq_id2obj(struct siw_device *dev, int id);
+extern struct siw_qp *siw_qp_id2obj(struct siw_device *dev, int id);
+extern struct siw_mem *siw_mem_id2obj(struct siw_device *dev, int id);
+
+extern int siw_qp_add(struct siw_device *dev, struct siw_qp *qp);
+extern int siw_cq_add(struct siw_device *dev, struct siw_cq *cq);
+extern int siw_pd_add(struct siw_device *dev, struct siw_pd *pd);
+extern int siw_mem_add(struct siw_device *dev, struct siw_mem *mem);
+
+extern struct siw_wqe *siw_freeq_wqe_get(struct siw_qp *qp);
+
+extern int siw_invalidate_stag(struct siw_pd *pd, u32 stag);
+#endif
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 05/13] SoftiWarp application interface
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (3 preceding siblings ...)
  2018-01-14 22:35   ` [PATCH v3 04/13] SoftiWarp object management Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
       [not found]     ` <20180114223603.19961-6-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
  2018-01-23 18:07     ` Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 06/13] SoftiWarp connection management Bernard Metzler
                     ` (9 subsequent siblings)
  14 siblings, 2 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_ae.c    |  120 +++
 drivers/infiniband/sw/siw/siw_verbs.c | 1876 +++++++++++++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_verbs.h |  119 +++
 include/uapi/rdma/siw_user.h          |  216 ++++
 4 files changed, 2331 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_ae.c
 create mode 100644 drivers/infiniband/sw/siw/siw_verbs.c
 create mode 100644 drivers/infiniband/sw/siw/siw_verbs.h
 create mode 100644 include/uapi/rdma/siw_user.h

diff --git a/drivers/infiniband/sw/siw/siw_ae.c b/drivers/infiniband/sw/siw/siw_ae.c
new file mode 100644
index 000000000000..eac55ce7f56d
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_ae.c
@@ -0,0 +1,120 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/net.h>
+#include <linux/scatterlist.h>
+#include <linux/highmem.h>
+#include <net/sock.h>
+#include <net/tcp_states.h>
+#include <net/tcp.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+
+void siw_qp_event(struct siw_qp *qp, enum ib_event_type etype)
+{
+	struct ib_event event;
+	struct ib_qp	*base_qp = &qp->base_qp;
+
+	/*
+	 * Do not report asynchronous errors on QP which gets
+	 * destroyed via verbs interface (siw_destroy_qp())
+	 */
+	if (qp->attrs.flags & SIW_QP_IN_DESTROY)
+		return;
+
+	event.event = etype;
+	event.device = base_qp->device;
+	event.element.qp = base_qp;
+
+	if (base_qp->event_handler) {
+		siw_dbg_qp(qp, "reporting event %d\n", etype);
+		(*base_qp->event_handler)(&event, base_qp->qp_context);
+	}
+}
+
+void siw_cq_event(struct siw_cq *cq, enum ib_event_type etype)
+{
+	struct ib_event event;
+	struct ib_cq	*base_cq = &cq->base_cq;
+
+	event.event = etype;
+	event.device = base_cq->device;
+	event.element.cq = base_cq;
+
+	if (base_cq->event_handler) {
+		siw_dbg(cq->hdr.sdev, "reporting CQ event %d\n", etype);
+		(*base_cq->event_handler)(&event, base_cq->cq_context);
+	}
+}
+
+void siw_srq_event(struct siw_srq *srq, enum ib_event_type etype)
+{
+	struct ib_event event;
+	struct ib_srq	*base_srq = &srq->base_srq;
+
+	event.event = etype;
+	event.device = base_srq->device;
+	event.element.srq = base_srq;
+
+	if (base_srq->event_handler) {
+		siw_dbg(srq->pd->hdr.sdev, "reporting SRQ event %d\n", etype);
+		(*base_srq->event_handler)(&event, base_srq->srq_context);
+	}
+}
+
+void siw_port_event(struct siw_device *sdev, u8 port, enum ib_event_type etype)
+{
+	struct ib_event event;
+
+	event.event = etype;
+	event.device = &sdev->base_dev;
+	event.element.port_num = port;
+
+	siw_dbg(sdev, "reporting port event %d\n", etype);
+
+	ib_dispatch_event(&event);
+}
diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
new file mode 100644
index 000000000000..667b004e5c6d
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -0,0 +1,1876 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_verbs.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+
+static int ib_qp_state_to_siw_qp_state[IB_QPS_ERR+1] = {
+	[IB_QPS_RESET]	= SIW_QP_STATE_IDLE,
+	[IB_QPS_INIT]	= SIW_QP_STATE_IDLE,
+	[IB_QPS_RTR]	= SIW_QP_STATE_RTR,
+	[IB_QPS_RTS]	= SIW_QP_STATE_RTS,
+	[IB_QPS_SQD]	= SIW_QP_STATE_CLOSING,
+	[IB_QPS_SQE]	= SIW_QP_STATE_TERMINATE,
+	[IB_QPS_ERR]	= SIW_QP_STATE_ERROR
+};
+
+static struct siw_pd *siw_pd_base2siw(struct ib_pd *base_pd)
+{
+	return container_of(base_pd, struct siw_pd, base_pd);
+}
+
+static struct siw_ucontext *siw_ctx_base2siw(struct ib_ucontext *base_ctx)
+{
+	return container_of(base_ctx, struct siw_ucontext, ib_ucontext);
+}
+
+static struct siw_cq *siw_cq_base2siw(struct ib_cq *base_cq)
+{
+	return container_of(base_cq, struct siw_cq, base_cq);
+}
+
+static struct siw_srq *siw_srq_base2siw(struct ib_srq *base_srq)
+{
+	return container_of(base_srq, struct siw_srq, base_srq);
+}
+
+static u32 siw_insert_uobj(struct siw_ucontext *uctx, void *vaddr, u32 size)
+{
+	struct siw_uobj *uobj;
+	u32	key;
+
+	uobj = kzalloc(sizeof(*uobj), GFP_KERNEL);
+	if (!uobj)
+		return SIW_INVAL_UOBJ_KEY;
+
+	size = PAGE_ALIGN(size);
+
+	spin_lock(&uctx->uobj_lock);
+
+	if (list_empty(&uctx->uobj_list))
+		uctx->uobj_key = 0;
+
+	key = uctx->uobj_key;
+	if (key > SIW_MAX_UOBJ_KEY) {
+		spin_unlock(&uctx->uobj_lock);
+		kfree(uobj);
+		return SIW_INVAL_UOBJ_KEY;
+	}
+	uobj->key = key;
+	uobj->size = size;
+	uobj->addr = vaddr;
+
+	uctx->uobj_key += size; /* advance for next object */
+
+	list_add_tail(&uobj->list, &uctx->uobj_list);
+
+	spin_unlock(&uctx->uobj_lock);
+
+	return key;
+}
+
+static struct siw_uobj *siw_remove_uobj(struct siw_ucontext *uctx, u32 key,
+					u32 size)
+{
+	struct list_head *pos, *nxt;
+
+	spin_lock(&uctx->uobj_lock);
+
+	list_for_each_safe(pos, nxt, &uctx->uobj_list) {
+		struct siw_uobj *uobj = list_entry(pos, struct siw_uobj, list);
+
+		if (uobj->key == key && uobj->size == size) {
+			list_del(&uobj->list);
+			spin_unlock(&uctx->uobj_lock);
+			return uobj;
+		}
+	}
+	spin_unlock(&uctx->uobj_lock);
+
+	return NULL;
+}
+
+int siw_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma)
+{
+	struct siw_ucontext     *uctx = siw_ctx_base2siw(ctx);
+	struct siw_uobj         *uobj;
+	u32     key = vma->vm_pgoff << PAGE_SHIFT;
+	int     size = vma->vm_end - vma->vm_start;
+	int     rv = -EINVAL;
+
+	/*
+	 * Must be page aligned
+	 */
+	if (vma->vm_start & (PAGE_SIZE - 1)) {
+		pr_warn("map not page aligned\n");
+		goto out;
+	}
+
+	uobj = siw_remove_uobj(uctx, key, size);
+	if (!uobj) {
+		pr_warn("mmap lookup failed: %u, %d\n", key, size);
+		goto out;
+	}
+	rv = remap_vmalloc_range(vma, uobj->addr, 0);
+	if (rv)
+		pr_warn("remap_vmalloc_range failed: %u, %d\n", key, size);
+
+	kfree(uobj);
+out:
+	return rv;
+}
+
+struct ib_ucontext *siw_alloc_ucontext(struct ib_device *base_dev,
+				       struct ib_udata *udata)
+{
+	struct siw_ucontext *ctx = NULL;
+	struct siw_device *sdev = siw_dev_base2siw(base_dev);
+	int rv;
+
+	if (atomic_inc_return(&sdev->num_ctx) > SIW_MAX_CONTEXT) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	spin_lock_init(&ctx->uobj_lock);
+	INIT_LIST_HEAD(&ctx->uobj_list);
+	ctx->uobj_key = 0;
+
+	ctx->sdev = sdev;
+	if (udata) {
+		struct siw_uresp_alloc_ctx uresp;
+
+		memset(&uresp, 0, sizeof(uresp));
+		uresp.dev_id = sdev->attrs.vendor_part_id;
+
+		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		if (rv)
+			goto err_out;
+	}
+	siw_dbg(sdev, "success. now %d context(s)\n",
+		atomic_read(&sdev->num_ctx));
+
+	return &ctx->ib_ucontext;
+
+err_out:
+	kfree(ctx);
+
+	atomic_dec(&sdev->num_ctx);
+	siw_dbg(sdev, "failure %d. now %d context(s)\n",
+		rv, atomic_read(&sdev->num_ctx));
+
+	return ERR_PTR(rv);
+}
+
+int siw_dealloc_ucontext(struct ib_ucontext *base_ctx)
+{
+	struct siw_ucontext *ctx = siw_ctx_base2siw(base_ctx);
+
+	atomic_dec(&ctx->sdev->num_ctx);
+	kfree(ctx);
+	return 0;
+}
+
+int siw_query_device(struct ib_device *base_dev, struct ib_device_attr *attr,
+		     struct ib_udata *unused)
+{
+	struct siw_device *sdev = siw_dev_base2siw(base_dev);
+	/*
+	 * A process context is needed to report avail memory resources.
+	 */
+	if (in_interrupt())
+		return -EINVAL;
+
+	memset(attr, 0, sizeof(*attr));
+
+	attr->max_mr_size = rlimit(RLIMIT_MEMLOCK); /* per process */
+	attr->vendor_id = sdev->attrs.vendor_id;
+	attr->vendor_part_id = sdev->attrs.vendor_part_id;
+	attr->max_qp = sdev->attrs.max_qp;
+	attr->max_qp_wr = sdev->attrs.max_qp_wr;
+
+	attr->max_qp_rd_atom = sdev->attrs.max_ord;
+	attr->max_qp_init_rd_atom = sdev->attrs.max_ird;
+	attr->max_res_rd_atom = sdev->attrs.max_qp * sdev->attrs.max_ird;
+	attr->device_cap_flags = sdev->attrs.cap_flags;
+	attr->max_sge = sdev->attrs.max_sge;
+	attr->max_sge_rd = sdev->attrs.max_sge_rd;
+	attr->max_cq = sdev->attrs.max_cq;
+	attr->max_cqe = sdev->attrs.max_cqe;
+	attr->max_mr = sdev->attrs.max_mr;
+	attr->max_pd = sdev->attrs.max_pd;
+	attr->max_mw = sdev->attrs.max_mw;
+	attr->max_fmr = sdev->attrs.max_fmr;
+	attr->max_srq = sdev->attrs.max_srq;
+	attr->max_srq_wr = sdev->attrs.max_srq_wr;
+	attr->max_srq_sge = sdev->attrs.max_srq_sge;
+	attr->max_fast_reg_page_list_len = SIW_MAX_SGE_PBL;
+	attr->page_size_cap = PAGE_SIZE;
+	/* Revisit if RFC 7306 gets supported */
+	attr->atomic_cap = 0;
+
+	memcpy(&attr->sys_image_guid, sdev->netdev->dev_addr, 6);
+
+	return 0;
+}
+
+/*
+ * Approximate translation of real MTU for IB.
+ */
+static inline enum ib_mtu siw_mtu_net2base(unsigned short mtu)
+{
+	if (mtu >= 4096)
+		return IB_MTU_4096;
+	if (mtu >= 2048)
+		return IB_MTU_2048;
+	if (mtu >= 1024)
+		return IB_MTU_1024;
+	if (mtu >= 512)
+		return IB_MTU_512;
+	if (mtu >= 256)
+		return IB_MTU_256;
+	return IB_MTU_4096;
+}
+
+int siw_query_port(struct ib_device *base_dev, u8 port,
+		     struct ib_port_attr *attr)
+{
+	struct siw_device *sdev = siw_dev_base2siw(base_dev);
+
+	memset(attr, 0, sizeof(*attr));
+
+	attr->state = sdev->state;
+	attr->max_mtu = siw_mtu_net2base(sdev->netdev->mtu);
+	attr->active_mtu = attr->max_mtu;
+	attr->gid_tbl_len = 1;
+	attr->port_cap_flags = IB_PORT_CM_SUP;
+	attr->port_cap_flags |= IB_PORT_DEVICE_MGMT_SUP;
+	attr->max_msg_sz = -1;
+	attr->pkey_tbl_len = 1;
+	attr->active_width = 2;
+	attr->active_speed = 2;
+	attr->phys_state = sdev->state == IB_PORT_ACTIVE ? 5 : 3;
+	/*
+	 * All zero
+	 *
+	 * attr->lid = 0;
+	 * attr->bad_pkey_cntr = 0;
+	 * attr->qkey_viol_cntr = 0;
+	 * attr->sm_lid = 0;
+	 * attr->lmc = 0;
+	 * attr->max_vl_num = 0;
+	 * attr->sm_sl = 0;
+	 * attr->subnet_timeout = 0;
+	 * attr->init_type_repy = 0;
+	 */
+	return 0;
+}
+
+int siw_get_port_immutable(struct ib_device *base_dev, u8 port,
+			   struct ib_port_immutable *port_immutable)
+{
+	struct ib_port_attr attr;
+	int rv = siw_query_port(base_dev, port, &attr);
+
+	if (rv)
+		return rv;
+
+	port_immutable->pkey_tbl_len = attr.pkey_tbl_len;
+	port_immutable->gid_tbl_len = attr.gid_tbl_len;
+	port_immutable->core_cap_flags = RDMA_CORE_PORT_IWARP;
+
+	return 0;
+}
+
+int siw_query_pkey(struct ib_device *base_dev, u8 port, u16 idx, u16 *pkey)
+{
+	/* Report the default pkey */
+	*pkey = 0xffff;
+	return 0;
+}
+
+int siw_query_gid(struct ib_device *base_dev, u8 port, int idx,
+		   union ib_gid *gid)
+{
+	struct siw_device *sdev = siw_dev_base2siw(base_dev);
+
+	/* subnet_prefix == interface_id == 0; */
+	memset(gid, 0, sizeof(*gid));
+	memcpy(&gid->raw[0], sdev->netdev->dev_addr, 6);
+
+	return 0;
+}
+
+struct ib_pd *siw_alloc_pd(struct ib_device *base_dev,
+			   struct ib_ucontext *context, struct ib_udata *udata)
+{
+	struct siw_pd *pd = NULL;
+	struct siw_device *sdev  = siw_dev_base2siw(base_dev);
+	int rv;
+
+	if (atomic_inc_return(&sdev->num_pd) > SIW_MAX_PD) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	pd = kmalloc(sizeof(*pd), GFP_KERNEL);
+	if (!pd) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	rv = siw_pd_add(sdev, pd);
+	if (rv) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (context) {
+		if (ib_copy_to_udata(udata, &pd->hdr.id, sizeof(pd->hdr.id))) {
+			rv = -EFAULT;
+			goto err_out_idr;
+		}
+	}
+	siw_dbg(sdev, "success. now %d pd's(s)\n",
+		atomic_read(&sdev->num_pd));
+	return &pd->base_pd;
+
+err_out_idr:
+	siw_remove_obj(&sdev->idr_lock, &sdev->pd_idr, &pd->hdr);
+err_out:
+	kfree(pd);
+	atomic_dec(&sdev->num_pd);
+	siw_dbg(sdev, "failure %d. now %d pd's(s)\n",
+		rv, atomic_read(&sdev->num_pd));
+
+	return ERR_PTR(rv);
+}
+
+int siw_dealloc_pd(struct ib_pd *base_pd)
+{
+	struct siw_pd *pd = siw_pd_base2siw(base_pd);
+	struct siw_device *sdev = siw_dev_base2siw(base_pd->device);
+
+	siw_remove_obj(&sdev->idr_lock, &sdev->pd_idr, &pd->hdr);
+	siw_pd_put(pd);
+
+	return 0;
+}
+
+void siw_qp_get_ref(struct ib_qp *base_qp)
+{
+	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
+
+	siw_dbg_qp(qp, "get user reference\n");
+	siw_qp_get(qp);
+}
+
+void siw_qp_put_ref(struct ib_qp *base_qp)
+{
+	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
+
+	siw_dbg_qp(qp, "put user reference\n");
+	siw_qp_put(qp);
+}
+
+int siw_no_mad(struct ib_device *base_dev, int flags, u8 port,
+	       const struct ib_wc *wc, const struct ib_grh *grh,
+	       const struct ib_mad_hdr *in_mad, size_t in_mad_size,
+	       struct ib_mad_hdr *out_mad, size_t *out_mad_size,
+	       u16 *outmad_pkey_index)
+{
+	return -EOPNOTSUPP;
+}
+
+/*
+ * siw_create_qp()
+ *
+ * Create QP of requested size on given device.
+ *
+ * @base_pd:	Base PD contained in siw PD
+ * @attrs:	Initial QP attributes.
+ * @udata:	used to provide QP ID, SQ and RQ size back to user.
+ */
+
+struct ib_qp *siw_create_qp(struct ib_pd *base_pd,
+			    struct ib_qp_init_attr *attrs,
+			    struct ib_udata *udata)
+{
+	struct siw_qp		*qp = NULL;
+	struct siw_pd		*pd = siw_pd_base2siw(base_pd);
+	struct ib_device	 *base_dev = base_pd->device;
+	struct siw_device	*sdev = siw_dev_base2siw(base_dev);
+	struct siw_cq		*scq = NULL, *rcq = NULL;
+
+	unsigned long flags;
+	int num_sqe, num_rqe, rv = 0;
+
+	siw_dbg(sdev, "create new qp\n");
+
+	if (atomic_inc_return(&sdev->num_qp) > SIW_MAX_QP) {
+		siw_dbg(sdev, "too many qp's\n");
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (attrs->qp_type != IB_QPT_RC) {
+		siw_dbg(sdev, "only rc qp's supported\n");
+		rv = -EINVAL;
+		goto err_out;
+	}
+	if ((attrs->cap.max_send_wr > SIW_MAX_QP_WR) ||
+	    (attrs->cap.max_recv_wr > SIW_MAX_QP_WR) ||
+	    (attrs->cap.max_send_sge > SIW_MAX_SGE)  ||
+	    (attrs->cap.max_recv_sge > SIW_MAX_SGE)) {
+		siw_dbg(sdev, "qp size error\n");
+		rv = -EINVAL;
+		goto err_out;
+	}
+	if (attrs->cap.max_inline_data > SIW_MAX_INLINE) {
+		siw_dbg(sdev, "max inline send: %d > %d\n",
+			attrs->cap.max_inline_data, (int)SIW_MAX_INLINE);
+		rv = -EINVAL;
+		goto err_out;
+	}
+	/*
+	 * NOTE: we allow for zero element SQ and RQ WQE's SGL's
+	 * but not for a QP unable to hold any WQE (SQ + RQ)
+	 */
+	if (attrs->cap.max_send_wr + attrs->cap.max_recv_wr == 0) {
+		siw_dbg(sdev, "qp must have send or receive queue\n");
+		rv = -EINVAL;
+		goto err_out;
+	}
+
+	scq = siw_cq_id2obj(sdev, ((struct siw_cq *)attrs->send_cq)->hdr.id);
+	rcq = siw_cq_id2obj(sdev, ((struct siw_cq *)attrs->recv_cq)->hdr.id);
+
+	if (!scq || (!rcq && !attrs->srq)) {
+		siw_dbg(sdev, "send cq or receive cq invalid\n");
+		rv = -EINVAL;
+		goto err_out;
+	}
+	qp = kzalloc(sizeof(*qp), GFP_KERNEL);
+	if (!qp) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+
+	init_rwsem(&qp->state_lock);
+	spin_lock_init(&qp->sq_lock);
+	spin_lock_init(&qp->rq_lock);
+	spin_lock_init(&qp->orq_lock);
+
+	init_waitqueue_head(&qp->tx_ctx.waitq);
+
+	if (!base_pd->uobject)
+		qp->kernel_verbs = 1;
+
+	rv = siw_qp_add(sdev, qp);
+	if (rv)
+		goto err_out;
+
+	num_sqe = roundup_pow_of_two(attrs->cap.max_send_wr);
+	num_rqe = roundup_pow_of_two(attrs->cap.max_recv_wr);
+
+	if (qp->kernel_verbs)
+		qp->sendq = vzalloc(num_sqe * sizeof(struct siw_sqe));
+	else
+		qp->sendq = vmalloc_user(num_sqe * sizeof(struct siw_sqe));
+
+	if (qp->sendq == NULL) {
+		siw_dbg_qp(qp, "send queue size %d alloc failed\n", num_sqe);
+		rv = -ENOMEM;
+		goto err_out_idr;
+	}
+	if (attrs->sq_sig_type != IB_SIGNAL_REQ_WR) {
+		if (attrs->sq_sig_type == IB_SIGNAL_ALL_WR)
+			qp->attrs.flags |= SIW_SIGNAL_ALL_WR;
+		else {
+			rv = -EINVAL;
+			goto err_out_idr;
+		}
+	}
+	qp->pd  = pd;
+	qp->scq = scq;
+	qp->rcq = rcq;
+
+	if (attrs->srq) {
+		/*
+		 * SRQ support.
+		 * Verbs 6.3.7: ignore RQ size, if SRQ present
+		 * Verbs 6.3.5: do not check PD of SRQ against PD of QP
+		 */
+		qp->srq = siw_srq_base2siw(attrs->srq);
+		qp->attrs.rq_size = 0;
+		siw_dbg_qp(qp, "[SRQ 0x%p] attached\n", qp->srq);
+	} else if (num_rqe) {
+		if (qp->kernel_verbs)
+			qp->recvq = vzalloc(num_rqe * sizeof(struct siw_rqe));
+		else
+			qp->recvq = vmalloc_user(num_rqe *
+						 sizeof(struct siw_rqe));
+
+		if (qp->recvq == NULL) {
+			siw_dbg_qp(qp, "recv queue size %d alloc failed\n",
+				   num_rqe);
+			rv = -ENOMEM;
+			goto err_out_idr;
+		}
+
+		qp->attrs.rq_size = num_rqe;
+	}
+	qp->attrs.sq_size = num_sqe;
+	qp->attrs.sq_max_sges = attrs->cap.max_send_sge;
+	/*
+	 * ofed has no max_send_sge_rdmawrite
+	 */
+	qp->attrs.sq_max_sges_rdmaw = attrs->cap.max_send_sge;
+	qp->attrs.rq_max_sges = attrs->cap.max_recv_sge;
+
+	qp->attrs.state = SIW_QP_STATE_IDLE;
+
+	if (udata) {
+		struct siw_uresp_create_qp uresp;
+		struct siw_ucontext *ctx;
+
+		memset(&uresp, 0, sizeof(uresp));
+		ctx = siw_ctx_base2siw(base_pd->uobject->context);
+
+		uresp.sq_key = uresp.rq_key = SIW_INVAL_UOBJ_KEY;
+		uresp.num_sqe = num_sqe;
+		uresp.num_rqe = num_rqe;
+		uresp.qp_id = QP_ID(qp);
+
+		if (qp->sendq) {
+			uresp.sq_key = siw_insert_uobj(ctx, qp->sendq,
+					num_sqe * sizeof(struct siw_sqe));
+			if (uresp.sq_key > SIW_MAX_UOBJ_KEY)
+				siw_dbg_qp(qp, "preparing mmap sq failed\n");
+		}
+		if (qp->recvq) {
+			uresp.rq_key = siw_insert_uobj(ctx, qp->recvq,
+					num_rqe * sizeof(struct siw_rqe));
+			if (uresp.rq_key > SIW_MAX_UOBJ_KEY)
+				siw_dbg_qp(qp, "preparing mmap rq failed\n");
+		}
+		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		if (rv)
+			goto err_out_idr;
+	}
+	qp->tx_cpu = siw_get_tx_cpu(sdev);
+	if (qp->tx_cpu < 0) {
+		rv = -EINVAL;
+		goto err_out_idr;
+	}
+	atomic_set(&qp->tx_ctx.in_use, 0);
+
+	qp->base_qp.qp_num = QP_ID(qp);
+
+	siw_pd_get(pd);
+
+	INIT_LIST_HEAD(&qp->devq);
+	spin_lock_irqsave(&sdev->idr_lock, flags);
+	list_add_tail(&qp->devq, &sdev->qp_list);
+	spin_unlock_irqrestore(&sdev->idr_lock, flags);
+
+	return &qp->base_qp;
+
+err_out_idr:
+	siw_remove_obj(&sdev->idr_lock, &sdev->qp_idr, &qp->hdr);
+err_out:
+	if (scq)
+		siw_cq_put(scq);
+	if (rcq)
+		siw_cq_put(rcq);
+
+	if (qp) {
+		if (qp->sendq)
+			vfree(qp->sendq);
+		if (qp->recvq)
+			vfree(qp->recvq);
+		kfree(qp);
+	}
+	atomic_dec(&sdev->num_qp);
+
+	return ERR_PTR(rv);
+}
+
+/*
+ * Minimum siw_query_qp() verb interface.
+ *
+ * @qp_attr_mask is not used but all available information is provided
+ */
+int siw_query_qp(struct ib_qp *base_qp, struct ib_qp_attr *qp_attr,
+		 int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr)
+{
+	struct siw_qp *qp;
+	struct siw_device *sdev;
+
+	if (base_qp && qp_attr && qp_init_attr) {
+		qp = siw_qp_base2siw(base_qp);
+		sdev = siw_dev_base2siw(base_qp->device);
+	} else
+		return -EINVAL;
+
+	qp_attr->cap.max_inline_data = SIW_MAX_INLINE;
+	qp_attr->cap.max_send_wr = qp->attrs.sq_size;
+	qp_attr->cap.max_send_sge = qp->attrs.sq_max_sges;
+	qp_attr->cap.max_recv_wr = qp->attrs.rq_size;
+	qp_attr->cap.max_recv_sge = qp->attrs.rq_max_sges;
+	qp_attr->path_mtu = siw_mtu_net2base(sdev->netdev->mtu);
+	qp_attr->max_rd_atomic = qp->attrs.irq_size;
+	qp_attr->max_dest_rd_atomic = qp->attrs.orq_size;
+
+	qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE |
+			IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ;
+
+	qp_init_attr->qp_type = base_qp->qp_type;
+	qp_init_attr->send_cq = base_qp->send_cq;
+	qp_init_attr->recv_cq = base_qp->recv_cq;
+	qp_init_attr->srq = base_qp->srq;
+
+	qp_init_attr->cap = qp_attr->cap;
+
+	return 0;
+}
+
+int siw_verbs_modify_qp(struct ib_qp *base_qp, struct ib_qp_attr *attr,
+			int attr_mask, struct ib_udata *udata)
+{
+	struct siw_qp_attrs	new_attrs;
+	enum siw_qp_attr_mask	siw_attr_mask = 0;
+	struct siw_qp		*qp = siw_qp_base2siw(base_qp);
+	int			rv = 0;
+
+	if (!attr_mask)
+		return 0;
+
+	memset(&new_attrs, 0, sizeof(new_attrs));
+
+	if (attr_mask & IB_QP_ACCESS_FLAGS) {
+
+		siw_attr_mask = SIW_QP_ATTR_ACCESS_FLAGS;
+
+		if (attr->qp_access_flags & IB_ACCESS_REMOTE_READ)
+			new_attrs.flags |= SIW_RDMA_READ_ENABLED;
+		if (attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE)
+			new_attrs.flags |= SIW_RDMA_WRITE_ENABLED;
+		if (attr->qp_access_flags & IB_ACCESS_MW_BIND)
+			new_attrs.flags |= SIW_RDMA_BIND_ENABLED;
+	}
+	if (attr_mask & IB_QP_STATE) {
+		siw_dbg_qp(qp, "desired ib qp state: %s\n",
+			   ib_qp_state_to_string[attr->qp_state]);
+
+		new_attrs.state = ib_qp_state_to_siw_qp_state[attr->qp_state];
+
+		if (new_attrs.state > SIW_QP_STATE_RTS)
+			qp->tx_ctx.tx_suspend = 1;
+
+		siw_attr_mask |= SIW_QP_ATTR_STATE;
+	}
+	if (!attr_mask)
+		goto out;
+
+	down_write(&qp->state_lock);
+
+	rv = siw_qp_modify(qp, &new_attrs, siw_attr_mask);
+
+	up_write(&qp->state_lock);
+out:
+	return rv;
+}
+
+int siw_destroy_qp(struct ib_qp *base_qp)
+{
+	struct siw_qp		*qp = siw_qp_base2siw(base_qp);
+	struct siw_qp_attrs	qp_attrs;
+
+	siw_dbg_qp(qp, "state %d, cep 0x%p\n", qp->attrs.state, qp->cep);
+
+	/*
+	 * Mark QP as in process of destruction to prevent from
+	 * any async callbacks to RDMA core
+	 */
+	qp->attrs.flags |= SIW_QP_IN_DESTROY;
+	qp->rx_ctx.rx_suspend = 1;
+
+	down_write(&qp->state_lock);
+
+	qp_attrs.state = SIW_QP_STATE_ERROR;
+	(void)siw_qp_modify(qp, &qp_attrs, SIW_QP_ATTR_STATE);
+
+	if (qp->cep) {
+		siw_cep_put(qp->cep);
+		qp->cep = NULL;
+	}
+
+	up_write(&qp->state_lock);
+
+	kfree(qp->rx_ctx.mpa_crc_hd);
+	kfree(qp->tx_ctx.mpa_crc_hd);
+
+	/* Drop references */
+	siw_cq_put(qp->scq);
+	siw_cq_put(qp->rcq);
+	siw_pd_put(qp->pd);
+	qp->scq = qp->rcq = NULL;
+
+	siw_qp_put(qp);
+
+	return 0;
+}
+
+/*
+ * siw_copy_sgl()
+ *
+ * Copy SGL from RDMA core representation to local
+ * representation.
+ */
+static inline void siw_copy_sgl(struct ib_sge *sge, struct siw_sge *siw_sge,
+			       int num_sge)
+{
+	while (num_sge--) {
+		siw_sge->laddr = sge->addr;
+		siw_sge->length = sge->length;
+		siw_sge->lkey = sge->lkey;
+
+		siw_sge++; sge++;
+	}
+}
+
+/*
+ * siw_copy_inline_sgl()
+ *
+ * Prepare sgl of inlined data for sending. For userland callers
+ * function checks if given buffer addresses and len's are within
+ * process context bounds.
+ * Data from all provided sge's are copied together into the wqe,
+ * referenced by a single sge.
+ */
+static int siw_copy_inline_sgl(struct ib_send_wr *core_wr, struct siw_sqe *sqe)
+{
+	struct ib_sge	*core_sge = core_wr->sg_list;
+	void		*kbuf	 = &sqe->sge[1];
+	int		num_sge	 = core_wr->num_sge,
+			bytes	 = 0;
+
+	sqe->sge[0].laddr = (u64)kbuf;
+	sqe->sge[0].lkey = 0;
+
+	while (num_sge--) {
+		if (!core_sge->length) {
+			core_sge++;
+			continue;
+		}
+		bytes += core_sge->length;
+		if (bytes > SIW_MAX_INLINE) {
+			bytes = -EINVAL;
+			break;
+		}
+		memcpy(kbuf, (void *)(uintptr_t)core_sge->addr,
+		       core_sge->length);
+
+		kbuf += core_sge->length;
+		core_sge++;
+	}
+	sqe->sge[0].length = bytes > 0 ? bytes : 0;
+	sqe->num_sge = bytes > 0 ? 1 : 0;
+
+	return bytes;
+}
+
+/*
+ * siw_post_send()
+ *
+ * Post a list of S-WR's to a SQ.
+ *
+ * @base_qp:	Base QP contained in siw QP
+ * @wr:		Null terminated list of user WR's
+ * @bad_wr:	Points to failing WR in case of synchronous failure.
+ */
+int siw_post_send(struct ib_qp *base_qp, struct ib_send_wr *wr,
+		  struct ib_send_wr **bad_wr)
+{
+	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
+	struct siw_wqe	*wqe = tx_wqe(qp);
+
+	unsigned long flags;
+	int rv = 0;
+
+	siw_dbg_qp(qp, "state %d\n", qp->attrs.state);
+
+	/*
+	 * Try to acquire QP state lock. Must be non-blocking
+	 * to accommodate kernel clients needs.
+	 */
+	if (!down_read_trylock(&qp->state_lock)) {
+		*bad_wr = wr;
+		return -ENOTCONN;
+	}
+
+	if (unlikely(qp->attrs.state != SIW_QP_STATE_RTS)) {
+		up_read(&qp->state_lock);
+		*bad_wr = wr;
+		return -ENOTCONN;
+	}
+	if (wr && !qp->kernel_verbs) {
+		siw_dbg_qp(qp, "wr must be empty for user mapped sq\n");
+		up_read(&qp->state_lock);
+		*bad_wr = wr;
+		return -EINVAL;
+	}
+
+	spin_lock_irqsave(&qp->sq_lock, flags);
+
+	while (wr) {
+		u32 idx = qp->sq_put % qp->attrs.sq_size;
+		struct siw_sqe *sqe = &qp->sendq[idx];
+
+		if (sqe->flags) {
+			siw_dbg_qp(qp, "sq full\n");
+			rv = -ENOMEM;
+			break;
+		}
+		if (wr->num_sge > qp->attrs.sq_max_sges) {
+			siw_dbg_qp(qp, "too many sge's: %d\n", wr->num_sge);
+			rv = -EINVAL;
+			break;
+		}
+		sqe->id = wr->wr_id;
+
+		if ((wr->send_flags & IB_SEND_SIGNALED) ||
+		    (qp->attrs.flags & SIW_SIGNAL_ALL_WR))
+			sqe->flags |= SIW_WQE_SIGNALLED;
+
+		if (wr->send_flags & IB_SEND_FENCE)
+			sqe->flags |= SIW_WQE_READ_FENCE;
+
+		switch (wr->opcode) {
+
+		case IB_WR_SEND:
+		case IB_WR_SEND_WITH_INV:
+			if (wr->send_flags & IB_SEND_SOLICITED)
+				sqe->flags |= SIW_WQE_SOLICITED;
+
+			if (!(wr->send_flags & IB_SEND_INLINE)) {
+				siw_copy_sgl(wr->sg_list, sqe->sge,
+					     wr->num_sge);
+				sqe->num_sge = wr->num_sge;
+			} else {
+				rv = siw_copy_inline_sgl(wr, sqe);
+				if (rv <= 0) {
+					rv = -EINVAL;
+					break;
+				}
+				sqe->flags |= SIW_WQE_INLINE;
+				sqe->num_sge = 1;
+			}
+			if (wr->opcode == IB_WR_SEND)
+				sqe->opcode = SIW_OP_SEND;
+			else {
+				sqe->opcode = SIW_OP_SEND_REMOTE_INV;
+				sqe->rkey = wr->ex.invalidate_rkey;
+			}
+			break;
+
+		case IB_WR_RDMA_READ_WITH_INV:
+		case IB_WR_RDMA_READ:
+			/*
+			 * OFED WR restricts RREAD sink to SGL containing
+			 * 1 SGE only. we could relax to SGL with multiple
+			 * elements referring the SAME ltag or even sending
+			 * a private per-rreq tag referring to a checked
+			 * local sgl with MULTIPLE ltag's. would be easy
+			 * to do...
+			 */
+			if (unlikely(wr->num_sge != 1)) {
+				rv = -EINVAL;
+				break;
+			}
+			siw_copy_sgl(wr->sg_list, &sqe->sge[0], 1);
+			/*
+			 * NOTE: zero length RREAD is allowed!
+			 */
+			sqe->raddr	= rdma_wr(wr)->remote_addr;
+			sqe->rkey	= rdma_wr(wr)->rkey;
+			sqe->num_sge	= 1;
+
+			if (wr->opcode == IB_WR_RDMA_READ)
+				sqe->opcode = SIW_OP_READ;
+			else
+				sqe->opcode = SIW_OP_READ_LOCAL_INV;
+			break;
+
+		case IB_WR_RDMA_WRITE:
+			if (!(wr->send_flags & IB_SEND_INLINE)) {
+				siw_copy_sgl(wr->sg_list, &sqe->sge[0],
+					     wr->num_sge);
+				sqe->num_sge = wr->num_sge;
+			} else {
+				rv = siw_copy_inline_sgl(wr, sqe);
+				if (unlikely(rv < 0)) {
+					rv = -EINVAL;
+					break;
+				}
+				sqe->flags |= SIW_WQE_INLINE;
+				sqe->num_sge = 1;
+			}
+			sqe->raddr	= rdma_wr(wr)->remote_addr;
+			sqe->rkey	= rdma_wr(wr)->rkey;
+			sqe->opcode	= SIW_OP_WRITE;
+
+			break;
+
+		case IB_WR_REG_MR:
+			sqe->base_mr = (uint64_t)reg_wr(wr)->mr;
+			sqe->rkey = reg_wr(wr)->key;
+			sqe->access = SIW_MEM_LREAD;
+			if (reg_wr(wr)->access & IB_ACCESS_LOCAL_WRITE)
+				sqe->access |= SIW_MEM_LWRITE;
+			if (reg_wr(wr)->access & IB_ACCESS_REMOTE_WRITE)
+				sqe->access |= SIW_MEM_RWRITE;
+			if (reg_wr(wr)->access & IB_ACCESS_REMOTE_READ)
+				sqe->access |= SIW_MEM_RREAD;
+			sqe->opcode = SIW_OP_REG_MR;
+
+			break;
+
+		case IB_WR_LOCAL_INV:
+			sqe->rkey = wr->ex.invalidate_rkey;
+			sqe->opcode = SIW_OP_INVAL_STAG;
+
+			break;
+
+		default:
+			siw_dbg_qp(qp, "ib wr type %d unsupported\n",
+				   wr->opcode);
+			rv = -EINVAL;
+			break;
+		}
+		siw_dbg_qp(qp, "opcode %d, flags 0x%x\n",
+			   sqe->opcode, sqe->flags);
+
+		if (unlikely(rv < 0))
+			break;
+
+		/* make SQE only vaild after completely written */
+		smp_wmb();
+		sqe->flags |= SIW_WQE_VALID;
+
+		qp->sq_put++;
+		wr = wr->next;
+	}
+
+	/*
+	 * Send directly if SQ processing is not in progress.
+	 * Eventual immediate errors (rv < 0) do not affect the involved
+	 * RI resources (Verbs, 8.3.1) and thus do not prevent from SQ
+	 * processing, if new work is already pending. But rv must be passed
+	 * to caller.
+	 */
+	if (wqe->wr_status != SIW_WR_IDLE) {
+		spin_unlock_irqrestore(&qp->sq_lock, flags);
+		goto skip_direct_sending;
+	}
+	rv = siw_activate_tx(qp);
+	spin_unlock_irqrestore(&qp->sq_lock, flags);
+
+	if (rv <= 0)
+		goto skip_direct_sending;
+
+	if (qp->kernel_verbs) {
+		rv = siw_sq_start(qp);
+	} else {
+		qp->tx_ctx.in_syscall = 1;
+
+		if (siw_qp_sq_process(qp) != 0 && !(qp->tx_ctx.tx_suspend))
+			siw_qp_cm_drop(qp, 0);
+
+		qp->tx_ctx.in_syscall = 0;
+	}
+skip_direct_sending:
+
+	up_read(&qp->state_lock);
+
+	if (rv >= 0)
+		return 0;
+	/*
+	 * Immediate error
+	 */
+	siw_dbg_qp(qp, "error %d\n", rv);
+
+	*bad_wr = wr;
+	return rv;
+}
+
+/*
+ * siw_post_receive()
+ *
+ * Post a list of R-WR's to a RQ.
+ *
+ * @base_qp:	Base QP contained in siw QP
+ * @wr:		Null terminated list of user WR's
+ * @bad_wr:	Points to failing WR in case of synchronous failure.
+ */
+int siw_post_receive(struct ib_qp *base_qp, struct ib_recv_wr *wr,
+		     struct ib_recv_wr **bad_wr)
+{
+	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
+	int rv = 0;
+
+	siw_dbg_qp(qp, "state %d\n", qp->attrs.state);
+
+	if (qp->srq) {
+		*bad_wr = wr;
+		return -EOPNOTSUPP; /* what else from errno.h? */
+	}
+	/*
+	 * Try to acquire QP state lock. Must be non-blocking
+	 * to accommodate kernel clients needs.
+	 */
+	if (!down_read_trylock(&qp->state_lock)) {
+		*bad_wr = wr;
+		return -ENOTCONN;
+	}
+	if (!qp->kernel_verbs) {
+		siw_dbg_qp(qp, "no kernel post_recv for user mapped sq\n");
+		up_read(&qp->state_lock);
+		*bad_wr = wr;
+		return -EINVAL;
+	}
+	if (qp->attrs.state > SIW_QP_STATE_RTS) {
+		up_read(&qp->state_lock);
+		*bad_wr = wr;
+		return -EINVAL;
+	}
+	while (wr) {
+		u32 idx = qp->rq_put % qp->attrs.rq_size;
+		struct siw_rqe *rqe = &qp->recvq[idx];
+
+		if (rqe->flags) {
+			siw_dbg_qp(qp, "rq full\n");
+			rv = -ENOMEM;
+			break;
+		}
+		if (wr->num_sge > qp->attrs.rq_max_sges) {
+			siw_dbg_qp(qp, "too many sge's: %d\n", wr->num_sge);
+			rv = -EINVAL;
+			break;
+		}
+		rqe->id = wr->wr_id;
+		rqe->num_sge = wr->num_sge;
+		siw_copy_sgl(wr->sg_list, rqe->sge, wr->num_sge);
+
+		/* make sure RQE is completely written before valid */
+		smp_wmb();
+
+		rqe->flags = SIW_WQE_VALID;
+
+		qp->rq_put++;
+		wr = wr->next;
+	}
+	if (rv < 0) {
+		siw_dbg_qp(qp, "error %d\n", rv);
+		*bad_wr = wr;
+	}
+	up_read(&qp->state_lock);
+
+	return rv > 0 ? 0 : rv;
+}
+
+int siw_destroy_cq(struct ib_cq *base_cq)
+{
+	struct siw_cq		*cq  = siw_cq_base2siw(base_cq);
+	struct ib_device	*base_dev = base_cq->device;
+	struct siw_device	*sdev = siw_dev_base2siw(base_dev);
+
+	siw_cq_flush(cq);
+
+	siw_remove_obj(&sdev->idr_lock, &sdev->cq_idr, &cq->hdr);
+	siw_cq_put(cq);
+
+	return 0;
+}
+
+/*
+ * siw_create_cq()
+ *
+ * Create CQ of requested size on given device.
+ *
+ * @base_dev:	RDMA device contained in siw device
+ * @size:	maximum number of CQE's allowed.
+ * @ib_context: user context.
+ * @udata:	used to provide CQ ID back to user.
+ */
+
+struct ib_cq *siw_create_cq(struct ib_device *base_dev,
+			    const struct ib_cq_init_attr *attr,
+			    struct ib_ucontext *ib_context,
+			    struct ib_udata *udata)
+{
+	struct siw_cq			*cq = NULL;
+	struct siw_device		*sdev = siw_dev_base2siw(base_dev);
+	struct siw_uresp_create_cq	uresp;
+	int rv, size = attr->cqe;
+
+	if (!base_dev) {
+		rv = -ENODEV;
+		goto err_out;
+	}
+	if (atomic_inc_return(&sdev->num_cq) > SIW_MAX_CQ) {
+		siw_dbg(sdev, "too many cq's\n");
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (size < 1 || size > SIW_MAX_CQE) {
+		siw_dbg(sdev, "cq size error: %d\n", size);
+		rv = -EINVAL;
+		goto err_out;
+	}
+	cq = kzalloc(sizeof(*cq), GFP_KERNEL);
+	if (!cq) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	size = roundup_pow_of_two(size);
+	cq->base_cq.cqe = size;
+	cq->num_cqe = size;
+
+	if (!ib_context) {
+		cq->kernel_verbs = 1;
+		cq->queue = vzalloc(size * sizeof(struct siw_cqe)
+				+ sizeof(struct siw_cq_ctrl));
+	} else
+		cq->queue = vmalloc_user(size * sizeof(struct siw_cqe)
+				+ sizeof(struct siw_cq_ctrl));
+
+	if (cq->queue == NULL) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+
+	rv = siw_cq_add(sdev, cq);
+	if (rv)
+		goto err_out;
+
+	spin_lock_init(&cq->lock);
+
+	cq->notify = &((struct siw_cq_ctrl *)&cq->queue[size])->notify;
+
+	if (!cq->kernel_verbs) {
+		struct siw_ucontext *ctx = siw_ctx_base2siw(ib_context);
+
+		uresp.cq_key = siw_insert_uobj(ctx, cq->queue,
+					size * sizeof(struct siw_cqe) +
+					sizeof(struct siw_cq_ctrl));
+
+		if (uresp.cq_key > SIW_MAX_UOBJ_KEY)
+			siw_dbg(sdev, "[CQ %d]: preparing mmap rq failed\n",
+				OBJ_ID(cq));
+
+		uresp.cq_id = OBJ_ID(cq);
+		uresp.num_cqe = size;
+
+		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		if (rv)
+			goto err_out_idr;
+	}
+	return &cq->base_cq;
+
+err_out_idr:
+	siw_remove_obj(&sdev->idr_lock, &sdev->cq_idr, &cq->hdr);
+err_out:
+	siw_dbg(sdev, "cq creation failed: %d", rv);
+
+	if (cq && cq->queue)
+		vfree(cq->queue);
+
+	kfree(cq);
+	atomic_dec(&sdev->num_cq);
+
+	return ERR_PTR(rv);
+}
+
+/*
+ * siw_poll_cq()
+ *
+ * Reap CQ entries if available and copy work completion status into
+ * array of WC's provided by caller. Returns number of reaped CQE's.
+ *
+ * @base_cq:	Base CQ contained in siw CQ.
+ * @num_cqe:	Maximum number of CQE's to reap.
+ * @wc:		Array of work completions to be filled by siw.
+ */
+int siw_poll_cq(struct ib_cq *base_cq, int num_cqe, struct ib_wc *wc)
+{
+	struct siw_cq		*cq  = siw_cq_base2siw(base_cq);
+	int			i;
+
+	for (i = 0; i < num_cqe; i++) {
+		if (!(siw_reap_cqe(cq, wc)))
+			break;
+		wc++;
+	}
+	return i;
+}
+
+/*
+ * siw_req_notify_cq()
+ *
+ * Request notification for new CQE's added to that CQ.
+ * Defined flags:
+ * o SIW_CQ_NOTIFY_SOLICITED lets siw trigger a notification
+ *   event if a WQE with notification flag set enters the CQ
+ * o SIW_CQ_NOTIFY_NEXT_COMP lets siw trigger a notification
+ *   event if a WQE enters the CQ.
+ * o IB_CQ_REPORT_MISSED_EVENTS: return value will provide the
+ *   number of not reaped CQE's regardless of its notification
+ *   type and current or new CQ notification settings.
+ *
+ * @base_cq:	Base CQ contained in siw CQ.
+ * @flags:	Requested notification flags.
+ */
+int siw_req_notify_cq(struct ib_cq *base_cq, enum ib_cq_notify_flags flags)
+{
+	struct siw_cq	 *cq  = siw_cq_base2siw(base_cq);
+
+	siw_dbg(cq->hdr.sdev, "[CQ %d]: flags: 0x%8x\n", OBJ_ID(cq), flags);
+
+	if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED)
+		/* CQ event for next solicited completion */
+		smp_store_mb(*cq->notify, SIW_NOTIFY_SOLICITED);
+	else
+		/* CQ event for any signalled completion */
+		smp_store_mb(*cq->notify, SIW_NOTIFY_ALL);
+
+	if (flags & IB_CQ_REPORT_MISSED_EVENTS)
+		return cq->cq_put - cq->cq_get;
+
+	return 0;
+}
+
+/*
+ * siw_dereg_mr()
+ *
+ * Release Memory Region.
+ *
+ * TODO: Update function if Memory Windows are supported by siw:
+ *       Is OFED core checking for MW dependencies for current
+ *       MR before calling MR deregistration?.
+ *
+ * @base_mr:     Base MR contained in siw MR.
+ */
+int siw_dereg_mr(struct ib_mr *base_mr)
+{
+	struct siw_mr *mr;
+	struct siw_device *sdev = siw_dev_base2siw(base_mr->device);
+
+	mr = siw_mr_base2siw(base_mr);
+
+	siw_dbg(sdev, "[MEM %d]: deregister mr, #ref's %d\n",
+		mr->mem.hdr.id, refcount_read(&mr->mem.hdr.ref));
+
+	mr->mem.stag_valid = 0;
+
+	siw_remove_obj(&sdev->idr_lock, &sdev->mem_idr, &mr->mem.hdr);
+	siw_mem_put(&mr->mem);
+
+	return 0;
+}
+
+static struct siw_mr *siw_create_mr(struct siw_device *sdev, void *mem_obj,
+				    u64 start, u64 len, int rights)
+{
+	struct siw_mr *mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	unsigned long flags;
+
+	if (!mr)
+		return NULL;
+
+	mr->mem.stag_valid = 0;
+
+	if (siw_mem_add(sdev, &mr->mem) < 0) {
+		kfree(mr);
+		return NULL;
+	}
+	siw_dbg(sdev, "[MEM %d]: new mr, object 0x%p\n",
+		mr->mem.hdr.id, mem_obj);
+
+	mr->base_mr.lkey = mr->base_mr.rkey = mr->mem.hdr.id << 8;
+
+	mr->mem.va  = start;
+	mr->mem.len = len;
+	mr->mem.mr  = NULL;
+	mr->mem.perms = SIW_MEM_LREAD | /* not selectable in RDMA core */
+		(rights & IB_ACCESS_REMOTE_READ  ? SIW_MEM_RREAD  : 0) |
+		(rights & IB_ACCESS_LOCAL_WRITE  ? SIW_MEM_LWRITE : 0) |
+		(rights & IB_ACCESS_REMOTE_WRITE ? SIW_MEM_RWRITE : 0);
+
+	mr->mem_obj = mem_obj;
+
+	INIT_LIST_HEAD(&mr->devq);
+	spin_lock_irqsave(&sdev->idr_lock, flags);
+	list_add_tail(&mr->devq, &sdev->mr_list);
+	spin_unlock_irqrestore(&sdev->idr_lock, flags);
+
+	return mr;
+}
+
+/*
+ * siw_reg_user_mr()
+ *
+ * Register Memory Region.
+ *
+ * @base_pd:	Base PD contained in siw PD.
+ * @start:	starting address of MR (virtual address)
+ * @len:	len of MR
+ * @rnic_va:	not used by siw
+ * @rights:	MR access rights
+ * @udata:	user buffer to communicate STag and Key.
+ */
+struct ib_mr *siw_reg_user_mr(struct ib_pd *base_pd, u64 start, u64 len,
+			      u64 rnic_va, int rights, struct ib_udata *udata)
+{
+	struct siw_mr		*mr = NULL;
+	struct siw_pd		*pd = siw_pd_base2siw(base_pd);
+	struct siw_umem		*umem = NULL;
+	struct siw_ureq_reg_mr	ureq;
+	struct siw_uresp_reg_mr	uresp;
+	struct siw_device	*sdev = pd->hdr.sdev;
+
+	unsigned long mem_limit = rlimit(RLIMIT_MEMLOCK);
+	int rv;
+
+	siw_dbg(sdev, "[PD %d]: start: 0x%016llx, va: 0x%016llx, len: %llu\n",
+		OBJ_ID(pd), (unsigned long long)start,
+		(unsigned long long)rnic_va, (unsigned long long)len);
+
+	if (atomic_inc_return(&sdev->num_mr) > SIW_MAX_MR) {
+		siw_dbg(sdev, "[PD %d]: too many mr's\n", OBJ_ID(pd));
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (!len) {
+		rv = -EINVAL;
+		goto err_out;
+	}
+	if (mem_limit != RLIM_INFINITY) {
+		unsigned long num_pages =
+			(PAGE_ALIGN(len + (start & ~PAGE_MASK))) >> PAGE_SHIFT;
+		mem_limit >>= PAGE_SHIFT;
+
+		if (num_pages > mem_limit - current->mm->locked_vm) {
+			siw_dbg(sdev,
+				"[PD %d]: pages req %lu, max %lu, lock %lu\n",
+				OBJ_ID(pd), num_pages, mem_limit,
+				current->mm->locked_vm);
+			rv = -ENOMEM;
+			goto err_out;
+		}
+	}
+	umem = siw_umem_get(start, len);
+	if (IS_ERR(umem)) {
+		rv = PTR_ERR(umem);
+		siw_dbg(sdev, "[PD %d]: getting user memory failed: %d\n",
+			OBJ_ID(pd), rv);
+		umem = NULL;
+		goto err_out;
+	}
+	mr = siw_create_mr(sdev, umem, start, len, rights);
+	if (!mr) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (udata) {
+		rv = ib_copy_from_udata(&ureq, udata, sizeof(ureq));
+		if (rv)
+			goto err_out;
+
+		mr->base_mr.lkey |= ureq.stag_key;
+		mr->base_mr.rkey |= ureq.stag_key;
+		uresp.stag = mr->base_mr.lkey;
+
+		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		if (rv)
+			goto err_out;
+	}
+	mr->pd = pd;
+	siw_pd_get(pd);
+
+	mr->mem.stag_valid = 1;
+
+	return &mr->base_mr;
+
+err_out:
+	if (mr) {
+		siw_remove_obj(&sdev->idr_lock, &sdev->mem_idr, &mr->mem.hdr);
+		siw_mem_put(&mr->mem);
+		umem = NULL;
+	} else
+		atomic_dec(&sdev->num_mr);
+
+	if (umem)
+		siw_umem_release(umem);
+
+	return ERR_PTR(rv);
+}
+
+struct ib_mr *siw_alloc_mr(struct ib_pd *base_pd, enum ib_mr_type mr_type,
+			   u32 max_sge)
+{
+	struct siw_mr	*mr;
+	struct siw_pd	*pd = siw_pd_base2siw(base_pd);
+	struct siw_device *sdev = pd->hdr.sdev;
+	struct siw_pbl	*pbl = NULL;
+	int rv;
+
+	if (atomic_inc_return(&sdev->num_mr) > SIW_MAX_MR) {
+		siw_dbg(sdev, "[PD %d]: too many mr's\n", OBJ_ID(pd));
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (mr_type != IB_MR_TYPE_MEM_REG) {
+		siw_dbg(sdev, "[PD %d]: mr type %d unsupported\n",
+			OBJ_ID(pd), mr_type);
+		rv = -EOPNOTSUPP;
+		goto err_out;
+	}
+	if (max_sge > SIW_MAX_SGE_PBL) {
+		siw_dbg(sdev, "[PD %d]: too many sge's: %d\n",
+			OBJ_ID(pd), max_sge);
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	pbl = siw_pbl_alloc(max_sge);
+	if (IS_ERR(pbl)) {
+		rv = PTR_ERR(pbl);
+		siw_dbg(sdev, "[PD %d]: pbl allocation failed: %d\n",
+			OBJ_ID(pd), rv);
+		pbl = NULL;
+		goto err_out;
+	}
+	mr = siw_create_mr(sdev, pbl, 0, max_sge * PAGE_SIZE, 0);
+	if (!mr) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	mr->mem.is_pbl = 1;
+	mr->pd = pd;
+	siw_pd_get(pd);
+
+	siw_dbg(sdev, "[PD %d], [MEM %d]: success\n",
+		OBJ_ID(pd), OBJ_ID(&mr->mem));
+
+	return &mr->base_mr;
+
+err_out:
+	if (pbl)
+		siw_pbl_free(pbl);
+
+	siw_dbg(sdev, "[PD %d]: failed: %d\n", OBJ_ID(pd), rv);
+
+	atomic_dec(&sdev->num_mr);
+
+	return ERR_PTR(rv);
+}
+
+/* Just used to count number of pages being mapped */
+static int siw_set_pbl_page(struct ib_mr *base_mr, u64 buf_addr)
+{
+	return 0;
+}
+
+int siw_map_mr_sg(struct ib_mr *base_mr, struct scatterlist *sl, int num_sle,
+		  unsigned int *sg_off)
+{
+	struct scatterlist *slp;
+	struct siw_mr *mr = siw_mr_base2siw(base_mr);
+	struct siw_pbl *pbl = mr->pbl;
+	struct siw_pble *pble = pbl->pbe;
+	u64 pbl_size;
+	int i, rv;
+
+	if (!pbl) {
+		siw_dbg(mr->mem.hdr.sdev, "[MEM %d]: no pbl allocated\n",
+			OBJ_ID(&mr->mem));
+		return -EINVAL;
+	}
+	if (pbl->max_buf < num_sle) {
+		siw_dbg(mr->mem.hdr.sdev, "[MEM %d]: too many sge's: %d>%d\n",
+			OBJ_ID(&mr->mem), mr->pbl->max_buf, num_sle);
+		return -ENOMEM;
+	}
+
+	for_each_sg(sl, slp, num_sle, i) {
+		if (sg_dma_len(slp) == 0) {
+			siw_dbg(mr->mem.hdr.sdev, "[MEM %d]: empty sge\n",
+				OBJ_ID(&mr->mem));
+			return -EINVAL;
+		}
+		if (i == 0) {
+			pble->addr = sg_dma_address(slp);
+			pble->size = sg_dma_len(slp);
+			pble->pbl_off = 0;
+			pbl_size = pble->size;
+			pbl->num_buf = 1;
+
+			continue;
+		}
+		/* Merge PBL entries if adjacent */
+		if (pble->addr + pble->size == sg_dma_address(slp))
+			pble->size += sg_dma_len(slp);
+		else {
+			pble++;
+			pbl->num_buf++;
+			pble->addr = sg_dma_address(slp);
+			pble->size = sg_dma_len(slp);
+			pble->pbl_off = pbl_size;
+		}
+		pbl_size += sg_dma_len(slp);
+
+		siw_dbg(mr->mem.hdr.sdev,
+			"[MEM %d]: sge[%d], size %llu, addr %p, total %llu\n",
+			OBJ_ID(&mr->mem), i, pble->size, (void *)pble->addr,
+			pbl_size);
+	}
+	rv = ib_sg_to_pages(base_mr, sl, num_sle, sg_off, siw_set_pbl_page);
+	if (rv > 0) {
+		mr->mem.len = base_mr->length;
+		mr->mem.va = base_mr->iova;
+		siw_dbg(mr->mem.hdr.sdev,
+			"[MEM %d]: %llu byte, %u SLE into %u entries\n",
+			OBJ_ID(&mr->mem), mr->mem.len, num_sle, pbl->num_buf);
+	}
+	return rv;
+}
+
+/*
+ * siw_get_dma_mr()
+ *
+ * Create a (empty) DMA memory region, where no umem is attached.
+ */
+struct ib_mr *siw_get_dma_mr(struct ib_pd *base_pd, int rights)
+{
+	struct siw_mr *mr;
+	struct siw_pd *pd = siw_pd_base2siw(base_pd);
+	struct siw_device *sdev = pd->hdr.sdev;
+	int rv;
+
+	if (atomic_inc_return(&sdev->num_mr) > SIW_MAX_MR) {
+		siw_dbg(sdev, "[PD %d]: too many mr's\n", OBJ_ID(pd));
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	mr = siw_create_mr(sdev, NULL, 0, ULONG_MAX, rights);
+	if (!mr) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	mr->mem.stag_valid = 1;
+
+	mr->pd = pd;
+	siw_pd_get(pd);
+
+	siw_dbg(sdev, "[PD %d], [MEM %d]: success\n",
+		OBJ_ID(pd), OBJ_ID(&mr->mem));
+
+	return &mr->base_mr;
+
+err_out:
+	atomic_dec(&sdev->num_mr);
+
+	return ERR_PTR(rv);
+}
+
+/*
+ * siw_create_srq()
+ *
+ * Create Shared Receive Queue of attributes @init_attrs
+ * within protection domain given by @base_pd.
+ *
+ * @base_pd:	Base PD contained in siw PD.
+ * @init_attrs:	SRQ init attributes.
+ * @udata:	not used by siw.
+ */
+struct ib_srq *siw_create_srq(struct ib_pd *base_pd,
+			      struct ib_srq_init_attr *init_attrs,
+			      struct ib_udata *udata)
+{
+	struct siw_srq		*srq = NULL;
+	struct ib_srq_attr	*attrs = &init_attrs->attr;
+	struct siw_pd		*pd = siw_pd_base2siw(base_pd);
+	struct siw_device	*sdev = pd->hdr.sdev;
+
+	int kernel_verbs = base_pd->uobject ? 0 : 1;
+	int rv;
+
+	if (atomic_inc_return(&sdev->num_srq) > SIW_MAX_SRQ) {
+		siw_dbg(sdev, "too many srq's\n");
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (attrs->max_wr == 0 || attrs->max_wr > SIW_MAX_SRQ_WR ||
+	    attrs->max_sge > SIW_MAX_SGE || attrs->srq_limit > attrs->max_wr) {
+		rv = -EINVAL;
+		goto err_out;
+	}
+
+	srq = kzalloc(sizeof(*srq), GFP_KERNEL);
+	if (!srq) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+
+	srq->max_sge = attrs->max_sge;
+	srq->num_rqe = roundup_pow_of_two(attrs->max_wr);
+	atomic_set(&srq->space, srq->num_rqe);
+
+	srq->limit = attrs->srq_limit;
+	if (srq->limit)
+		srq->armed = 1;
+
+	if (kernel_verbs)
+		srq->recvq = vzalloc(srq->num_rqe * sizeof(struct siw_rqe));
+	else
+		srq->recvq = vmalloc_user(srq->num_rqe *
+					  sizeof(struct siw_rqe));
+
+	if (srq->recvq == NULL) {
+		rv = -ENOMEM;
+		goto err_out;
+	}
+	if (kernel_verbs) {
+		srq->kernel_verbs = 1;
+	} else if (udata) {
+		struct siw_uresp_create_srq uresp;
+		struct siw_ucontext *ctx;
+
+		memset(&uresp, 0, sizeof(uresp));
+		ctx = siw_ctx_base2siw(base_pd->uobject->context);
+
+		uresp.num_rqe = srq->num_rqe;
+		uresp.srq_key = siw_insert_uobj(ctx, srq->recvq,
+					srq->num_rqe * sizeof(struct siw_rqe));
+
+		if (uresp.srq_key > SIW_MAX_UOBJ_KEY)
+			siw_dbg(sdev, "preparing mmap srq failed\n");
+
+		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		if (rv)
+			goto err_out;
+	}
+	srq->pd	= pd;
+	siw_pd_get(pd);
+
+	spin_lock_init(&srq->lock);
+
+	siw_dbg(sdev, "[SRQ 0x%p]: success\n", srq);
+
+	return &srq->base_srq;
+
+err_out:
+	if (srq) {
+		if (srq->recvq)
+			vfree(srq->recvq);
+		kfree(srq);
+	}
+	atomic_dec(&sdev->num_srq);
+
+	return ERR_PTR(rv);
+}
+
+/*
+ * siw_modify_srq()
+ *
+ * Modify SRQ. The caller may resize SRQ and/or set/reset notification
+ * limit and (re)arm IB_EVENT_SRQ_LIMIT_REACHED notification.
+ *
+ * NOTE: it is unclear if RDMA core allows for changing the MAX_SGE
+ * parameter. siw_modify_srq() does not check the attrs->max_sge param.
+ */
+int siw_modify_srq(struct ib_srq *base_srq, struct ib_srq_attr *attrs,
+		   enum ib_srq_attr_mask attr_mask, struct ib_udata *udata)
+{
+	struct siw_srq	*srq = siw_srq_base2siw(base_srq);
+	unsigned long	flags;
+	int rv = 0;
+
+	spin_lock_irqsave(&srq->lock, flags);
+
+	if (attr_mask & IB_SRQ_MAX_WR) {
+		/* resize request not yet supported */
+		rv = -EOPNOTSUPP;
+		goto out;
+	}
+	if (attr_mask & IB_SRQ_LIMIT) {
+		if (attrs->srq_limit) {
+			if (unlikely(attrs->srq_limit > srq->num_rqe)) {
+				rv = -EINVAL;
+				goto out;
+			}
+			srq->armed = 1;
+		} else
+			srq->armed = 0;
+
+		srq->limit = attrs->srq_limit;
+	}
+out:
+	spin_unlock_irqrestore(&srq->lock, flags);
+
+	return rv;
+}
+
+/*
+ * siw_query_srq()
+ *
+ * Query SRQ attributes.
+ */
+int siw_query_srq(struct ib_srq *base_srq, struct ib_srq_attr *attrs)
+{
+	struct siw_srq	*srq = siw_srq_base2siw(base_srq);
+	unsigned long	flags;
+
+	spin_lock_irqsave(&srq->lock, flags);
+
+	attrs->max_wr = srq->num_rqe;
+	attrs->max_sge = srq->max_sge;
+	attrs->srq_limit = srq->limit;
+
+	spin_unlock_irqrestore(&srq->lock, flags);
+
+	return 0;
+}
+
+/*
+ * siw_destroy_srq()
+ *
+ * Destroy SRQ.
+ * It is assumed that the SRQ is not referenced by any
+ * QP anymore - the code trusts the RDMA core environment to keep track
+ * of QP references.
+ */
+int siw_destroy_srq(struct ib_srq *base_srq)
+{
+	struct siw_srq		*srq = siw_srq_base2siw(base_srq);
+	struct siw_device	*sdev = srq->pd->hdr.sdev;
+
+	siw_pd_put(srq->pd);
+
+	vfree(srq->recvq);
+	kfree(srq);
+
+	atomic_dec(&sdev->num_srq);
+
+	return 0;
+}
+
+/*
+ * siw_post_srq_recv()
+ *
+ * Post a list of receive queue elements to SRQ.
+ * NOTE: The function does not check or lock a certain SRQ state
+ *       during the post operation. The code simply trusts the
+ *       RDMA core environment.
+ *
+ * @base_srq:	Base SRQ contained in siw SRQ
+ * @wr:		List of R-WR's
+ * @bad_wr:	Updated to failing WR if posting fails.
+ */
+int siw_post_srq_recv(struct ib_srq *base_srq, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr)
+{
+	struct siw_srq	*srq = siw_srq_base2siw(base_srq);
+	int rv = 0;
+
+	if (!srq->kernel_verbs) {
+		siw_dbg(srq->pd->hdr.sdev,
+			"[SRQ 0x%p]: no kernel post_recv for mapped srq\n",
+			srq);
+		rv = -EINVAL;
+		goto out;
+	}
+	while (wr) {
+		u32 idx = srq->rq_put % srq->num_rqe;
+		struct siw_rqe *rqe = &srq->recvq[idx];
+
+		if (rqe->flags) {
+			siw_dbg(srq->pd->hdr.sdev, "[SRQ 0x%p]: srq full\n",
+				srq);
+			rv = -ENOMEM;
+			break;
+		}
+		if (wr->num_sge > srq->max_sge) {
+			siw_dbg(srq->pd->hdr.sdev,
+				"[SRQ 0x%p]: too many sge's: %d\n",
+				srq, wr->num_sge);
+			rv = -EINVAL;
+			break;
+		}
+		rqe->id = wr->wr_id;
+		rqe->num_sge = wr->num_sge;
+		siw_copy_sgl(wr->sg_list, rqe->sge, wr->num_sge);
+
+		/* Make sure S-RQE is completely written before valid */
+		smp_wmb();
+
+		rqe->flags = SIW_WQE_VALID;
+
+		srq->rq_put++;
+		wr = wr->next;
+	}
+out:
+	if (unlikely(rv < 0)) {
+		siw_dbg(srq->pd->hdr.sdev, "[SRQ 0x%p]: error %d\n", srq, rv);
+		*bad_wr = wr;
+	}
+	return rv;
+}
diff --git a/drivers/infiniband/sw/siw/siw_verbs.h b/drivers/infiniband/sw/siw/siw_verbs.h
new file mode 100644
index 000000000000..829f603923cc
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_verbs.h
@@ -0,0 +1,119 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _SIW_VERBS_H
+#define _SIW_VERBS_H
+
+#include <linux/errno.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_cm.h"
+
+
+extern struct ib_ucontext *siw_alloc_ucontext(struct ib_device *ibdev,
+					      struct ib_udata *udata);
+extern int siw_dealloc_ucontext(struct ib_ucontext *ucontext);
+extern int siw_query_port(struct ib_device *ibdev, u8 port,
+			  struct ib_port_attr *attr);
+extern int siw_get_port_immutable(struct ib_device *ibdev, u8 port,
+				  struct ib_port_immutable *port_imm);
+extern int siw_query_device(struct ib_device *ibdev,
+			    struct ib_device_attr *attr,
+			    struct ib_udata *udata);
+extern struct ib_cq *siw_create_cq(struct ib_device *ibdev,
+				   const struct ib_cq_init_attr *attr,
+				   struct ib_ucontext *ucontext,
+				   struct ib_udata *udata);
+extern int siw_no_mad(struct ib_device *base_dev, int flags, u8 port,
+		      const struct ib_wc *wc, const struct ib_grh *grh,
+		      const struct ib_mad_hdr *in_mad, size_t in_mad_size,
+		      struct ib_mad_hdr *out_mad, size_t *out_mad_size,
+		      u16 *outmad_pkey_index);
+extern int siw_query_port(struct ib_device *ibdev, u8 port,
+			  struct ib_port_attr *attr);
+extern int siw_query_pkey(struct ib_device *ibdev, u8 port,
+			  u16 idx, u16 *pkey);
+extern int siw_query_gid(struct ib_device *ibdev, u8 port, int idx,
+			 union ib_gid *gid);
+extern struct ib_pd *siw_alloc_pd(struct ib_device *ibdev,
+				  struct ib_ucontext *ucontext,
+				  struct ib_udata *udata);
+extern int siw_dealloc_pd(struct ib_pd *pd);
+extern struct ib_qp *siw_create_qp(struct ib_pd *pd,
+				  struct ib_qp_init_attr *attr,
+				   struct ib_udata *udata);
+extern int siw_query_qp(struct ib_qp *ofa_qp, struct ib_qp_attr *qp_attr,
+			int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr);
+extern int siw_verbs_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+			       int attr_mask, struct ib_udata *udata);
+extern int siw_destroy_qp(struct ib_qp *ibqp);
+extern int siw_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+			 struct ib_send_wr **bad_wr);
+extern int siw_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+			    struct ib_recv_wr **bad_wr);
+extern int siw_destroy_cq(struct ib_cq *ibcq);
+extern int siw_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+extern int siw_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
+extern struct ib_mr *siw_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
+				     u64 rnic_va, int rights,
+				     struct ib_udata *udata);
+extern struct ib_mr *siw_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
+				  u32 max_sge);
+extern struct ib_mr *siw_get_dma_mr(struct ib_pd *ibpd, int rights);
+extern int siw_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sl,
+			 int num_sle, unsigned int *sg_off);
+extern int siw_dereg_mr(struct ib_mr *ibmr);
+extern struct ib_srq *siw_create_srq(struct ib_pd *ibpd,
+				     struct ib_srq_init_attr *attr,
+				     struct ib_udata *udata);
+extern int siw_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
+			  enum ib_srq_attr_mask mask, struct ib_udata *udata);
+extern int siw_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr);
+extern int siw_destroy_srq(struct ib_srq *ibsrq);
+extern int siw_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr,
+			     struct ib_recv_wr **bad_wr);
+extern int siw_mmap(struct ib_ucontext *ibctx, struct vm_area_struct *vma);
+
+extern const struct dma_map_ops siw_dma_generic_ops;
+
+#endif
diff --git a/include/uapi/rdma/siw_user.h b/include/uapi/rdma/siw_user.h
new file mode 100644
index 000000000000..85cdc3e2c83d
--- /dev/null
+++ b/include/uapi/rdma/siw_user.h
@@ -0,0 +1,216 @@
+/*
+ * Software iWARP device driver for Linux
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _SIW_USER_H
+#define _SIW_USER_H
+
+#include <linux/types.h>
+
+/*Common string that is matched to accept the device by the user library*/
+#define SIW_NODE_DESC_COMMON	"Software iWARP stack"
+
+#define SIW_IBDEV_PREFIX "siw_"
+
+#define VERSION_ID_SOFTIWARP	2
+
+#define SIW_MAX_SGE		6
+#define SIW_MAX_UOBJ_KEY	0xffffff
+#define SIW_INVAL_UOBJ_KEY	(SIW_MAX_UOBJ_KEY + 1)
+
+struct siw_uresp_create_cq {
+	__u32	cq_id;
+	__u32	num_cqe;
+	__u32	cq_key;
+};
+
+struct siw_uresp_create_qp {
+	__u32	qp_id;
+	__u32	num_sqe;
+	__u32	num_rqe;
+	__u32	sq_key;
+	__u32	rq_key;
+};
+
+struct siw_ureq_reg_mr {
+	__u8	stag_key;
+	__u8	reserved[3];
+};
+
+struct siw_uresp_reg_mr {
+	__u32	stag;
+};
+
+struct siw_uresp_create_srq {
+	__u32	num_rqe;
+	__u32	srq_key;
+};
+
+struct siw_uresp_alloc_ctx {
+	__u32	dev_id;
+};
+
+enum siw_opcode {
+	SIW_OP_WRITE		= 0,
+	SIW_OP_READ		= 1,
+	SIW_OP_READ_LOCAL_INV	= 2,
+	SIW_OP_SEND		= 3,
+	SIW_OP_SEND_WITH_IMM	= 4,
+	SIW_OP_SEND_REMOTE_INV	= 5,
+
+	/* Unsupported */
+	SIW_OP_FETCH_AND_ADD	= 6,
+	SIW_OP_COMP_AND_SWAP	= 7,
+
+	SIW_OP_RECEIVE		= 8,
+	/* provider internal SQE */
+	SIW_OP_READ_RESPONSE	= 9,
+	/*
+	 * below opcodes valid for
+	 * in-kernel clients only
+	 */
+	SIW_OP_INVAL_STAG	= 10,
+	SIW_OP_REG_MR		= 11,
+	SIW_NUM_OPCODES		= 12
+};
+
+/* Keep it same as ibv_sge to allow for memcpy */
+struct siw_sge {
+	__u64	laddr;
+	__u32	length;
+	__u32	lkey;
+};
+
+/*
+ * Inline data are kept within the work request itself occupying
+ * the space of sge[1] .. sge[n]. Therefore, inline data cannot be
+ * supported if SIW_MAX_SGE is below 2 elements.
+ */
+#define SIW_MAX_INLINE	(sizeof(struct siw_sge) * (SIW_MAX_SGE - 1))
+
+#if SIW_MAX_SGE < 2
+#error "SIW_MAX_SGE must be at least 2"
+#endif
+
+enum siw_wqe_flags { 
+	SIW_WQE_VALID           = 1,
+	SIW_WQE_INLINE          = (1 << 1),
+	SIW_WQE_SIGNALLED       = (1 << 2),
+	SIW_WQE_SOLICITED       = (1 << 3),
+	SIW_WQE_READ_FENCE	= (1 << 4),
+	SIW_WQE_COMPLETED       = (1 << 5)
+};
+
+/* Send Queue Element */
+struct siw_sqe {
+	__u64	id;
+	__u16	flags;
+	__u8	num_sge;
+	/* Contains enum siw_opcode values */
+	__u8	opcode;
+	__u32	rkey;
+	union {
+		__u64	raddr;
+		__u64	base_mr;
+	};
+	union {
+		struct siw_sge	sge[SIW_MAX_SGE];
+		__u32	access;
+	};
+};
+
+/* Receive Queue Element */
+struct siw_rqe {
+	__u64	id;
+	__u16	flags;
+	__u8	num_sge;
+	/*
+	 * only used by kernel driver,
+	 * ignored if set by user
+	 */
+	__u8	opcode;
+	__u32	unused;
+	struct siw_sge	sge[SIW_MAX_SGE];
+};
+
+enum siw_notify_flags {
+	SIW_NOTIFY_NOT			= (0),
+	SIW_NOTIFY_SOLICITED		= (1 << 0),
+	SIW_NOTIFY_NEXT_COMPLETION	= (1 << 1),
+	SIW_NOTIFY_MISSED_EVENTS	= (1 << 2),
+	SIW_NOTIFY_ALL = SIW_NOTIFY_SOLICITED |
+			SIW_NOTIFY_NEXT_COMPLETION |
+			SIW_NOTIFY_MISSED_EVENTS
+};
+
+enum siw_wc_status {
+	SIW_WC_SUCCESS		= 0,
+	SIW_WC_LOC_LEN_ERR	= 1,
+	SIW_WC_LOC_PROT_ERR	= 2,
+	SIW_WC_LOC_QP_OP_ERR	= 3,
+	SIW_WC_WR_FLUSH_ERR	= 4,
+	SIW_WC_BAD_RESP_ERR	= 5,
+	SIW_WC_LOC_ACCESS_ERR	= 6,
+	SIW_WC_REM_ACCESS_ERR	= 7,
+	SIW_WC_REM_INV_REQ_ERR	= 8,
+	SIW_WC_GENERAL_ERR	= 9,
+	SIW_NUM_WC_STATUS	= 10
+};
+
+struct siw_cqe {
+	__u64	id;
+	__u8	flags;
+	__u8	opcode;
+	__u16	status;
+	__u32	bytes;
+	__u64	imm_data;
+	/* QP number or QP pointer */
+	union { 
+		void	*qp;
+		__u64	qp_id;
+	};
+};
+
+/*
+ * Shared structure between user and kernel
+ * to control CQ arming.
+ */
+struct siw_cq_ctrl {
+	enum siw_notify_flags	notify;
+};
+
+#endif
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 06/13] SoftiWarp connection management
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (4 preceding siblings ...)
  2018-01-14 22:35   ` [PATCH v3 05/13] SoftiWarp application interface Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
       [not found]     ` <20180114223603.19961-7-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
  2018-01-31 17:28     ` Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 07/13] SoftiWarp application buffer management Bernard Metzler
                     ` (8 subsequent siblings)
  14 siblings, 2 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_cm.c | 2184 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_cm.h |  156 +++
 2 files changed, 2340 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_cm.c
 create mode 100644 drivers/infiniband/sw/siw/siw_cm.h

diff --git a/drivers/infiniband/sw/siw/siw_cm.c b/drivers/infiniband/sw/siw/siw_cm.c
new file mode 100644
index 000000000000..6e80ac002ddb
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_cm.c
@@ -0,0 +1,2184 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Fredy Neeser <nfd-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Greg Joyce <greg-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ * Copyright (c) 2017, Open Grid Computing, Inc.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/net.h>
+#include <linux/inetdevice.h>
+#include <linux/workqueue.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <linux/tcp.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_cm.h"
+#include "siw_obj.h"
+
+/*
+ * Set to any combination of
+ * MPA_V2_RDMA_NO_RTR, MPA_V2_RDMA_READ_RTR, MPA_V2_RDMA_WRITE_RTR
+ */
+static __be16 rtr_type = MPA_V2_RDMA_READ_RTR|MPA_V2_RDMA_WRITE_RTR;
+static const bool relaxed_ird_negotiation = 1;
+
+/*
+ * siw_sock_nodelay() - Disable Nagle algorithm and/or set pingpong
+ */
+static int siw_sock_nodelay(struct socket *sock)
+{
+	int val = 1, rv = 0;
+
+	if (siw_lowdelay)
+		rv = kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY,
+				       (char *)&val, sizeof(val));
+
+	if (!rv && tcp_quickack)
+		rv =  kernel_setsockopt(sock, SOL_TCP, TCP_QUICKACK,
+					(char *)&val, sizeof(val));
+	return rv;
+}
+
+static void siw_cm_llp_state_change(struct sock *s);
+static void siw_cm_llp_data_ready(struct sock *s);
+static void siw_cm_llp_write_space(struct sock *s);
+static void siw_cm_llp_error_report(struct sock *s);
+static int siw_cm_upcall(struct siw_cep *cep, enum iw_cm_event_type reson,
+			 int status);
+
+static void siw_sk_assign_cm_upcalls(struct sock *sk)
+{
+	write_lock_bh(&sk->sk_callback_lock);
+	sk->sk_state_change = siw_cm_llp_state_change;
+	sk->sk_data_ready   = siw_cm_llp_data_ready;
+	sk->sk_write_space  = siw_cm_llp_write_space;
+	sk->sk_error_report = siw_cm_llp_error_report;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void siw_sk_save_upcalls(struct sock *sk)
+{
+	struct siw_cep *cep = sk_to_cep(sk);
+
+	write_lock_bh(&sk->sk_callback_lock);
+	cep->sk_state_change = sk->sk_state_change;
+	cep->sk_data_ready   = sk->sk_data_ready;
+	cep->sk_write_space  = sk->sk_write_space;
+	cep->sk_error_report = sk->sk_error_report;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void siw_sk_restore_upcalls(struct sock *sk, struct siw_cep *cep)
+{
+	sk->sk_state_change	= cep->sk_state_change;
+	sk->sk_data_ready	= cep->sk_data_ready;
+	sk->sk_write_space	= cep->sk_write_space;
+	sk->sk_error_report	= cep->sk_error_report;
+	sk->sk_user_data	= NULL;
+}
+
+static void siw_qp_socket_assoc(struct siw_cep *cep, struct siw_qp *qp)
+{
+	struct socket *s = cep->llp.sock;
+	struct sock *sk = s->sk;
+
+	write_lock_bh(&sk->sk_callback_lock);
+
+	qp->attrs.llp_stream_handle = s;
+	sk->sk_data_ready = siw_qp_llp_data_ready;
+	sk->sk_write_space = siw_qp_llp_write_space;
+
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void siw_socket_disassoc(struct socket *s)
+{
+	struct sock	*sk = s->sk;
+	struct siw_cep	*cep;
+
+	if (sk) {
+		write_lock_bh(&sk->sk_callback_lock);
+		cep = sk_to_cep(sk);
+		if (cep) {
+			siw_sk_restore_upcalls(sk, cep);
+			siw_cep_put(cep);
+		} else
+			pr_warn("SIW: cannot restore sk callbacks: no ep\n");
+		write_unlock_bh(&sk->sk_callback_lock);
+	} else {
+		pr_warn("SIW: cannot restore sk callbacks: no sk\n");
+	}
+}
+
+static void siw_rtr_data_ready(struct sock *sk)
+{
+	struct siw_cep		*cep;
+	struct siw_qp		*qp = NULL;
+	read_descriptor_t	rd_desc;
+
+	read_lock(&sk->sk_callback_lock);
+
+	cep = sk_to_cep(sk);
+	if (!cep) {
+		WARN(1, "No connection endpoint\n");
+		goto out;
+	}
+	qp = sk_to_qp(sk);
+
+	memset(&rd_desc, 0, sizeof(rd_desc));
+	rd_desc.arg.data = qp;
+	rd_desc.count = 1;
+
+	tcp_read_sock(sk, &rd_desc, siw_tcp_rx_data);
+	/*
+	 * Check if first frame was successfully processed.
+	 * Signal connection full establishment if yes.
+	 * Failed data processing would have already scheduled
+	 * connection drop.
+	 */
+	if (!qp->rx_ctx.rx_suspend)
+		siw_cm_upcall(cep, IW_CM_EVENT_ESTABLISHED, 0);
+out:
+	read_unlock(&sk->sk_callback_lock);
+	if (qp)
+		siw_qp_socket_assoc(cep, qp);
+}
+
+static void siw_sk_assign_rtr_upcalls(struct siw_cep *cep)
+{
+	struct sock *sk = cep->llp.sock->sk;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	sk->sk_data_ready = siw_rtr_data_ready;
+	sk->sk_write_space = siw_qp_llp_write_space;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static inline int kernel_peername(struct socket *s, struct sockaddr_in *addr)
+{
+	int addr_len;
+
+	return s->ops->getname(s, (struct sockaddr *)addr, &addr_len, 1);
+}
+
+static inline int kernel_localname(struct socket *s, struct sockaddr_in *addr)
+{
+	int addr_len;
+
+	return s->ops->getname(s, (struct sockaddr *)addr, &addr_len, 0);
+}
+
+static void siw_cep_socket_assoc(struct siw_cep *cep, struct socket *s)
+{
+	cep->llp.sock = s;
+	siw_cep_get(cep);
+	s->sk->sk_user_data = cep;
+
+	siw_sk_save_upcalls(s->sk);
+	siw_sk_assign_cm_upcalls(s->sk);
+}
+
+static struct siw_cep *siw_cep_alloc(struct siw_device  *sdev)
+{
+	struct siw_cep *cep = kzalloc(sizeof(*cep), GFP_KERNEL);
+
+	if (cep) {
+		unsigned long flags;
+
+		INIT_LIST_HEAD(&cep->listenq);
+		INIT_LIST_HEAD(&cep->devq);
+		INIT_LIST_HEAD(&cep->work_freelist);
+
+		kref_init(&cep->ref);
+		cep->state = SIW_EPSTATE_IDLE;
+		init_waitqueue_head(&cep->waitq);
+		spin_lock_init(&cep->lock);
+		cep->sdev = sdev;
+		cep->enhanced_rdma_conn_est = false;
+
+		spin_lock_irqsave(&sdev->idr_lock, flags);
+		list_add_tail(&cep->devq, &sdev->cep_list);
+		spin_unlock_irqrestore(&sdev->idr_lock, flags);
+		atomic_inc(&sdev->num_cep);
+
+		siw_dbg_cep(cep, "new object\n");
+	}
+	return cep;
+}
+
+static void siw_cm_free_work(struct siw_cep *cep)
+{
+	struct list_head	*w, *tmp;
+	struct siw_cm_work	*work;
+
+	list_for_each_safe(w, tmp, &cep->work_freelist) {
+		work = list_entry(w, struct siw_cm_work, list);
+		list_del(&work->list);
+		kfree(work);
+	}
+}
+
+static void siw_cancel_mpatimer(struct siw_cep *cep)
+{
+	spin_lock_bh(&cep->lock);
+	if (cep->mpa_timer) {
+		if (cancel_delayed_work(&cep->mpa_timer->work)) {
+			siw_cep_put(cep);
+			kfree(cep->mpa_timer); /* not needed again */
+		}
+		cep->mpa_timer = NULL;
+	}
+	spin_unlock_bh(&cep->lock);
+}
+
+static void siw_put_work(struct siw_cm_work *work)
+{
+	INIT_LIST_HEAD(&work->list);
+	spin_lock_bh(&work->cep->lock);
+	list_add(&work->list, &work->cep->work_freelist);
+	spin_unlock_bh(&work->cep->lock);
+}
+
+static void siw_cep_set_inuse(struct siw_cep *cep)
+{
+	unsigned long flags;
+	int rv;
+retry:
+	siw_dbg_cep(cep, "current in_use %d\n", cep->in_use);
+
+	spin_lock_irqsave(&cep->lock, flags);
+
+	if (cep->in_use) {
+		spin_unlock_irqrestore(&cep->lock, flags);
+		rv = wait_event_interruptible(cep->waitq, !cep->in_use);
+		if (signal_pending(current))
+			flush_signals(current);
+		goto retry;
+	} else {
+		cep->in_use = 1;
+		spin_unlock_irqrestore(&cep->lock, flags);
+	}
+}
+
+static void siw_cep_set_free(struct siw_cep *cep)
+{
+	unsigned long flags;
+
+	siw_dbg_cep(cep, "current in_use %d\n", cep->in_use);
+
+	spin_lock_irqsave(&cep->lock, flags);
+	cep->in_use = 0;
+	spin_unlock_irqrestore(&cep->lock, flags);
+
+	wake_up(&cep->waitq);
+}
+
+static void __siw_cep_dealloc(struct kref *ref)
+{
+	struct siw_cep *cep = container_of(ref, struct siw_cep, ref);
+	struct siw_device *sdev = cep->sdev;
+	unsigned long flags;
+
+	siw_dbg_cep(cep, "free object\n");
+
+	WARN_ON(cep->listen_cep);
+
+	/* kfree(NULL) is safe */
+	kfree(cep->mpa.pdata);
+	spin_lock_bh(&cep->lock);
+	if (!list_empty(&cep->work_freelist))
+		siw_cm_free_work(cep);
+	spin_unlock_bh(&cep->lock);
+
+	spin_lock_irqsave(&sdev->idr_lock, flags);
+	list_del(&cep->devq);
+	spin_unlock_irqrestore(&sdev->idr_lock, flags);
+	atomic_dec(&sdev->num_cep);
+	kfree(cep);
+}
+
+static struct siw_cm_work *siw_get_work(struct siw_cep *cep)
+{
+	struct siw_cm_work	*work = NULL;
+
+	spin_lock_bh(&cep->lock);
+	if (!list_empty(&cep->work_freelist)) {
+		work = list_entry(cep->work_freelist.next, struct siw_cm_work,
+				  list);
+		list_del_init(&work->list);
+	}
+	spin_unlock_bh(&cep->lock);
+	return work;
+}
+
+static int siw_cm_alloc_work(struct siw_cep *cep, int num)
+{
+	struct siw_cm_work	*work;
+
+	while (num--) {
+		work = kmalloc(sizeof(*work), GFP_KERNEL);
+		if (!work) {
+			if (!(list_empty(&cep->work_freelist)))
+				siw_cm_free_work(cep);
+			return -ENOMEM;
+		}
+		work->cep = cep;
+		INIT_LIST_HEAD(&work->list);
+		list_add(&work->list, &cep->work_freelist);
+	}
+	return 0;
+}
+
+/*
+ * siw_cm_upcall()
+ *
+ * Upcall to IWCM to inform about async connection events
+ */
+static int siw_cm_upcall(struct siw_cep *cep, enum iw_cm_event_type reason,
+			 int status)
+{
+	struct iw_cm_event	event;
+	struct iw_cm_id		*cm_id;
+
+	memset(&event, 0, sizeof(event));
+	event.status = status;
+	event.event = reason;
+
+	if (reason == IW_CM_EVENT_CONNECT_REQUEST) {
+		event.provider_data = cep;
+		cm_id = cep->listen_cep->cm_id;
+	} else {
+		cm_id = cep->cm_id;
+	}
+	/* Signal private data and address information */
+	if (reason == IW_CM_EVENT_CONNECT_REQUEST ||
+	    reason == IW_CM_EVENT_CONNECT_REPLY) {
+		u16 pd_len = be16_to_cpu(cep->mpa.hdr.params.pd_len);
+
+		if (pd_len && cep->enhanced_rdma_conn_est)
+			pd_len -= sizeof(struct mpa_v2_data);
+
+		if (pd_len) {
+			/*
+			 * hand over MPA private data
+			 */
+			event.private_data_len = pd_len;
+			event.private_data = cep->mpa.pdata;
+			/* Hide MPA V2 IRD/ORD control */
+			if (cep->enhanced_rdma_conn_est)
+				event.private_data +=
+					sizeof(struct mpa_v2_data);
+		}
+		to_sockaddr_in(event.local_addr) = cep->llp.laddr;
+		to_sockaddr_in(event.remote_addr) = cep->llp.raddr;
+	}
+	/* Signal IRD and ORD */
+	if (reason == IW_CM_EVENT_ESTABLISHED ||
+	    reason == IW_CM_EVENT_CONNECT_REPLY) {
+		/* Signal negotiated IRD/ORD values we will use */
+		event.ird = cep->ird;
+		event.ord = cep->ord;
+	} else if (reason == IW_CM_EVENT_CONNECT_REQUEST) {
+		event.ird = cep->ord;
+		event.ord = cep->ird;
+	}
+	siw_dbg_cep(cep, "[QP %d]: id 0x%p, reason=%d, status=%d\n",
+		    cep->qp ? QP_ID(cep->qp) : -1, cm_id, reason, status);
+
+	return cm_id->event_handler(cm_id, &event);
+}
+
+/*
+ * siw_qp_cm_drop()
+ *
+ * Drops established LLP connection if present and not already
+ * scheduled for dropping. Called from user context, SQ workqueue
+ * or receive IRQ. Caller signals if socket can be immediately
+ * closed (basically, if not in IRQ).
+ */
+void siw_qp_cm_drop(struct siw_qp *qp, int schedule)
+{
+	struct siw_cep *cep = qp->cep;
+
+	qp->rx_ctx.rx_suspend = 1;
+	qp->tx_ctx.tx_suspend = 1;
+
+	if (!qp->cep)
+		return;
+
+	if (schedule) {
+		siw_cm_queue_work(cep, SIW_CM_WORK_CLOSE_LLP);
+	} else {
+		siw_cep_set_inuse(cep);
+
+		if (cep->state == SIW_EPSTATE_CLOSED) {
+			siw_dbg_cep(cep, "already closed\n");
+			goto out;
+		}
+		siw_dbg_cep(cep, "immediate close, state %d\n", cep->state);
+
+		if (qp->term_info.valid)
+			siw_send_terminate(qp);
+
+		if (cep->cm_id) {
+			switch (cep->state) {
+
+			case SIW_EPSTATE_AWAIT_MPAREP:
+				siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+					      -EINVAL);
+				break;
+
+			case SIW_EPSTATE_RDMA_MODE:
+				siw_cm_upcall(cep, IW_CM_EVENT_CLOSE, 0);
+
+				break;
+
+			case SIW_EPSTATE_IDLE:
+			case SIW_EPSTATE_LISTENING:
+			case SIW_EPSTATE_CONNECTING:
+			case SIW_EPSTATE_AWAIT_MPAREQ:
+			case SIW_EPSTATE_RECVD_MPAREQ:
+			case SIW_EPSTATE_CLOSED:
+			default:
+
+				break;
+			}
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+			siw_cep_put(cep);
+		}
+		cep->state = SIW_EPSTATE_CLOSED;
+
+		if (cep->llp.sock) {
+			siw_socket_disassoc(cep->llp.sock);
+			/*
+			 * Immediately close socket
+			 */
+			sock_release(cep->llp.sock);
+			cep->llp.sock = NULL;
+		}
+		if (cep->qp) {
+			cep->qp = NULL;
+			siw_qp_put(qp);
+		}
+out:
+		siw_cep_set_free(cep);
+	}
+}
+
+void siw_cep_put(struct siw_cep *cep)
+{
+	siw_dbg_cep(cep, "new refcount: %d\n", refcount_read(&cep->ref) - 1);
+
+	WARN_ON(refcount_read(&cep->ref) < 1);
+	kref_put(&cep->ref, __siw_cep_dealloc);
+}
+
+void siw_cep_get(struct siw_cep *cep)
+{
+	kref_get(&cep->ref);
+	siw_dbg_cep(cep, "new refcount: %d\n", refcount_read(&cep->ref));
+}
+
+static inline int ksock_recv(struct socket *sock, char *buf, size_t size,
+			     int flags)
+{
+	struct kvec iov = {buf, size};
+	struct msghdr msg = {.msg_name = NULL, .msg_flags = flags};
+
+	return kernel_recvmsg(sock, &msg, &iov, 1, size, flags);
+}
+
+/*
+ * Expects params->pd_len in host byte order
+ */
+static int siw_send_mpareqrep(struct siw_cep *cep, const void *pdata,
+			      u8 pd_len)
+{
+	struct socket	*s = cep->llp.sock;
+	struct mpa_rr	*rr = &cep->mpa.hdr;
+	struct kvec	iov[3];
+	struct msghdr	msg;
+	int		rv;
+	int		iovec_num = 0;
+	int		mpa_len;
+
+	memset(&msg, 0, sizeof(msg));
+
+	iov[iovec_num].iov_base = rr;
+	iov[iovec_num].iov_len = sizeof(*rr);
+	mpa_len = sizeof(*rr);
+
+	if (cep->enhanced_rdma_conn_est) {
+		iovec_num++;
+		iov[iovec_num].iov_base = &cep->mpa.v2_ctrl;
+		iov[iovec_num].iov_len = sizeof(cep->mpa.v2_ctrl);
+		mpa_len += sizeof(cep->mpa.v2_ctrl);
+	}
+	if (pd_len) {
+		iovec_num++;
+		iov[iovec_num].iov_base = (char *)pdata;
+		iov[iovec_num].iov_len = pd_len;
+		mpa_len += pd_len;
+	}
+	if (cep->enhanced_rdma_conn_est)
+		pd_len += sizeof(cep->mpa.v2_ctrl);
+
+	rr->params.pd_len = cpu_to_be16(pd_len);
+
+	rv = kernel_sendmsg(s, &msg, iov, iovec_num + 1, mpa_len);
+
+	return rv < 0 ? rv : 0;
+}
+
+/*
+ * Receive MPA Request/Reply header.
+ *
+ * Returns 0 if complete MPA Request/Reply haeder including
+ * eventual private data was received. Returns -EAGAIN if
+ * header was partially received or negative error code otherwise.
+ *
+ * Context: May be called in process context only
+ */
+static int siw_recv_mpa_rr(struct siw_cep *cep)
+{
+	struct mpa_rr	*hdr = &cep->mpa.hdr;
+	struct socket	*s = cep->llp.sock;
+	u16		pd_len;
+	int		rcvd, to_rcv;
+
+	if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
+
+		rcvd = ksock_recv(s, (char *)hdr + cep->mpa.bytes_rcvd,
+				  sizeof(struct mpa_rr) -
+				  cep->mpa.bytes_rcvd, 0);
+
+		if (rcvd <= 0)
+			return -ECONNABORTED;
+
+		cep->mpa.bytes_rcvd += rcvd;
+
+		if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr))
+			return -EAGAIN;
+
+		if (be16_to_cpu(hdr->params.pd_len) > MPA_MAX_PRIVDATA)
+			return -EPROTO;
+	}
+	pd_len = be16_to_cpu(hdr->params.pd_len);
+
+	/*
+	 * At least the MPA Request/Reply header (frame not including
+	 * private data) has been received.
+	 * Receive (or continue receiving) any private data.
+	 */
+	to_rcv = pd_len - (cep->mpa.bytes_rcvd - sizeof(struct mpa_rr));
+
+	if (!to_rcv) {
+		/*
+		 * We must have hdr->params.pd_len == 0 and thus received a
+		 * complete MPA Request/Reply frame.
+		 * Check against peer protocol violation.
+		 */
+		u32 word;
+
+		rcvd = ksock_recv(s, (char *)&word, sizeof(word), MSG_DONTWAIT);
+		if (rcvd == -EAGAIN)
+			return 0;
+
+		if (rcvd == 0) {
+			siw_dbg_cep(cep, "peer EOF\n");
+			return -EPIPE;
+		}
+		if (rcvd < 0) {
+			siw_dbg_cep(cep, "error: %d\n", rcvd);
+			return rcvd;
+		}
+		siw_dbg_cep(cep, "peer sent extra data: %d\n", rcvd);
+
+		return -EPROTO;
+	}
+
+	/*
+	 * At this point, we must have hdr->params.pd_len != 0.
+	 * A private data buffer gets allocated if hdr->params.pd_len != 0.
+	 */
+	if (!cep->mpa.pdata) {
+		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
+		if (!cep->mpa.pdata)
+			return -ENOMEM;
+	}
+	rcvd = ksock_recv(s, cep->mpa.pdata + cep->mpa.bytes_rcvd
+			  - sizeof(struct mpa_rr), to_rcv + 4, MSG_DONTWAIT);
+
+	if (rcvd < 0)
+		return rcvd;
+
+	if (rcvd > to_rcv)
+		return -EPROTO;
+
+	cep->mpa.bytes_rcvd += rcvd;
+
+	if (to_rcv == rcvd) {
+		siw_dbg_cep(cep, "%d bytes private data received\n", pd_len);
+		return 0;
+	}
+	return -EAGAIN;
+}
+
+/*
+ * siw_proc_mpareq()
+ *
+ * Read MPA Request from socket and signal new connection to IWCM
+ * if success. Caller must hold lock on corresponding listening CEP.
+ */
+static int siw_proc_mpareq(struct siw_cep *cep)
+{
+	struct mpa_rr	*req;
+	int		version, rv;
+	u16		pd_len;
+
+	rv = siw_recv_mpa_rr(cep);
+	if (rv)
+		return rv;
+
+	req = &cep->mpa.hdr;
+
+	version = __mpa_rr_revision(req->params.bits);
+	pd_len = be16_to_cpu(req->params.pd_len);
+
+	if (version > MPA_REVISION_2)
+		/* allow for 0, 1, and 2 only */
+		return -EPROTO;
+
+	if (memcmp(req->key, MPA_KEY_REQ, 16))
+		return -EPROTO;
+
+	/* Prepare for sending MPA reply */
+	memcpy(req->key, MPA_KEY_REP, 16);
+
+	if (version == MPA_REVISION_2 &&
+	    (req->params.bits & MPA_RR_FLAG_ENHANCED)) {
+		/*
+		 * MPA version 2 must signal IRD/ORD values and P2P mode
+		 * in private data if header flag MPA_RR_FLAG_ENHANCED
+		 * is set.
+		 */
+		if (pd_len < sizeof(struct mpa_v2_data))
+			goto reject_conn;
+
+		cep->enhanced_rdma_conn_est = true;
+	}
+
+	/* MPA Markers: currently not supported. Marker TX to be added. */
+	if (req->params.bits & MPA_RR_FLAG_MARKERS)
+		goto reject_conn;
+
+	if (req->params.bits & MPA_RR_FLAG_CRC) {
+		/*
+		 * RFC 5044, page 27: CRC MUST be used if peer requests it.
+		 * siw specific: 'mpa_crc_strict' parameter to reject
+		 * connection with CRC if local CRC off enforced by
+		 * 'mpa_crc_strict' module parameter.
+		 */
+		if (!mpa_crc_required && mpa_crc_strict)
+			goto reject_conn;
+
+		/* Enable CRC if requested by module parameter */
+		if (mpa_crc_required)
+			req->params.bits |= MPA_RR_FLAG_CRC;
+	}
+
+	if (cep->enhanced_rdma_conn_est) {
+		struct mpa_v2_data *v2 = (struct mpa_v2_data *)cep->mpa.pdata;
+
+		/*
+		 * Peer requested ORD becomes requested local IRD,
+		 * peer requested IRD becomes requested local ORD.
+		 * IRD and ORD get limited by global maximum values.
+		 */
+		cep->ord = ntohs(v2->ird) & MPA_IRD_ORD_MASK;
+		cep->ord = min(cep->ord, SIW_MAX_ORD_QP);
+		cep->ird = ntohs(v2->ord) & MPA_IRD_ORD_MASK;
+		cep->ird = min(cep->ird, SIW_MAX_IRD_QP);
+
+		/* May get overwritten by locally negotiated values */
+		cep->mpa.v2_ctrl.ird = htons(cep->ird);
+		cep->mpa.v2_ctrl.ord = htons(cep->ord);
+
+		/*
+		 * Support for peer sent zero length Write or Read to
+		 * let local side enter RTS. Writes are preferred.
+		 * Sends would require pre-posting a Receive and are
+		 * not supported.
+		 * Propose zero length Write if none of Read and Write
+		 * is indicated.
+		 */
+		if (v2->ird & MPA_V2_PEER_TO_PEER) {
+			cep->mpa.v2_ctrl.ird |= MPA_V2_PEER_TO_PEER;
+
+			if (v2->ord & MPA_V2_RDMA_WRITE_RTR)
+				cep->mpa.v2_ctrl.ord |= MPA_V2_RDMA_WRITE_RTR;
+			else if (v2->ord & MPA_V2_RDMA_READ_RTR)
+				cep->mpa.v2_ctrl.ord |= MPA_V2_RDMA_READ_RTR;
+			else
+				cep->mpa.v2_ctrl.ord |= MPA_V2_RDMA_WRITE_RTR;
+		}
+	}
+
+	cep->state = SIW_EPSTATE_RECVD_MPAREQ;
+
+	/* Keep reference until IWCM accepts/rejects */
+	siw_cep_get(cep);
+	rv = siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REQUEST, 0);
+	if (rv)
+		siw_cep_put(cep);
+
+	return rv;
+
+reject_conn:
+	siw_dbg_cep(cep, "reject: crc %d:%d:%d, m %d:%d\n",
+		    req->params.bits & MPA_RR_FLAG_CRC ? 1 : 0,
+		    mpa_crc_required, mpa_crc_strict,
+		    req->params.bits & MPA_RR_FLAG_MARKERS ? 1 : 0, 0);
+
+	req->params.bits &= ~MPA_RR_FLAG_MARKERS;
+	req->params.bits |= MPA_RR_FLAG_REJECT;
+
+	if (!mpa_crc_required && mpa_crc_strict)
+		req->params.bits &= ~MPA_RR_FLAG_CRC;
+
+	if (pd_len)
+		kfree(cep->mpa.pdata);
+
+	cep->mpa.pdata = NULL;
+
+	(void)siw_send_mpareqrep(cep, NULL, 0);
+
+	return -EOPNOTSUPP;
+}
+
+static int siw_proc_mpareply(struct siw_cep *cep)
+{
+	struct siw_qp_attrs	qp_attrs;
+	enum siw_qp_attr_mask	qp_attr_mask;
+	struct siw_qp		*qp = cep->qp;
+	struct mpa_rr		*rep;
+	int			rv;
+	u16			rep_ord;
+	u16			rep_ird;
+	bool			ird_insufficient = false;
+	enum mpa_v2_ctrl	mpa_p2p_mode = MPA_V2_RDMA_NO_RTR;
+
+	rv = siw_recv_mpa_rr(cep);
+	if (rv != -EAGAIN)
+		siw_cancel_mpatimer(cep);
+	if (rv)
+		goto out_err;
+
+	rep = &cep->mpa.hdr;
+
+	if (__mpa_rr_revision(rep->params.bits) > MPA_REVISION_2) {
+		/* allow for 0, 1,  and 2 only */
+		rv = -EPROTO;
+		goto out_err;
+	}
+	if (memcmp(rep->key, MPA_KEY_REP, 16)) {
+		siw_init_terminate(qp, TERM_ERROR_LAYER_LLP,
+				   LLP_ETYPE_MPA,
+				   LLP_ECODE_INVALID_REQ_RESP, 0);
+		siw_send_terminate(qp);
+		rv = -EPROTO;
+		goto out_err;
+	}
+	if (rep->params.bits & MPA_RR_FLAG_REJECT) {
+		siw_dbg_cep(cep, "got mpa reject\n");
+		(void)siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+				    -ECONNRESET);
+
+		return -ECONNRESET;
+	}
+	if ((rep->params.bits & MPA_RR_FLAG_MARKERS)
+		|| (mpa_crc_required && !(rep->params.bits & MPA_RR_FLAG_CRC))
+		|| (mpa_crc_strict && !mpa_crc_required
+			&& (rep->params.bits & MPA_RR_FLAG_CRC))) {
+
+		siw_dbg_cep(cep, "reply unsupp: crc %d:%d:%d, m %d:%d\n",
+			    rep->params.bits & MPA_RR_FLAG_CRC ? 1 : 0,
+			    mpa_crc_required, mpa_crc_strict,
+			    rep->params.bits & MPA_RR_FLAG_MARKERS ? 1 : 0, 0);
+
+		(void)siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+				    -ECONNREFUSED);
+		return -EINVAL;
+	}
+
+	if (cep->enhanced_rdma_conn_est) {
+		struct mpa_v2_data *v2;
+
+		if (__mpa_rr_revision(rep->params.bits) < MPA_REVISION_2 ||
+		    !(rep->params.bits & MPA_RR_FLAG_ENHANCED)) {
+			/*
+			 * Protocol failure: The responder MUST reply with
+			 * MPA version 2 and MUST set MPA_RR_FLAG_ENHANCED.
+			 */
+			siw_dbg_cep(cep, "mpa reply error: vers %d, enhcd %d\n",
+				__mpa_rr_revision(rep->params.bits),
+				rep->params.bits & MPA_RR_FLAG_ENHANCED ? 1:0);
+
+			(void)siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+					    -ECONNRESET);
+			return -EINVAL;
+		}
+		v2 = (struct mpa_v2_data *)cep->mpa.pdata;
+		rep_ird = ntohs(v2->ird) & MPA_IRD_ORD_MASK;
+		rep_ord = ntohs(v2->ord) & MPA_IRD_ORD_MASK;
+
+		if (cep->ird < rep_ord &&
+		    (relaxed_ird_negotiation == false ||
+		     rep_ord > cep->sdev->attrs.max_ird)) {
+			siw_dbg_cep(cep, "ird %d, rep_ord %d, max_ord %d\n",
+				    cep->ird, rep_ord,
+				    cep->sdev->attrs.max_ord);
+			ird_insufficient = true;
+		}
+		if (cep->ord > rep_ird && relaxed_ird_negotiation == false) {
+			siw_dbg_cep(cep, "ord %d, rep_ird %d\n",
+				    cep->ord, rep_ird);
+			ird_insufficient = true;
+		}
+		/*
+		 * Always report negotiated peer values to user,
+		 * even if IRD/ORD negotiation failed
+		 */
+		cep->ird = rep_ord;
+		cep->ord = rep_ird;
+
+		if (ird_insufficient) {
+			/*
+			 * If the initiator IRD is insuffient for the
+			 * responder ORD, send a TERM.
+			 */
+			siw_init_terminate(qp, TERM_ERROR_LAYER_LLP,
+					   LLP_ETYPE_MPA,
+					   LLP_ECODE_INSUFFICIENT_IRD, 0);
+			siw_send_terminate(qp);
+			rv = -ENOMEM;
+			goto out_err;
+		}
+
+		if (cep->mpa.v2_ctrl_req.ird & MPA_V2_PEER_TO_PEER)
+			mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord &
+				(MPA_V2_RDMA_WRITE_RTR | MPA_V2_RDMA_READ_RTR);
+
+		/*
+		 * Check if we requested P2P mode, and if peer agrees
+		 */
+		if (mpa_p2p_mode != MPA_V2_RDMA_NO_RTR) {
+			if ((mpa_p2p_mode & v2->ord) == 0) {
+				/*
+				 * We requested RTR mode(s), but the peer
+				 * did not pick any mode we support.
+				 */
+				siw_dbg_cep(cep,
+					"rtr mode:  req %2x, got %2x\n",
+					mpa_p2p_mode,
+					v2->ord & (MPA_V2_RDMA_WRITE_RTR |
+						   MPA_V2_RDMA_READ_RTR));
+
+				siw_init_terminate(qp, TERM_ERROR_LAYER_LLP,
+						   LLP_ETYPE_MPA,
+						   LLP_ECODE_NO_MATCHING_RTR,
+						   0);
+				siw_send_terminate(qp);
+				rv = -EPROTO;
+				goto out_err;
+			}
+			mpa_p2p_mode = v2->ord &
+				(MPA_V2_RDMA_WRITE_RTR | MPA_V2_RDMA_READ_RTR);
+		}
+	}
+
+	memset(&qp_attrs, 0, sizeof(qp_attrs));
+
+	if (rep->params.bits & MPA_RR_FLAG_CRC)
+		qp_attrs.flags = SIW_MPA_CRC;
+
+	qp_attrs.irq_size = cep->ird;
+	qp_attrs.orq_size = cep->ord;
+	qp_attrs.llp_stream_handle = cep->llp.sock;
+	qp_attrs.state = SIW_QP_STATE_RTS;
+
+	qp_attr_mask = SIW_QP_ATTR_STATE | SIW_QP_ATTR_LLP_HANDLE |
+		       SIW_QP_ATTR_ORD | SIW_QP_ATTR_IRD | SIW_QP_ATTR_MPA;
+
+	/* Move socket RX/TX under QP control */
+	down_write(&qp->state_lock);
+	if (qp->attrs.state > SIW_QP_STATE_RTR) {
+		rv = -EINVAL;
+		up_write(&qp->state_lock);
+		goto out_err;
+	}
+	rv = siw_qp_modify(qp, &qp_attrs, qp_attr_mask);
+
+	siw_qp_socket_assoc(cep, qp);
+
+	up_write(&qp->state_lock);
+
+	/* Send extra RDMA frame to trigger peer RTS if negotiated */
+	if (mpa_p2p_mode != MPA_V2_RDMA_NO_RTR) {
+		rv = siw_qp_mpa_rts(qp, mpa_p2p_mode);
+		if (rv)
+			goto out_err;
+	}
+
+	if (!rv) {
+		rv = siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, 0);
+		if (!rv)
+			cep->state = SIW_EPSTATE_RDMA_MODE;
+
+		return 0;
+	}
+
+out_err:
+	(void)siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, -EINVAL);
+
+	return rv;
+}
+
+/*
+ * siw_accept_newconn - accept an incoming pending connection
+ *
+ */
+static void siw_accept_newconn(struct siw_cep *cep)
+{
+	struct socket		*s = cep->llp.sock;
+	struct socket		*new_s = NULL;
+	struct siw_cep		*new_cep = NULL;
+	int			rv = 0; /* debug only. should disappear */
+
+	if (cep->state != SIW_EPSTATE_LISTENING)
+		goto error;
+
+	new_cep = siw_cep_alloc(cep->sdev);
+	if (!new_cep)
+		goto error;
+
+	/*
+	 * 4: Allocate a sufficient number of work elements
+	 * to allow concurrent handling of local + peer close
+	 * events, MPA header processing + MPA timeout.
+	 */
+	if (siw_cm_alloc_work(new_cep, 4) != 0)
+		goto error;
+
+	/*
+	 * Copy saved socket callbacks from listening CEP
+	 * and assign new socket with new CEP
+	 */
+	new_cep->sk_state_change = cep->sk_state_change;
+	new_cep->sk_data_ready   = cep->sk_data_ready;
+	new_cep->sk_write_space  = cep->sk_write_space;
+	new_cep->sk_error_report = cep->sk_error_report;
+
+	rv = kernel_accept(s, &new_s, O_NONBLOCK);
+	if (rv != 0) {
+		/*
+		 * Connection already aborted by peer..?
+		 */
+		siw_dbg_cep(cep, "kernel_accept() error: %d\n", rv);
+		goto error;
+	}
+	new_cep->llp.sock = new_s;
+	siw_cep_get(new_cep);
+	new_s->sk->sk_user_data = new_cep;
+
+	siw_dbg_cep(cep, "listen socket 0x%p, new 0x%p\n", s, new_s);
+
+	rv = siw_sock_nodelay(new_s);
+	if (rv) {
+		siw_dbg_cep(cep, "siw_sock_nodelay() error: %d\n", rv);
+		goto error;
+	}
+
+	rv = kernel_peername(new_s, &new_cep->llp.raddr);
+	if (rv) {
+		siw_dbg_cep(cep, "kernel_peername() error: %d\n", rv);
+		goto error;
+	}
+	rv = kernel_localname(new_s, &new_cep->llp.laddr);
+	if (rv) {
+		siw_dbg_cep(cep, "kernel_localname() error: %d\n", rv);
+		goto error;
+	}
+
+	new_cep->state = SIW_EPSTATE_AWAIT_MPAREQ;
+
+	rv = siw_cm_queue_work(new_cep, SIW_CM_WORK_MPATIMEOUT);
+	if (rv)
+		goto error;
+	/*
+	 * See siw_proc_mpareq() etc. for the use of new_cep->listen_cep.
+	 */
+	new_cep->listen_cep = cep;
+	siw_cep_get(cep);
+
+	if (atomic_read(&new_s->sk->sk_rmem_alloc)) {
+		/*
+		 * MPA REQ already queued
+		 */
+		siw_dbg_cep(cep, "immediate mpa request\n");
+
+		siw_cep_set_inuse(new_cep);
+		rv = siw_proc_mpareq(new_cep);
+		siw_cep_set_free(new_cep);
+
+		if (rv != -EAGAIN) {
+			siw_cep_put(cep);
+			new_cep->listen_cep = NULL;
+			if (rv)
+				goto error;
+		}
+	}
+	return;
+
+error:
+	if (new_cep)
+		siw_cep_put(new_cep);
+
+	if (new_s) {
+		siw_socket_disassoc(new_s);
+		sock_release(new_s);
+		new_cep->llp.sock = NULL;
+	}
+	siw_dbg_cep(cep, "error %d\n", rv);
+}
+
+static void siw_cm_work_handler(struct work_struct *w)
+{
+	struct siw_cm_work	*work;
+	struct siw_cep		*cep;
+	int release_cep = 0, rv = 0;
+
+	work = container_of(w, struct siw_cm_work, work.work);
+	cep = work->cep;
+
+	siw_dbg_cep(cep, "[QP %d]: work type: %d, state %d\n",
+		    cep->qp ? QP_ID(cep->qp) : -1, work->type, cep->state);
+
+	siw_cep_set_inuse(cep);
+
+	switch (work->type) {
+
+	case SIW_CM_WORK_ACCEPT:
+
+		siw_accept_newconn(cep);
+		break;
+
+	case SIW_CM_WORK_READ_MPAHDR:
+
+		switch (cep->state) {
+
+		case SIW_EPSTATE_AWAIT_MPAREQ:
+
+			if (cep->listen_cep) {
+				siw_cep_set_inuse(cep->listen_cep);
+
+				if (cep->listen_cep->state ==
+				    SIW_EPSTATE_LISTENING)
+					rv = siw_proc_mpareq(cep);
+				else
+					rv = -EFAULT;
+
+				siw_cep_set_free(cep->listen_cep);
+
+				if (rv != -EAGAIN) {
+					siw_cep_put(cep->listen_cep);
+					cep->listen_cep = NULL;
+					if (rv)
+						siw_cep_put(cep);
+				}
+			}
+			break;
+
+		case SIW_EPSTATE_AWAIT_MPAREP:
+
+			rv = siw_proc_mpareply(cep);
+			break;
+
+		default:
+			/*
+			 * CEP already moved out of MPA handshake.
+			 * any connection management already done.
+			 * silently ignore the mpa packet.
+			 */
+			if (cep->state == SIW_EPSTATE_RDMA_MODE) {
+				cep->llp.sock->sk->sk_data_ready(
+					cep->llp.sock->sk);
+				siw_dbg_cep(cep, "already in RDMA mode");
+			} else {
+				siw_dbg_cep(cep, "out of state: %d\n",
+					    cep->state);
+			}
+		}
+		if (rv && rv != EAGAIN)
+			release_cep = 1;
+
+		break;
+
+	case SIW_CM_WORK_CLOSE_LLP:
+		/*
+		 * QP scheduled LLP close
+		 */
+		siw_dbg_cep(cep, "work close_llp, state=%d\n", cep->state);
+
+		if (cep->qp && cep->qp->term_info.valid)
+			siw_send_terminate(cep->qp);
+
+		if (cep->cm_id)
+			siw_cm_upcall(cep, IW_CM_EVENT_CLOSE, 0);
+
+		release_cep = 1;
+
+		break;
+
+	case SIW_CM_WORK_PEER_CLOSE:
+
+		siw_dbg_cep(cep, "work peer_close, state=%d\n", cep->state);
+
+		if (cep->cm_id) {
+			switch (cep->state) {
+
+			case SIW_EPSTATE_AWAIT_MPAREP:
+				/*
+				 * MPA reply not received, but connection drop
+				 */
+				siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+					      -ECONNRESET);
+				break;
+
+			case SIW_EPSTATE_RDMA_MODE:
+				/*
+				 * NOTE: IW_CM_EVENT_DISCONNECT is given just
+				 *       to transition IWCM into CLOSING.
+				 */
+				siw_cm_upcall(cep, IW_CM_EVENT_DISCONNECT, 0);
+				siw_cm_upcall(cep, IW_CM_EVENT_CLOSE, 0);
+
+				break;
+
+			default:
+
+				break;
+				/*
+				 * for these states there is no connection
+				 * known to the IWCM.
+				 */
+			}
+		} else {
+			switch (cep->state) {
+
+			case SIW_EPSTATE_RECVD_MPAREQ:
+				/*
+				 * Wait for the ulp/CM to call accept/reject
+				 */
+				siw_dbg_cep(cep,
+					    "mpa req recvd, wait for ULP\n");
+				break;
+
+			case SIW_EPSTATE_AWAIT_MPAREQ:
+				/*
+				 * Socket close before MPA request received.
+				 */
+				siw_dbg_cep(cep, "no mpareq: drop listener\n");
+				siw_cep_put(cep->listen_cep);
+				cep->listen_cep = NULL;
+
+				break;
+
+			default:
+				break;
+			}
+		}
+		release_cep = 1;
+
+		break;
+
+	case SIW_CM_WORK_MPATIMEOUT:
+
+		cep->mpa_timer = NULL;
+
+		if (cep->state == SIW_EPSTATE_AWAIT_MPAREP) {
+			/*
+			 * MPA request timed out:
+			 * Hide any partially received private data and signal
+			 * timeout
+			 */
+			cep->mpa.hdr.params.pd_len = 0;
+
+			if (cep->cm_id)
+				siw_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+					      -ETIMEDOUT);
+			release_cep = 1;
+
+		} else if (cep->state == SIW_EPSTATE_AWAIT_MPAREQ) {
+			/*
+			 * No MPA request received after peer TCP stream setup.
+			 */
+			if (cep->listen_cep) {
+				siw_cep_put(cep->listen_cep);
+				cep->listen_cep = NULL;
+			}
+			release_cep = 1;
+		}
+		break;
+
+	default:
+		WARN(1, "Undefined CM work type: %d\n", work->type);
+	}
+
+	if (release_cep) {
+		siw_dbg_cep(cep,
+			    "release: timer=%s, sock=0x%p, QP %d, id 0x%p\n",
+			    cep->mpa_timer ? "y" : "n", cep->llp.sock,
+			    cep->qp ? QP_ID(cep->qp) : -1, cep->cm_id);
+
+		siw_cancel_mpatimer(cep);
+
+		cep->state = SIW_EPSTATE_CLOSED;
+
+		if (cep->qp) {
+			struct siw_qp *qp = cep->qp;
+			/*
+			 * Serialize a potential race with application
+			 * closing the QP and calling siw_qp_cm_drop()
+			 */
+			siw_qp_get(qp);
+			siw_cep_set_free(cep);
+
+			siw_qp_llp_close(qp);
+			siw_qp_put(qp);
+
+			siw_cep_set_inuse(cep);
+			cep->qp = NULL;
+			siw_qp_put(qp);
+		}
+		if (cep->llp.sock) {
+			siw_socket_disassoc(cep->llp.sock);
+			sock_release(cep->llp.sock);
+			cep->llp.sock = NULL;
+		}
+		if (cep->cm_id) {
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+			siw_cep_put(cep);
+		}
+	}
+
+	siw_cep_set_free(cep);
+	siw_dbg_cep(cep, "Exit: work type: %d\n", work->type);
+
+	siw_put_work(work);
+	siw_cep_put(cep);
+}
+
+static struct workqueue_struct *siw_cm_wq;
+
+int siw_cm_queue_work(struct siw_cep *cep, enum siw_work_type type)
+{
+	struct siw_cm_work *work = siw_get_work(cep);
+	unsigned long delay = 0;
+
+	if (!work) {
+		siw_dbg_cep(cep, "failed with no work available\n");
+		return -ENOMEM;
+	}
+	work->type = type;
+	work->cep = cep;
+
+	siw_cep_get(cep);
+
+	INIT_DELAYED_WORK(&work->work, siw_cm_work_handler);
+
+	if (type == SIW_CM_WORK_MPATIMEOUT) {
+		cep->mpa_timer = work;
+
+		if (cep->state == SIW_EPSTATE_AWAIT_MPAREP)
+			delay = MPAREQ_TIMEOUT;
+		else
+			delay = MPAREP_TIMEOUT;
+	}
+	siw_dbg_cep(cep, "[QP %d]: work type: %d, work 0x%p, timeout %lu\n",
+		    cep->qp ? QP_ID(cep->qp) : -1, type, work, delay);
+
+	queue_delayed_work(siw_cm_wq, &work->work, delay);
+
+	return 0;
+}
+
+static void siw_cm_llp_data_ready(struct sock *sk)
+{
+	struct siw_cep	*cep;
+
+	read_lock(&sk->sk_callback_lock);
+
+	cep = sk_to_cep(sk);
+	if (!cep) {
+		WARN_ON(1);
+		goto out;
+	}
+
+	siw_dbg_cep(cep, "state: %d\n", cep->state);
+
+	switch (cep->state) {
+
+	case SIW_EPSTATE_RDMA_MODE:
+		/* fall through */
+	case SIW_EPSTATE_LISTENING:
+
+		break;
+
+	case SIW_EPSTATE_AWAIT_MPAREQ:
+		/* fall through */
+	case SIW_EPSTATE_AWAIT_MPAREP:
+
+		siw_cm_queue_work(cep, SIW_CM_WORK_READ_MPAHDR);
+		break;
+
+	default:
+		siw_dbg_cep(cep, "unexpected data, state %d\n", cep->state);
+		break;
+	}
+out:
+	read_unlock(&sk->sk_callback_lock);
+}
+
+static void siw_cm_llp_write_space(struct sock *sk)
+{
+	struct siw_cep	*cep = sk_to_cep(sk);
+
+	if (cep)
+		siw_dbg_cep(cep, "state: %d\n", cep->state);
+}
+
+static void siw_cm_llp_error_report(struct sock *sk)
+{
+	struct siw_cep	*cep = sk_to_cep(sk);
+
+	if (cep) {
+		siw_dbg_cep(cep, "error %d, socket state: %d, cep state: %d\n",
+			    sk->sk_err, sk->sk_state, cep->state);
+		cep->sk_error = sk->sk_err;
+		cep->sk_error_report(sk);
+	}
+}
+
+static void siw_cm_llp_state_change(struct sock *sk)
+{
+	struct siw_cep	*cep;
+	struct socket	*s;
+	void (*orig_state_change)(struct sock *);
+
+	read_lock(&sk->sk_callback_lock);
+
+	cep = sk_to_cep(sk);
+	if (!cep) {
+		/* endpoint already disassociated */
+		read_unlock(&sk->sk_callback_lock);
+		return;
+	}
+	orig_state_change = cep->sk_state_change;
+
+	s = sk->sk_socket;
+
+	siw_dbg_cep(cep, "state: %d\n", cep->state);
+
+	switch (sk->sk_state) {
+
+	case TCP_ESTABLISHED:
+		/*
+		 * handle accepting socket as special case where only
+		 * new connection is possible
+		 */
+		siw_cm_queue_work(cep, SIW_CM_WORK_ACCEPT);
+
+		break;
+
+	case TCP_CLOSE:
+	case TCP_CLOSE_WAIT:
+
+		if (cep->qp)
+			cep->qp->tx_ctx.tx_suspend = 1;
+		siw_cm_queue_work(cep, SIW_CM_WORK_PEER_CLOSE);
+
+		break;
+
+	default:
+		siw_dbg_cep(cep, "unexpected socket state %d\n", sk->sk_state);
+	}
+	read_unlock(&sk->sk_callback_lock);
+	orig_state_change(sk);
+}
+
+static int kernel_bindconnect(struct socket *s,
+			      struct sockaddr *laddr, int laddrlen,
+			      struct sockaddr *raddr, int raddrlen, int flags)
+{
+	int rv, s_val = 1;
+
+	/*
+	 * Make address available again asap.
+	 */
+	rv = kernel_setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (char *)&s_val,
+				sizeof(s_val));
+	if (rv < 0)
+		return rv;
+
+	rv = s->ops->bind(s, laddr, laddrlen);
+	if (rv < 0)
+		return rv;
+
+	rv = s->ops->connect(s, raddr, raddrlen, flags);
+	if (rv < 0)
+		return rv;
+
+	return s->ops->getname(s, laddr, &s_val, 0);
+}
+
+int siw_connect(struct iw_cm_id *id, struct iw_cm_conn_param *params)
+{
+	struct siw_device *sdev = siw_dev_base2siw(id->device);
+	struct siw_qp *qp;
+	struct siw_cep *cep = NULL;
+	struct socket *s = NULL;
+	struct sockaddr	*laddr, *raddr;
+	bool p2p_mode = peer_to_peer;
+
+	u16 pd_len = params->private_data_len;
+	int version = mpa_version, rv;
+
+	if (pd_len > MPA_MAX_PRIVDATA)
+		return -EINVAL;
+
+	if (params->ird > sdev->attrs.max_ird ||
+	    params->ord > sdev->attrs.max_ord)
+		return -ENOMEM;
+
+	qp = siw_qp_id2obj(sdev, params->qpn);
+	if (!qp) {
+		WARN(1, "[QP %u] does not exist\n", params->qpn);
+		rv = -EINVAL;
+		goto error;
+	}
+
+	siw_dbg_qp(qp, "id 0x%p, pd_len %d, laddr 0x%x:%d, raddr 0x%x:%d\n",
+		   id, pd_len,
+		   ntohl(to_sockaddr_in(id->local_addr).sin_addr.s_addr),
+		   ntohs(to_sockaddr_in(id->local_addr).sin_port),
+		   ntohl(to_sockaddr_in(id->remote_addr).sin_addr.s_addr),
+		   ntohs(to_sockaddr_in(id->remote_addr).sin_port));
+
+	laddr = (struct sockaddr *)&id->local_addr;
+	raddr = (struct sockaddr *)&id->remote_addr;
+
+	rv = sock_create(AF_INET, SOCK_STREAM, IPPROTO_TCP, &s);
+	if (rv < 0)
+		goto error;
+
+	/*
+	 * NOTE: For simplification, connect() is called in blocking
+	 * mode. Might be reconsidered for async connection setup at
+	 * TCP level.
+	 */
+	rv = kernel_bindconnect(s, laddr, sizeof(*laddr), raddr,
+				sizeof(*raddr), 0);
+	if (rv != 0) {
+		siw_dbg_qp(qp, "id 0x%p: kernel_bindconnect: rv %d\n", id, rv);
+		goto error;
+	}
+	rv = siw_sock_nodelay(s);
+	if (rv != 0) {
+		siw_dbg_qp(qp, "siw_sock_nodelay() error: %d\n", rv);
+		goto error;
+	}
+	cep = siw_cep_alloc(sdev);
+	if (!cep) {
+		rv =  -ENOMEM;
+		goto error;
+	}
+	siw_cep_set_inuse(cep);
+
+	/* Associate QP with CEP */
+	siw_cep_get(cep);
+	qp->cep = cep;
+
+	/* siw_qp_get(qp) already done by QP lookup */
+	cep->qp = qp;
+
+	id->add_ref(id);
+	cep->cm_id = id;
+
+	/*
+	 * 4: Allocate a sufficient number of work elements
+	 * to allow concurrent handling of local + peer close
+	 * events, MPA header processing + MPA timeout.
+	 */
+	rv = siw_cm_alloc_work(cep, 4);
+	if (rv != 0) {
+		rv = -ENOMEM;
+		goto error;
+	}
+	cep->ird = params->ird;
+	cep->ord = params->ord;
+
+	if (p2p_mode && cep->ord == 0)
+		cep->ord = 1;
+
+	cep->state = SIW_EPSTATE_CONNECTING;
+
+	rv = kernel_peername(s, &cep->llp.raddr);
+	if (rv)
+		goto error;
+
+	rv = kernel_localname(s, &cep->llp.laddr);
+	if (rv)
+		goto error;
+
+	/*
+	 * Associate CEP with socket
+	 */
+	siw_cep_socket_assoc(cep, s);
+
+	cep->state = SIW_EPSTATE_AWAIT_MPAREP;
+
+	/*
+	 * Set MPA Request bits: CRC if required, no MPA Markers,
+	 * MPA Rev. according to module parameter 'mpa_version', Key 'Request'.
+	 */
+	cep->mpa.hdr.params.bits = 0;
+	if (version > MPA_REVISION_2) {
+		pr_warn("Setting MPA version to %u\n", MPA_REVISION_2);
+		version = MPA_REVISION_2;
+		/* Adjust also module parameter */
+		mpa_version = MPA_REVISION_2;
+	}
+	__mpa_rr_set_revision(&cep->mpa.hdr.params.bits, version);
+
+	if (mpa_crc_required)
+		cep->mpa.hdr.params.bits |= MPA_RR_FLAG_CRC;
+
+	/*
+	 * If MPA version == 2:
+	 * o Include ORD and IRD.
+	 * o Indicate peer-to-peer mode, if required by module
+	 *   parameter 'peer_to_peer'.
+	 */
+	if (version == MPA_REVISION_2) {
+		cep->enhanced_rdma_conn_est = true;
+		cep->mpa.hdr.params.bits |= MPA_RR_FLAG_ENHANCED;
+
+		cep->mpa.v2_ctrl.ird = htons(cep->ird);
+		cep->mpa.v2_ctrl.ord = htons(cep->ord);
+
+		if (p2p_mode) {
+			cep->mpa.v2_ctrl.ird |= MPA_V2_PEER_TO_PEER;
+			cep->mpa.v2_ctrl.ord |= rtr_type;
+		}
+		/* Remember own P2P mode requested */
+		cep->mpa.v2_ctrl_req.ird = cep->mpa.v2_ctrl.ird;
+		cep->mpa.v2_ctrl_req.ord = cep->mpa.v2_ctrl.ord;
+	}
+
+	memcpy(cep->mpa.hdr.key, MPA_KEY_REQ, 16);
+
+	rv = siw_send_mpareqrep(cep, params->private_data, pd_len);
+	/*
+	 * Reset private data.
+	 */
+	cep->mpa.hdr.params.pd_len = 0;
+
+	if (rv >= 0) {
+		rv = siw_cm_queue_work(cep, SIW_CM_WORK_MPATIMEOUT);
+		if (!rv) {
+			siw_dbg_cep(cep, "id 0x%p, [QP %d]: exit\n",
+				    id, QP_ID(qp));
+			siw_cep_set_free(cep);
+			return 0;
+		}
+	}
+error:
+	siw_dbg_qp(qp, "failed: %d\n", rv);
+
+	if (cep) {
+		siw_socket_disassoc(s);
+		sock_release(s);
+		cep->llp.sock = NULL;
+
+		cep->qp = NULL;
+
+		cep->cm_id = NULL;
+		id->rem_ref(id);
+		siw_cep_put(cep);
+
+		qp->cep = NULL;
+		siw_cep_put(cep);
+
+		cep->state = SIW_EPSTATE_CLOSED;
+
+		siw_cep_set_free(cep);
+
+		siw_cep_put(cep);
+
+	} else if (s) {
+		sock_release(s);
+	}
+	siw_qp_put(qp);
+
+	return rv;
+}
+
+/*
+ * siw_accept - Let SoftiWARP accept an RDMA connection request
+ *
+ * @id:		New connection management id to be used for accepted
+ *		connection request
+ * @params:	Connection parameters provided by ULP for accepting connection
+ *
+ * Transition QP to RTS state, associate new CM id @id with accepted CEP
+ * and get prepared for TCP input by installing socket callbacks.
+ * Then send MPA Reply and generate the "connection established" event.
+ * Socket callbacks must be installed before sending MPA Reply, because
+ * the latter may cause a first RDMA message to arrive from the RDMA Initiator
+ * side very quickly, at which time the socket callbacks must be ready.
+ */
+int siw_accept(struct iw_cm_id *id, struct iw_cm_conn_param *params)
+{
+	struct siw_device	*sdev = siw_dev_base2siw(id->device);
+	struct siw_cep		*cep = (struct siw_cep *)id->provider_data;
+	struct siw_qp		*qp;
+	struct siw_qp_attrs	qp_attrs;
+	int rv, max_priv_data = MPA_MAX_PRIVDATA;
+	bool wait_for_peer_rts = false;
+
+	siw_cep_set_inuse(cep);
+	siw_cep_put(cep);
+
+	/* Free lingering inbound private data */
+	if (cep->mpa.hdr.params.pd_len) {
+		cep->mpa.hdr.params.pd_len = 0;
+		kfree(cep->mpa.pdata);
+		cep->mpa.pdata = NULL;
+	}
+	siw_cancel_mpatimer(cep);
+
+	if (cep->state != SIW_EPSTATE_RECVD_MPAREQ) {
+		siw_dbg_cep(cep, "id 0x%p: out of state\n", id);
+
+		siw_cep_set_free(cep);
+		siw_cep_put(cep);
+
+		return -ECONNRESET;
+	}
+
+	qp = siw_qp_id2obj(sdev, params->qpn);
+	if (!qp) {
+		WARN(1, "[QP %d] does not exist\n", params->qpn);
+		siw_cep_set_free(cep);
+		siw_cep_put(cep);
+
+		return -EINVAL;
+	}
+
+	down_write(&qp->state_lock);
+	if (qp->attrs.state > SIW_QP_STATE_RTR) {
+		rv = -EINVAL;
+		up_write(&qp->state_lock);
+		goto error;
+	}
+
+	siw_dbg_cep(cep, "id 0x%p\n", id);
+
+	if (params->ord > sdev->attrs.max_ord ||
+	    params->ird > sdev->attrs.max_ird) {
+		siw_dbg_cep(cep,
+			"id 0x%p, [QP %d]: ord %d (max %d), ird %d (max %d)\n",
+			id, QP_ID(qp),
+			params->ord, sdev->attrs.max_ord,
+			params->ird, sdev->attrs.max_ird);
+		rv = -EINVAL;
+		up_write(&qp->state_lock);
+		goto error;
+	}
+	if (cep->enhanced_rdma_conn_est)
+		max_priv_data -= sizeof(struct mpa_v2_data);
+
+	if (params->private_data_len > max_priv_data) {
+		siw_dbg_cep(cep,
+			"id 0x%p, [QP %d]: private data length: %d (max %d)\n",
+			id, QP_ID(qp),
+			params->private_data_len, max_priv_data);
+		rv =  -EINVAL;
+		up_write(&qp->state_lock);
+		goto error;
+	}
+
+	if (cep->enhanced_rdma_conn_est) {
+		if (params->ord > cep->ord) {
+			if (relaxed_ird_negotiation)
+				params->ord = cep->ord;
+			else {
+				cep->ird = params->ird;
+				cep->ord = params->ord;
+				rv =  -EINVAL;
+				up_write(&qp->state_lock);
+				goto error;
+			}
+		}
+		if (params->ird < cep->ird) {
+			if (relaxed_ird_negotiation &&
+			    cep->ird <= sdev->attrs.max_ird)
+				params->ird = cep->ird;
+			else {
+				rv = -ENOMEM;
+				up_write(&qp->state_lock);
+				goto error;
+			}
+		}
+		if (cep->mpa.v2_ctrl.ord &
+		    (MPA_V2_RDMA_WRITE_RTR | MPA_V2_RDMA_READ_RTR))
+			wait_for_peer_rts = true;
+		/*
+		 * Signal back negotiated IRD and ORD values
+		 */
+		cep->mpa.v2_ctrl.ord = htons(params->ord & MPA_IRD_ORD_MASK) |
+				(cep->mpa.v2_ctrl.ord & ~MPA_V2_MASK_IRD_ORD);
+		cep->mpa.v2_ctrl.ird = htons(params->ird & MPA_IRD_ORD_MASK) |
+				(cep->mpa.v2_ctrl.ird & ~MPA_V2_MASK_IRD_ORD);
+	}
+	cep->ird = params->ird;
+	cep->ord = params->ord;
+
+	cep->cm_id = id;
+	id->add_ref(id);
+
+	memset(&qp_attrs, 0, sizeof(qp_attrs));
+	qp_attrs.orq_size = cep->ord;
+	qp_attrs.irq_size = cep->ird;
+	qp_attrs.llp_stream_handle = cep->llp.sock;
+	if (cep->mpa.hdr.params.bits & MPA_RR_FLAG_CRC)
+		qp_attrs.flags = SIW_MPA_CRC;
+	qp_attrs.state = SIW_QP_STATE_RTS;
+
+	siw_dbg_cep(cep, "id 0x%p, [QP%d]: moving to rts\n", id, QP_ID(qp));
+
+	/* Associate QP with CEP */
+	siw_cep_get(cep);
+	qp->cep = cep;
+
+	/* siw_qp_get(qp) already done by QP lookup */
+	cep->qp = qp;
+
+	cep->state = SIW_EPSTATE_RDMA_MODE;
+
+	/* Move socket RX/TX under QP control */
+	rv = siw_qp_modify(qp, &qp_attrs, SIW_QP_ATTR_STATE|
+					  SIW_QP_ATTR_LLP_HANDLE|
+					  SIW_QP_ATTR_ORD|
+					  SIW_QP_ATTR_IRD|
+					  SIW_QP_ATTR_MPA);
+	up_write(&qp->state_lock);
+
+	if (rv)
+		goto error;
+
+	siw_dbg_cep(cep, "id 0x%p, [QP %d]: send mpa reply, %d byte pdata\n",
+		    id, QP_ID(qp), params->private_data_len);
+
+	rv = siw_send_mpareqrep(cep, params->private_data,
+				params->private_data_len);
+	if (rv != 0)
+		goto error;
+
+	if (wait_for_peer_rts) {
+		siw_sk_assign_rtr_upcalls(cep);
+	} else {
+		siw_qp_socket_assoc(cep, qp);
+		rv = siw_cm_upcall(cep, IW_CM_EVENT_ESTABLISHED, 0);
+		if (rv)
+			goto error;
+	}
+	siw_cep_set_free(cep);
+
+	return 0;
+error:
+	siw_socket_disassoc(cep->llp.sock);
+	sock_release(cep->llp.sock);
+	cep->llp.sock = NULL;
+
+	cep->state = SIW_EPSTATE_CLOSED;
+
+	if (cep->cm_id) {
+		cep->cm_id->rem_ref(id);
+		cep->cm_id = NULL;
+	}
+	if (qp->cep) {
+		siw_cep_put(cep);
+		qp->cep = NULL;
+	}
+	cep->qp = NULL;
+	siw_qp_put(qp);
+
+	siw_cep_set_free(cep);
+	siw_cep_put(cep);
+
+	return rv;
+}
+
+/*
+ * siw_reject()
+ *
+ * Local connection reject case. Send private data back to peer,
+ * close connection and dereference connection id.
+ */
+int siw_reject(struct iw_cm_id *id, const void *pdata, u8 pd_len)
+{
+	struct siw_cep *cep = (struct siw_cep *)id->provider_data;
+
+	siw_cep_set_inuse(cep);
+	siw_cep_put(cep);
+
+	siw_cancel_mpatimer(cep);
+
+	if (cep->state != SIW_EPSTATE_RECVD_MPAREQ) {
+
+		siw_dbg_cep(cep, "id 0x%p: out of state\n", id);
+
+		siw_cep_set_free(cep);
+		siw_cep_put(cep); /* put last reference */
+
+		return -ECONNRESET;
+	}
+	siw_dbg_cep(cep, "id 0x%p, cep->state %d, pd_len %d\n",
+		    id, cep->state, pd_len);
+
+	if (__mpa_rr_revision(cep->mpa.hdr.params.bits) >= MPA_REVISION_1) {
+		cep->mpa.hdr.params.bits |= MPA_RR_FLAG_REJECT; /* reject */
+		(void)siw_send_mpareqrep(cep, pdata, pd_len);
+	}
+	siw_socket_disassoc(cep->llp.sock);
+	sock_release(cep->llp.sock);
+	cep->llp.sock = NULL;
+
+	cep->state = SIW_EPSTATE_CLOSED;
+
+	siw_cep_set_free(cep);
+	siw_cep_put(cep);
+
+	return 0;
+}
+
+static int siw_listen_address(struct iw_cm_id *id, int backlog,
+			      struct sockaddr *laddr)
+{
+	struct socket *s;
+	struct siw_cep *cep = NULL;
+	struct siw_device *sdev = siw_dev_base2siw(id->device);
+	int rv = 0, s_val;
+
+	rv = sock_create(AF_INET, SOCK_STREAM, IPPROTO_TCP, &s);
+	if (rv < 0)
+		return rv;
+
+	/*
+	 * Probably to be removed later. Allows binding
+	 * local port when still in TIME_WAIT from last close.
+	 */
+	s_val = 1;
+	rv = kernel_setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (char *)&s_val,
+			       sizeof(s_val));
+	if (rv) {
+		siw_dbg(sdev, "id 0x%p: setsockopt error: %d\n", id, rv);
+		goto error;
+	}
+	rv = s->ops->bind(s, laddr, sizeof(*laddr));
+	if (rv != 0) {
+		siw_dbg(sdev, "id 0x%p: socket bind error: %d\n", id, rv);
+		goto error;
+	}
+	cep = siw_cep_alloc(sdev);
+	if (!cep) {
+		rv = -ENOMEM;
+		goto error;
+	}
+	siw_cep_socket_assoc(cep, s);
+
+	rv = siw_cm_alloc_work(cep, backlog);
+	if (rv != 0) {
+		siw_dbg(sdev, "id 0x%p: alloc_work error %d, backlog %d\n",
+			id, rv, backlog);
+		goto error;
+	}
+	rv = s->ops->listen(s, backlog);
+	if (rv != 0) {
+		siw_dbg(sdev, "id 0x%p: listen error %d\n", id, rv);
+		goto error;
+	}
+	memcpy(&cep->llp.laddr, &id->local_addr, sizeof(cep->llp.laddr));
+	memcpy(&cep->llp.raddr, &id->remote_addr, sizeof(cep->llp.raddr));
+
+	cep->cm_id = id;
+	id->add_ref(id);
+
+	/*
+	 * In case of a wildcard rdma_listen on a multi-homed device,
+	 * a listener's IWCM id is associated with more than one listening CEP.
+	 *
+	 * We currently use id->provider_data in three different ways:
+	 *
+	 * o For a listener's IWCM id, id->provider_data points to
+	 *   the list_head of the list of listening CEPs.
+	 *   Uses: siw_create_listen(), siw_destroy_listen()
+	 *
+	 * o For a passive-side IWCM id, id->provider_data points to
+	 *   the CEP itself. This is a consequence of
+	 *   - siw_cm_upcall() setting event.provider_data = cep and
+	 *   - the IWCM's cm_conn_req_handler() setting provider_data of the
+	 *     new passive-side IWCM id equal to event.provider_data
+	 *   Uses: siw_accept(), siw_reject()
+	 *
+	 * o For an active-side IWCM id, id->provider_data is not used at all.
+	 *
+	 */
+	if (!id->provider_data) {
+		id->provider_data = kmalloc(sizeof(struct list_head),
+					    GFP_KERNEL);
+		if (!id->provider_data) {
+			rv = -ENOMEM;
+			goto error;
+		}
+		INIT_LIST_HEAD((struct list_head *)id->provider_data);
+	}
+
+	siw_dbg(sdev, "id 0x%p: provider_data 0x%p, cep 0x%p\n",
+		id, id->provider_data, cep);
+
+	list_add_tail(&cep->listenq, (struct list_head *)id->provider_data);
+	cep->state = SIW_EPSTATE_LISTENING;
+
+	return 0;
+
+error:
+	siw_dbg(sdev, "failed: %d\n", rv);
+
+	if (cep) {
+		siw_cep_set_inuse(cep);
+
+		if (cep->cm_id) {
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+		}
+		cep->llp.sock = NULL;
+		siw_socket_disassoc(s);
+		cep->state = SIW_EPSTATE_CLOSED;
+
+		siw_cep_set_free(cep);
+		siw_cep_put(cep);
+	}
+	sock_release(s);
+
+	return rv;
+}
+
+static void siw_drop_listeners(struct iw_cm_id *id)
+{
+	struct list_head	*p, *tmp;
+
+	/*
+	 * In case of a wildcard rdma_listen on a multi-homed device,
+	 * a listener's IWCM id is associated with more than one listening CEP.
+	 */
+	list_for_each_safe(p, tmp, (struct list_head *)id->provider_data) {
+		struct siw_cep *cep = list_entry(p, struct siw_cep, listenq);
+
+		list_del(p);
+
+		siw_dbg_cep(cep, "id 0x%p: drop cep, state %d\n",
+			    id, cep->state);
+
+		siw_cep_set_inuse(cep);
+
+		if (cep->cm_id) {
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+		}
+		if (cep->llp.sock) {
+			siw_socket_disassoc(cep->llp.sock);
+			sock_release(cep->llp.sock);
+			cep->llp.sock = NULL;
+		}
+		cep->state = SIW_EPSTATE_CLOSED;
+		siw_cep_set_free(cep);
+		siw_cep_put(cep);
+	}
+}
+
+/*
+ * siw_create_listen - Create resources for a listener's IWCM ID @id
+ *
+ * Listens on the socket addresses id->local_addr and id->remote_addr.
+ *
+ * If the listener's @id provides a specific local IP address, at most one
+ * listening socket is created and associated with @id.
+ *
+ * If the listener's @id provides the wildcard (zero) local IP address,
+ * a separate listen is performed for each local IP address of the device
+ * by creating a listening socket and binding to that local IP address.
+ *
+ */
+int siw_create_listen(struct iw_cm_id *id, int backlog)
+{
+	struct ib_device *base_dev = id->device;
+	struct siw_device *sdev = siw_dev_base2siw(base_dev);
+	int rv = 0;
+
+	siw_dbg(sdev, "id 0x%p: backlog %d\n", id, backlog);
+
+	if (to_sockaddr_in(id->local_addr).sin_family == AF_INET) {
+		/* IPv4 */
+		struct sockaddr_in	laddr = to_sockaddr_in(id->local_addr);
+		u8			*l_ip, *r_ip;
+		struct in_device	*in_dev;
+
+		l_ip = (u8 *) &to_sockaddr_in(id->local_addr).sin_addr.s_addr;
+		r_ip = (u8 *) &to_sockaddr_in(id->remote_addr).sin_addr.s_addr;
+		siw_dbg(sdev,
+			"id 0x%p: laddr %d.%d.%d.%d:%d, raddr %d.%d.%d.%d:%d\n",
+			id, l_ip[0], l_ip[1], l_ip[2], l_ip[3],
+			ntohs(to_sockaddr_in(id->local_addr).sin_port),
+			r_ip[0], r_ip[1], r_ip[2], r_ip[3],
+			ntohs(to_sockaddr_in(id->remote_addr).sin_port));
+
+		in_dev = in_dev_get(sdev->netdev);
+		if (!in_dev) {
+			siw_dbg(sdev, "id 0x%p: netdev w/o in_device\n", id);
+			return -ENODEV;
+		}
+
+		for_ifa(in_dev) {
+			/*
+			 * Create a listening socket if id->local_addr
+			 * contains the wildcard IP address OR
+			 * the IP address of the interface.
+			 */
+			if (ipv4_is_zeronet(
+			    to_sockaddr_in(id->local_addr).sin_addr.s_addr) ||
+			    to_sockaddr_in(id->local_addr).sin_addr.s_addr ==
+			    ifa->ifa_address) {
+				laddr.sin_addr.s_addr = ifa->ifa_address;
+
+				l_ip = (u8 *) &laddr.sin_addr.s_addr;
+				siw_dbg(sdev,
+					"id 0x%p: bind: %d.%d.%d.%d:%d\n",
+					id, l_ip[0], l_ip[1], l_ip[2],
+					l_ip[3], ntohs(laddr.sin_port));
+
+				rv = siw_listen_address(id, backlog,
+						(struct sockaddr *)&laddr);
+				if (rv)
+					break;
+			}
+		}
+		endfor_ifa(in_dev);
+		in_dev_put(in_dev);
+
+		if (rv && id->provider_data)
+			siw_drop_listeners(id);
+
+	} else {
+		/* IPv6 */
+		rv = -EAFNOSUPPORT;
+		siw_dbg(sdev, "id 0x%p: TODO: IPv6 support\n", id);
+	}
+	if (!rv)
+		siw_dbg(sdev, "id 0x%p: success\n", id);
+
+	return rv;
+}
+
+int siw_destroy_listen(struct iw_cm_id *id)
+{
+	struct siw_device *sdev = siw_dev_base2siw(id->device);
+
+	siw_dbg(sdev, "id 0x%p\n", id);
+
+	if (!id->provider_data) {
+		siw_dbg(sdev, "id 0x%p: no cep(s)\n", id);
+		return 0;
+	}
+	siw_drop_listeners(id);
+	kfree(id->provider_data);
+	id->provider_data = NULL;
+
+	return 0;
+}
+
+int siw_cm_init(void)
+{
+	/*
+	 * create_single_workqueue for strict ordering
+	 */
+	siw_cm_wq = create_singlethread_workqueue("siw_cm_wq");
+	if (!siw_cm_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void siw_cm_exit(void)
+{
+	if (siw_cm_wq) {
+		flush_workqueue(siw_cm_wq);
+		destroy_workqueue(siw_cm_wq);
+	}
+}
diff --git a/drivers/infiniband/sw/siw/siw_cm.h b/drivers/infiniband/sw/siw/siw_cm.h
new file mode 100644
index 000000000000..36a8d15d793a
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_cm.h
@@ -0,0 +1,156 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Greg Joyce <greg-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ * Copyright (c) 2017, Open Grid Computing, Inc.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _SIW_CM_H
+#define _SIW_CM_H
+
+#include <net/sock.h>
+#include <linux/tcp.h>
+
+#include <rdma/iw_cm.h>
+
+
+enum siw_cep_state {
+	SIW_EPSTATE_IDLE = 1,
+	SIW_EPSTATE_LISTENING,
+	SIW_EPSTATE_CONNECTING,
+	SIW_EPSTATE_AWAIT_MPAREQ,
+	SIW_EPSTATE_RECVD_MPAREQ,
+	SIW_EPSTATE_AWAIT_MPAREP,
+	SIW_EPSTATE_RDMA_MODE,
+	SIW_EPSTATE_CLOSED
+};
+
+struct siw_mpa_info {
+	struct mpa_rr		hdr;	/* peer mpa hdr in host byte order */
+	struct mpa_v2_data	v2_ctrl;
+	struct mpa_v2_data	v2_ctrl_req;
+	char			*pdata;
+	int			bytes_rcvd;
+};
+
+struct siw_llp_info {
+	struct socket		*sock;
+	struct sockaddr_in	laddr;	/* redundant with socket info above */
+	struct sockaddr_in	raddr;	/* dito, consider removal */
+	struct siw_sk_upcalls	sk_def_upcalls;
+};
+
+struct siw_device;
+
+struct siw_cep {
+	struct iw_cm_id		*cm_id;
+	struct siw_device	*sdev;
+
+	struct list_head	devq;
+	/*
+	 * The provider_data element of a listener IWCM ID
+	 * refers to a list of one or more listener CEPs
+	 */
+	struct list_head	listenq;
+	struct siw_cep		*listen_cep;
+	struct siw_qp		*qp;
+	spinlock_t		lock;
+	wait_queue_head_t	waitq;
+	struct kref		ref;
+	enum siw_cep_state	state;
+	short			in_use;
+	struct siw_cm_work	*mpa_timer;
+	struct list_head	work_freelist;
+	struct siw_llp_info	llp;
+	struct siw_mpa_info	mpa;
+	int			ord;
+	int			ird;
+	bool			enhanced_rdma_conn_est;
+	int			sk_error; /* not (yet) used XXX */
+
+	/* Saved upcalls of socket llp.sock */
+	void	(*sk_state_change)(struct sock *sk);
+	void	(*sk_data_ready)(struct sock *sk);
+	void	(*sk_write_space)(struct sock *sk);
+	void	(*sk_error_report)(struct sock *sk);
+};
+
+/*
+ * Connection initiator waits 10 seconds to receive an
+ * MPA reply after sending out MPA request. Reponder waits for
+ * 5 seconds for MPA request to arrive if new TCP connection
+ * was set up.
+ */
+#define MPAREQ_TIMEOUT	(HZ*10)
+#define MPAREP_TIMEOUT	(HZ*5)
+
+enum siw_work_type {
+	SIW_CM_WORK_ACCEPT = 1,
+	SIW_CM_WORK_READ_MPAHDR,
+	SIW_CM_WORK_CLOSE_LLP,		/* close socket */
+	SIW_CM_WORK_PEER_CLOSE,		/* socket indicated peer close */
+	SIW_CM_WORK_MPATIMEOUT
+};
+
+struct siw_cm_work {
+	struct delayed_work	work;
+	struct list_head	list;
+	enum siw_work_type	type;
+	struct siw_cep	*cep;
+};
+
+#define to_sockaddr_in(a) (*(struct sockaddr_in *)(&(a)))
+
+extern int siw_connect(struct iw_cm_id *id, struct iw_cm_conn_param *parm);
+extern int siw_accept(struct iw_cm_id *id, struct iw_cm_conn_param *param);
+extern int siw_reject(struct iw_cm_id *id, const void *data, u8 len);
+extern int siw_create_listen(struct iw_cm_id *id, int backlog);
+extern int siw_destroy_listen(struct iw_cm_id *id);
+
+extern void siw_cep_get(struct siw_cep *cep);
+extern void siw_cep_put(struct siw_cep *cep);
+extern int siw_cm_queue_work(struct siw_cep *cep, enum siw_work_type type);
+
+extern int siw_cm_init(void);
+extern void siw_cm_exit(void);
+
+/*
+ * TCP socket interface
+ */
+#define sk_to_qp(sk)	(((struct siw_cep *)((sk)->sk_user_data))->qp)
+#define sk_to_cep(sk)	((struct siw_cep *)((sk)->sk_user_data))
+
+#endif
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 07/13] SoftiWarp application buffer management
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (5 preceding siblings ...)
  2018-01-14 22:35   ` [PATCH v3 06/13] SoftiWarp connection management Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 08/13] SoftiWarp Queue Pair methods Bernard Metzler
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_mem.c | 243 ++++++++++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_mem.c

diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
new file mode 100644
index 000000000000..314a0d4caa41
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -0,0 +1,243 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Animesh Trivedi <atr-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/version.h>
+#include <linux/scatterlist.h>
+#include <linux/gfp.h>
+#include <rdma/ib_verbs.h>
+#include <linux/dma-mapping.h>
+#include <linux/slab.h>
+#include <linux/pid.h>
+#include <linux/sched/mm.h>
+
+#include "siw.h"
+#include "siw_debug.h"
+
+static void siw_umem_update_stats(struct work_struct *work)
+{
+	struct siw_umem *umem = container_of(work, struct siw_umem, work);
+	struct mm_struct *mm_s = umem->mm_s;
+
+	down_write(&mm_s->mmap_sem);
+	mm_s->pinned_vm -= umem->num_pages;
+	up_write(&mm_s->mmap_sem);
+
+	mmput(mm_s);
+
+	kfree(umem->page_chunk);
+	kfree(umem);
+}
+
+static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages)
+{
+	struct page **p = chunk->p;
+
+	while (num_pages--) {
+		put_page(*p);
+		p++;
+	}
+}
+
+void siw_umem_release(struct siw_umem *umem)
+{
+	struct task_struct *task = get_pid_task(umem->pid, PIDTYPE_PID);
+	int i, num_pages = umem->num_pages;
+
+	for (i = 0; num_pages; i++) {
+		int to_free = min_t(int, PAGES_PER_CHUNK, num_pages);
+
+		siw_free_plist(&umem->page_chunk[i], to_free);
+		kfree(umem->page_chunk[i].p);
+		num_pages -= to_free;
+	}
+	put_pid(umem->pid);
+	if (task) {
+		struct mm_struct *mm_s = get_task_mm(task);
+
+		put_task_struct(task);
+		if (mm_s) {
+			if (down_write_trylock(&mm_s->mmap_sem)) {
+				mm_s->pinned_vm -= umem->num_pages;
+				up_write(&mm_s->mmap_sem);
+				mmput(mm_s);
+			} else {
+				/*
+				 * Schedule delayed accounting if
+				 * mm semaphore is not available
+				 */
+				INIT_WORK(&umem->work, siw_umem_update_stats);
+				umem->mm_s = mm_s;
+				schedule_work(&umem->work);
+
+				return;
+			}
+		}
+	}
+	kfree(umem->page_chunk);
+	kfree(umem);
+}
+
+void siw_pbl_free(struct siw_pbl *pbl)
+{
+	kfree(pbl);
+}
+
+/*
+ * Get physical address backed by PBL element. Address is referenced
+ * by linear byte offset into list of variably sized PB elements.
+ * Optionally, provide remaining len within current element, and
+ * current PBL index for later resume at same element.
+ */
+u64 siw_pbl_get_buffer(struct siw_pbl *pbl, u64 off, int *len, int *idx)
+{
+	int i = idx ? *idx : 0;
+
+	while (i < pbl->num_buf) {
+		struct siw_pble *pble = &pbl->pbe[i];
+
+		if (pble->pbl_off + pble->size > off) {
+			u64 pble_off = off - pble->pbl_off;
+
+			if (len)
+				*len = pble->size - pble_off;
+			if (idx)
+				*idx = i;
+
+			return pble->addr + pble_off;
+		}
+		i++;
+	}
+	if (len)
+		*len = 0;
+	return 0;
+}
+
+struct siw_pbl *siw_pbl_alloc(u32 num_buf)
+{
+	struct siw_pbl *pbl;
+	int buf_size = sizeof(*pbl);
+
+	if (num_buf == 0)
+		return ERR_PTR(-EINVAL);
+
+	buf_size += ((num_buf - 1) * sizeof(struct siw_pble));
+
+	pbl = kzalloc(buf_size, GFP_KERNEL);
+	if (!pbl)
+		return ERR_PTR(-ENOMEM);
+
+	pbl->max_buf = num_buf;
+
+	return pbl;
+}
+
+struct siw_umem *siw_umem_get(u64 start, u64 len)
+{
+	struct siw_umem *umem;
+	u64 first_page_va;
+	unsigned long mlock_limit;
+	int num_pages, num_chunks, i, rv = 0;
+
+	if (!can_do_mlock())
+		return ERR_PTR(-EPERM);
+
+	if (!len)
+		return ERR_PTR(-EINVAL);
+
+	first_page_va = start & PAGE_MASK;
+	num_pages = PAGE_ALIGN(start + len - first_page_va) >> PAGE_SHIFT;
+	num_chunks = (num_pages >> CHUNK_SHIFT) + 1;
+
+	umem = kzalloc(sizeof(*umem), GFP_KERNEL);
+	if (!umem)
+		return ERR_PTR(-ENOMEM);
+
+	umem->pid = get_task_pid(current, PIDTYPE_PID);
+
+	down_write(&current->mm->mmap_sem);
+
+	mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	if (num_pages + current->mm->pinned_vm > mlock_limit) {
+		rv = -ENOMEM;
+		goto out;
+	}
+	umem->fp_addr = first_page_va;
+
+	umem->page_chunk = kcalloc(num_chunks, sizeof(struct siw_page_chunk),
+				   GFP_KERNEL);
+	if (!umem->page_chunk) {
+		rv = -ENOMEM;
+		goto out;
+	}
+	for (i = 0; num_pages; i++) {
+		int got, nents = min_t(int, num_pages, PAGES_PER_CHUNK);
+
+		umem->page_chunk[i].p = kcalloc(nents, sizeof(struct page *),
+						GFP_KERNEL);
+		if (!umem->page_chunk[i].p) {
+			rv = -ENOMEM;
+			goto out;
+		}
+		got = 0;
+		while (nents) {
+			struct page **plist = &umem->page_chunk[i].p[got];
+
+			rv = get_user_pages(first_page_va, nents, FOLL_WRITE,
+					    plist, NULL);
+			if (rv < 0)
+				goto out;
+
+			umem->num_pages += rv;
+			current->mm->pinned_vm += rv;
+			first_page_va += rv * PAGE_SIZE;
+			nents -= rv;
+			got += rv;
+		}
+		num_pages -= got;
+	}
+out:
+	up_write(&current->mm->mmap_sem);
+
+	if (rv > 0)
+		return umem;
+
+	siw_umem_release(umem);
+
+	return ERR_PTR(rv);
+}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 08/13] SoftiWarp Queue Pair methods
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (6 preceding siblings ...)
  2018-01-14 22:35   ` [PATCH v3 07/13] SoftiWarp application buffer management Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
  2018-01-14 22:35   ` [PATCH v3 09/13] SoftiWarp transmit path Bernard Metzler
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_qp.c | 1445 ++++++++++++++++++++++++++++++++++++
 1 file changed, 1445 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_qp.c

diff --git a/drivers/infiniband/sw/siw/siw_qp.c b/drivers/infiniband/sw/siw/siw_qp.c
new file mode 100644
index 000000000000..a80647b55f4b
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_qp.c
@@ -0,0 +1,1445 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Fredy Neeser <nfd-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/net.h>
+#include <linux/file.h>
+#include <linux/scatterlist.h>
+#include <linux/highmem.h>
+#include <linux/vmalloc.h>
+#include <asm/barrier.h>
+#include <net/sock.h>
+#include <net/tcp_states.h>
+#include <net/tcp.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+
+static char siw_qp_state_to_string[SIW_QP_STATE_COUNT][sizeof "TERMINATE"] = {
+	[SIW_QP_STATE_IDLE]		= "IDLE",
+	[SIW_QP_STATE_RTR]		= "RTR",
+	[SIW_QP_STATE_RTS]		= "RTS",
+	[SIW_QP_STATE_CLOSING]		= "CLOSING",
+	[SIW_QP_STATE_TERMINATE]	= "TERMINATE",
+	[SIW_QP_STATE_ERROR]		= "ERROR"
+};
+
+/*
+ * iWARP (RDMAP, DDP and MPA) parameters as well as Softiwarp settings on a
+ * per-RDMAP message basis. Please keep order of initializer. All MPA len
+ * is initialized to minimum packet size.
+ */
+struct iwarp_msg_info iwarp_pktinfo[RDMAP_TERMINATE + 1] = { {
+	/* RDMAP_RDMA_WRITE */
+	.hdr_len = sizeof(struct iwarp_rdma_write),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_rdma_write) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_TAGGED | DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_RDMA_WRITE),
+	.proc_data = siw_proc_write
+},
+{	/* RDMAP_RDMA_READ_REQ */
+	.hdr_len = sizeof(struct iwarp_rdma_rreq),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_rdma_rreq) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_RDMA_READ_REQ),
+	.proc_data = siw_proc_rreq
+},
+{	/* RDMAP_RDMA_READ_RESP */
+	.hdr_len = sizeof(struct iwarp_rdma_rresp),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_rdma_rresp) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_TAGGED | DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_RDMA_READ_RESP),
+	.proc_data = siw_proc_rresp
+},
+{	/* RDMAP_SEND */
+	.hdr_len = sizeof(struct iwarp_send),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_send) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_SEND),
+	.proc_data = siw_proc_send
+},
+{	/* RDMAP_SEND_INVAL */
+	.hdr_len = sizeof(struct iwarp_send_inv),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_send_inv) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_SEND_INVAL),
+	.proc_data = siw_proc_send
+},
+{	/* RDMAP_SEND_SE */
+	.hdr_len = sizeof(struct iwarp_send),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_send) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_SEND_SE),
+	.proc_data = siw_proc_send
+},
+{	/* RDMAP_SEND_SE_INVAL */
+	.hdr_len = sizeof(struct iwarp_send_inv),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_send_inv) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_SEND_SE_INVAL),
+	.proc_data = siw_proc_send
+},
+{	/* RDMAP_TERMINATE */
+	.hdr_len = sizeof(struct iwarp_terminate),
+	.ctrl.mpa_len = htons(sizeof(struct iwarp_terminate) - 2),
+	.ctrl.ddp_rdmap_ctrl = DDP_FLAG_LAST
+		| cpu_to_be16(DDP_VERSION << 8)
+		| cpu_to_be16(RDMAP_VERSION << 6)
+		| cpu_to_be16(RDMAP_TERMINATE),
+	.proc_data = siw_proc_terminate
+} };
+
+void siw_qp_llp_data_ready(struct sock *sk)
+{
+	struct siw_qp		*qp;
+
+	read_lock(&sk->sk_callback_lock);
+
+	if (unlikely(!sk->sk_user_data || !sk_to_qp(sk)))
+		goto done;
+
+	qp = sk_to_qp(sk);
+
+	if (likely(!qp->rx_ctx.rx_suspend &&
+		   down_read_trylock(&qp->state_lock))) {
+		read_descriptor_t rd_desc = {.arg.data = qp, .count = 1};
+
+		if (likely(qp->attrs.state == SIW_QP_STATE_RTS))
+			/*
+			 * Implements data receive operation during
+			 * socket callback. TCP gracefully catches
+			 * the case where there is nothing to receive
+			 * (not calling siw_tcp_rx_data() then).
+			 */
+			tcp_read_sock(sk, &rd_desc, siw_tcp_rx_data);
+
+		up_read(&qp->state_lock);
+	} else {
+		siw_dbg_qp(qp, "unable to rx, suspend: %d\n",
+			   qp->rx_ctx.rx_suspend);
+	}
+done:
+	read_unlock(&sk->sk_callback_lock);
+}
+
+void siw_qp_llp_close(struct siw_qp *qp)
+{
+	siw_dbg_qp(qp, "enter llp close, state = %s\n",
+		   siw_qp_state_to_string[qp->attrs.state]);
+
+	down_write(&qp->state_lock);
+
+	qp->rx_ctx.rx_suspend = 1;
+	qp->tx_ctx.tx_suspend = 1;
+	qp->attrs.llp_stream_handle = NULL;
+
+	switch (qp->attrs.state) {
+
+	case SIW_QP_STATE_RTS:
+	case SIW_QP_STATE_RTR:
+	case SIW_QP_STATE_IDLE:
+	case SIW_QP_STATE_TERMINATE:
+
+		qp->attrs.state = SIW_QP_STATE_ERROR;
+
+		break;
+	/*
+	 * SIW_QP_STATE_CLOSING:
+	 *
+	 * This is a forced close. shall the QP be moved to
+	 * ERROR or IDLE ?
+	 */
+	case SIW_QP_STATE_CLOSING:
+		if (tx_wqe(qp)->wr_status == SIW_WR_IDLE)
+			qp->attrs.state = SIW_QP_STATE_ERROR;
+		else
+			qp->attrs.state = SIW_QP_STATE_IDLE;
+
+		break;
+
+	default:
+		siw_dbg_qp(qp, "llp close: no state transition needed: %s\n",
+			   siw_qp_state_to_string[qp->attrs.state]);
+		break;
+	}
+	siw_sq_flush(qp);
+	siw_rq_flush(qp);
+
+	/*
+	 * Dereference closing CEP
+	 */
+	if (qp->cep) {
+		siw_cep_put(qp->cep);
+		qp->cep = NULL;
+	}
+
+	up_write(&qp->state_lock);
+
+	siw_dbg_qp(qp, "llp close exit: state %s\n",
+		   siw_qp_state_to_string[qp->attrs.state]);
+}
+
+/*
+ * socket callback routine informing about newly available send space.
+ * Function schedules SQ work for processing SQ items.
+ */
+void siw_qp_llp_write_space(struct sock *sk)
+{
+	struct siw_cep	*cep = sk_to_cep(sk);
+
+	cep->sk_write_space(sk);
+
+	if (!test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
+		(void) siw_sq_start(cep->qp);
+}
+
+static int siw_qp_readq_init(struct siw_qp *qp, int irq_size, int orq_size)
+{
+	if (!irq_size)
+		irq_size = 1;
+	if (!orq_size)
+		orq_size = 1;
+
+	qp->attrs.irq_size = irq_size;
+	qp->attrs.orq_size = orq_size;
+
+	qp->irq = vzalloc(irq_size * sizeof(struct siw_sqe));
+	if (!qp->irq) {
+		siw_dbg_qp(qp, "irq malloc for %d failed\n", irq_size);
+		qp->attrs.irq_size = 0;
+		return -ENOMEM;
+	}
+	qp->orq = vzalloc(orq_size * sizeof(struct siw_sqe));
+	if (!qp->orq) {
+		siw_dbg_qp(qp, "orq malloc for %d failed\n", orq_size);
+		qp->attrs.orq_size = 0;
+		qp->attrs.irq_size = 0;
+		vfree(qp->irq);
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static int siw_qp_enable_crc(struct siw_qp *qp)
+{
+	struct siw_iwarp_rx *c_rx = &qp->rx_ctx;
+	struct siw_iwarp_tx *c_tx = &qp->tx_ctx;
+	int rv = 0;
+
+	if (siw_crypto_shash == NULL) {
+		rv = -ENOENT;
+		goto error;
+	}
+	c_tx->mpa_crc_hd = kzalloc(sizeof(struct shash_desc) +
+				   crypto_shash_descsize(siw_crypto_shash),
+				   GFP_KERNEL);
+	c_rx->mpa_crc_hd = kzalloc(sizeof(struct shash_desc) +
+				   crypto_shash_descsize(siw_crypto_shash),
+				   GFP_KERNEL);
+	if (!c_tx->mpa_crc_hd || !c_rx->mpa_crc_hd) {
+		rv = -ENOMEM;
+		goto error;
+	}
+	c_tx->mpa_crc_hd->tfm = siw_crypto_shash;
+	c_rx->mpa_crc_hd->tfm = siw_crypto_shash;
+
+	return 0;
+error:
+	siw_dbg_qp(qp, "falied loading crc32c. error %d\n", rv);
+
+	kfree(c_tx->mpa_crc_hd);
+	kfree(c_rx->mpa_crc_hd);
+
+	c_tx->mpa_crc_hd = c_rx->mpa_crc_hd = NULL;
+
+	return rv;
+}
+
+/*
+ * Send a non signalled READ or WRITE to peer side as negotiated
+ * with MPAv2 P2P setup protocol. The work request is only created
+ * as a current active WR and does not consume Send Queue space.
+ *
+ * Caller must hold QP state lock.
+ */
+int siw_qp_mpa_rts(struct siw_qp *qp, enum mpa_v2_ctrl ctrl)
+{
+	struct siw_wqe	*wqe = tx_wqe(qp);
+	unsigned long flags;
+	int rv = 0;
+
+	spin_lock_irqsave(&qp->sq_lock, flags);
+
+	if (unlikely(wqe->wr_status != SIW_WR_IDLE)) {
+		spin_unlock_irqrestore(&qp->sq_lock, flags);
+		return -EIO;
+	}
+	memset(wqe->mem, 0, sizeof(*wqe->mem) * SIW_MAX_SGE);
+
+	wqe->wr_status = SIW_WR_QUEUED;
+	wqe->sqe.flags = 0;
+	wqe->sqe.num_sge = 1;
+	wqe->sqe.sge[0].length = 0;
+	wqe->sqe.sge[0].laddr = 0;
+	wqe->sqe.sge[0].lkey = 0;
+	/*
+	 * While it must not be checked for inbound zero length
+	 * READ/WRITE, some HW may treat STag 0 special.
+	 */
+	wqe->sqe.rkey = 1;
+	wqe->sqe.raddr = 0;
+	wqe->processed = 0;
+
+	if (ctrl & MPA_V2_RDMA_WRITE_RTR)
+		wqe->sqe.opcode = SIW_OP_WRITE;
+	else if (ctrl & MPA_V2_RDMA_READ_RTR) {
+		struct siw_sqe	*rreq;
+
+		wqe->sqe.opcode = SIW_OP_READ;
+
+		spin_lock(&qp->orq_lock);
+
+		rreq = orq_get_free(qp);
+		if (rreq) {
+			siw_read_to_orq(rreq, &wqe->sqe);
+			qp->orq_put++;
+		} else
+			rv = -EIO;
+
+		spin_unlock(&qp->orq_lock);
+	} else
+		rv = -EINVAL;
+
+	if (rv)
+		wqe->wr_status = SIW_WR_IDLE;
+
+	spin_unlock_irqrestore(&qp->sq_lock, flags);
+
+	if (!rv)
+		rv = siw_sq_start(qp);
+
+	return rv;
+}
+
+/*
+ * Map memory access error to DDP tagged error
+ */
+enum ddp_ecode siw_tagged_error(enum siw_access_state state)
+{
+	if (state == E_STAG_INVALID)
+		return DDP_ECODE_T_INVALID_STAG;
+	if (state == E_BASE_BOUNDS)
+		return DDP_ECODE_T_BASE_BOUNDS;
+	if (state == E_PD_MISMATCH)
+		return DDP_ECODE_T_STAG_NOT_ASSOC;
+	if (state == E_ACCESS_PERM)
+		/*
+		 * RFC 5041 (DDP) lacks an ecode for insufficient access
+		 * permissions. 'Invalid STag' seem to be the closest
+		 * match though.
+		 */
+		return DDP_ECODE_T_INVALID_STAG;
+
+	WARN_ON(1);
+
+	return DDP_ECODE_T_INVALID_STAG;
+}
+
+/*
+ * Map memory access error to RDMAP protection error
+ */
+enum rdmap_ecode siw_rdmap_error(enum siw_access_state state)
+{
+	if (state == E_STAG_INVALID)
+		return RDMAP_ECODE_INVALID_STAG;
+	if (state == E_BASE_BOUNDS)
+		return RDMAP_ECODE_BASE_BOUNDS;
+	if (state == E_PD_MISMATCH)
+		return RDMAP_ECODE_STAG_NOT_ASSOC;
+	if (state == E_ACCESS_PERM)
+		return RDMAP_ECODE_ACCESS_RIGHTS;
+
+	return RDMAP_ECODE_UNSPECIFIED;
+}
+
+void siw_init_terminate(struct siw_qp *qp, enum term_elayer layer,
+			u8 etype, u8 ecode, int in_tx)
+{
+	if (!qp->term_info.valid) {
+		memset(&qp->term_info, 0, sizeof(qp->term_info));
+		qp->term_info.layer = layer;
+		qp->term_info.etype = etype;
+		qp->term_info.ecode = ecode;
+		qp->term_info.in_tx = in_tx;
+		qp->term_info.valid = 1;
+	}
+}
+
+/*
+ * Send a TERMINATE message, as defined in RFC's 5040/5041/5044/6581.
+ * Sending TERMINATE messages is best effort - such messages
+ * can only be send if the QP is still connected and it does
+ * not have another outbound message in-progress, i.e. the
+ * TERMINATE message must not interfer with an incomplete current
+ * transmit operation.
+ */
+void siw_send_terminate(struct siw_qp *qp)
+{
+	struct kvec		iov[3];
+	struct msghdr		msg = {.msg_flags = MSG_DONTWAIT|MSG_EOR};
+	struct iwarp_terminate	*term = NULL;
+	union iwarp_hdr		*err_hdr = NULL;
+	struct socket		*s = qp->attrs.llp_stream_handle;
+	struct siw_iwarp_rx	*rx_ctx = &qp->rx_ctx;
+	union iwarp_hdr		*rx_hdr = &rx_ctx->hdr;
+	u32 crc = 0;
+	int num_frags, len_terminate, rv;
+
+	if (!qp->term_info.valid)
+		return;
+
+	qp->term_info.valid = 0;
+
+	if (tx_wqe(qp)->wr_status == SIW_WR_INPROGRESS) {
+		siw_dbg_qp(qp, "cannot send TERMINATE: op %d in progress\n",
+			   tx_type(tx_wqe(qp)));
+		return;
+	}
+	if (!s && qp->cep)
+		/* QP not yet in RTS. Take socket from connection end point */
+		s = qp->cep->llp.sock;
+
+	if (!s) {
+		siw_dbg_qp(qp, "cannot send TERMINATE: not connected\n");
+		return;
+	}
+
+	term = kzalloc(sizeof(*term), GFP_KERNEL);
+	if (!term)
+		return;
+
+	term->ddp_qn = cpu_to_be32(RDMAP_UNTAGGED_QN_TERMINATE);
+	term->ddp_mo = 0;
+	term->ddp_msn = cpu_to_be32(1);
+
+	iov[0].iov_base = term;
+	iov[0].iov_len = sizeof(*term);
+
+	if ((qp->term_info.layer == TERM_ERROR_LAYER_DDP) ||
+	    ((qp->term_info.layer == TERM_ERROR_LAYER_RDMAP) &&
+	     (qp->term_info.etype != RDMAP_ETYPE_CATASTROPHIC))) {
+		err_hdr = kzalloc(sizeof(*err_hdr), GFP_KERNEL);
+		if (!err_hdr) {
+			kfree(term);
+			return;
+		}
+	}
+	memcpy(&term->ctrl, &iwarp_pktinfo[RDMAP_TERMINATE].ctrl,
+	       sizeof(struct iwarp_ctrl));
+
+	__rdmap_term_set_layer(term, qp->term_info.layer);
+	__rdmap_term_set_etype(term, qp->term_info.etype);
+	__rdmap_term_set_ecode(term, qp->term_info.ecode);
+
+	switch (qp->term_info.layer) {
+
+	case TERM_ERROR_LAYER_RDMAP:
+		if (qp->term_info.etype == RDMAP_ETYPE_CATASTROPHIC)
+			/* No additional DDP/RDMAP header to be included */
+			break;
+
+		if (qp->term_info.etype == RDMAP_ETYPE_REMOTE_PROTECTION) {
+			/*
+			 * Complete RDMAP frame will get attached, and
+			 * DDP segment length is valid
+			 */
+			term->flag_m = 1;
+			term->flag_d = 1;
+			term->flag_r = 1;
+
+			if (qp->term_info.in_tx) {
+				struct iwarp_rdma_rreq *rreq;
+				struct siw_wqe *wqe = tx_wqe(qp);
+
+				/* Inbound RREQ error, detected during
+				 * RRESP creation. Take state from
+				 * current TX work queue element to
+				 * reconstruct peers RREQ.
+				 */
+				rreq = (struct iwarp_rdma_rreq *)err_hdr;
+
+				memcpy(&rreq->ctrl,
+				       &iwarp_pktinfo[RDMAP_RDMA_READ_REQ].ctrl,
+				       sizeof(struct iwarp_ctrl));
+
+				rreq->rsvd = 0;
+				rreq->ddp_qn =
+					htonl(RDMAP_UNTAGGED_QN_RDMA_READ);
+
+				/* Provide RREQ's MSN as kept aside */
+				rreq->ddp_msn = htonl(wqe->sqe.sge[0].length);
+
+				rreq->ddp_mo = htonl(wqe->processed);
+				rreq->sink_stag = htonl(wqe->sqe.rkey);
+				rreq->sink_to = cpu_to_be64(wqe->sqe.raddr);
+				rreq->read_size = htonl(wqe->sqe.sge[0].length);
+				rreq->source_stag = htonl(wqe->sqe.sge[0].lkey);
+				rreq->source_to =
+					cpu_to_be64(wqe->sqe.sge[0].laddr);
+
+				iov[1].iov_base = rreq;
+				iov[1].iov_len = sizeof(*rreq);
+
+				rx_hdr = (union iwarp_hdr *)rreq;
+			} else {
+				/* Take RDMAP/DDP information from
+				 * current (failed) inbound frame.
+				 */
+				iov[1].iov_base = rx_hdr;
+
+				if (__rdmap_opcode(&rx_hdr->ctrl) ==
+				    RDMAP_RDMA_READ_REQ)
+					iov[1].iov_len =
+						sizeof(struct iwarp_rdma_rreq);
+				else /* SEND type */
+					iov[1].iov_len =
+						sizeof(struct iwarp_send);
+			}
+		} else {
+			/* Do not report DDP hdr information if packet
+			 * layout is unknown
+			 */
+			if ((qp->term_info.ecode == RDMAP_ECODE_VERSION) ||
+			    (qp->term_info.ecode == RDMAP_ECODE_OPCODE))
+				break;
+
+			iov[1].iov_base = rx_hdr;
+
+			/* Only DDP frame will get attached */
+			if (rx_hdr->ctrl.ddp_rdmap_ctrl & DDP_FLAG_TAGGED)
+				iov[1].iov_len =
+					sizeof(struct iwarp_rdma_write);
+			else
+				iov[1].iov_len = sizeof(struct iwarp_send);
+
+			term->flag_m = 1;
+			term->flag_d = 1;
+		}
+		term->ctrl.mpa_len = cpu_to_be16(iov[1].iov_len);
+
+		break;
+
+	case TERM_ERROR_LAYER_DDP:
+		/* Report error encountered while DDP processing.
+		 * This can only happen as a result of inbound
+		 * DDP processing
+		 */
+
+		/* Do not report DDP hdr information if packet
+		 * layout is unknown
+		 */
+		if (((qp->term_info.etype == DDP_ETYPE_TAGGED_BUF) &&
+		     (qp->term_info.ecode == DDP_ECODE_T_VERSION)) ||
+		    ((qp->term_info.etype == DDP_ETYPE_UNTAGGED_BUF) &&
+		     (qp->term_info.ecode == DDP_ECODE_UT_VERSION)))
+			break;
+
+		iov[1].iov_base = rx_hdr;
+
+		if (rx_hdr->ctrl.ddp_rdmap_ctrl & DDP_FLAG_TAGGED)
+			iov[1].iov_len = sizeof(struct iwarp_ctrl_tagged);
+		else
+			iov[1].iov_len = sizeof(struct iwarp_ctrl_untagged);
+
+		term->flag_m = 1;
+		term->flag_d = 1;
+
+		break;
+
+	default:
+		break;
+
+	}
+	if (term->flag_m || term->flag_d || term->flag_r) {
+		iov[2].iov_base = &crc;
+		iov[2].iov_len = sizeof(crc);
+		len_terminate = sizeof(*term) + iov[1].iov_len + MPA_CRC_SIZE;
+		num_frags = 3;
+	} else {
+		iov[1].iov_base = &crc;
+		iov[1].iov_len = sizeof(crc);
+		len_terminate = sizeof(*term) + MPA_CRC_SIZE;
+		num_frags = 2;
+	}
+
+	/* Adjust DDP Segment Length parameter, if valid */
+	if (term->flag_m) {
+		u32 real_ddp_len = be16_to_cpu(rx_hdr->ctrl.mpa_len);
+		enum rdma_opcode op = __rdmap_opcode(&rx_hdr->ctrl);
+
+		real_ddp_len -= iwarp_pktinfo[op].hdr_len - MPA_HDR_SIZE;
+		rx_hdr->ctrl.mpa_len = cpu_to_be16(real_ddp_len);
+	}
+
+	term->ctrl.mpa_len = cpu_to_be16(len_terminate -
+					 (MPA_HDR_SIZE + MPA_CRC_SIZE));
+	if (qp->tx_ctx.mpa_crc_hd) {
+		crypto_shash_init(rx_ctx->mpa_crc_hd);
+		if (siw_crc_array(rx_ctx->mpa_crc_hd, (u8 *)iov[0].iov_base,
+				  iov[0].iov_len))
+			goto out;
+
+		if (num_frags == 3) {
+			if (siw_crc_array(rx_ctx->mpa_crc_hd,
+					  (u8 *)iov[1].iov_base,
+					  iov[1].iov_len))
+				goto out;
+		}
+		crypto_shash_final(rx_ctx->mpa_crc_hd, (u8 *)&crc);
+	}
+
+	rv = kernel_sendmsg(s, &msg, iov, num_frags, len_terminate);
+	siw_dbg_qp(qp,
+		   "sent TERM: %s, layer %d, type %d, code %d (%d bytes)\n",
+		   rv == len_terminate ? "success" : "failure",
+		   __rdmap_term_layer(term), __rdmap_term_etype(term),
+		   __rdmap_term_ecode(term), rv);
+out:
+	kfree(term);
+	kfree(err_hdr);
+}
+
+/*
+ * handle all attrs other than state
+ */
+static void siw_qp_modify_nonstate(struct siw_qp *qp,
+				   struct siw_qp_attrs *attrs,
+				   enum siw_qp_attr_mask mask)
+{
+	if (mask & SIW_QP_ATTR_ACCESS_FLAGS) {
+		if (attrs->flags & SIW_RDMA_BIND_ENABLED)
+			qp->attrs.flags |= SIW_RDMA_BIND_ENABLED;
+		else
+			qp->attrs.flags &= ~SIW_RDMA_BIND_ENABLED;
+
+		if (attrs->flags & SIW_RDMA_WRITE_ENABLED)
+			qp->attrs.flags |= SIW_RDMA_WRITE_ENABLED;
+		else
+			qp->attrs.flags &= ~SIW_RDMA_WRITE_ENABLED;
+
+		if (attrs->flags & SIW_RDMA_READ_ENABLED)
+			qp->attrs.flags |= SIW_RDMA_READ_ENABLED;
+		else
+			qp->attrs.flags &= ~SIW_RDMA_READ_ENABLED;
+	}
+}
+
+/*
+ * caller holds qp->state_lock
+ */
+int siw_qp_modify(struct siw_qp *qp, struct siw_qp_attrs *attrs,
+		  enum siw_qp_attr_mask mask)
+{
+	int	drop_conn = 0, rv = 0;
+
+	if (!mask)
+		return 0;
+
+	siw_dbg_qp(qp, "state: %s => %s\n",
+		   siw_qp_state_to_string[qp->attrs.state],
+		   siw_qp_state_to_string[attrs->state]);
+
+	if (mask != SIW_QP_ATTR_STATE)
+		siw_qp_modify_nonstate(qp, attrs, mask);
+
+	if (!(mask & SIW_QP_ATTR_STATE))
+		return 0;
+
+	switch (qp->attrs.state) {
+
+	case SIW_QP_STATE_IDLE:
+	case SIW_QP_STATE_RTR:
+
+		switch (attrs->state) {
+
+		case SIW_QP_STATE_RTS:
+
+			if (attrs->flags & SIW_MPA_CRC) {
+				rv = siw_qp_enable_crc(qp);
+				if (rv)
+					break;
+			}
+			if (!(mask & SIW_QP_ATTR_LLP_HANDLE)) {
+				siw_dbg_qp(qp, "no socket\n");
+				rv = -EINVAL;
+				break;
+			}
+			if (!(mask & SIW_QP_ATTR_MPA)) {
+				siw_dbg_qp(qp, "no MPA\n");
+				rv = -EINVAL;
+				break;
+			}
+			siw_dbg_qp(qp, "enter rts, peer 0x%08x, loc 0x%08x\n",
+				   qp->cep->llp.raddr.sin_addr.s_addr,
+				   qp->cep->llp.laddr.sin_addr.s_addr);
+			/*
+			 * Initialize iWARP TX state
+			 */
+			qp->tx_ctx.ddp_msn[RDMAP_UNTAGGED_QN_SEND] = 0;
+			qp->tx_ctx.ddp_msn[RDMAP_UNTAGGED_QN_RDMA_READ] = 0;
+			qp->tx_ctx.ddp_msn[RDMAP_UNTAGGED_QN_TERMINATE] = 0;
+
+			/*
+			 * Initialize iWARP RX state
+			 */
+			qp->rx_ctx.ddp_msn[RDMAP_UNTAGGED_QN_SEND] = 1;
+			qp->rx_ctx.ddp_msn[RDMAP_UNTAGGED_QN_RDMA_READ] = 1;
+			qp->rx_ctx.ddp_msn[RDMAP_UNTAGGED_QN_TERMINATE] = 1;
+
+			/*
+			 * init IRD free queue, caller has already checked
+			 * limits.
+			 */
+			rv = siw_qp_readq_init(qp, attrs->irq_size,
+					       attrs->orq_size);
+			if (rv)
+				break;
+
+			qp->attrs.llp_stream_handle = attrs->llp_stream_handle;
+			qp->attrs.state = SIW_QP_STATE_RTS;
+
+			break;
+
+		case SIW_QP_STATE_ERROR:
+			siw_rq_flush(qp);
+			qp->attrs.state = SIW_QP_STATE_ERROR;
+			if (qp->cep) {
+				siw_cep_put(qp->cep);
+				qp->cep = NULL;
+			}
+			break;
+
+		case SIW_QP_STATE_RTR:
+			/* ignore */
+			break;
+
+		default:
+			siw_dbg_qp(qp, "state transition undefined: %s => %s\n",
+				   siw_qp_state_to_string[qp->attrs.state],
+				   siw_qp_state_to_string[attrs->state]);
+			break;
+		}
+		break;
+
+	case SIW_QP_STATE_RTS:
+
+		switch (attrs->state) {
+
+		case SIW_QP_STATE_CLOSING:
+			/*
+			 * Verbs: move to IDLE if SQ and ORQ are empty.
+			 * Move to ERROR otherwise. But first of all we must
+			 * close the connection. So we keep CLOSING or ERROR
+			 * as a transient state, schedule connection drop work
+			 * and wait for the socket state change upcall to
+			 * come back closed.
+			 */
+			if (tx_wqe(qp)->wr_status == SIW_WR_IDLE) {
+				qp->attrs.state = SIW_QP_STATE_CLOSING;
+			} else {
+				qp->attrs.state = SIW_QP_STATE_ERROR;
+				siw_sq_flush(qp);
+			}
+			siw_rq_flush(qp);
+
+			drop_conn = 1;
+			break;
+
+		case SIW_QP_STATE_TERMINATE:
+			qp->attrs.state = SIW_QP_STATE_TERMINATE;
+
+			siw_init_terminate(qp, TERM_ERROR_LAYER_RDMAP,
+					   RDMAP_ETYPE_CATASTROPHIC,
+					   RDMAP_ECODE_UNSPECIFIED, 1);
+			drop_conn = 1;
+
+			break;
+
+		case SIW_QP_STATE_ERROR:
+			/*
+			 * This is an emergency close.
+			 *
+			 * Any in progress transmit operation will get
+			 * cancelled.
+			 * This will likely result in a protocol failure,
+			 * if a TX operation is in transit. The caller
+			 * could unconditional wait to give the current
+			 * operation a chance to complete.
+			 * Esp., how to handle the non-empty IRQ case?
+			 * The peer was asking for data transfer at a valid
+			 * point in time.
+			 */
+			siw_sq_flush(qp);
+			siw_rq_flush(qp);
+			qp->attrs.state = SIW_QP_STATE_ERROR;
+			drop_conn = 1;
+
+			break;
+
+		default:
+			siw_dbg_qp(qp, "state transition undefined: %s => %s\n",
+				   siw_qp_state_to_string[qp->attrs.state],
+				   siw_qp_state_to_string[attrs->state]);
+			break;
+		}
+		break;
+
+	case SIW_QP_STATE_TERMINATE:
+
+		switch (attrs->state) {
+
+		case SIW_QP_STATE_ERROR:
+			siw_rq_flush(qp);
+			qp->attrs.state = SIW_QP_STATE_ERROR;
+
+			if (tx_wqe(qp)->wr_status != SIW_WR_IDLE)
+				siw_sq_flush(qp);
+
+			break;
+
+		default:
+			siw_dbg_qp(qp, "state transition undefined: %s => %s\n",
+				   siw_qp_state_to_string[qp->attrs.state],
+				   siw_qp_state_to_string[attrs->state]);
+		}
+		break;
+
+	case SIW_QP_STATE_CLOSING:
+
+		switch (attrs->state) {
+
+		case SIW_QP_STATE_IDLE:
+			WARN_ON(tx_wqe(qp)->wr_status != SIW_WR_IDLE);
+			qp->attrs.state = SIW_QP_STATE_IDLE;
+
+			break;
+
+		case SIW_QP_STATE_CLOSING:
+			/*
+			 * The LLP may already moved the QP to closing
+			 * due to graceful peer close init
+			 */
+			break;
+
+		case SIW_QP_STATE_ERROR:
+			/*
+			 * QP was moved to CLOSING by LLP event
+			 * not yet seen by user.
+			 */
+			qp->attrs.state = SIW_QP_STATE_ERROR;
+
+			if (tx_wqe(qp)->wr_status != SIW_WR_IDLE)
+				siw_sq_flush(qp);
+
+			siw_rq_flush(qp);
+
+			break;
+
+		default:
+			siw_dbg_qp(qp, "state transition undefined: %s => %s\n",
+				   siw_qp_state_to_string[qp->attrs.state],
+				   siw_qp_state_to_string[attrs->state]);
+
+			return -ECONNABORTED;
+		}
+		break;
+
+	default:
+		siw_dbg_qp(qp, " noop: state %s\n",
+			   siw_qp_state_to_string[qp->attrs.state]);
+		break;
+	}
+	if (drop_conn)
+		siw_qp_cm_drop(qp, 0);
+
+	return rv;
+}
+
+struct ib_qp *siw_get_base_qp(struct ib_device *base_dev, int id)
+{
+	struct siw_qp *qp =  siw_qp_id2obj(siw_dev_base2siw(base_dev), id);
+
+	if (qp) {
+		/*
+		 * siw_qp_id2obj() increments object reference count
+		 */
+		siw_qp_put(qp);
+		siw_dbg_qp(qp, "got base QP");
+
+		return &qp->base_qp;
+	}
+	return (struct ib_qp *)NULL;
+}
+
+/*
+ * siw_check_mem()
+ *
+ * Check protection domain, STAG state, access permissions and
+ * address range for memory object.
+ *
+ * @pd:		Protection Domain memory should belong to
+ * @mem:	memory to be checked
+ * @addr:	starting addr of mem
+ * @perms:	requested access permissions
+ * @len:	len of memory interval to be checked
+ *
+ */
+int siw_check_mem(struct siw_pd *pd, struct siw_mem *mem, u64 addr,
+		  enum siw_access_flags perms, int len)
+{
+	if (siw_mem2mr(mem)->pd != pd) {
+		siw_dbg(pd->hdr.sdev, "[PD %d]: pd mismatch\n", OBJ_ID(pd));
+		return -E_PD_MISMATCH;
+	}
+	if (!mem->stag_valid) {
+		siw_dbg(pd->hdr.sdev, "[PD %d]: stag 0x%08x invalid\n",
+			OBJ_ID(pd), OBJ_ID(mem));
+		return -E_STAG_INVALID;
+	}
+	/*
+	 * check access permissions
+	 */
+	if ((mem->perms & perms) < perms) {
+		siw_dbg(pd->hdr.sdev, "[PD %d]: permissions 0x%08x < 0x%08x\n",
+			OBJ_ID(pd), mem->perms, perms);
+		return -E_ACCESS_PERM;
+	}
+	/*
+	 * Check address interval: we relax check to allow memory shrinked
+	 * from the start address _after_ placing or fetching len bytes.
+	 * TODO: this relaxation is probably overdone
+	 */
+	if (addr < mem->va || addr + len > mem->va + mem->len) {
+		siw_dbg(pd->hdr.sdev, "[PD %d]: MEM interval len %d\n",
+			OBJ_ID(pd), len);
+		siw_dbg(pd->hdr.sdev, "[0x%016llx, 0x%016llx) out of bounds\n",
+			(unsigned long long)addr,
+			(unsigned long long)(addr + len));
+		siw_dbg(pd->hdr.sdev, "[0x%016llx, 0x%016llx] LKey=0x%08x\n",
+			(unsigned long long)mem->va,
+			(unsigned long long)(mem->va + mem->len),
+			OBJ_ID(mem));
+
+		return -E_BASE_BOUNDS;
+	}
+	return E_ACCESS_OK;
+}
+
+/*
+ * siw_check_sge()
+ *
+ * Check SGE for access rights in given interval
+ *
+ * @pd:		Protection Domain memory should belong to
+ * @sge:	SGE to be checked
+ * @mem:	array of memory references
+ * @perms:	requested access permissions
+ * @off:	starting offset in SGE
+ * @len:	len of memory interval to be checked
+ *
+ * NOTE: Function references SGE's memory object (mem->obj)
+ * if not yet done. New reference is kept if check went ok and
+ * released if check failed. If mem->obj is already valid, no new
+ * lookup is being done and mem is not released it check fails.
+ */
+int
+siw_check_sge(struct siw_pd *pd, struct siw_sge *sge,
+	      struct siw_mem *mem[], enum siw_access_flags perms,
+	      u32 off, int len)
+{
+	struct siw_device *sdev = pd->hdr.sdev;
+	int new_ref = 0, rv = E_ACCESS_OK;
+
+	if (len + off > sge->length) {
+		rv = -E_BASE_BOUNDS;
+		goto fail;
+	}
+	if (*mem == NULL) {
+		*mem = siw_mem_id2obj(sdev, sge->lkey >> 8);
+		if (*mem == NULL) {
+			rv = -E_STAG_INVALID;
+			goto fail;
+		}
+		new_ref = 1;
+	}
+
+	rv = siw_check_mem(pd, *mem, sge->laddr + off, perms, len);
+	if (rv)
+		goto fail;
+
+	return 0;
+
+fail:
+	if (new_ref) {
+		siw_mem_put(*mem);
+		*mem = NULL;
+	}
+	return rv;
+}
+
+void siw_read_to_orq(struct siw_sqe *rreq, struct siw_sqe *sqe)
+{
+	rreq->id = sqe->id;
+	rreq->opcode = sqe->opcode;
+	rreq->sge[0].laddr = sqe->sge[0].laddr;
+	rreq->sge[0].length = sqe->sge[0].length;
+	rreq->sge[0].lkey = sqe->sge[0].lkey;
+	rreq->sge[1].lkey = sqe->sge[1].lkey;
+	rreq->flags = sqe->flags | SIW_WQE_VALID;
+	rreq->num_sge = 1;
+}
+
+/*
+ * Must be called with SQ locked.
+ * To avoid complete SQ starvation by constant inbound READ requests,
+ * the active IRQ will not be served after qp->irq_burst, if the
+ * SQ has pending work.
+ */
+int siw_activate_tx(struct siw_qp *qp)
+{
+	struct siw_sqe	*irqe, *sqe;
+	struct siw_wqe	*wqe = tx_wqe(qp);
+	int rv = 1;
+
+	irqe = &qp->irq[qp->irq_get % qp->attrs.irq_size];
+
+	if (irqe->flags & SIW_WQE_VALID) {
+		sqe = sq_get_next(qp);
+		if (sqe && ++qp->irq_burst >= SIW_IRQ_MAXBURST_SQ_ACTIVE) {
+			qp->irq_burst = 0;
+			goto skip_irq;
+		}
+		memset(wqe->mem, 0, sizeof(*wqe->mem) * SIW_MAX_SGE);
+		wqe->wr_status = SIW_WR_QUEUED;
+
+		/* start READ RESPONSE */
+		wqe->sqe.opcode = SIW_OP_READ_RESPONSE;
+		wqe->sqe.flags = 0;
+		if (irqe->num_sge) {
+			wqe->sqe.num_sge = 1;
+			wqe->sqe.sge[0].length = irqe->sge[0].length;
+			wqe->sqe.sge[0].laddr = irqe->sge[0].laddr;
+			wqe->sqe.sge[0].lkey = irqe->sge[0].lkey;
+		} else {
+			wqe->sqe.num_sge = 0;
+		}
+
+		/* Retain original RREQ's message sequence number for
+		 * potential error reporting cases.
+		 */
+		wqe->sqe.sge[1].length = irqe->sge[1].length;
+
+		wqe->sqe.rkey = irqe->rkey;
+		wqe->sqe.raddr = irqe->raddr;
+
+		wqe->processed = 0;
+		qp->irq_get++;
+
+		/* mark current IRQ entry free */
+		smp_store_mb(irqe->flags, 0);
+
+		goto out;
+	}
+
+	sqe = sq_get_next(qp);
+	if (sqe) {
+skip_irq:
+		memset(wqe->mem, 0, sizeof(*wqe->mem) * SIW_MAX_SGE);
+		wqe->wr_status = SIW_WR_QUEUED;
+
+		/* First copy SQE to kernel private memory */
+		memcpy(&wqe->sqe, sqe, sizeof(*sqe));
+
+		if (wqe->sqe.opcode >= SIW_NUM_OPCODES) {
+			rv = -EINVAL;
+			goto out;
+		}
+		if (wqe->sqe.flags & SIW_WQE_INLINE) {
+			if (wqe->sqe.opcode != SIW_OP_SEND &&
+			    wqe->sqe.opcode != SIW_OP_WRITE) {
+				rv = -EINVAL;
+				goto out;
+			}
+			if (wqe->sqe.sge[0].length > SIW_MAX_INLINE) {
+				rv = -EINVAL;
+				goto out;
+			}
+			wqe->sqe.sge[0].laddr = (u64)&wqe->sqe.sge[1];
+			wqe->sqe.sge[0].lkey = 0;
+			wqe->sqe.num_sge = 1;
+		}
+		if (wqe->sqe.flags & SIW_WQE_READ_FENCE) {
+			/* A READ cannot be fenced */
+			if (unlikely(wqe->sqe.opcode == SIW_OP_READ ||
+			    wqe->sqe.opcode == SIW_OP_READ_LOCAL_INV)) {
+				siw_dbg_qp(qp, "cannot fence read\n");
+				rv = -EINVAL;
+				goto out;
+			}
+			spin_lock(&qp->orq_lock);
+
+			if (!siw_orq_empty(qp)) {
+				qp->tx_ctx.orq_fence = 1;
+				rv = 0;
+			}
+			spin_unlock(&qp->orq_lock);
+
+		} else if (wqe->sqe.opcode == SIW_OP_READ ||
+			   wqe->sqe.opcode == SIW_OP_READ_LOCAL_INV) {
+			struct siw_sqe	*rreq;
+
+			wqe->sqe.num_sge = 1;
+
+			spin_lock(&qp->orq_lock);
+
+			rreq = orq_get_free(qp);
+			if (rreq) {
+				/*
+				 * Make an immediate copy in ORQ to be ready
+				 * to process loopback READ reply
+				 */
+				siw_read_to_orq(rreq, &wqe->sqe);
+				qp->orq_put++;
+			} else {
+				qp->tx_ctx.orq_fence = 1;
+				rv = 0;
+			}
+			spin_unlock(&qp->orq_lock);
+		}
+
+		/* Clear SQE, can be re-used by application */
+		smp_store_mb(sqe->flags, 0);
+		qp->sq_get++;
+	} else {
+		rv = 0;
+	}
+out:
+	if (unlikely(rv < 0)) {
+		siw_dbg_qp(qp, "error %d\n", rv);
+		wqe->wr_status = SIW_WR_IDLE;
+	}
+	return rv;
+}
+
+static void siw_cq_notify(struct siw_cq *cq, u32 flags)
+{
+	u32 cq_notify;
+
+	if (unlikely(!cq->base_cq.comp_handler))
+		return;
+
+	cq_notify = READ_ONCE(*cq->notify);
+
+	if ((cq_notify & SIW_NOTIFY_NEXT_COMPLETION) ||
+	    ((cq_notify & SIW_NOTIFY_SOLICITED) &&
+	     (flags & SIW_WQE_SOLICITED))) {
+		/* de-arm CQ */
+		smp_store_mb(*cq->notify, SIW_NOTIFY_NOT);
+		(*cq->base_cq.comp_handler)(&cq->base_cq,
+					    cq->base_cq.cq_context);
+	}
+}
+
+int siw_sqe_complete(struct siw_qp *qp, struct siw_sqe *sqe, u32 bytes,
+		     enum siw_wc_status status)
+{
+	struct siw_cq *cq = qp->scq;
+	int rv = 0;
+
+	if (cq) {
+		u32 sqe_flags = sqe->flags;
+		struct siw_cqe *cqe;
+		u32 idx;
+		unsigned long flags;
+
+		spin_lock_irqsave(&cq->lock, flags);
+
+		idx = cq->cq_put % cq->num_cqe;
+		cqe = &cq->queue[idx];
+
+		if (!cqe->flags) {
+			cqe->id = sqe->id;
+			cqe->opcode = sqe->opcode;
+			cqe->status = status;
+			cqe->imm_data = 0;
+			cqe->bytes = bytes;
+
+			if (cq->kernel_verbs) {
+				siw_qp_get(qp);
+				cqe->qp = qp;
+			} else {
+				cqe->qp_id = QP_ID(qp);
+			}
+			/* mark CQE valid for application */
+			smp_store_mb(cqe->flags, SIW_WQE_VALID);
+			/* recycle SQE */
+			smp_store_mb(sqe->flags, 0);
+
+			cq->cq_put++;
+			spin_unlock_irqrestore(&cq->lock, flags);
+			siw_cq_notify(cq, sqe_flags);
+		} else {
+			spin_unlock_irqrestore(&cq->lock, flags);
+			rv = -ENOMEM;
+			siw_cq_event(cq, IB_EVENT_CQ_ERR);
+		}
+	} else {
+		/* recycle SQE */
+		smp_store_mb(sqe->flags, 0);
+	}
+	return rv;
+}
+
+int siw_rqe_complete(struct siw_qp *qp, struct siw_rqe *rqe, u32 bytes,
+		     enum siw_wc_status status)
+{
+	struct siw_cq *cq = qp->rcq;
+	int rv = 0;
+
+	if (cq) {
+		struct siw_cqe *cqe;
+		u32 idx;
+		unsigned long flags;
+
+		spin_lock_irqsave(&cq->lock, flags);
+
+		idx = cq->cq_put % cq->num_cqe;
+		cqe = &cq->queue[idx];
+
+		if (!cqe->flags) {
+			cqe->id = rqe->id;
+			cqe->opcode = SIW_OP_RECEIVE;
+			cqe->status = status;
+			cqe->imm_data = 0;
+			cqe->bytes = bytes;
+
+			if (cq->kernel_verbs) {
+				siw_qp_get(qp);
+				cqe->qp = qp;
+			} else
+				cqe->qp_id = QP_ID(qp);
+
+			/* mark CQE valid for application */
+			smp_store_mb(cqe->flags, SIW_WQE_VALID);
+			/* recycle RQE */
+			smp_store_mb(rqe->flags, 0);
+
+			cq->cq_put++;
+			spin_unlock_irqrestore(&cq->lock, flags);
+			siw_cq_notify(cq, SIW_WQE_SIGNALLED);
+		} else {
+			spin_unlock_irqrestore(&cq->lock, flags);
+			rv = -ENOMEM;
+			siw_cq_event(cq, IB_EVENT_CQ_ERR);
+		}
+	} else {
+		/* recycle RQE */
+		smp_store_mb(rqe->flags, 0);
+	}
+	return rv;
+}
+
+/*
+ * siw_sq_flush()
+ *
+ * Flush SQ and ORRQ entries to CQ.
+ *
+ * TODO: Add termination code for in-progress WQE.
+ * TODO: an in-progress WQE may have been partially
+ *       processed. It should be enforced, that transmission
+ *       of a started DDP segment must be completed if possible
+ *       by any chance.
+ *
+ * Must be called with QP state write lock held.
+ * Therefore, SQ and ORQ lock must not be taken.
+ */
+void siw_sq_flush(struct siw_qp *qp)
+{
+	struct siw_sqe	*sqe;
+	struct siw_wqe	*wqe = tx_wqe(qp);
+	int		async_event = 0;
+
+	siw_dbg_qp(qp, "enter\n");
+
+	/*
+	 * Start with completing any work currently on the ORQ
+	 */
+	for (;;) {
+		if (qp->attrs.orq_size == 0)
+			break;
+
+		sqe = &qp->orq[qp->orq_get % qp->attrs.orq_size];
+		if (!sqe->flags)
+			break;
+
+		if (siw_sqe_complete(qp, sqe, 0,
+				     SIW_WC_WR_FLUSH_ERR) != 0)
+			break;
+
+		qp->orq_get++;
+	}
+	/*
+	 * Flush an in-progess WQE if present
+	 */
+	if (wqe->wr_status != SIW_WR_IDLE) {
+		/*
+		 * TODO: Add iWARP Termination code
+		 */
+		siw_dbg_qp(qp, "flush current sqe, type %d, status %d\n",
+			   tx_type(wqe), wqe->wr_status);
+
+		siw_wqe_put_mem(wqe, tx_type(wqe));
+
+		if (tx_type(wqe) != SIW_OP_READ_RESPONSE &&
+			((tx_type(wqe) != SIW_OP_READ &&
+			  tx_type(wqe) != SIW_OP_READ_LOCAL_INV) ||
+			wqe->wr_status == SIW_WR_QUEUED))
+			/*
+			 * An in-progress Read Request is already in
+			 * the ORQ
+			 */
+			siw_sqe_complete(qp, &wqe->sqe, wqe->bytes,
+					 SIW_WC_WR_FLUSH_ERR);
+
+		wqe->wr_status = SIW_WR_IDLE;
+	}
+	/*
+	 * Flush the Send Queue
+	 */
+	while (qp->attrs.sq_size) {
+		sqe = &qp->sendq[qp->sq_get % qp->attrs.sq_size];
+		if (!sqe->flags)
+			break;
+
+		async_event = 1;
+		if (siw_sqe_complete(qp, sqe, 0, SIW_WC_WR_FLUSH_ERR) != 0)
+			/*
+			 * Shall IB_EVENT_SQ_DRAINED be supressed if work
+			 * completion fails?
+			 */
+			break;
+
+		sqe->flags = 0;
+		qp->sq_get++;
+	}
+	if (async_event)
+		siw_qp_event(qp, IB_EVENT_SQ_DRAINED);
+}
+
+/*
+ * siw_rq_flush()
+ *
+ * Flush recv queue entries to CQ.
+ *
+ * Must be called with QP state write lock held.
+ * Therefore, RQ lock must not be taken.
+ */
+void siw_rq_flush(struct siw_qp *qp)
+{
+	struct siw_wqe		*wqe = rx_wqe(qp);
+
+	siw_dbg_qp(qp, "enter\n");
+
+	/*
+	 * Flush an in-progess WQE if present
+	 */
+	if (wqe->wr_status != SIW_WR_IDLE) {
+		siw_dbg_qp(qp, "flush current rqe, type %d, status %d\n",
+			   rx_type(wqe), wqe->wr_status);
+
+		siw_wqe_put_mem(wqe, rx_type(wqe));
+
+		if (rx_type(wqe) == SIW_OP_RECEIVE) {
+			siw_rqe_complete(qp, &wqe->rqe, wqe->bytes,
+					 SIW_WC_WR_FLUSH_ERR);
+		} else if (rx_type(wqe) != SIW_OP_READ &&
+			   rx_type(wqe) != SIW_OP_READ_RESPONSE &&
+			   rx_type(wqe) != SIW_OP_WRITE) {
+			siw_sqe_complete(qp, &wqe->sqe, 0, SIW_WC_WR_FLUSH_ERR);
+		}
+		wqe->wr_status = SIW_WR_IDLE;
+	}
+	/*
+	 * Flush the Receive Queue
+	 */
+	while (qp->attrs.rq_size) {
+		struct siw_rqe *rqe =
+			&qp->recvq[qp->rq_get % qp->attrs.rq_size];
+
+		if (!rqe->flags)
+			break;
+
+		if (siw_rqe_complete(qp, rqe, 0, SIW_WC_WR_FLUSH_ERR) != 0)
+			break;
+
+		rqe->flags = 0;
+		qp->rq_get++;
+	}
+}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 09/13] SoftiWarp transmit path
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (7 preceding siblings ...)
  2018-01-14 22:35   ` [PATCH v3 08/13] SoftiWarp Queue Pair methods Bernard Metzler
@ 2018-01-14 22:35   ` Bernard Metzler
  2018-01-14 22:36   ` [PATCH v3 10/13] SoftiWarp receive path Bernard Metzler
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:35 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_qp_tx.c | 1346 +++++++++++++++++++++++++++++++++
 1 file changed, 1346 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_qp_tx.c

diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
new file mode 100644
index 000000000000..129bbb881a64
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -0,0 +1,1346 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/net.h>
+#include <linux/scatterlist.h>
+#include <linux/highmem.h>
+#include <linux/llist.h>
+#include <net/sock.h>
+#include <net/tcp_states.h>
+#include <net/tcp.h>
+#include <linux/cpu.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+
+#include <linux/kthread.h>
+
+static int siw_crc_txhdr(struct siw_iwarp_tx *ctx)
+{
+	crypto_shash_init(ctx->mpa_crc_hd);
+	return siw_crc_array(ctx->mpa_crc_hd, (u8 *)&ctx->pkt,
+			     ctx->ctrl_len);
+}
+
+#define MAX_HDR_INLINE					\
+	(((uint32_t)(sizeof(struct siw_rreq_pkt) -	\
+	sizeof(struct iwarp_send))) & 0xF8)
+
+static struct page *siw_get_pblpage(struct siw_mr *mr, u64 addr, int *idx)
+{
+	struct siw_pbl *pbl = mr->pbl;
+	u64 offset = addr - mr->mem.va;
+	u64 paddr = siw_pbl_get_buffer(pbl, offset, NULL, idx);
+
+	if (paddr)
+		return virt_to_page(paddr);
+
+	return NULL;
+}
+
+/*
+ * Copy short payload at provided destination payload address
+ */
+static int siw_try_1seg(struct siw_iwarp_tx *c_tx, u64 paddr)
+{
+	struct siw_wqe *wqe = &c_tx->wqe_active;
+	struct siw_sge *sge = &wqe->sqe.sge[0];
+	u32 bytes = sge->length;
+
+	if (bytes > MAX_HDR_INLINE || wqe->sqe.num_sge != 1)
+		return MAX_HDR_INLINE + 1;
+
+	if (!bytes)
+		return 0;
+
+	if (tx_flags(wqe) & SIW_WQE_INLINE)
+		memcpy((void *)paddr, &wqe->sqe.sge[1], bytes);
+	else {
+		struct siw_mr *mr = siw_mem2mr(wqe->mem[0]);
+
+		if (!mr->mem_obj) {
+			/* Kernel client using kva */
+			memcpy((void *)paddr, (void *)sge->laddr, bytes);
+		} else if (c_tx->in_syscall) {
+			if (copy_from_user((void *)paddr,
+					   (const void __user *)sge->laddr,
+					   bytes))
+				return -EFAULT;
+		} else {
+			unsigned int off = sge->laddr & ~PAGE_MASK;
+			struct page *p;
+			char *buffer;
+			int pbl_idx = 0;
+
+			if (!mr->mem.is_pbl)
+				p = siw_get_upage(mr->umem, sge->laddr);
+			else
+				p = siw_get_pblpage(mr, sge->laddr, &pbl_idx);
+
+			if (unlikely(!p))
+				return -EFAULT;
+
+			buffer = kmap_atomic(p);
+
+			if (likely(PAGE_SIZE - off >= bytes)) {
+				memcpy((void *)paddr, buffer + off, bytes);
+				kunmap_atomic(buffer);
+			} else {
+				unsigned long part = bytes - (PAGE_SIZE - off);
+
+				memcpy((void *)paddr, buffer + off, part);
+				kunmap_atomic(buffer);
+
+				if (!mr->mem.is_pbl)
+					p = siw_get_upage(mr->umem,
+							  sge->laddr + part);
+				else
+					p = siw_get_pblpage(mr,
+							    sge->laddr + part,
+							    &pbl_idx);
+				if (unlikely(!p))
+					return -EFAULT;
+
+				buffer = kmap_atomic(p);
+				memcpy((void *)(paddr + part),
+					buffer, bytes - part);
+				kunmap_atomic(buffer);
+			}
+		}
+	}
+	return (int)bytes;
+}
+
+#define PKT_FRAGMENTED 1
+#define PKT_COMPLETE 0
+
+/*
+ * siw_qp_prepare_tx()
+ *
+ * Prepare tx state for sending out one fpdu. Builds complete pkt
+ * if no user data or only immediate data are present.
+ *
+ * returns PKT_COMPLETE if complete pkt built, PKT_FRAGMENTED otherwise.
+ */
+static int siw_qp_prepare_tx(struct siw_iwarp_tx *c_tx)
+{
+	struct siw_wqe		*wqe = &c_tx->wqe_active;
+	char			*crc = NULL;
+	int			data = 0;
+
+	switch (tx_type(wqe)) {
+
+	case SIW_OP_READ:
+	case SIW_OP_READ_LOCAL_INV:
+		memcpy(&c_tx->pkt.ctrl,
+		       &iwarp_pktinfo[RDMAP_RDMA_READ_REQ].ctrl,
+		       sizeof(struct iwarp_ctrl));
+
+		c_tx->pkt.rreq.rsvd = 0;
+		c_tx->pkt.rreq.ddp_qn = htonl(RDMAP_UNTAGGED_QN_RDMA_READ);
+		c_tx->pkt.rreq.ddp_msn =
+			htonl(++c_tx->ddp_msn[RDMAP_UNTAGGED_QN_RDMA_READ]);
+		c_tx->pkt.rreq.ddp_mo = 0;
+		c_tx->pkt.rreq.sink_stag = htonl(wqe->sqe.sge[0].lkey);
+		c_tx->pkt.rreq.sink_to =
+			cpu_to_be64(wqe->sqe.sge[0].laddr); /* abs addr! */
+		c_tx->pkt.rreq.source_stag = htonl(wqe->sqe.rkey);
+		c_tx->pkt.rreq.source_to = cpu_to_be64(wqe->sqe.raddr);
+		c_tx->pkt.rreq.read_size = htonl(wqe->sqe.sge[0].length);
+
+		c_tx->ctrl_len = sizeof(struct iwarp_rdma_rreq);
+		crc = (char *)&c_tx->pkt.rreq_pkt.crc;
+		break;
+
+	case SIW_OP_SEND:
+		if (tx_flags(wqe) & SIW_WQE_SOLICITED)
+			memcpy(&c_tx->pkt.ctrl,
+			       &iwarp_pktinfo[RDMAP_SEND_SE].ctrl,
+			       sizeof(struct iwarp_ctrl));
+		else
+			memcpy(&c_tx->pkt.ctrl,
+			       &iwarp_pktinfo[RDMAP_SEND].ctrl,
+			       sizeof(struct iwarp_ctrl));
+
+		c_tx->pkt.send.ddp_qn = RDMAP_UNTAGGED_QN_SEND;
+		c_tx->pkt.send.ddp_msn =
+			htonl(++c_tx->ddp_msn[RDMAP_UNTAGGED_QN_SEND]);
+		c_tx->pkt.send.ddp_mo = 0;
+
+		c_tx->pkt.send_inv.inval_stag = 0;
+
+		c_tx->ctrl_len = sizeof(struct iwarp_send);
+
+		crc = (char *)&c_tx->pkt.send_pkt.crc;
+		data = siw_try_1seg(c_tx, (u64)crc);
+		break;
+
+	case SIW_OP_SEND_REMOTE_INV:
+		if (tx_flags(wqe) & SIW_WQE_SOLICITED)
+			memcpy(&c_tx->pkt.ctrl,
+			       &iwarp_pktinfo[RDMAP_SEND_SE_INVAL].ctrl,
+			       sizeof(struct iwarp_ctrl));
+		else
+			memcpy(&c_tx->pkt.ctrl,
+			       &iwarp_pktinfo[RDMAP_SEND_INVAL].ctrl,
+			       sizeof(struct iwarp_ctrl));
+
+		c_tx->pkt.send.ddp_qn = RDMAP_UNTAGGED_QN_SEND;
+		c_tx->pkt.send.ddp_msn =
+			htonl(++c_tx->ddp_msn[RDMAP_UNTAGGED_QN_SEND]);
+		c_tx->pkt.send.ddp_mo = 0;
+
+		c_tx->pkt.send_inv.inval_stag = cpu_to_be32(wqe->sqe.rkey);
+
+		c_tx->ctrl_len = sizeof(struct iwarp_send_inv);
+
+		crc = (char *)&c_tx->pkt.send_pkt.crc;
+		data = siw_try_1seg(c_tx, (u64)crc);
+		break;
+
+	case SIW_OP_WRITE:
+		memcpy(&c_tx->pkt.ctrl, &iwarp_pktinfo[RDMAP_RDMA_WRITE].ctrl,
+		       sizeof(struct iwarp_ctrl));
+
+		c_tx->pkt.rwrite.sink_stag = htonl(wqe->sqe.rkey);
+		c_tx->pkt.rwrite.sink_to = cpu_to_be64(wqe->sqe.raddr);
+		c_tx->ctrl_len = sizeof(struct iwarp_rdma_write);
+
+		crc = (char *)&c_tx->pkt.write_pkt.crc;
+		data = siw_try_1seg(c_tx, (u64)crc);
+		break;
+
+	case SIW_OP_READ_RESPONSE:
+		memcpy(&c_tx->pkt.ctrl,
+		       &iwarp_pktinfo[RDMAP_RDMA_READ_RESP].ctrl,
+		       sizeof(struct iwarp_ctrl));
+
+		/* NBO */
+		c_tx->pkt.rresp.sink_stag = cpu_to_be32(wqe->sqe.rkey);
+		c_tx->pkt.rresp.sink_to = cpu_to_be64(wqe->sqe.raddr);
+
+		c_tx->ctrl_len = sizeof(struct iwarp_rdma_rresp);
+
+		crc = (char *)&c_tx->pkt.write_pkt.crc;
+		data = siw_try_1seg(c_tx, (u64)crc);
+		break;
+
+	default:
+		siw_dbg_qp(tx_qp(c_tx), "stale wqe type %d\n", tx_type(wqe));
+		return -EOPNOTSUPP;
+	}
+	if (unlikely(data < 0))
+		return data;
+
+	c_tx->ctrl_sent = 0;
+
+	if (data <= MAX_HDR_INLINE) {
+		if (data) {
+			wqe->processed = data;
+
+			c_tx->pkt.ctrl.mpa_len =
+				htons(c_tx->ctrl_len + data - MPA_HDR_SIZE);
+
+			/* Add pad, if needed */
+			data += -(int)data & 0x3;
+			/* advance CRC location after payload */
+			crc += data;
+			c_tx->ctrl_len += data;
+
+			if (!(c_tx->pkt.ctrl.ddp_rdmap_ctrl & DDP_FLAG_TAGGED))
+				c_tx->pkt.c_untagged.ddp_mo = 0;
+			else
+				c_tx->pkt.c_tagged.ddp_to =
+				    cpu_to_be64(wqe->sqe.raddr);
+		}
+
+		*(u32 *)crc = 0;
+		/*
+		 * Do complete CRC if enabled and short packet
+		 */
+		if (c_tx->mpa_crc_hd) {
+			if (siw_crc_txhdr(c_tx) != 0)
+				return -EINVAL;
+			crypto_shash_final(c_tx->mpa_crc_hd, (u8 *)crc);
+		}
+		c_tx->ctrl_len += MPA_CRC_SIZE;
+
+		return PKT_COMPLETE;
+	}
+	c_tx->ctrl_len += MPA_CRC_SIZE;
+	c_tx->sge_idx = 0;
+	c_tx->sge_off = 0;
+	c_tx->pbl_idx = 0;
+
+	/*
+	 * Allow direct sending out of user buffer if WR is non signalled
+	 * and payload is over threshold and no CRC is enabled.
+	 * Per RDMA verbs, the application should not change the send buffer
+	 * until the work completed. In iWarp, work completion is only
+	 * local delivery to TCP. TCP may reuse the buffer for
+	 * retransmission. Changing unsent data also breaks the CRC,
+	 * if applied.
+	 */
+	if (zcopy_tx
+	    && wqe->bytes > SENDPAGE_THRESH
+	    && !(tx_flags(wqe) & SIW_WQE_SIGNALLED)
+	    && tx_type(wqe) != SIW_OP_READ
+	    && tx_type(wqe) != SIW_OP_READ_LOCAL_INV)
+		c_tx->use_sendpage = 1;
+	else
+		c_tx->use_sendpage = 0;
+
+	return PKT_FRAGMENTED;
+}
+
+/*
+ * Send out one complete control type FPDU, or header of FPDU carrying
+ * data. Used for fixed sized packets like Read.Requests or zero length
+ * SENDs, WRITEs, READ.Responses, or header only.
+ */
+static inline int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
+			      int flags)
+{
+	struct msghdr msg = {.msg_flags = flags};
+	struct kvec iov = {
+		.iov_base = (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent,
+		.iov_len = c_tx->ctrl_len - c_tx->ctrl_sent};
+
+	int rv = kernel_sendmsg(s, &msg, &iov, 1,
+				c_tx->ctrl_len - c_tx->ctrl_sent);
+
+	if (rv >= 0) {
+		c_tx->ctrl_sent += rv;
+
+		if (c_tx->ctrl_sent == c_tx->ctrl_len) {
+			siw_dprint_hdr(&c_tx->pkt.hdr, TX_QPID(c_tx),
+					"HDR/CTRL sent");
+			rv = 0;
+		} else
+			rv = -EAGAIN;
+	}
+	return rv;
+}
+
+/*
+ * 0copy TCP transmit interface: Use do_tcp_sendpages.
+ *
+ * Using sendpage to push page by page appears to be less efficient
+ * than using sendmsg, even if data are copied.
+ *
+ * A general performance limitation might be the extra four bytes
+ * trailer checksum segment to be pushed after user data.
+ */
+static int siw_tcp_sendpages(struct socket *s, struct page **page,
+			     int offset, size_t size)
+{
+	struct sock *sk = s->sk;
+	int i = 0, rv = 0, sent = 0,
+	    flags = MSG_MORE|MSG_DONTWAIT|MSG_SENDPAGE_NOTLAST;
+
+	while (size) {
+		size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
+
+		if (size + offset <= PAGE_SIZE)
+			flags = MSG_MORE|MSG_DONTWAIT;
+
+		tcp_rate_check_app_limited(sk);
+try_page_again:
+		lock_sock(sk);
+		rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags);
+		release_sock(sk);
+
+		if (rv > 0) {
+			size -= rv;
+			sent += rv;
+			if (rv != bytes) {
+				offset += rv;
+				bytes -= rv;
+				goto try_page_again;
+			}
+			offset = 0;
+		} else {
+			if (rv == -EAGAIN || rv == 0)
+				break;
+			return rv;
+		}
+		i++;
+	}
+	return sent;
+}
+
+/*
+ * siw_0copy_tx()
+ *
+ * Pushes list of pages to TCP socket. If pages from multiple
+ * SGE's, all referenced pages of each SGE are pushed in one
+ * shot.
+ */
+static int siw_0copy_tx(struct socket *s, struct page **page,
+			struct siw_sge *sge, unsigned int offset,
+			unsigned int size)
+{
+	int i = 0, sent = 0, rv;
+	int sge_bytes = min(sge->length - offset, size);
+
+	offset  = (sge->laddr + offset) & ~PAGE_MASK;
+
+	while (sent != size) {
+		rv = siw_tcp_sendpages(s, &page[i], offset, sge_bytes);
+		if (rv >= 0) {
+			sent += rv;
+			if (size == sent || sge_bytes > rv)
+				break;
+
+			i += PAGE_ALIGN(sge_bytes + offset) >> PAGE_SHIFT;
+			sge++;
+			sge_bytes = min(sge->length, size - sent);
+			offset = sge->laddr & ~PAGE_MASK;
+		} else {
+			sent = rv;
+			break;
+		}
+	}
+	return sent;
+}
+
+#define MAX_TRAILER (MPA_CRC_SIZE + 4)
+
+static void siw_unmap_pages(struct page **pages, int hdr_len, int num_maps)
+{
+	if (hdr_len) {
+		++pages;
+		--num_maps;
+	}
+	while (num_maps-- > 0) {
+		kunmap(*pages);
+		pages++;
+	}
+}
+
+/*
+ * siw_tx_hdt() tries to push a complete packet to TCP where all
+ * packet fragments are referenced by the elements of one iovec.
+ * For the data portion, each involved page must be referenced by
+ * one extra element. All sge's data can be non-aligned to page
+ * boundaries. Two more elements are referencing iWARP header
+ * and trailer:
+ * MAX_ARRAY = 64KB/PAGE_SIZE + 1 + (2 * (SIW_MAX_SGE - 1) + HDR + TRL
+ */
+#define MAX_ARRAY ((0xffff / PAGE_SIZE) + 1 + (2 * (SIW_MAX_SGE - 1) + 2))
+
+/*
+ * Write out iov referencing hdr, data and trailer of current FPDU.
+ * Update transmit state dependent on write return status
+ */
+static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
+{
+	struct siw_wqe	*wqe = &c_tx->wqe_active;
+	struct siw_sge	*sge = &wqe->sqe.sge[c_tx->sge_idx];
+	struct siw_mem	*mem = wqe->mem[c_tx->sge_idx];
+	struct siw_mr	*mr = NULL;
+
+	struct kvec	iov[MAX_ARRAY];
+	struct page	*page_array[MAX_ARRAY];
+	struct msghdr	msg = {.msg_flags = MSG_DONTWAIT|MSG_EOR};
+
+	int		seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
+	unsigned int	data_len = c_tx->bytes_unsent,
+			hdr_len = 0,
+			trl_len = 0,
+			sge_off = c_tx->sge_off,
+			sge_idx = c_tx->sge_idx,
+			pbl_idx = c_tx->pbl_idx;
+
+	if (c_tx->state == SIW_SEND_HDR) {
+		if (c_tx->use_sendpage) {
+			rv = siw_tx_ctrl(c_tx, s, MSG_DONTWAIT|MSG_MORE);
+			if (rv)
+				goto done;
+
+			c_tx->state = SIW_SEND_DATA;
+		} else {
+			iov[0].iov_base =
+				(char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent;
+			iov[0].iov_len = hdr_len =
+				c_tx->ctrl_len - c_tx->ctrl_sent;
+			seg = 1;
+			siw_dprint_hdr(&c_tx->pkt.hdr, TX_QPID(c_tx),
+					"HDR to send");
+		}
+	}
+
+	wqe->processed += data_len;
+
+	while (data_len) { /* walk the list of SGE's */
+		unsigned int	sge_len = min(sge->length - sge_off, data_len);
+		unsigned int	fp_off = (sge->laddr + sge_off) & ~PAGE_MASK;
+
+		if (!(tx_flags(wqe) & SIW_WQE_INLINE)) {
+			mr = siw_mem2mr(mem);
+			if (!mr->mem_obj)
+				is_kva = 1;
+		} else
+			is_kva = 1;
+
+		if (is_kva && !c_tx->use_sendpage) {
+			/*
+			 * tx from kernel virtual address: either inline data
+			 * or memory region with assigned kernel buffer
+			 */
+			iov[seg].iov_base = (void *)(sge->laddr + sge_off);
+			iov[seg].iov_len = sge_len;
+
+			if (do_crc)
+				siw_crc_array(c_tx->mpa_crc_hd,
+					      iov[seg].iov_base, sge_len);
+			sge_off += sge_len;
+			data_len -= sge_len;
+			seg++;
+			goto sge_done;
+		}
+
+		while (sge_len) {
+			size_t plen = min((int)PAGE_SIZE - fp_off, sge_len);
+
+			if (!is_kva) {
+				struct page *p;
+
+				if (mr->mem.is_pbl)
+					p = siw_get_pblpage(mr,
+						sge->laddr + sge_off,
+						&pbl_idx);
+				else
+					p = siw_get_upage(mr->umem, sge->laddr
+							  + sge_off);
+				if (unlikely(!p)) {
+					if (hdr_len)
+						seg--;
+					if (!c_tx->use_sendpage && seg) {
+						siw_unmap_pages(page_array,
+								hdr_len, seg);
+					}
+					wqe->processed -= c_tx->bytes_unsent;
+					rv = -EFAULT;
+					goto done_crc;
+				}
+				page_array[seg] = p;
+
+				if (!c_tx->use_sendpage) {
+					iov[seg].iov_base = kmap(p) + fp_off;
+					iov[seg].iov_len = plen;
+				}
+				if (do_crc)
+					siw_crc_page(c_tx->mpa_crc_hd, p,
+						     fp_off, plen);
+			} else {
+				u64 pa = ((sge->laddr + sge_off) & PAGE_MASK);
+
+				page_array[seg] = virt_to_page(pa);
+				if (do_crc)
+					siw_crc_array(c_tx->mpa_crc_hd,
+						(void *)(sge->laddr + sge_off),
+						plen);
+			}
+
+			sge_len -= plen;
+			sge_off += plen;
+			data_len -= plen;
+			fp_off = 0;
+
+			if (++seg > (int)MAX_ARRAY) {
+				siw_dbg_qp(tx_qp(c_tx), "to many fragments\n");
+				if (!is_kva && !c_tx->use_sendpage) {
+					siw_unmap_pages(page_array, hdr_len,
+							seg - 1);
+				}
+				wqe->processed -= c_tx->bytes_unsent;
+				rv = -EMSGSIZE;
+				goto done_crc;
+			}
+		}
+sge_done:
+		/* Update SGE variables at end of SGE */
+		if (sge_off == sge->length &&
+		    (data_len != 0 || wqe->processed < wqe->bytes)) {
+			sge_idx++;
+			sge++;
+			mem++;
+			sge_off = 0;
+		}
+	}
+	/* trailer */
+	if (likely(c_tx->state != SIW_SEND_TRAILER)) {
+		iov[seg].iov_base = &c_tx->trailer.pad[4 - c_tx->pad];
+		iov[seg].iov_len = trl_len = MAX_TRAILER - (4 - c_tx->pad);
+	} else {
+		iov[seg].iov_base = &c_tx->trailer.pad[c_tx->ctrl_sent];
+		iov[seg].iov_len = trl_len = MAX_TRAILER - c_tx->ctrl_sent;
+	}
+
+	if (c_tx->pad) {
+		*(u32 *)c_tx->trailer.pad = 0;
+		if (do_crc)
+			siw_crc_array(c_tx->mpa_crc_hd,
+				      (u8 *)&c_tx->trailer.crc - c_tx->pad,
+				      c_tx->pad);
+	}
+	if (!c_tx->mpa_crc_hd)
+		c_tx->trailer.crc = 0;
+	else if (do_crc)
+		crypto_shash_final(c_tx->mpa_crc_hd,
+				   (u8 *)&c_tx->trailer.crc);
+
+	data_len = c_tx->bytes_unsent;
+
+	if (c_tx->use_sendpage) {
+		rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx],
+				  c_tx->sge_off, data_len);
+		if (rv == data_len) {
+			rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len);
+			if (rv > 0)
+				rv += data_len;
+			else
+				rv = data_len;
+		}
+	} else {
+		rv = kernel_sendmsg(s, &msg, iov, seg + 1,
+				    hdr_len + data_len + trl_len);
+		if (!is_kva)
+			siw_unmap_pages(page_array, hdr_len, seg);
+	}
+	if (rv < (int)hdr_len) {
+		/* Not even complete hdr pushed or negative rv */
+		wqe->processed -= data_len;
+		if (rv >= 0) {
+			c_tx->ctrl_sent += rv;
+			rv = -EAGAIN;
+		}
+		goto done_crc;
+	}
+	rv -= hdr_len;
+
+	if (rv >= (int)data_len) {
+		/* all user data pushed to TCP or no data to push */
+		if (data_len > 0 && wqe->processed < wqe->bytes) {
+			/* Save the current state for next tx */
+			c_tx->sge_idx = sge_idx;
+			c_tx->sge_off = sge_off;
+			c_tx->pbl_idx = pbl_idx;
+		}
+		rv -= data_len;
+
+		if (rv == trl_len) /* all pushed */
+			rv = 0;
+		else {
+			c_tx->state = SIW_SEND_TRAILER;
+			c_tx->ctrl_len = MAX_TRAILER;
+			c_tx->ctrl_sent = rv + 4 - c_tx->pad;
+			c_tx->bytes_unsent = 0;
+			rv = -EAGAIN;
+		}
+
+	} else if (data_len > 0) {
+		/* Maybe some user data pushed to TCP */
+		c_tx->state = SIW_SEND_DATA;
+		wqe->processed -= data_len - rv;
+
+		if (rv) {
+			/*
+			 * Some bytes out. Recompute tx state based
+			 * on old state and bytes pushed
+			 */
+			unsigned int sge_unsent;
+
+			c_tx->bytes_unsent -= rv;
+			sge = &wqe->sqe.sge[c_tx->sge_idx];
+			sge_unsent = sge->length - c_tx->sge_off;
+
+			while (sge_unsent <= rv) {
+				rv -= sge_unsent;
+				c_tx->sge_idx++;
+				c_tx->sge_off = 0;
+				sge++;
+				sge_unsent = sge->length;
+			}
+			c_tx->sge_off += rv;
+		}
+		rv = -EAGAIN;
+	}
+done_crc:
+	c_tx->do_crc = 0;
+done:
+	return rv;
+}
+
+static inline void siw_update_tcpseg(struct siw_iwarp_tx *c_tx,
+				     struct socket *s)
+{
+	struct tcp_sock *tp = tcp_sk(s->sk);
+
+	if (tp->gso_segs) {
+		if (gso_seg_limit == 0)
+			c_tx->tcp_seglen = tp->mss_cache * tp->gso_segs;
+		else
+			c_tx->tcp_seglen = tp->mss_cache *
+				min_t(unsigned int, gso_seg_limit,
+				      tp->gso_segs);
+	} else {
+		c_tx->tcp_seglen = tp->mss_cache;
+	}
+	/* Loopback may give odd numbers */
+	c_tx->tcp_seglen &= 0xfffffff8;
+}
+
+/*
+ * siw_unseg_txlen()
+ *
+ * Compute complete tcp payload len if packet would not
+ * get fragmented
+ */
+static inline int siw_unseg_txlen(struct siw_iwarp_tx *c_tx)
+{
+	int pad = c_tx->bytes_unsent ? -c_tx->bytes_unsent & 0x3 : 0;
+
+	return c_tx->bytes_unsent + c_tx->ctrl_len + pad + MPA_CRC_SIZE;
+}
+
+/*
+ * siw_prepare_fpdu()
+ *
+ * Prepares transmit context to send out one FPDU if FPDU will contain
+ * user data and user data are not immediate data.
+ * Computes maximum FPDU length to fill up TCP MSS if possible.
+ *
+ * @qp:		QP from which to transmit
+ * @wqe:	Current WQE causing transmission
+ *
+ * TODO: Take into account real available sendspace on socket
+ *       to avoid header misalignment due to send pausing within
+ *       fpdu transmission
+ */
+static void siw_prepare_fpdu(struct siw_qp *qp, struct siw_wqe *wqe)
+{
+	struct siw_iwarp_tx	*c_tx  = &qp->tx_ctx;
+	int data_len;
+
+	c_tx->ctrl_len = iwarp_pktinfo[__rdmap_opcode(&c_tx->pkt.ctrl)].hdr_len;
+	c_tx->ctrl_sent = 0;
+
+	/*
+	 * Update target buffer offset if any
+	 */
+	if (!(c_tx->pkt.ctrl.ddp_rdmap_ctrl & DDP_FLAG_TAGGED))
+		/* Untagged message */
+		c_tx->pkt.c_untagged.ddp_mo = cpu_to_be32(wqe->processed);
+	else	/* Tagged message */
+		c_tx->pkt.c_tagged.ddp_to =
+		    cpu_to_be64(wqe->sqe.raddr + wqe->processed);
+
+	data_len = wqe->bytes - wqe->processed;
+	if (data_len + c_tx->ctrl_len + MPA_CRC_SIZE > c_tx->tcp_seglen) {
+		/* Trim DDP payload to fit into current TCP segment */
+		data_len = c_tx->tcp_seglen - (c_tx->ctrl_len + MPA_CRC_SIZE);
+		c_tx->pkt.ctrl.ddp_rdmap_ctrl &= ~DDP_FLAG_LAST;
+		c_tx->pad = 0;
+	} else {
+		c_tx->pkt.ctrl.ddp_rdmap_ctrl |= DDP_FLAG_LAST;
+		c_tx->pad = -data_len & 0x3;
+	}
+	c_tx->bytes_unsent = data_len;
+
+	c_tx->pkt.ctrl.mpa_len =
+		htons(c_tx->ctrl_len + data_len - MPA_HDR_SIZE);
+
+	/*
+	 * Init MPA CRC computation
+	 */
+	if (c_tx->mpa_crc_hd) {
+		siw_crc_txhdr(c_tx);
+		c_tx->do_crc = 1;
+	}
+}
+
+/*
+ * siw_check_sgl_tx()
+ *
+ * Check permissions for a list of SGE's (SGL).
+ * A successful check will have all memory referenced
+ * for transmission resolved and assigned to the WQE.
+ *
+ * @pd:		Protection Domain SGL should belong to
+ * @wqe:	WQE to be checked
+ * @perms:	requested access permissions
+ *
+ */
+
+static int siw_check_sgl_tx(struct siw_pd *pd, struct siw_wqe *wqe,
+			    enum siw_access_flags perms)
+{
+	struct siw_sge *sge = &wqe->sqe.sge[0];
+	struct siw_mem **mem = &wqe->mem[0];
+	int len = 0, num_sge = wqe->sqe.num_sge;
+
+	if (unlikely(num_sge > SIW_MAX_SGE))
+		return -EINVAL;
+
+	for (; num_sge; num_sge--) {
+		/*
+		 * rdma verbs: do not check stag for a zero length sge
+		 */
+		if (sge->length) {
+			int rv = siw_check_sge(pd, sge, mem, perms, 0,
+					       sge->length);
+			if (unlikely(rv != E_ACCESS_OK))
+				return rv;
+		}
+		len += sge->length;
+		sge++, mem++;
+	}
+	return len;
+}
+
+/*
+ * siw_qp_sq_proc_tx()
+ *
+ * Process one WQE which needs transmission on the wire.
+ */
+static int siw_qp_sq_proc_tx(struct siw_qp *qp, struct siw_wqe *wqe)
+{
+	struct siw_iwarp_tx	*c_tx = &qp->tx_ctx;
+	struct socket		*s = qp->attrs.llp_stream_handle;
+	int			rv = 0,
+				burst_len = qp->tx_ctx.burst;
+	enum rdmap_ecode	ecode = RDMAP_ECODE_CATASTROPHIC_STREAM;
+
+	if (unlikely(wqe->wr_status == SIW_WR_IDLE))
+		return 0;
+
+	if (!burst_len)
+		burst_len = SQ_USER_MAXBURST;
+
+	if (wqe->wr_status == SIW_WR_QUEUED) {
+		if (!(wqe->sqe.flags & SIW_WQE_INLINE)) {
+			if (tx_type(wqe) == SIW_OP_READ_RESPONSE)
+				wqe->sqe.num_sge = 1;
+
+			if (tx_type(wqe) != SIW_OP_READ &&
+			    tx_type(wqe) != SIW_OP_READ_LOCAL_INV) {
+				/*
+				 * Reference memory to be tx'd
+				 */
+				rv = siw_check_sgl_tx(qp->pd, wqe,
+						      SIW_MEM_LREAD);
+				if (rv < 0) {
+					if (tx_type(wqe) ==
+					    SIW_OP_READ_RESPONSE)
+						ecode = siw_rdmap_error(-rv);
+					rv = -EINVAL;
+					goto tx_error;
+				}
+				wqe->bytes = rv;
+			} else {
+				wqe->bytes = 0;
+			}
+		} else {
+			wqe->bytes = wqe->sqe.sge[0].length;
+			if (!qp->kernel_verbs) {
+				if (wqe->bytes > SIW_MAX_INLINE) {
+					rv = -EINVAL;
+					goto tx_error;
+				}
+				wqe->sqe.sge[0].laddr = (u64)&wqe->sqe.sge[1];
+			}
+		}
+		wqe->wr_status = SIW_WR_INPROGRESS;
+		wqe->processed = 0;
+
+		siw_update_tcpseg(c_tx, s);
+
+		rv = siw_qp_prepare_tx(c_tx);
+		if (rv == PKT_FRAGMENTED) {
+			c_tx->state = SIW_SEND_HDR;
+			siw_prepare_fpdu(qp, wqe);
+		} else if (rv == PKT_COMPLETE) {
+			c_tx->state = SIW_SEND_SHORT_FPDU;
+		} else {
+			goto tx_error;
+		}
+	}
+
+next_segment:
+	siw_dbg_qp(qp, "wr type %d, state %d, data %u, sent %u, id %llx\n",
+		   tx_type(wqe), wqe->wr_status, wqe->bytes,
+		   wqe->processed, wqe->sqe.id);
+
+	if (--burst_len == 0) {
+		rv = -EINPROGRESS;
+		goto tx_done;
+	}
+	if (c_tx->state == SIW_SEND_SHORT_FPDU) {
+		enum siw_opcode tx_type = tx_type(wqe);
+		unsigned int msg_flags;
+
+		if (siw_sq_empty(qp) || siw_lowdelay || burst_len < 2)
+			/*
+			 * End current TCP segment, if SQ runs empty,
+			 * or siw_lowdelay is set, or we bail out
+			 * soon due to no burst credit left.
+			 */
+			msg_flags = MSG_DONTWAIT;
+		else
+			msg_flags = MSG_DONTWAIT|MSG_MORE;
+
+		rv = siw_tx_ctrl(c_tx, s, msg_flags);
+
+		if (!rv && tx_type != SIW_OP_READ &&
+		    tx_type != SIW_OP_READ_LOCAL_INV)
+			wqe->processed = wqe->bytes;
+
+		goto tx_done;
+
+	} else {
+		rv = siw_tx_hdt(c_tx, s);
+	}
+	if (!rv) {
+		/*
+		 * One segment sent. Processing completed if last
+		 * segment, Do next segment otherwise.
+		 */
+		if (unlikely(c_tx->tx_suspend)) {
+			/*
+			 * Verbs, 6.4.: Try stopping sending after a full
+			 * DDP segment if the connection goes down
+			 * (== peer halfclose)
+			 */
+			rv = -ECONNABORTED;
+			goto tx_done;
+		}
+		if (c_tx->pkt.ctrl.ddp_rdmap_ctrl & DDP_FLAG_LAST) {
+			siw_dbg_qp(qp, "WQE completed\n");
+			goto tx_done;
+		}
+		c_tx->state = SIW_SEND_HDR;
+
+		siw_update_tcpseg(c_tx, s);
+
+		siw_prepare_fpdu(qp, wqe);
+		goto next_segment;
+	}
+tx_done:
+	qp->tx_ctx.burst = burst_len;
+	return rv;
+
+tx_error:
+	if (ecode != RDMAP_ECODE_CATASTROPHIC_STREAM)
+		siw_init_terminate(qp, TERM_ERROR_LAYER_RDMAP,
+				   RDMAP_ETYPE_REMOTE_PROTECTION,
+				   ecode, 1);
+	else
+		siw_init_terminate(qp, TERM_ERROR_LAYER_RDMAP,
+				   RDMAP_ETYPE_CATASTROPHIC,
+				   RDMAP_ECODE_UNSPECIFIED, 1);
+	return rv;
+}
+
+static int siw_fastreg_mr(struct siw_pd *pd, struct siw_sqe *sqe)
+{
+	struct siw_mem *mem = siw_mem_id2obj(pd->hdr.sdev, sqe->rkey >> 8);
+	struct siw_mr *mr;
+	int rv = 0;
+
+	siw_dbg(pd->hdr.sdev, "enter. stag %u\n", sqe->rkey >> 8);
+
+	if (!mem) {
+		pr_warn("siw: fastreg: stag %u unknown\n", sqe->rkey >> 8);
+		return -EINVAL;
+	}
+	mr = siw_mem2mr(mem);
+	if (&mr->base_mr != (void *)sqe->base_mr) {
+		pr_warn("siw: fastreg: stag %u: unexpected mr\n",
+			sqe->rkey >> 8);
+		rv = -EINVAL;
+		goto out;
+	}
+	if (mr->pd != pd) {
+		pr_warn("siw: fastreg: pd mismatch\n");
+		rv = -EINVAL;
+		goto out;
+	}
+	if (mem->stag_valid) {
+		pr_warn("siw: fastreg: stag %u already valid\n",
+			sqe->rkey >> 8);
+		rv = -EINVAL;
+		goto out;
+	}
+	mem->perms = sqe->access;
+	mem->stag_valid = 1;
+
+	siw_dbg(pd->hdr.sdev, "stag %u now valid\n", sqe->rkey >> 8);
+out:
+	siw_mem_put(mem);
+	return rv;
+}
+
+static int siw_qp_sq_proc_local(struct siw_qp *qp, struct siw_wqe *wqe)
+{
+	int rv;
+
+	switch (tx_type(wqe)) {
+
+	case SIW_OP_REG_MR:
+		rv = siw_fastreg_mr(qp->pd, &wqe->sqe);
+		break;
+
+	case SIW_OP_INVAL_STAG:
+		rv = siw_invalidate_stag(qp->pd, wqe->sqe.rkey);
+		break;
+
+	default:
+		rv = -EINVAL;
+	}
+	return rv;
+}
+
+/*
+ * siw_qp_sq_process()
+ *
+ * Core TX path routine for RDMAP/DDP/MPA using a TCP kernel socket.
+ * Sends RDMAP payload for the current SQ WR @wqe of @qp in one or more
+ * MPA FPDUs, each containing a DDP segment.
+ *
+ * SQ processing may occur in user context as a result of posting
+ * new WQE's or from siw_sq_work_handler() context. Processing in
+ * user context is limited to non-kernel verbs users.
+ *
+ * SQ processing may get paused anytime, possibly in the middle of a WR
+ * or FPDU, if insufficient send space is available. SQ processing
+ * gets resumed from siw_sq_work_handler(), if send space becomes
+ * available again.
+ *
+ * Must be called with the QP state read-locked.
+ *
+ * TODO:
+ * To be solved more seriously: an outbound RREQ can be satisfied
+ * by the corresponding RRESP _before_ it gets assigned to the ORQ.
+ * This happens regularly in RDMA READ via loopback case. Since both
+ * outbound RREQ and inbound RRESP can be handled by the same CPU
+ * locking the ORQ is dead-lock prone and thus not an option.
+ * Tentatively, the RREQ gets assigned to the ORQ _before_ being
+ * sent (and pulled back in case of send failure).
+ */
+int siw_qp_sq_process(struct siw_qp *qp)
+{
+	struct siw_wqe		*wqe = tx_wqe(qp);
+	enum siw_opcode		tx_type;
+	unsigned long		flags;
+	int			rv = 0;
+
+	siw_dbg_qp(qp, "enter for type %d\n", tx_type(wqe));
+	wait_event(qp->tx_ctx.waitq, !atomic_read(&qp->tx_ctx.in_use));
+
+	if (atomic_inc_return(&qp->tx_ctx.in_use) > 1) {
+		pr_warn("SIW: [QP %d] already active\n", QP_ID(qp));
+		goto done;
+	}
+next_wqe:
+	/*
+	 * Stop QP processing if SQ state changed
+	 */
+	if (unlikely(qp->tx_ctx.tx_suspend)) {
+		siw_dbg_qp(qp, "tx suspended\n");
+		goto done;
+	}
+	tx_type = tx_type(wqe);
+
+	if (tx_type <= SIW_OP_READ_RESPONSE)
+		rv = siw_qp_sq_proc_tx(qp, wqe);
+	else
+		rv = siw_qp_sq_proc_local(qp, wqe);
+
+	if (!rv) {
+		/*
+		 * WQE processing done
+		 */
+		switch (tx_type) {
+
+		case SIW_OP_SEND:
+		case SIW_OP_SEND_REMOTE_INV:
+		case SIW_OP_WRITE:
+			siw_wqe_put_mem(wqe, tx_type);
+		case SIW_OP_INVAL_STAG:
+		case SIW_OP_REG_MR:
+			if (tx_flags(wqe) & SIW_WQE_SIGNALLED)
+				siw_sqe_complete(qp, &wqe->sqe, wqe->bytes,
+						 SIW_WC_SUCCESS);
+			break;
+
+		case SIW_OP_READ:
+		case SIW_OP_READ_LOCAL_INV:
+			/*
+			 * already enqueued to ORQ queue
+			 */
+			break;
+
+		case SIW_OP_READ_RESPONSE:
+			siw_wqe_put_mem(wqe, tx_type);
+			break;
+
+		default:
+			WARN(1, "undefined WQE type %d\n", tx_type);
+			rv = -EINVAL;
+			goto done;
+		}
+
+		spin_lock_irqsave(&qp->sq_lock, flags);
+		wqe->wr_status = SIW_WR_IDLE;
+		rv = siw_activate_tx(qp);
+		spin_unlock_irqrestore(&qp->sq_lock, flags);
+
+		if (unlikely(rv <= 0))
+			goto done;
+
+		goto next_wqe;
+
+	} else if (rv == -EAGAIN) {
+		siw_dbg_qp(qp, "sq paused: hd/tr %d of %d, data %d\n",
+			   qp->tx_ctx.ctrl_sent, qp->tx_ctx.ctrl_len,
+			   qp->tx_ctx.bytes_unsent);
+		rv = 0;
+		goto done;
+	} else if (rv == -EINPROGRESS) {
+		rv = siw_sq_start(qp);
+		goto done;
+	} else {
+		/*
+		 * WQE processing failed.
+		 * Verbs 8.3.2:
+		 * o It turns any WQE into a signalled WQE.
+		 * o Local catastrophic error must be surfaced
+		 * o QP must be moved into Terminate state: done by code
+		 *   doing socket state change processing
+		 *
+		 * o TODO: Termination message must be sent.
+		 * o TODO: Implement more precise work completion errors,
+		 *         see enum ib_wc_status in ib_verbs.h
+		 */
+		siw_dbg_qp(qp, "wqe type %d processing failed: %d\n",
+			   tx_type(wqe), rv);
+
+		spin_lock_irqsave(&qp->sq_lock, flags);
+		/*
+		 * RREQ may have already been completed by inbound RRESP!
+		 */
+		if (tx_type == SIW_OP_READ ||
+		    tx_type == SIW_OP_READ_LOCAL_INV) {
+			/* Cleanup pending entry in ORQ */
+			qp->orq_put--;
+			qp->orq[qp->orq_put % qp->attrs.orq_size].flags = 0;
+		}
+		spin_unlock_irqrestore(&qp->sq_lock, flags);
+		/*
+		 * immediately suspends further TX processing
+		 */
+		if (!qp->tx_ctx.tx_suspend)
+			siw_qp_cm_drop(qp, 0);
+
+		switch (tx_type) {
+
+		case SIW_OP_SEND:
+		case SIW_OP_SEND_REMOTE_INV:
+		case SIW_OP_SEND_WITH_IMM:
+		case SIW_OP_WRITE:
+		case SIW_OP_READ:
+		case SIW_OP_READ_LOCAL_INV:
+			siw_wqe_put_mem(wqe, tx_type);
+		case SIW_OP_INVAL_STAG:
+		case SIW_OP_REG_MR:
+			siw_sqe_complete(qp, &wqe->sqe, wqe->bytes,
+					 SIW_WC_LOC_QP_OP_ERR);
+
+			siw_qp_event(qp, IB_EVENT_QP_FATAL);
+
+			break;
+
+		case SIW_OP_READ_RESPONSE:
+			siw_dbg_qp(qp, "proc. read.response failed: %d\n", rv);
+
+			siw_qp_event(qp, IB_EVENT_QP_REQ_ERR);
+
+			siw_wqe_put_mem(wqe, SIW_OP_READ_RESPONSE);
+
+			break;
+
+		default:
+			WARN(1, "undefined WQE type %d\n", tx_type);
+			rv = -EINVAL;
+		}
+		wqe->wr_status = SIW_WR_IDLE;
+	}
+done:
+	atomic_dec(&qp->tx_ctx.in_use);
+	wake_up(&qp->tx_ctx.waitq);
+
+	return rv;
+}
+
+static void siw_sq_resume(struct siw_qp *qp)
+{
+
+	if (down_read_trylock(&qp->state_lock)) {
+		if (likely(qp->attrs.state == SIW_QP_STATE_RTS &&
+			!qp->tx_ctx.tx_suspend)) {
+
+			int rv = siw_qp_sq_process(qp);
+
+			up_read(&qp->state_lock);
+
+			if (unlikely(rv < 0)) {
+				siw_dbg_qp(qp, "SQ task failed: err %d\n", rv);
+
+				if (!qp->tx_ctx.tx_suspend)
+					siw_qp_cm_drop(qp, 0);
+			}
+		} else {
+			up_read(&qp->state_lock);
+		}
+	} else {
+		siw_dbg_qp(qp, "Resume SQ while QP locked\n");
+	}
+	siw_qp_put(qp);
+}
+
+struct tx_task_t {
+	struct llist_head active;
+	wait_queue_head_t waiting;
+};
+
+static DEFINE_PER_CPU(struct tx_task_t, tx_task_g);
+
+void siw_stop_tx_thread(int nr_cpu)
+{
+	kthread_stop(siw_tx_thread[nr_cpu]);
+	wake_up(&per_cpu(tx_task_g, nr_cpu).waiting);
+}
+
+int siw_run_sq(void *data)
+{
+	const int nr_cpu = (unsigned int)(long)data;
+	struct llist_node *active;
+	struct siw_qp *qp;
+	struct tx_task_t *tx_task = &per_cpu(tx_task_g, nr_cpu);
+
+	init_llist_head(&tx_task->active);
+	init_waitqueue_head(&tx_task->waiting);
+
+	pr_info("Started siw TX thread on CPU %u\n", nr_cpu);
+
+	while (1) {
+		struct llist_node *fifo_list = NULL;
+
+		wait_event_interruptible(tx_task->waiting,
+					 !llist_empty(&tx_task->active) ||
+					 kthread_should_stop());
+
+		if (kthread_should_stop())
+			break;
+
+		active = llist_del_all(&tx_task->active);
+		/*
+		 * llist_del_all returns a list with newest entry first.
+		 * Re-order list for fairness among QP's.
+		 */
+		while (active) {
+			struct llist_node *tmp = active;
+
+			active = llist_next(active);
+			tmp->next = fifo_list;
+			fifo_list = tmp;
+		}
+		while (fifo_list) {
+			qp = container_of(fifo_list, struct siw_qp, tx_list);
+			fifo_list = llist_next(fifo_list);
+			qp->tx_list.next = NULL;
+
+			siw_sq_resume(qp);
+		}
+	}
+	active = llist_del_all(&tx_task->active);
+	if (active != NULL) {
+		llist_for_each_entry(qp, active, tx_list) {
+			qp->tx_list.next = NULL;
+			siw_sq_resume(qp);
+		}
+	}
+	pr_info("Stopped siw TX thread on CPU %u\n", nr_cpu);
+
+	return 0;
+}
+
+int siw_sq_start(struct siw_qp *qp)
+{
+	if (tx_wqe(qp)->wr_status == SIW_WR_IDLE)
+		return 0;
+
+	siw_dbg_qp(qp, "enter\n");
+
+	if (unlikely(!cpu_online(qp->tx_cpu))) {
+		siw_put_tx_cpu(qp->tx_cpu);
+		qp->tx_cpu = siw_get_tx_cpu(qp->hdr.sdev);
+		if (qp->tx_cpu < 0) {
+			pr_warn("siw: no tx cpu available\n");
+
+			return -EIO;
+		}
+	}
+	siw_qp_get(qp);
+
+	llist_add(&qp->tx_list, &per_cpu(tx_task_g, qp->tx_cpu).active);
+
+	wake_up(&per_cpu(tx_task_g, qp->tx_cpu).waiting);
+
+	return 0;
+}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 10/13] SoftiWarp receive path
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (8 preceding siblings ...)
  2018-01-14 22:35   ` [PATCH v3 09/13] SoftiWarp transmit path Bernard Metzler
@ 2018-01-14 22:36   ` Bernard Metzler
  2018-01-14 22:36   ` [PATCH v3 11/13] SoftiWarp Completion Queue methods Bernard Metzler
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_qp_rx.c | 1531 +++++++++++++++++++++++++++++++++
 1 file changed, 1531 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_qp_rx.c

diff --git a/drivers/infiniband/sw/siw/siw_qp_rx.c b/drivers/infiniband/sw/siw/siw_qp_rx.c
new file mode 100644
index 000000000000..9af148a23624
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_qp_rx.c
@@ -0,0 +1,1531 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Fredy Neeser <nfd-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/net.h>
+#include <linux/scatterlist.h>
+#include <linux/highmem.h>
+#include <net/sock.h>
+#include <net/tcp_states.h>
+#include <net/tcp.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+
+/*
+ * ----------------------------
+ * DDP reassembly for Softiwarp
+ * ----------------------------
+ * For the ordering of transmitted DDP segments, the relevant iWARP ordering
+ * rules are as follows:
+ *
+ * - RDMAP (RFC 5040): Section 7.5, Rule 17:
+ *   "RDMA Read Response Message processing at the Remote Peer (reading
+ *    the specified Tagged Buffer) MUST be started only after the RDMA
+ *    Read Request Message has been Delivered by the DDP layer (thus,
+ *    all previous RDMA Messages have been properly submitted for
+ *    ordered Placement)."
+ *
+ * - DDP (RFC 5041): Section 5.3:
+ *   "At the Data Source, DDP:
+ *    o MUST transmit DDP Messages in the order they were submitted to
+ *      the DDP layer,
+ *    o SHOULD transmit DDP Segments within a DDP Message in increasing
+ *      MO order for Untagged DDP Messages, and in increasing TO order
+ *      for Tagged DDP Messages."
+ *
+ * Combining these rules implies that, although RDMAP does not provide
+ * ordering between operations that are generated from the two ends of an
+ * RDMAP stream, DDP *must not* transmit an RDMA Read Response Message before
+ * it has finished transmitting SQ operations that were already submitted
+ * to the DDP layer. It follows that an iWARP transmitter must fully
+ * serialize RDMAP messages belonging to the same QP.
+ *
+ * Given that a TCP socket receives DDP segments in peer transmit order,
+ * we obtain the following ordering of received DDP segments:
+ *
+ * (i)  the received DDP segments of RDMAP messages for the same QP
+ *      cannot be interleaved
+ * (ii) the received DDP segments of a single RDMAP message *should*
+ *      arrive in order.
+ *
+ * The Softiwarp transmitter obeys rule #2 in DDP Section 5.3.
+ * With this property, the "should" becomes a "must" in (ii) above,
+ * which simplifies DDP reassembly considerably.
+ * The Softiwarp receiver currently relies on this property
+ * and reports an error if DDP segments of the same RDMAP message
+ * do not arrive in sequence.
+ */
+
+static inline int siw_crc_rxhdr(struct siw_iwarp_rx *ctx)
+{
+	crypto_shash_init(ctx->mpa_crc_hd);
+
+	return siw_crc_array(ctx->mpa_crc_hd, (u8 *)&ctx->hdr,
+			     ctx->fpdu_part_rcvd);
+}
+
+/*
+ * siw_rx_umem()
+ *
+ * Receive data of @len into target referenced by @rctx.
+ * This function does not check if umem is within bounds requested by
+ * @len and @t_off. @umem_ends indicates if routine should
+ * not update chunk position pointers after the point it is
+ * currently receiving
+ *
+ * @rctx:	Receive Context
+ * @umem:	siw representation of target memory
+ * @dest_addr:	1, if rctx chunk pointer should not be updated after len.
+ */
+static int siw_rx_umem(struct siw_iwarp_rx *rctx, struct siw_umem *umem,
+		       u64 dest_addr, int len)
+{
+	void	*dest;
+	int	pg_off = dest_addr & ~PAGE_MASK,
+		copied = 0,
+		bytes,
+		rv;
+
+	while (len) {
+		struct page *p = siw_get_upage(umem, dest_addr);
+
+		if (unlikely(!p)) {
+			pr_warn("siw: %s: [QP %d]: bogus addr: %p, %p\n",
+				__func__, RX_QPID(rctx),
+				(void *)dest_addr, (void *)umem->fp_addr);
+			/* siw internal error */
+			rctx->skb_copied += copied;
+			rctx->skb_new -= copied;
+			copied = -EFAULT;
+
+			goto out;
+		}
+
+		bytes  = min(len, (int)PAGE_SIZE - pg_off);
+		dest = kmap_atomic(p);
+
+		siw_dbg_qp(rx_qp(rctx), "page %p, bytes=%u\n", p, bytes);
+
+		rv = skb_copy_bits(rctx->skb, rctx->skb_offset, dest + pg_off,
+				   bytes);
+		if (likely(!rv)) {
+			if (rctx->mpa_crc_hd)
+				rv = siw_crc_page(rctx->mpa_crc_hd, p, pg_off,
+						  bytes);
+
+			rctx->skb_offset += bytes;
+			copied += bytes;
+			len -= bytes;
+			dest_addr += bytes;
+			pg_off = 0;
+		}
+		kunmap_atomic(dest);
+
+		if (unlikely(rv)) {
+			rctx->skb_copied += copied;
+			rctx->skb_new -= copied;
+			copied = -EFAULT;
+
+			pr_warn("siw: [QP %d]: %s, len %d, page %p, rv %d\n",
+				RX_QPID(rctx), __func__, len, p, rv);
+
+			goto out;
+		}
+	}
+	/*
+	 * store chunk position for resume
+	 */
+	rctx->skb_copied += copied;
+	rctx->skb_new -= copied;
+out:
+	return copied;
+}
+
+static inline int siw_rx_kva(struct siw_iwarp_rx *rctx, void *kva, int len)
+{
+	int rv;
+
+	siw_dbg_qp(rx_qp(rctx), "kva %p, bytes=%u\n", kva, len);
+
+	rv = skb_copy_bits(rctx->skb, rctx->skb_offset, kva, len);
+	if (likely(!rv)) {
+		rctx->skb_offset += len;
+		rctx->skb_copied += len;
+		rctx->skb_new -= len;
+		if (rctx->mpa_crc_hd) {
+			rv = siw_crc_array(rctx->mpa_crc_hd, kva, len);
+			if (rv)
+				goto error;
+		}
+		return len;
+	}
+	pr_warn("siw: [QP %d]: %s, len %d, kva %p, rv %d\n",
+		RX_QPID(rctx), __func__, len, kva, rv);
+error:
+	return rv;
+}
+
+static int siw_rx_pbl(struct siw_iwarp_rx *rctx, struct siw_mr *mr,
+		      u64 addr, int len)
+{
+	struct siw_pbl *pbl = mr->pbl;
+	u64 offset = addr - mr->mem.va;
+	int copied = 0;
+
+	while (len) {
+		int bytes;
+		u64 buf_addr = siw_pbl_get_buffer(pbl, offset, &bytes,
+						  &rctx->pbl_idx);
+		if (!buf_addr)
+			break;
+
+		bytes = min(bytes, len);
+		if (siw_rx_kva(rctx, (void *)buf_addr, bytes) == bytes) {
+			copied += bytes;
+			offset += bytes;
+			len -= bytes;
+		} else
+			break;
+	}
+	return copied;
+}
+
+/*
+ * siw_rresp_check_ntoh()
+ *
+ * Check incoming RRESP fragment header against expected
+ * header values and update expected values for potential next
+ * fragment.
+ *
+ * NOTE: This function must be called only if a RRESP DDP segment
+ *       starts but not for fragmented consecutive pieces of an
+ *       already started DDP segement.
+ */
+static inline int siw_rresp_check_ntoh(struct siw_iwarp_rx *rctx)
+{
+	struct iwarp_rdma_rresp	*rresp = &rctx->hdr.rresp;
+	struct siw_wqe		*wqe = &rctx->wqe_active;
+	enum ddp_ecode		ecode;
+
+	u32 sink_stag = be32_to_cpu(rresp->sink_stag);
+	u64 sink_to   = be64_to_cpu(rresp->sink_to);
+
+	if (rctx->first_ddp_seg) {
+		rctx->ddp_stag = wqe->sqe.sge[0].lkey;
+		rctx->ddp_to = wqe->sqe.sge[0].laddr;
+		rctx->pbl_idx = 0;
+	}
+	/* Below checks extend beyond the smeantics of DDP, and
+	 * into RDMAP:
+	 * We check if the read response matches exactly the
+	 * read request which was send to the remote peer to
+	 * trigger this read response. RFC5040/5041 do not
+	 * always have a proper error code for the detected
+	 * error cases. We choose 'base or bounds error' for
+	 * cases where the inbound STag is valid, but offset
+	 * or length do not match our response receive state.
+	 */
+	if (unlikely(rctx->ddp_stag != sink_stag)) {
+		pr_warn("siw: [QP %d]: rresp stag: %08x != %08x\n",
+			RX_QPID(rctx), sink_stag, rctx->ddp_stag);
+		ecode = DDP_ECODE_T_INVALID_STAG;
+		goto error;
+	}
+	if (unlikely(rctx->ddp_to != sink_to)) {
+		pr_warn("siw: [QP %d]: rresp off: %016llx != %016llx\n",
+			RX_QPID(rctx), (unsigned long long)sink_to,
+			(unsigned long long)rctx->ddp_to);
+		ecode = DDP_ECODE_T_BASE_BOUNDS;
+		goto error;
+	}
+	if (unlikely(!rctx->more_ddp_segs &&
+		     (wqe->processed + rctx->fpdu_part_rem != wqe->bytes))) {
+		pr_warn("siw: [QP %d]: rresp len: %d != %d\n",
+			RX_QPID(rctx), wqe->processed + rctx->fpdu_part_rem,
+			wqe->bytes);
+		ecode = DDP_ECODE_T_BASE_BOUNDS;
+		goto error;
+	}
+	return 0;
+error:
+	siw_init_terminate(rx_qp(rctx), TERM_ERROR_LAYER_DDP,
+			   DDP_ETYPE_TAGGED_BUF, ecode, 0);
+	return -EINVAL;
+}
+
+/*
+ * siw_write_check_ntoh()
+ *
+ * Check incoming WRITE fragment header against expected
+ * header values and update expected values for potential next
+ * fragment
+ *
+ * NOTE: This function must be called only if a WRITE DDP segment
+ *       starts but not for fragmented consecutive pieces of an
+ *       already started DDP segement.
+ */
+static inline int siw_write_check_ntoh(struct siw_iwarp_rx *rctx)
+{
+	struct iwarp_rdma_write	*write = &rctx->hdr.rwrite;
+	enum ddp_ecode ecode;
+
+	u32 sink_stag	= be32_to_cpu(write->sink_stag);
+	u64 sink_to	= be64_to_cpu(write->sink_to);
+
+	if (rctx->first_ddp_seg) {
+		rctx->ddp_stag = sink_stag;
+		rctx->ddp_to = sink_to;
+		rctx->pbl_idx = 0;
+	} else {
+		if (unlikely(rctx->ddp_stag != sink_stag)) {
+			pr_warn("siw: [QP %d]: write stag: %08x != %08x\n",
+				RX_QPID(rctx), sink_stag, rctx->ddp_stag);
+			ecode = DDP_ECODE_T_INVALID_STAG;
+			goto error;
+		}
+		if (unlikely(rctx->ddp_to != sink_to)) {
+			pr_warn("siw: [QP %d]: write off: %016llx != %016llx\n",
+				RX_QPID(rctx), (unsigned long long)sink_to,
+				(unsigned long long)rctx->ddp_to);
+			ecode = DDP_ECODE_T_BASE_BOUNDS;
+			goto error;
+		}
+	}
+	return 0;
+error:
+	siw_init_terminate(rx_qp(rctx), TERM_ERROR_LAYER_DDP,
+			   DDP_ETYPE_TAGGED_BUF, ecode, 0);
+	return -EINVAL;
+}
+
+/*
+ * siw_send_check_ntoh()
+ *
+ * Check incoming SEND fragment header against expected
+ * header values and update expected MSN if no next
+ * fragment expected
+ *
+ * NOTE: This function must be called only if a SEND DDP segment
+ *       starts but not for fragmented consecutive pieces of an
+ *       already started DDP segement.
+ */
+static inline int siw_send_check_ntoh(struct siw_iwarp_rx *rctx)
+{
+	struct iwarp_send_inv	*send = &rctx->hdr.send_inv;
+	struct siw_wqe		*wqe = &rctx->wqe_active;
+	enum ddp_ecode		ecode;
+
+	u32 ddp_msn = be32_to_cpu(send->ddp_msn);
+	u32 ddp_mo  = be32_to_cpu(send->ddp_mo);
+	u32 ddp_qn  = be32_to_cpu(send->ddp_qn);
+
+	if (unlikely(ddp_qn != RDMAP_UNTAGGED_QN_SEND)) {
+		pr_warn("siw: [QP %d]: invalid ddp qn %d for send\n",
+			RX_QPID(rctx), ddp_qn);
+		ecode = DDP_ECODE_UT_INVALID_QN;
+		goto error;
+	}
+	if (unlikely(ddp_msn != rctx->ddp_msn[RDMAP_UNTAGGED_QN_SEND])) {
+		pr_warn("siw: [QP %d]: send msn: %u != %u\n",
+			RX_QPID(rctx), ddp_msn,
+			rctx->ddp_msn[RDMAP_UNTAGGED_QN_SEND]);
+		ecode = DDP_ECODE_UT_INVALID_MSN_RANGE;
+		goto error;
+	}
+	if (unlikely(ddp_mo != wqe->processed)) {
+		pr_warn("siw: [QP %d], send mo: %u != %u\n",
+			RX_QPID(rctx), ddp_mo, wqe->processed);
+		ecode = DDP_ECODE_UT_INVALID_MO;
+		goto error;
+	}
+	if (rctx->first_ddp_seg) {
+		/* initialize user memory write position */
+		rctx->sge_idx = 0;
+		rctx->sge_off = 0;
+		rctx->pbl_idx = 0;
+
+		/* only valid for SEND_INV and SEND_SE_INV operations */
+		rctx->inval_stag = be32_to_cpu(send->inval_stag);
+	}
+	if (unlikely(wqe->bytes < wqe->processed + rctx->fpdu_part_rem)) {
+		siw_dbg_qp(rx_qp(rctx), "receive space short: %d - %d < %d\n",
+			   wqe->bytes, wqe->processed, rctx->fpdu_part_rem);
+		wqe->wc_status = SIW_WC_LOC_LEN_ERR;
+		ecode = DDP_ECODE_UT_INVALID_MSN_NOBUF;
+		goto error;
+	}
+	return 0;
+error:
+	siw_init_terminate(rx_qp(rctx), TERM_ERROR_LAYER_DDP,
+			   DDP_ETYPE_UNTAGGED_BUF, ecode, 0);
+	return -EINVAL;
+}
+
+static struct siw_wqe *siw_rqe_get(struct siw_qp *qp)
+{
+	struct siw_rqe *rqe;
+	struct siw_srq *srq = qp->srq;
+	struct siw_wqe *wqe = NULL;
+	bool is_srq = false;
+	unsigned long flags;
+
+	if (srq) {
+		is_srq = true;
+		spin_lock_irqsave(&srq->lock, flags);
+		rqe = &srq->recvq[srq->rq_get % srq->num_rqe];
+	} else {
+		rqe = &qp->recvq[qp->rq_get % qp->attrs.rq_size];
+	}
+	if (likely(rqe->flags == SIW_WQE_VALID)) {
+		int num_sge = rqe->num_sge;
+
+		if (likely(num_sge <= SIW_MAX_SGE)) {
+			int i = 0;
+
+			wqe = rx_wqe(qp);
+			rx_type(wqe) = SIW_OP_RECEIVE;
+			wqe->wr_status = SIW_WR_INPROGRESS;
+			wqe->bytes = 0;
+			wqe->processed = 0;
+
+			wqe->rqe.id = rqe->id;
+			wqe->rqe.num_sge = num_sge;
+
+			while (i < num_sge) {
+				wqe->rqe.sge[i].laddr = rqe->sge[i].laddr;
+				wqe->rqe.sge[i].lkey = rqe->sge[i].lkey;
+				wqe->rqe.sge[i].length = rqe->sge[i].length;
+				wqe->bytes += wqe->rqe.sge[i].length;
+				wqe->mem[i] = NULL;
+				i++;
+			}
+			/* can be re-used by appl */
+			smp_store_mb(rqe->flags, 0);
+		} else {
+			siw_dbg_qp(qp, "too many sge's: %d\n", rqe->num_sge);
+			goto out_unlock;
+		}
+		if (!is_srq) {
+			qp->rq_get++;
+		} else {
+			if (srq->armed) {
+				/* Test SRQ limit */
+				u32 off = (srq->rq_get + srq->limit) %
+					  srq->num_rqe;
+				struct siw_rqe *rqe2 = &srq->recvq[off];
+
+				if (!(rqe2->flags & SIW_WQE_VALID)) {
+					srq->armed = 0;
+					siw_srq_event(srq,
+						IB_EVENT_SRQ_LIMIT_REACHED);
+				}
+			}
+			srq->rq_get++;
+		}
+	}
+out_unlock:
+	if (is_srq)
+		spin_unlock_irqrestore(&srq->lock, flags);
+
+	return wqe;
+}
+
+/*
+ * siw_proc_send:
+ *
+ * Process one incoming SEND and place data into memory referenced by
+ * receive wqe.
+ *
+ * Function supports partially received sends (suspending/resuming
+ * current receive wqe processing)
+ *
+ * return value:
+ *	0:       reached the end of a DDP segment
+ *	-EAGAIN: to be called again to finish the DDP segment
+ */
+int siw_proc_send(struct siw_qp *qp, struct siw_iwarp_rx *rctx)
+{
+	struct siw_wqe		*wqe;
+	struct siw_sge		*sge;
+	u32			data_bytes,	/* all data bytes available */
+				rcvd_bytes;	/* sum of data bytes rcvd */
+	int rv = 0;
+
+	if (rctx->first_ddp_seg) {
+		wqe = siw_rqe_get(qp);
+		if (unlikely(!wqe)) {
+			siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+					   DDP_ETYPE_UNTAGGED_BUF,
+					   DDP_ECODE_UT_INVALID_MSN_NOBUF, 0);
+			return -ENOENT;
+		}
+	} else {
+		wqe = rx_wqe(qp);
+	}
+	if (rctx->state == SIW_GET_DATA_START) {
+		rv = siw_send_check_ntoh(rctx);
+		if (unlikely(rv)) {
+			siw_qp_event(qp, IB_EVENT_QP_FATAL);
+			return rv;
+		}
+		if (!rctx->fpdu_part_rem) /* zero length SEND */
+			return 0;
+	}
+	data_bytes = min(rctx->fpdu_part_rem, rctx->skb_new);
+	rcvd_bytes = 0;
+
+	/* A zero length SEND will skip below loop */
+	while (data_bytes) {
+		struct siw_pd *pd;
+		struct siw_mr *mr;
+		struct siw_mem **mem;
+		u32 sge_bytes;	/* data bytes avail for SGE */
+
+		sge = &wqe->rqe.sge[rctx->sge_idx];
+
+		if (!sge->length) {
+			/* just skip empty sge's */
+			rctx->sge_idx++;
+			rctx->sge_off = 0;
+			rctx->pbl_idx = 0;
+			continue;
+		}
+		sge_bytes = min(data_bytes, sge->length - rctx->sge_off);
+		mem = &wqe->mem[rctx->sge_idx];
+
+		/*
+		 * check with QP's PD if no SRQ present, SRQ's PD otherwise
+		 */
+		pd = qp->srq == NULL ? qp->pd : qp->srq->pd;
+
+		rv = siw_check_sge(pd, sge, mem, SIW_MEM_LWRITE, rctx->sge_off,
+				   sge_bytes);
+		if (unlikely(rv)) {
+			siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+					   DDP_ETYPE_CATASTROPHIC,
+					   DDP_ECODE_CATASTROPHIC, 0);
+
+			siw_qp_event(qp, IB_EVENT_QP_ACCESS_ERR);
+			break;
+		}
+		mr = siw_mem2mr(*mem);
+		if (mr->mem_obj == NULL)
+			rv = siw_rx_kva(rctx,
+					(void *)(sge->laddr + rctx->sge_off),
+					sge_bytes);
+		else if (!mr->mem.is_pbl)
+			rv = siw_rx_umem(rctx, mr->umem,
+					 sge->laddr + rctx->sge_off, sge_bytes);
+		else
+			rv = siw_rx_pbl(rctx, mr,
+					sge->laddr + rctx->sge_off, sge_bytes);
+
+		if (unlikely(rv != sge_bytes)) {
+			wqe->processed += rcvd_bytes;
+
+			siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+					   DDP_ETYPE_CATASTROPHIC,
+					   DDP_ECODE_CATASTROPHIC, 0);
+			return -EINVAL;
+		}
+		rctx->sge_off += rv;
+
+		if (rctx->sge_off == sge->length) {
+			rctx->sge_idx++;
+			rctx->sge_off = 0;
+			rctx->pbl_idx = 0;
+		}
+		data_bytes -= rv;
+		rcvd_bytes += rv;
+
+		rctx->fpdu_part_rem -= rv;
+		rctx->fpdu_part_rcvd += rv;
+	}
+	wqe->processed += rcvd_bytes;
+
+	if (!rctx->fpdu_part_rem)
+		return 0;
+
+	return (rv < 0) ? rv : -EAGAIN;
+}
+
+/*
+ * siw_proc_write:
+ *
+ * Place incoming WRITE after referencing and checking target buffer
+
+ * Function supports partially received WRITEs (suspending/resuming
+ * current receive processing)
+ *
+ * return value:
+ *	0:       reached the end of a DDP segment
+ *	-EAGAIN: to be called again to finish the DDP segment
+ */
+int siw_proc_write(struct siw_qp *qp, struct siw_iwarp_rx *rctx)
+{
+	struct siw_device	*dev = qp->hdr.sdev;
+	struct siw_mem		*mem;
+	struct siw_mr		*mr;
+	int			bytes,
+				rv;
+
+	if (rctx->state == SIW_GET_DATA_START) {
+
+		if (!rctx->fpdu_part_rem) /* zero length WRITE */
+			return 0;
+
+		rv = siw_write_check_ntoh(rctx);
+		if (unlikely(rv)) {
+			siw_qp_event(qp, IB_EVENT_QP_FATAL);
+			return rv;
+		}
+	}
+	bytes = min(rctx->fpdu_part_rem, rctx->skb_new);
+
+	if (rctx->first_ddp_seg) {
+		rx_mem(qp) = siw_mem_id2obj(dev, rctx->ddp_stag >> 8);
+		rx_wqe(qp)->wr_status = SIW_WR_INPROGRESS;
+		rx_type(rx_wqe(qp)) = SIW_OP_WRITE;
+	}
+	if (unlikely(!rx_mem(qp))) {
+		siw_dbg_qp(qp, "sink stag not found/invalid, stag 0x%08x\n",
+			   rctx->ddp_stag);
+
+		siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+				   DDP_ETYPE_TAGGED_BUF,
+				   DDP_ECODE_T_INVALID_STAG, 0);
+		return -EINVAL;
+	}
+	mem = rx_mem(qp);
+	/*
+	 * Rtag not checked against mem's tag again because
+	 * hdr check guarantees same tag as before if fragmented
+	 */
+	rv = siw_check_mem(qp->pd, mem, rctx->ddp_to + rctx->fpdu_part_rcvd,
+			   SIW_MEM_RWRITE, bytes);
+	if (unlikely(rv)) {
+		siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+				   DDP_ETYPE_TAGGED_BUF,
+				   siw_tagged_error(-rv), 0);
+
+		siw_qp_event(qp, IB_EVENT_QP_ACCESS_ERR);
+
+		return -EINVAL;
+	}
+
+	mr = siw_mem2mr(mem);
+	if (mr->mem_obj == NULL)
+		rv = siw_rx_kva(rctx,
+				(void *)(rctx->ddp_to + rctx->fpdu_part_rcvd),
+				bytes);
+	else if (!mr->mem.is_pbl)
+		rv = siw_rx_umem(rctx, mr->umem,
+				 rctx->ddp_to + rctx->fpdu_part_rcvd, bytes);
+	else
+		rv = siw_rx_pbl(rctx, mr,
+				rctx->ddp_to + rctx->fpdu_part_rcvd, bytes);
+
+	if (unlikely(rv != bytes)) {
+		siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+				   DDP_ETYPE_CATASTROPHIC,
+				   DDP_ECODE_CATASTROPHIC, 0);
+		return -EINVAL;
+	}
+	rctx->fpdu_part_rem -= rv;
+	rctx->fpdu_part_rcvd += rv;
+
+	if (!rctx->fpdu_part_rem) {
+		rctx->ddp_to += rctx->fpdu_part_rcvd;
+		return 0;
+	}
+	return -EAGAIN;
+}
+
+/*
+ * Inbound RREQ's cannot carry user data.
+ */
+int siw_proc_rreq(struct siw_qp *qp, struct siw_iwarp_rx *rctx)
+{
+	if (!rctx->fpdu_part_rem)
+		return 0;
+
+	pr_warn("siw: [QP %d]: rreq with mpa len %d\n", QP_ID(qp),
+		be16_to_cpu(rctx->hdr.ctrl.mpa_len));
+
+	return -EPROTO;
+}
+
+/*
+ * siw_init_rresp:
+ *
+ * Process inbound RDMA READ REQ. Produce a pseudo READ RESPONSE WQE.
+ * Put it at the tail of the IRQ, if there is another WQE currently in
+ * transmit processing. If not, make it the current WQE to be processed
+ * and schedule transmit processing.
+ *
+ * Can be called from softirq context and from process
+ * context (RREAD socket loopback case!)
+ *
+ * return value:
+ *	0:      success,
+ *		failure code otherwise
+ */
+
+static int siw_init_rresp(struct siw_qp *qp, struct siw_iwarp_rx *rctx)
+{
+	struct siw_wqe *tx_work = tx_wqe(qp);
+	struct siw_sqe *resp;
+
+	uint64_t	raddr	= be64_to_cpu(rctx->hdr.rreq.sink_to),
+			laddr	= be64_to_cpu(rctx->hdr.rreq.source_to);
+	uint32_t	length	= be32_to_cpu(rctx->hdr.rreq.read_size),
+			lkey	= be32_to_cpu(rctx->hdr.rreq.source_stag),
+			rkey	= be32_to_cpu(rctx->hdr.rreq.sink_stag),
+			msn	= be32_to_cpu(rctx->hdr.rreq.ddp_msn);
+
+	int run_sq = 1, rv = 0;
+	unsigned long flags;
+
+	if (unlikely(msn != rctx->ddp_msn[RDMAP_UNTAGGED_QN_RDMA_READ])) {
+		siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+				   DDP_ETYPE_UNTAGGED_BUF,
+				   DDP_ECODE_UT_INVALID_MSN_RANGE, 0);
+		return -EPROTO;
+	}
+	spin_lock_irqsave(&qp->sq_lock, flags);
+
+	if (tx_work->wr_status == SIW_WR_IDLE) {
+		/*
+		 * immediately schedule READ response w/o
+		 * consuming IRQ entry: IRQ must be empty.
+		 */
+		tx_work->processed = 0;
+		tx_work->mem[0] = NULL;
+		tx_work->wr_status = SIW_WR_QUEUED;
+		resp = &tx_work->sqe;
+	} else {
+		resp = irq_alloc_free(qp);
+		run_sq = 0;
+	}
+	if (likely(resp)) {
+		resp->opcode = SIW_OP_READ_RESPONSE;
+
+		resp->sge[0].length = length;
+		resp->sge[0].laddr = laddr;
+		resp->sge[0].lkey = lkey;
+
+		/* Keep aside message sequence number for potential
+		 * error reporting during Read Response generation.
+		 */
+		resp->sge[1].length = msn;
+
+		resp->raddr = raddr;
+		resp->rkey = rkey;
+		resp->num_sge = length ? 1 : 0;
+
+		/* RRESP now valid as current TX wqe or placed into IRQ */
+		smp_store_mb(resp->flags, SIW_WQE_VALID);
+	} else {
+		pr_warn("siw: [QP %d]: irq %d exceeded %d\n",
+			QP_ID(qp), qp->irq_put % qp->attrs.irq_size,
+			qp->attrs.irq_size);
+
+		siw_init_terminate(qp, TERM_ERROR_LAYER_RDMAP,
+				   RDMAP_ETYPE_REMOTE_OPERATION,
+				   RDMAP_ECODE_CATASTROPHIC_STREAM, 0);
+		rv = -EPROTO;
+	}
+
+	spin_unlock_irqrestore(&qp->sq_lock, flags);
+
+	if (run_sq)
+		rv = siw_sq_start(qp);
+
+	return rv;
+}
+
+/*
+ * Only called at start of Read.Resonse processing.
+ * Transfer pending Read from tip of ORQ into currrent rx wqe,
+ * but keep ORQ entry valid until Read.Response processing done.
+ * No Queue locking needed.
+ */
+static int siw_orqe_start_rx(struct siw_qp *qp)
+{
+	struct siw_sqe *orqe;
+	struct siw_wqe *wqe = NULL;
+
+	/* make sure ORQ indices are current */
+	smp_mb();
+
+	orqe = orq_get_current(qp);
+	if (READ_ONCE(orqe->flags) & SIW_WQE_VALID) {
+		wqe = rx_wqe(qp);
+		wqe->sqe.id = orqe->id;
+		wqe->sqe.opcode = orqe->opcode;
+		wqe->sqe.sge[0].laddr = orqe->sge[0].laddr;
+		wqe->sqe.sge[0].lkey = orqe->sge[0].lkey;
+		wqe->sqe.sge[0].length = orqe->sge[0].length;
+		wqe->sqe.flags = orqe->flags;
+		wqe->sqe.num_sge = 1;
+		wqe->bytes = orqe->sge[0].length;
+		wqe->processed = 0;
+		wqe->mem[0] = NULL;
+		/* make sure WQE is completely written before valid */
+		smp_wmb();
+		wqe->wr_status = SIW_WR_INPROGRESS;
+
+		return 0;
+	}
+	return -EPROTO;
+}
+
+/*
+ * siw_proc_rresp:
+ *
+ * Place incoming RRESP data into memory referenced by RREQ WQE
+ * which is at the tip of the ORQ
+ *
+ * Function supports partially received RRESP's (suspending/resuming
+ * current receive processing)
+ */
+int siw_proc_rresp(struct siw_qp *qp, struct siw_iwarp_rx *rctx)
+{
+	struct siw_wqe	*wqe = rx_wqe(qp);
+	struct siw_mem	**mem;
+	struct siw_sge	*sge;
+	struct siw_mr	*mr;
+	int		bytes, rv;
+
+	if (rctx->first_ddp_seg) {
+		if (unlikely(wqe->wr_status != SIW_WR_IDLE)) {
+			pr_warn("siw: [QP %d]: proc rresp: status %d, op %d\n",
+				QP_ID(qp), wqe->wr_status, wqe->sqe.opcode);
+			rv = -EPROTO;
+			goto error_term;
+		}
+		/*
+		 * fetch pending RREQ from orq
+		 */
+		rv = siw_orqe_start_rx(qp);
+		if (rv) {
+			pr_warn("siw: [QP %d]: orq empty at idx %d\n",
+				QP_ID(qp), qp->orq_get % qp->attrs.orq_size);
+			goto error_term;
+		}
+		rv = siw_rresp_check_ntoh(rctx);
+		if (unlikely(rv)) {
+			siw_qp_event(qp, IB_EVENT_QP_FATAL);
+			return rv;
+		}
+	} else {
+		if (unlikely(wqe->wr_status != SIW_WR_INPROGRESS)) {
+			pr_warn("siw: [QP %d]: resume rresp: status %d\n",
+				QP_ID(qp), wqe->wr_status);
+			rv = -EPROTO;
+			goto error_term;
+		}
+	}
+	if (!rctx->fpdu_part_rem) /* zero length RRESPONSE */
+		return 0;
+
+	sge = wqe->sqe.sge; /* there is only one */
+	mem = &wqe->mem[0];
+
+	if (!(*mem)) {
+		/*
+		 * check target memory which resolves memory on first fragment
+		 */
+		rv = siw_check_sge(qp->pd, sge, mem, SIW_MEM_LWRITE, 0,
+				   wqe->bytes);
+		if (rv) {
+			siw_dbg_qp(qp, "target mem check: %d\n", rv);
+			wqe->wc_status = SIW_WC_LOC_PROT_ERR;
+
+			siw_init_terminate(qp, TERM_ERROR_LAYER_DDP,
+					   DDP_ETYPE_TAGGED_BUF,
+					   siw_tagged_error(-rv), 0);
+
+			siw_qp_event(qp, IB_EVENT_QP_ACCESS_ERR);
+
+			return -EINVAL;
+		}
+	}
+	bytes = min(rctx->fpdu_part_rem, rctx->skb_new);
+
+	mr = siw_mem2mr(*mem);
+	if (mr->mem_obj == NULL)
+		rv = siw_rx_kva(rctx, (void *)(sge->laddr + wqe->processed),
+				bytes);
+	else if (!mr->mem.is_pbl)
+		rv = siw_rx_umem(rctx, mr->umem, sge->laddr + wqe->processed,
+				 bytes);
+	else
+		rv = siw_rx_pbl(rctx, mr, sge->laddr + wqe->processed,
+				 bytes);
+	if (rv != bytes) {
+		wqe->wc_status = SIW_WC_GENERAL_ERR;
+		rv = -EINVAL;
+		goto error_term;
+	}
+	rctx->fpdu_part_rem -= rv;
+	rctx->fpdu_part_rcvd += rv;
+	wqe->processed += rv;
+
+	if (!rctx->fpdu_part_rem) {
+		rctx->ddp_to += rctx->fpdu_part_rcvd;
+		return 0;
+	}
+	return -EAGAIN;
+
+error_term:
+	siw_init_terminate(qp, TERM_ERROR_LAYER_DDP, DDP_ETYPE_CATASTROPHIC,
+			   DDP_ECODE_CATASTROPHIC, 0);
+	return rv;
+}
+
+int siw_proc_terminate(struct siw_qp *qp, struct siw_iwarp_rx *rctx)
+{
+	struct sk_buff		*skb = rctx->skb;
+	struct iwarp_terminate	*term = &rctx->hdr.terminate;
+	union iwarp_hdr		term_info;
+	u8			*infop = (u8 *)&term_info;
+	enum rdma_opcode	op;
+	u16			to_copy = sizeof(struct iwarp_ctrl);
+
+	pr_info("siw: [QP %d]: got TERMINATE. layer %d, type %d, code %d\n",
+		QP_ID(qp), __rdmap_term_layer(term), __rdmap_term_etype(term),
+		__rdmap_term_ecode(term));
+
+	if (be32_to_cpu(term->ddp_qn) != RDMAP_UNTAGGED_QN_TERMINATE ||
+	    be32_to_cpu(term->ddp_msn) !=
+			qp->rx_ctx.ddp_msn[RDMAP_UNTAGGED_QN_TERMINATE] ||
+	    be32_to_cpu(term->ddp_mo) != 0) {
+		pr_warn("siw: [QP %d]: received malformed TERM\n", QP_ID(qp));
+		pr_warn("     [QN x%08x, MSN x%08x, MO x%08x]\n",
+			be32_to_cpu(term->ddp_qn), be32_to_cpu(term->ddp_msn),
+			be32_to_cpu(term->ddp_mo));
+		return -ECONNRESET;
+	}
+	/*
+	 * Receive remaining pieces of TERM if indicated
+	 */
+	if (!term->flag_m)
+		return -ECONNRESET;
+
+	/* Do not take the effort to reassemble a network fragmented
+	 * TERM message
+	 */
+	if (rctx->skb_new < sizeof(struct iwarp_ctrl_tagged))
+		return -ECONNRESET;
+
+	memset(infop, 0, sizeof(term_info));
+
+	skb_copy_bits(skb, rctx->skb_offset, infop, to_copy);
+
+	op = __rdmap_opcode(&term_info.ctrl);
+	if (op >= RDMAP_TERMINATE)
+		goto out;
+
+	infop += to_copy;
+	rctx->skb_offset += to_copy;
+	rctx->skb_new -= to_copy;
+	rctx->skb_copied += to_copy;
+	rctx->fpdu_part_rcvd += to_copy;
+	rctx->fpdu_part_rem -= to_copy;
+
+	to_copy = iwarp_pktinfo[op].hdr_len - to_copy;
+
+	/* Again, no network fragmented TERM's */
+	if (to_copy + MPA_CRC_SIZE > rctx->skb_new)
+		return -ECONNRESET;
+
+	skb_copy_bits(skb, rctx->skb_offset, infop, to_copy);
+
+	if (term->flag_m) {
+		/* Adjust len to packet hdr print function */
+		u32 mpa_len = be16_to_cpu(term_info.ctrl.mpa_len);
+
+		mpa_len += iwarp_pktinfo[op].hdr_len - MPA_HDR_SIZE;
+		term_info.ctrl.mpa_len = cpu_to_be16(mpa_len);
+	}
+	if (term->flag_r) {
+		if (term->flag_m)
+			siw_print_hdr(&term_info, QP_ID(qp),
+			      "TERMINATE reports RDMAP HDR (len valid): ");
+		else
+			siw_print_hdr(&term_info, QP_ID(qp),
+			      "TERMINATE reports RDMAP HDR (len invalid): ");
+	} else if (term->flag_d) {
+		if (term->flag_m)
+			siw_print_hdr(&term_info, QP_ID(qp),
+			      "TERMINATE reports DDP HDR (len valid): ");
+		else
+			siw_print_hdr(&term_info, QP_ID(qp),
+			      "TERMINATE reports DDP HDR (len invalid): ");
+	}
+out:
+	rctx->skb_new -= to_copy;
+	rctx->skb_offset += to_copy;
+	rctx->skb_copied += to_copy;
+	rctx->fpdu_part_rcvd += to_copy;
+	rctx->fpdu_part_rem -= to_copy;
+
+	return -ECONNRESET;
+}
+
+static int siw_get_trailer(struct siw_qp *qp, struct siw_iwarp_rx *rctx)
+{
+	struct sk_buff	*skb = rctx->skb;
+	u8		*tbuf = (u8 *)&rctx->trailer.crc - rctx->pad;
+	int		avail;
+
+	avail = min(rctx->skb_new, rctx->fpdu_part_rem);
+
+	siw_dbg_qp(qp, "expected %d, available %d, pad %d, skb_new %d\n",
+		   rctx->fpdu_part_rem, avail, rctx->pad, rctx->skb_new);
+
+	skb_copy_bits(skb, rctx->skb_offset,
+		      tbuf + rctx->fpdu_part_rcvd, avail);
+
+	rctx->fpdu_part_rcvd += avail;
+	rctx->fpdu_part_rem -= avail;
+
+	rctx->skb_new -= avail;
+	rctx->skb_offset += avail;
+	rctx->skb_copied += avail;
+
+	if (!rctx->fpdu_part_rem) {
+		__be32	crc_in, crc_own = 0;
+		/*
+		 * check crc if required
+		 */
+		if (!rctx->mpa_crc_hd)
+			return 0;
+
+		if (rctx->pad && siw_crc_array(rctx->mpa_crc_hd,
+					       tbuf, rctx->pad) != 0)
+			return -EINVAL;
+
+		crypto_shash_final(rctx->mpa_crc_hd, (u8 *)&crc_own);
+
+		/*
+		 * CRC32 is computed, transmitted and received directly in NBO,
+		 * so there's never a reason to convert byte order.
+		 */
+		crc_in = rctx->trailer.crc;
+
+		if (unlikely(crc_in != crc_own)) {
+			pr_warn("SIW: crc error. in: %08x, computed %08x\n",
+				crc_in, crc_own);
+
+			siw_init_terminate(qp, TERM_ERROR_LAYER_LLP,
+					   LLP_ETYPE_MPA,
+					   LLP_ECODE_RECEIVED_CRC, 0);
+			return -EINVAL;
+		}
+		return 0;
+	}
+	return -EAGAIN;
+}
+
+#define MIN_DDP_HDR sizeof(struct iwarp_ctrl_tagged)
+
+static int siw_get_hdr(struct siw_iwarp_rx *rctx)
+{
+	struct sk_buff		*skb = rctx->skb;
+	struct iwarp_ctrl	*c_hdr = &rctx->hdr.ctrl;
+	u8			opcode;
+
+	int bytes;
+
+	if (rctx->fpdu_part_rcvd < MIN_DDP_HDR) {
+		/*
+		 * copy a mimimum sized (tagged) DDP frame control part
+		 */
+		bytes = min_t(int, rctx->skb_new, MIN_DDP_HDR
+				- rctx->fpdu_part_rcvd);
+
+		skb_copy_bits(skb, rctx->skb_offset,
+			      (char *)c_hdr + rctx->fpdu_part_rcvd, bytes);
+
+		rctx->fpdu_part_rcvd += bytes;
+
+		rctx->skb_new -= bytes;
+		rctx->skb_offset += bytes;
+		rctx->skb_copied += bytes;
+
+		if (rctx->fpdu_part_rcvd < MIN_DDP_HDR)
+			return -EAGAIN;
+
+		if (unlikely(__ddp_version(c_hdr) != DDP_VERSION)) {
+			enum ddp_etype etype;
+			enum ddp_ecode ecode;
+
+			pr_warn("SIW: received ddp version unsupported %d\n",
+				__ddp_version(c_hdr));
+
+			if (c_hdr->ddp_rdmap_ctrl & DDP_FLAG_TAGGED) {
+				etype = DDP_ETYPE_TAGGED_BUF;
+				ecode = DDP_ECODE_T_VERSION;
+			} else {
+				etype = DDP_ETYPE_UNTAGGED_BUF;
+				ecode = DDP_ECODE_UT_VERSION;
+			}
+			siw_init_terminate(rx_qp(rctx), TERM_ERROR_LAYER_DDP,
+					   etype, ecode, 0);
+			return -EINVAL;
+		}
+		if (unlikely(__rdmap_version(c_hdr) != RDMAP_VERSION)) {
+			pr_warn("SIW: received rdmap version unsupported %d\n",
+				__rdmap_version(c_hdr));
+
+			siw_init_terminate(rx_qp(rctx), TERM_ERROR_LAYER_RDMAP,
+					   RDMAP_ETYPE_REMOTE_OPERATION,
+					   RDMAP_ECODE_VERSION, 0);
+			return -EINVAL;
+		}
+		opcode = __rdmap_opcode(c_hdr);
+
+		if (opcode > RDMAP_TERMINATE) {
+			pr_warn("SIW: received unknown packet type %d\n",
+				opcode);
+
+			siw_init_terminate(rx_qp(rctx), TERM_ERROR_LAYER_RDMAP,
+					   RDMAP_ETYPE_REMOTE_OPERATION,
+					   RDMAP_ECODE_OPCODE, 0);
+			return -EINVAL;
+		}
+		siw_dbg_qp(rx_qp(rctx), "new header, opcode %d\n", opcode);
+	} else {
+		opcode = __rdmap_opcode(c_hdr);
+	}
+
+	/*
+	 * Figure out len of current hdr: variable length of
+	 * iwarp hdr may force us to copy hdr information in
+	 * two steps. Only tagged DDP messages are already
+	 * completely received.
+	 */
+	if (iwarp_pktinfo[opcode].hdr_len > sizeof(struct iwarp_ctrl_tagged)) {
+		bytes = iwarp_pktinfo[opcode].hdr_len - MIN_DDP_HDR;
+
+		if (rctx->skb_new < bytes)
+			return -EAGAIN;
+
+		skb_copy_bits(skb, rctx->skb_offset,
+			      (char *)c_hdr + rctx->fpdu_part_rcvd, bytes);
+
+		rctx->fpdu_part_rcvd += bytes;
+
+		rctx->skb_new -= bytes;
+		rctx->skb_offset += bytes;
+		rctx->skb_copied += bytes;
+	}
+	/*
+	 * DDP/RDMAP header receive completed. Check if the current
+	 * DDP segment starts a new RDMAP message or continues a previously
+	 * started RDMAP message.
+	 *
+	 * Note well from the comments on DDP reassembly:
+	 * - Support for unordered reception of DDP segments
+	 *   (or FPDUs) from different RDMAP messages is not needed.
+	 * - Unordered reception of DDP segments of the same
+	 *   RDMAP message is not supported. It is probably not
+	 *   needed with most peers.
+	 */
+	siw_dprint_hdr(&rctx->hdr, RX_QPID(rctx), "HDR received");
+
+	if (rctx->more_ddp_segs) {
+		rctx->first_ddp_seg = 0;
+		if (rctx->prev_rdmap_opcode != opcode) {
+			pr_warn("SIW: packet intersection: %d : %d\n",
+				rctx->prev_rdmap_opcode, opcode);
+			return -EPROTO;
+		}
+	} else {
+		rctx->prev_rdmap_opcode = opcode;
+		rctx->first_ddp_seg = 1;
+	}
+	rctx->more_ddp_segs =
+		c_hdr->ddp_rdmap_ctrl & DDP_FLAG_LAST ? 0 : 1;
+
+	return 0;
+}
+
+static inline int siw_fpdu_payload_len(struct siw_iwarp_rx *rctx)
+{
+	return be16_to_cpu(rctx->hdr.ctrl.mpa_len) - rctx->fpdu_part_rcvd
+		+ MPA_HDR_SIZE;
+}
+
+static inline int siw_fpdu_trailer_len(struct siw_iwarp_rx *rctx)
+{
+	int mpa_len = be16_to_cpu(rctx->hdr.ctrl.mpa_len) + MPA_HDR_SIZE;
+
+	return MPA_CRC_SIZE + (-mpa_len & 0x3);
+}
+
+static int siw_check_tx_fence(struct siw_qp *qp)
+{
+	struct siw_wqe *tx_waiting = tx_wqe(qp);
+	struct siw_sqe *rreq;
+	int resume_tx = 0, rv = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&qp->orq_lock, flags);
+
+	rreq = orq_get_current(qp);
+
+	/* free current orq entry */
+	smp_store_mb(rreq->flags, 0);
+
+	if (qp->tx_ctx.orq_fence) {
+		if (unlikely(tx_waiting->wr_status != SIW_WR_QUEUED)) {
+			pr_warn("SIW: [QP %d]: fence resume: bad status %d\n",
+				QP_ID(qp), tx_waiting->wr_status);
+			rv = -EPROTO;
+			goto out;
+		}
+		/* resume SQ processing */
+		if (tx_waiting->sqe.opcode == SIW_OP_READ ||
+		    tx_waiting->sqe.opcode == SIW_OP_READ_LOCAL_INV) {
+
+			rreq = orq_get_tail(qp);
+			if (unlikely(!rreq)) {
+				pr_warn("SIW: [QP %d]: no orqe\n", QP_ID(qp));
+				rv = -EPROTO;
+				goto out;
+			}
+			siw_read_to_orq(rreq, &tx_waiting->sqe);
+
+			qp->orq_put++;
+			qp->tx_ctx.orq_fence = 0;
+			resume_tx = 1;
+
+		} else if (siw_orq_empty(qp)) {
+			qp->tx_ctx.orq_fence = 0;
+			resume_tx = 1;
+		} else {
+			pr_warn("SIW: [QP %d]: fence resume: orq idx: %d:%d\n",
+				QP_ID(qp), qp->orq_get, qp->orq_put);
+			rv = -EPROTO;
+		}
+	}
+	qp->orq_get++;
+out:
+	spin_unlock_irqrestore(&qp->orq_lock, flags);
+
+	if (resume_tx)
+		rv = siw_sq_start(qp);
+
+	return rv;
+}
+
+/*
+ * siw_rdmap_complete()
+ *
+ * Complete processing of an RDMA message after receiving all
+ * DDP segmens or ABort processing after encountering error case.
+ *
+ *   o SENDs + RRESPs will need for completion,
+ *   o RREQs need for  READ RESPONSE initialization
+ *   o WRITEs need memory dereferencing
+ *
+ * TODO: Failed WRITEs need local error to be surfaced.
+ */
+static inline int
+siw_rdmap_complete(struct siw_qp *qp, int error)
+{
+	struct siw_iwarp_rx	*rctx = &qp->rx_ctx;
+	struct siw_wqe		*wqe = rx_wqe(qp);
+	enum siw_wc_status	wc_status = wqe->wc_status;
+
+	u8 opcode = __rdmap_opcode(&rctx->hdr.ctrl);
+	int rv = 0;
+
+	switch (opcode) {
+
+	case RDMAP_SEND_SE:
+	case RDMAP_SEND_SE_INVAL:
+		wqe->rqe.flags |= SIW_WQE_SOLICITED;
+	case RDMAP_SEND:
+	case RDMAP_SEND_INVAL:
+		if (wqe->wr_status == SIW_WR_IDLE)
+			break;
+
+		rctx->ddp_msn[RDMAP_UNTAGGED_QN_SEND]++;
+
+		if (error != 0 && wc_status == SIW_WC_SUCCESS)
+			wc_status = SIW_WC_GENERAL_ERR;
+		/*
+		 * Handle STag invalidation request
+		 */
+		if (wc_status == SIW_WC_SUCCESS &&
+		    (opcode == RDMAP_SEND_INVAL ||
+		     opcode == RDMAP_SEND_SE_INVAL)) {
+			rv = siw_invalidate_stag(qp->pd, rctx->inval_stag);
+			if (rv) {
+				siw_init_terminate(qp,
+					TERM_ERROR_LAYER_RDMAP,
+					rv == -EACCES ?
+						RDMAP_ETYPE_REMOTE_PROTECTION :
+						RDMAP_ETYPE_REMOTE_OPERATION,
+					RDMAP_ECODE_CANNOT_INVALIDATE, 0);
+
+				wc_status = SIW_WC_REM_INV_REQ_ERR;
+			}
+		}
+		rv = siw_rqe_complete(qp, &wqe->rqe, wqe->processed,
+				      wc_status);
+		siw_wqe_put_mem(wqe, SIW_OP_RECEIVE);
+
+		break;
+
+	case RDMAP_RDMA_READ_RESP:
+		if (wqe->wr_status == SIW_WR_IDLE)
+			break;
+
+		if (error != 0) {
+			if  (rctx->state == SIW_GET_HDR || error == -ENODATA)
+				/*  eventual RREQ in ORQ left untouched */
+				break;
+
+			if (wc_status == SIW_WC_SUCCESS)
+				wc_status = SIW_WC_GENERAL_ERR;
+		} else if (qp->kernel_verbs &&
+			   rx_type(wqe) == SIW_OP_READ_LOCAL_INV) {
+			/*
+			 * Handle any STag invalidation request
+			 */
+			rv = siw_invalidate_stag(qp->pd, wqe->sqe.sge[0].lkey);
+			if (rv) {
+				siw_init_terminate(qp, TERM_ERROR_LAYER_RDMAP,
+						   RDMAP_ETYPE_CATASTROPHIC,
+						   RDMAP_ECODE_UNSPECIFIED, 0);
+
+				if (wc_status == SIW_WC_SUCCESS) {
+					wc_status = SIW_WC_GENERAL_ERR;
+					error = rv;
+				}
+			}
+		}
+		/*
+		 * All errors turn the wqe into signalled.
+		 */
+		if ((wqe->sqe.flags & SIW_WQE_SIGNALLED) || error != 0)
+			rv = siw_sqe_complete(qp, &wqe->sqe, wqe->processed,
+					      wc_status);
+		siw_wqe_put_mem(wqe, SIW_OP_READ);
+
+		if (!error)
+			rv = siw_check_tx_fence(qp);
+		break;
+
+	case RDMAP_RDMA_READ_REQ:
+		if (!error) {
+			rv = siw_init_rresp(qp, rctx);
+			rctx->ddp_msn[RDMAP_UNTAGGED_QN_RDMA_READ]++;
+		}
+		break;
+
+	case RDMAP_RDMA_WRITE:
+		if (wqe->wr_status == SIW_WR_IDLE)
+			break;
+
+		/*
+		 * Free References from memory object if
+		 * attached to receive context (inbound WRITE).
+		 * While a zero-length WRITE is allowed,
+		 * no memory reference got created.
+		 */
+		if (rx_mem(qp)) {
+			siw_mem_put(rx_mem(qp));
+			rx_mem(qp) = NULL;
+		}
+		break;
+
+	default:
+		break;
+	}
+	wqe->wr_status = SIW_WR_IDLE;
+
+	return rv;
+}
+
+/*
+ * siw_tcp_rx_data()
+ *
+ * Main routine to consume inbound TCP payload
+ *
+ * @rd_desc:	read descriptor
+ * @skb:	socket buffer
+ * @off:	offset in skb
+ * @len:	skb->len - offset : payload in skb
+ */
+int siw_tcp_rx_data(read_descriptor_t *rd_desc, struct sk_buff *skb,
+		    unsigned int off, size_t len)
+{
+	struct siw_qp		*qp = rd_desc->arg.data;
+	struct siw_iwarp_rx	*rctx = &qp->rx_ctx;
+	int			rv;
+
+	rctx->skb = skb;
+	rctx->skb_new = skb->len - off;
+	rctx->skb_offset = off;
+	rctx->skb_copied = 0;
+
+	siw_dbg_qp(qp, "new data, len %d\n", rctx->skb_new);
+
+	while (rctx->skb_new) {
+		int run_completion = 1;
+
+		if (unlikely(rctx->rx_suspend)) {
+			/* Do not process any more data */
+			rctx->skb_copied += rctx->skb_new;
+			break;
+		}
+		switch (rctx->state) {
+
+		case SIW_GET_HDR:
+			rv = siw_get_hdr(rctx);
+			if (!rv) {
+				if (rctx->mpa_crc_hd &&
+				    siw_crc_rxhdr(rctx) != 0) {
+					rv = -EINVAL;
+					break;
+				}
+				rctx->fpdu_part_rem =
+					siw_fpdu_payload_len(rctx);
+
+				if (rctx->fpdu_part_rem)
+					rctx->pad = -rctx->fpdu_part_rem & 0x3;
+				else
+					rctx->pad = 0;
+
+				rctx->state = SIW_GET_DATA_START;
+				rctx->fpdu_part_rcvd = 0;
+			}
+			break;
+
+		case SIW_GET_DATA_MORE:
+			/*
+			 * Another data fragment of the same DDP segment.
+			 * Setting first_ddp_seg = 0 avoids repeating
+			 * initializations that shall occur only once per
+			 * DDP segment.
+			 */
+			rctx->first_ddp_seg = 0;
+			/* Fall through */
+
+		case SIW_GET_DATA_START:
+			/*
+			 * Headers will be checked by the opcode-specific
+			 * data receive function below.
+			 */
+			rv = siw_rx_data(qp, rctx);
+			if (!rv) {
+				rctx->fpdu_part_rem =
+					siw_fpdu_trailer_len(rctx);
+				rctx->fpdu_part_rcvd = 0;
+				rctx->state = SIW_GET_TRAILER;
+			} else {
+				if (unlikely(rv == -ECONNRESET))
+					run_completion = 0;
+				else
+					rctx->state = SIW_GET_DATA_MORE;
+			}
+			break;
+
+		case SIW_GET_TRAILER:
+			/*
+			 * read CRC + any padding
+			 */
+			rv = siw_get_trailer(qp, rctx);
+			if (likely(!rv)) {
+				/*
+				 * FPDU completed.
+				 * complete RDMAP message if last fragment
+				 */
+				rctx->state = SIW_GET_HDR;
+				rctx->fpdu_part_rcvd = 0;
+
+				if (!(rctx->hdr.ctrl.ddp_rdmap_ctrl
+					& DDP_FLAG_LAST))
+					/* more frags */
+					break;
+
+				rv = siw_rdmap_complete(qp, 0);
+				run_completion = 0;
+			}
+			break;
+
+		default:
+			pr_warn("QP[%d]: RX out of state\n", QP_ID(qp));
+			rv = -EPROTO;
+			run_completion = 0;
+		}
+
+		if (unlikely(rv != 0 && rv != -EAGAIN)) {
+			if (rctx->state > SIW_GET_HDR && run_completion)
+				siw_rdmap_complete(qp, rv);
+
+			siw_dbg_qp(qp, "rx error %d, rx state %d\n",
+				   rv, rctx->state);
+
+			siw_qp_cm_drop(qp, 1);
+
+			break;
+		}
+		if (rv) {
+			siw_dbg_qp(qp, "fpdu fragment, state %d, missing %d\n",
+				   rctx->state, rctx->fpdu_part_rem);
+			break;
+		}
+	}
+	return rctx->skb_copied;
+}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 11/13] SoftiWarp Completion Queue methods
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (9 preceding siblings ...)
  2018-01-14 22:36   ` [PATCH v3 10/13] SoftiWarp receive path Bernard Metzler
@ 2018-01-14 22:36   ` Bernard Metzler
  2018-01-14 22:36   ` [PATCH v3 12/13] SoftiWarp debugging code Bernard Metzler
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_cq.c | 150 +++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_cq.c

diff --git a/drivers/infiniband/sw/siw/siw_cq.c b/drivers/infiniband/sw/siw/siw_cq.c
new file mode 100644
index 000000000000..d47d6d03b318
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_cq.c
@@ -0,0 +1,150 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/list.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_obj.h"
+#include "siw_cm.h"
+
+static int siw_wc_op_siw2ib[SIW_NUM_OPCODES] = {
+	[SIW_OP_WRITE]		= IB_WC_RDMA_WRITE,
+	[SIW_OP_SEND]		= IB_WC_SEND,
+	[SIW_OP_SEND_WITH_IMM]	= IB_WC_SEND,
+	[SIW_OP_READ]		= IB_WC_RDMA_READ,
+	[SIW_OP_READ_LOCAL_INV]	= IB_WC_RDMA_READ,
+	[SIW_OP_COMP_AND_SWAP]	= IB_WC_COMP_SWAP,
+	[SIW_OP_FETCH_AND_ADD]	= IB_WC_FETCH_ADD,
+	[SIW_OP_INVAL_STAG]	= IB_WC_LOCAL_INV,
+	[SIW_OP_REG_MR]		= IB_WC_REG_MR,
+	[SIW_OP_RECEIVE]	= IB_WC_RECV,
+	[SIW_OP_READ_RESPONSE]	= -1 /* not used */
+};
+
+static struct {
+	enum siw_opcode   siw;
+	enum ib_wc_status ib;
+} map_cqe_status[SIW_NUM_WC_STATUS] = {
+	{SIW_WC_SUCCESS,		IB_WC_SUCCESS},
+	{SIW_WC_LOC_LEN_ERR,		IB_WC_LOC_LEN_ERR},
+	{SIW_WC_LOC_PROT_ERR,		IB_WC_LOC_PROT_ERR},
+	{SIW_WC_LOC_QP_OP_ERR,		IB_WC_LOC_QP_OP_ERR},
+	{SIW_WC_WR_FLUSH_ERR,		IB_WC_WR_FLUSH_ERR},
+	{SIW_WC_BAD_RESP_ERR,		IB_WC_BAD_RESP_ERR},
+	{SIW_WC_LOC_ACCESS_ERR,		IB_WC_LOC_ACCESS_ERR},
+	{SIW_WC_REM_ACCESS_ERR,		IB_WC_REM_ACCESS_ERR},
+	{SIW_WC_REM_INV_REQ_ERR,	IB_WC_REM_INV_REQ_ERR},
+	{SIW_WC_GENERAL_ERR,		IB_WC_GENERAL_ERR}
+};
+
+/*
+ * translate wc into ib syntax
+ */
+static void siw_wc_siw2ib(struct siw_cqe *cqe, struct ib_wc *wc)
+{
+	memset(wc, 0, sizeof(*wc));
+
+	wc->wr_id = cqe->id;
+	wc->status = map_cqe_status[cqe->status].ib;
+	wc->byte_len = cqe->bytes;
+	wc->qp = &((struct siw_qp *)cqe->qp)->base_qp;
+
+	wc->opcode = siw_wc_op_siw2ib[cqe->opcode];
+}
+
+/*
+ * Reap one CQE from the CQ.
+ */
+int siw_reap_cqe(struct siw_cq *cq, struct ib_wc *wc)
+{
+	struct siw_cqe *cqe;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cq->lock, flags);
+
+	cqe = &cq->queue[cq->cq_get % cq->num_cqe];
+	if (cqe->flags & SIW_WQE_VALID) {
+		siw_wc_siw2ib(cqe, wc);
+
+		if (cq->kernel_verbs) {
+			siw_dbg_qp(((struct siw_qp *)cqe->qp),
+				   "[CQ %d]: reap wqe type %d, idx %d\n",
+				   OBJ_ID(cq), cqe->opcode,
+				   cq->cq_get % cq->num_cqe);
+			siw_qp_put(cqe->qp);
+		}
+		cqe->flags = 0;
+		cq->cq_get++;
+
+		/* Make cqe state visible */
+		smp_wmb();
+
+		spin_unlock_irqrestore(&cq->lock, flags);
+
+		return 1;
+	}
+	spin_unlock_irqrestore(&cq->lock, flags);
+
+	return 0;
+}
+
+/*
+ * siw_cq_flush()
+ *
+ * Flush all CQ elements.
+ */
+void siw_cq_flush(struct siw_cq *cq)
+{
+	struct ib_wc wc;
+
+	int got, total = 0;
+
+	siw_dbg(cq->hdr.sdev, "[CQ %d]: enter\n", OBJ_ID(cq));
+
+	do {
+		got = siw_reap_cqe(cq, &wc);
+		total += got;
+	} while (got > 0);
+}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 12/13] SoftiWarp debugging code
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (10 preceding siblings ...)
  2018-01-14 22:36   ` [PATCH v3 11/13] SoftiWarp Completion Queue methods Bernard Metzler
@ 2018-01-14 22:36   ` Bernard Metzler
  2018-01-14 22:36   ` [PATCH v3 13/13] Add SoftiWarp to kernel build environment Bernard Metzler
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/sw/siw/siw_debug.c | 463 ++++++++++++++++++++++++++++++++++
 drivers/infiniband/sw/siw/siw_debug.h |  87 +++++++
 2 files changed, 550 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/siw_debug.c
 create mode 100644 drivers/infiniband/sw/siw/siw_debug.h

diff --git a/drivers/infiniband/sw/siw/siw_debug.c b/drivers/infiniband/sw/siw/siw_debug.c
new file mode 100644
index 000000000000..7cad480ea7a6
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_debug.c
@@ -0,0 +1,463 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Fredy Neeser <nfd-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <net/tcp.h>
+#include <linux/list.h>
+#include <linux/debugfs.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "siw.h"
+#include "siw_cm.h"
+#include "siw_obj.h"
+
+#define FDENTRY(f) (f->f_path.dentry)
+
+static struct dentry *siw_debugfs;
+
+static ssize_t siw_show_qps(struct file *f, char __user *buf, size_t space,
+			    loff_t *ppos)
+{
+	struct siw_device *sdev = FDENTRY(f)->d_inode->i_private;
+	struct list_head *pos, *tmp;
+	char *kbuf = NULL;
+	int len = 0, n, num_qp;
+
+	if (*ppos)
+		goto out;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	num_qp = atomic_read(&sdev->num_qp);
+	if (!num_qp)
+		goto out;
+
+	len = snprintf(kbuf, space, "%s: %d QPs\n", sdev->base_dev.name,
+		       num_qp);
+	if (len > space) {
+		len = space;
+		goto out;
+	}
+	space -= len;
+	n = snprintf(kbuf + len, space,
+		     "%-15s%-6s%-6s%-5s%-5s%-5s%-5s%-5s%-20s%s\n",
+		     "QP-ID", "State", "Ref's", "SQ", "RQ", "IRQ", "ORQ",
+		     "s/r", "Sock", "CEP");
+
+	if (n > space) {
+		len += space;
+		goto out;
+	}
+	len += n;
+	space -= n;
+
+	list_for_each_safe(pos, tmp, &sdev->qp_list) {
+		struct siw_qp *qp = list_entry(pos, struct siw_qp, devq);
+
+		n = snprintf(kbuf + len, space,
+			"%-15d%-6d%-6d%-5d%-5d%-5d%-5d%d/%-3d0x%-18p0x%-18p\n",
+			     QP_ID(qp),
+			     qp->attrs.state,
+			     refcount_read(&qp->hdr.ref),
+			     qp->attrs.sq_size,
+			     qp->attrs.rq_size,
+			     qp->attrs.irq_size,
+			     qp->attrs.orq_size,
+			     tx_wqe(qp) ? 1 : 0,
+			     rx_wqe(qp) ? 1 : 0,
+			     qp->attrs.llp_stream_handle,
+			     qp->cep);
+		if (n < space) {
+			len += n;
+			space -= n;
+		} else {
+			len += space;
+			break;
+		}
+	}
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+
+	return len;
+};
+
+static ssize_t siw_show_mrs(struct file *f, char __user *buf, size_t space,
+			    loff_t *ppos)
+{
+	struct siw_device *sdev = FDENTRY(f)->d_inode->i_private;
+	struct list_head *pos, *tmp;
+	char *kbuf = NULL;
+	int len = 0, n, num_mr;
+
+	if (*ppos)
+		goto out;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	num_mr = atomic_read(&sdev->num_mr);
+	if (!num_mr)
+		goto out;
+
+	len = snprintf(kbuf, space, "%s: %d MRs\n", sdev->base_dev.name,
+		       num_mr);
+	if (len > space) {
+		len = space;
+		goto out;
+	}
+	space -= len;
+
+	n = snprintf(kbuf + len, space,
+		     "%-15s%-18s%-8s%-8s%-22s%-8s%-9s\n",
+		     "MEM-ID", "PD", "STag", "Type", "size", "Ref's", "State");
+
+	if (n > space) {
+		len += space;
+		goto out;
+	}
+	len += n;
+	space -= n;
+
+	list_for_each_safe(pos, tmp, &sdev->mr_list) {
+		struct siw_mr *mr = list_entry(pos, struct siw_mr, devq);
+
+		n = snprintf(kbuf + len, space,
+			     "%-15d%-18p0x%-8x%-8s0x%-20llx%-8d%-9s\n",
+			     OBJ_ID(&mr->mem), mr->pd, mr->mem.hdr.id,
+			     mr->mem_obj ? mr->mem.is_pbl ? "PBL":"UMEM":"KVA",
+			     mr->mem.len,
+			     refcount_read(&mr->mem.hdr.ref),
+			     mr->mem.stag_valid ? "valid" : "invalid");
+		if (n < space) {
+			len += n;
+			space -= n;
+		} else {
+			len += space;
+			break;
+		}
+	}
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+
+	return len;
+}
+
+static ssize_t siw_show_ceps(struct file *f, char __user *buf, size_t space,
+			     loff_t *ppos)
+{
+	struct siw_device *sdev = FDENTRY(f)->d_inode->i_private;
+	struct list_head *pos, *tmp;
+	char *kbuf = NULL;
+	int len = 0, n, num_cep;
+
+	if (*ppos)
+		goto out;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	num_cep = atomic_read(&sdev->num_cep);
+	if (!num_cep)
+		goto out;
+
+	len = snprintf(kbuf, space, "%s: %d CEPs\n", sdev->base_dev.name,
+		       num_cep);
+	if (len > space) {
+		len = space;
+		goto out;
+	}
+	space -= len;
+
+	n = snprintf(kbuf + len, space,
+		     "%-20s%-6s%-6s%-9s%-5s%-3s%-4s%-21s%-9s\n",
+		     "CEP", "State", "Ref's", "QP-ID", "LQ", "LC", "U", "Sock",
+		     "CM-ID");
+
+	if (n > space) {
+		len += space;
+		goto out;
+	}
+	len += n;
+	space -= n;
+
+	list_for_each_safe(pos, tmp, &sdev->cep_list) {
+		struct siw_cep *cep = list_entry(pos, struct siw_cep, devq);
+
+		n = snprintf(kbuf + len, space,
+			     "0x%-18p%-6d%-6d%-9d%-5s%-3s%-4d0x%-18p 0x%-16p\n",
+			     cep, cep->state,
+			     refcount_read(&cep->ref),
+			     cep->qp ? QP_ID(cep->qp) : -1,
+			     list_empty(&cep->listenq) ? "n" : "y",
+			     cep->listen_cep ? "y" : "n",
+			     cep->in_use,
+			     cep->llp.sock,
+			     cep->cm_id);
+		if (n < space) {
+			len += n;
+			space -= n;
+		} else {
+			len += space;
+			break;
+		}
+	}
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+
+	return len;
+}
+
+static ssize_t siw_show_stats(struct file *f, char __user *buf, size_t space,
+			      loff_t *ppos)
+{
+	struct siw_device *sdev = FDENTRY(f)->d_inode->i_private;
+	char *kbuf = NULL;
+	int len = 0;
+
+	if (*ppos)
+		goto out;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	len =  snprintf(kbuf, space, "Allocated SIW Objects:\n"
+		"Device %s (%s):\t"
+		"%s: %d, %s %d, %s: %d, %s: %d, %s: %d, %s: %d, %s: %d\n",
+		sdev->base_dev.name,
+		sdev->netdev->flags & IFF_UP ? "IFF_UP" : "IFF_DOWN",
+		"CXs", atomic_read(&sdev->num_ctx),
+		"PDs", atomic_read(&sdev->num_pd),
+		"QPs", atomic_read(&sdev->num_qp),
+		"CQs", atomic_read(&sdev->num_cq),
+		"SRQs", atomic_read(&sdev->num_srq),
+		"MRs", atomic_read(&sdev->num_mr),
+		"CEPs", atomic_read(&sdev->num_cep));
+	if (len > space)
+		len = space;
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+	return len;
+}
+
+static const struct file_operations siw_qp_debug_fops = {
+	.owner	= THIS_MODULE,
+	.read	= siw_show_qps
+};
+
+static const struct file_operations siw_mr_debug_fops = {
+	.owner	= THIS_MODULE,
+	.read	= siw_show_mrs
+};
+
+static const struct file_operations siw_cep_debug_fops = {
+	.owner	= THIS_MODULE,
+	.read	= siw_show_ceps
+};
+
+static const struct file_operations siw_stats_debug_fops = {
+	.owner	= THIS_MODULE,
+	.read	= siw_show_stats
+};
+
+void siw_debugfs_add_device(struct siw_device *sdev)
+{
+	struct dentry	*entry;
+
+	if (!siw_debugfs)
+		return;
+
+	sdev->debugfs = debugfs_create_dir(sdev->base_dev.name, siw_debugfs);
+	if (sdev->debugfs) {
+		entry = debugfs_create_file("qp", 0400, sdev->debugfs,
+					    (void *)sdev, &siw_qp_debug_fops);
+		if (!entry)
+			siw_dbg(sdev, "could not create 'qp' entry\n");
+
+		entry = debugfs_create_file("cep", 0400, sdev->debugfs,
+					    (void *)sdev, &siw_cep_debug_fops);
+		if (!entry)
+			siw_dbg(sdev, "could not create 'cep' entry\n");
+
+		entry = debugfs_create_file("mr", 0400, sdev->debugfs,
+					    (void *)sdev, &siw_mr_debug_fops);
+		if (!entry)
+			siw_dbg(sdev, "could not create 'qp' entry\n");
+
+		entry = debugfs_create_file("stats", 0400, sdev->debugfs,
+					    (void *)sdev,
+					    &siw_stats_debug_fops);
+		if (!entry)
+			siw_dbg(sdev, "could not create 'stats' entry\n");
+	}
+}
+
+void siw_debugfs_del_device(struct siw_device *sdev)
+{
+	debugfs_remove_recursive(sdev->debugfs);
+	sdev->debugfs = NULL;
+}
+
+void siw_debug_init(void)
+{
+	siw_debugfs = debugfs_create_dir("siw", NULL);
+
+	if (!siw_debugfs || siw_debugfs == ERR_PTR(-ENODEV)) {
+		pr_warn("SIW: could not init debugfs\n");
+		siw_debugfs = NULL;
+	}
+}
+
+void siw_debugfs_delete(void)
+{
+	debugfs_remove_recursive(siw_debugfs);
+	siw_debugfs = NULL;
+}
+
+void siw_print_hdr(union iwarp_hdr *hdr, int qp_id, char *msg)
+{
+	enum rdma_opcode op = __rdmap_opcode(&hdr->ctrl);
+	u16 mpa_len = be16_to_cpu(hdr->ctrl.mpa_len);
+
+	switch (op) {
+
+	case RDMAP_RDMA_WRITE:
+		pr_info("siw: [QP %d]: %s(WRITE, DDP len %d): %08x %016llx\n",
+			qp_id, msg, ddp_data_len(op, mpa_len),
+			hdr->rwrite.sink_stag, hdr->rwrite.sink_to);
+		break;
+
+	case RDMAP_RDMA_READ_REQ:
+		pr_info("siw: [QP %d]: %s(RREQ, DDP len %d): %08x %08x %08x %08x %016llx %08x %08x %016llx\n",
+			qp_id, msg,
+			ddp_data_len(op, mpa_len),
+			be32_to_cpu(hdr->rreq.ddp_qn),
+			be32_to_cpu(hdr->rreq.ddp_msn),
+			be32_to_cpu(hdr->rreq.ddp_mo),
+			be32_to_cpu(hdr->rreq.sink_stag),
+			be64_to_cpu(hdr->rreq.sink_to),
+			be32_to_cpu(hdr->rreq.read_size),
+			be32_to_cpu(hdr->rreq.source_stag),
+			be64_to_cpu(hdr->rreq.source_to));
+
+		break;
+
+	case RDMAP_RDMA_READ_RESP:
+		pr_info("siw: [QP %d]: %s(RRESP, DDP len %d): %08x %016llx\n",
+			qp_id, msg, ddp_data_len(op, mpa_len),
+			be32_to_cpu(hdr->rresp.sink_stag),
+			be64_to_cpu(hdr->rresp.sink_to));
+		break;
+
+	case RDMAP_SEND:
+		pr_info("siw: [QP %d]: %s(SEND, DDP len %d): %08x %08x %08x\n",
+			qp_id, msg, ddp_data_len(op, mpa_len),
+			be32_to_cpu(hdr->send.ddp_qn),
+			be32_to_cpu(hdr->send.ddp_msn),
+			be32_to_cpu(hdr->send.ddp_mo));
+		break;
+
+	case RDMAP_SEND_INVAL:
+		pr_info("siw: [QP %d]: %s(S_INV, DDP len %d): %08x %08x %08x %08x\n",
+			qp_id, msg, ddp_data_len(op, mpa_len),
+			be32_to_cpu(hdr->send_inv.inval_stag),
+			be32_to_cpu(hdr->send_inv.ddp_qn),
+			be32_to_cpu(hdr->send_inv.ddp_msn),
+			be32_to_cpu(hdr->send_inv.ddp_mo));
+		break;
+
+	case RDMAP_SEND_SE:
+		pr_info("siw: [QP %d]: %s(S_SE, DDP len %d): %08x %08x %08x\n",
+			qp_id, msg, ddp_data_len(op, mpa_len),
+			be32_to_cpu(hdr->send.ddp_qn),
+			be32_to_cpu(hdr->send.ddp_msn),
+			be32_to_cpu(hdr->send.ddp_mo));
+		break;
+
+	case RDMAP_SEND_SE_INVAL:
+		pr_info("siw: [QP %d]: %s(S_SE_INV, DDP len %d): %08x %08x %08x %08x\n",
+			qp_id, msg, ddp_data_len(op, mpa_len),
+			be32_to_cpu(hdr->send_inv.inval_stag),
+			be32_to_cpu(hdr->send_inv.ddp_qn),
+			be32_to_cpu(hdr->send_inv.ddp_msn),
+			be32_to_cpu(hdr->send_inv.ddp_mo));
+		break;
+
+	case RDMAP_TERMINATE:
+		pr_info("siw: [QP %d]: %s(TERM, DDP len %d):\n", qp_id, msg,
+			ddp_data_len(op, mpa_len));
+		break;
+
+	default:
+		pr_info("siw: [QP %d]: %s (undefined opcode %d)", qp_id, msg,
+			op);
+		break;
+	}
+}
+
+char ib_qp_state_to_string[IB_QPS_ERR+1][sizeof "RESET"] = {
+	[IB_QPS_RESET]	= "RESET",
+	[IB_QPS_INIT]	= "INIT",
+	[IB_QPS_RTR]	= "RTR",
+	[IB_QPS_RTS]	= "RTS",
+	[IB_QPS_SQD]	= "SQD",
+	[IB_QPS_SQE]	= "SQE",
+	[IB_QPS_ERR]	= "ERR"
+};
diff --git a/drivers/infiniband/sw/siw/siw_debug.h b/drivers/infiniband/sw/siw/siw_debug.h
new file mode 100644
index 000000000000..b7804e826e67
--- /dev/null
+++ b/drivers/infiniband/sw/siw/siw_debug.h
@@ -0,0 +1,87 @@
+/*
+ * Software iWARP device driver
+ *
+ * Authors: Fredy Neeser <nfd-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *          Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
+ *
+ * Copyright (c) 2008-2017, IBM Corporation
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * BSD license below:
+ *
+ *   Redistribution and use in source and binary forms, with or
+ *   without modification, are permitted provided that the following
+ *   conditions are met:
+ *
+ *   - Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *   - Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *   - Neither the name of IBM nor the names of its contributors may be
+ *     used to endorse or promote products derived from this software without
+ *     specific prior written permission.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _SIW_DEBUG_H
+#define _SIW_DEBUG_H
+
+#include <linux/uaccess.h>
+#include <linux/hardirq.h>	/* in_interrupt() */
+
+struct siw_device;
+struct siw_iwarp_rx;
+union iwarp_hdr;
+
+extern void siw_debug_init(void);
+extern void siw_debugfs_add_device(struct siw_device *dev);
+extern void siw_debugfs_del_device(struct siw_device *dev);
+extern void siw_debugfs_delete(void);
+
+extern void siw_print_hdr(union iwarp_hdr *hdr, int id, char *msg);
+extern void siw_print_qp_attr_mask(enum ib_qp_attr_mask mask, char *msg);
+extern char ib_qp_state_to_string[IB_QPS_ERR+1][sizeof "RESET"];
+
+#ifndef refcount_read
+#define refcount_read(x)	atomic_read(x.refcount.refs)
+#endif
+
+#define siw_dbg(ddev, fmt, ...) \
+	dev_dbg(&(ddev)->base_dev.dev, "cpu%2d %s: " fmt, smp_processor_id(),\
+			__func__, ##__VA_ARGS__)
+
+#define siw_dbg_qp(qp, fmt, ...) \
+	siw_dbg(qp->hdr.sdev, "[QP %d]: " fmt, QP_ID(qp), ##__VA_ARGS__)
+
+#define siw_dbg_cep(cep, fmt, ...) \
+	siw_dbg(cep->sdev, "[CEP 0x%p]: " fmt, cep, ##__VA_ARGS__)
+
+#define siw_dbg_obj(obj, fmt, ...) \
+	siw_dbg(obj->hdr.sdev, "[OBJ ID %d]: " fmt, obj->hdr.id, ##__VA_ARGS__)
+
+#ifdef DEBUG
+
+#define siw_dprint_hdr(h, i, m)	siw_print_hdr(h, i, m)
+
+#else
+
+#define dprint(dbgcat, fmt, args...)	do { } while (0)
+#define siw_dprint_hdr(h, i, m)	do { } while (0)
+
+#endif
+
+#endif
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 13/13] Add SoftiWarp to kernel build environment
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (11 preceding siblings ...)
  2018-01-14 22:36   ` [PATCH v3 12/13] SoftiWarp debugging code Bernard Metzler
@ 2018-01-14 22:36   ` Bernard Metzler
  2018-01-17 16:07   ` [PATCH v3 00/13] Request for Comments on SoftiWarp Steve Wise
  2018-01-23 16:31   ` Steve Wise
  14 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-14 22:36 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Bernard Metzler

Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
---
 drivers/infiniband/Kconfig         |  1 +
 drivers/infiniband/sw/Makefile     |  1 +
 drivers/infiniband/sw/siw/Kconfig  | 18 ++++++++++++++++++
 drivers/infiniband/sw/siw/Makefile | 15 +++++++++++++++
 4 files changed, 35 insertions(+)
 create mode 100644 drivers/infiniband/sw/siw/Kconfig
 create mode 100644 drivers/infiniband/sw/siw/Makefile

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index cbf186522016..58e9c7e1b9d7 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -97,6 +97,7 @@ source "drivers/infiniband/ulp/isert/Kconfig"
 source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
+source "drivers/infiniband/sw/siw/Kconfig"
 
 source "drivers/infiniband/hw/hfi1/Kconfig"
 
diff --git a/drivers/infiniband/sw/Makefile b/drivers/infiniband/sw/Makefile
index 8b095b27db87..d37610fcbbc7 100644
--- a/drivers/infiniband/sw/Makefile
+++ b/drivers/infiniband/sw/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_INFINIBAND_RDMAVT)		+= rdmavt/
 obj-$(CONFIG_RDMA_RXE)			+= rxe/
+obj-$(CONFIG_RDMA_SIW)			+= siw/
diff --git a/drivers/infiniband/sw/siw/Kconfig b/drivers/infiniband/sw/siw/Kconfig
new file mode 100644
index 000000000000..482a0e992bc8
--- /dev/null
+++ b/drivers/infiniband/sw/siw/Kconfig
@@ -0,0 +1,18 @@
+config RDMA_SIW
+	tristate "Software RDMA over TCP/IP (iWARP) driver"
+	depends on INET && INFINIBAND
+	depends on CRYPTO_CRC32
+	---help---
+	This driver implements the iWARP RDMA transport over
+	the Linux TCP/IP network stack. It enables a system with a
+	standard Ethernet adapter to interoperate with a iWARP
+	adapter or with another system running the SIW driver.
+	(See also RXE which is a similar software driver for RoCE.)
+
+	The driver interfaces with the Linux RDMA stack and
+	implements both a kernel and user space RDMA verbs API.
+	The user space verbs API requires a support
+	library named libsiw which is loaded by the generic user
+	space verbs API, libibverbs. To implement RDMA over
+	TCP/IP, the driver further interfaces with the Linux
+	in-kernel TCP socket layer.
diff --git a/drivers/infiniband/sw/siw/Makefile b/drivers/infiniband/sw/siw/Makefile
new file mode 100644
index 000000000000..20f31c9e827b
--- /dev/null
+++ b/drivers/infiniband/sw/siw/Makefile
@@ -0,0 +1,15 @@
+obj-$(CONFIG_RDMA_SIW) += siw.o
+
+siw-y := \
+	siw_main.o \
+	siw_cm.o \
+	siw_verbs.o \
+	siw_obj.o \
+	siw_qp.o \
+	siw_qp_tx.o \
+	siw_qp_rx.o \
+	siw_cq.o \
+	siw_cm.o \
+	siw_debug.o \
+	siw_ae.o \
+	siw_mem.o
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 00/13] Request for Comments on SoftiWarp
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (12 preceding siblings ...)
  2018-01-14 22:36   ` [PATCH v3 13/13] Add SoftiWarp to kernel build environment Bernard Metzler
@ 2018-01-17 16:07   ` Steve Wise
  2018-01-18  7:29     ` Leon Romanovsky
       [not found]     ` <20180118072958.GW13639-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  2018-01-23 16:31   ` Steve Wise
  14 siblings, 2 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-17 16:07 UTC (permalink / raw)
  To: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hey Bernard,

> 
> This patch set contributes a new version of the SoftiWarp
> driver, as originally introduced Oct 6th, 2017.
> 


Two things I think will help with review:

1) provide a github linux kernel repo with this series added.  That allows
reviewers to easily use both patches and a full kernel tree for reviewing.

2) provide the userlib patches (and a github repo) that works with the
latest upstream rdma-core.  That will allow folks to test use-side rdma as
part of their review.

I'll review the individual patches in detail asap.

Thanks!

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
  2018-01-17 16:07   ` [PATCH v3 00/13] Request for Comments on SoftiWarp Steve Wise
@ 2018-01-18  7:29     ` Leon Romanovsky
       [not found]     ` <20180118072958.GW13639-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  1 sibling, 0 replies; 44+ messages in thread
From: Leon Romanovsky @ 2018-01-18  7:29 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

On Wed, Jan 17, 2018 at 10:07:22AM -0600, Steve Wise wrote:
> Hey Bernard,
>
> >
> > This patch set contributes a new version of the SoftiWarp
> > driver, as originally introduced Oct 6th, 2017.
> >
>
>
> Two things I think will help with review:
>
> 1) provide a github linux kernel repo with this series added.  That allows
> reviewers to easily use both patches and a full kernel tree for reviewing.
>
> 2) provide the userlib patches (and a github repo) that works with the
> latest upstream rdma-core.  That will allow folks to test use-side rdma as
> part of their review.

And please don't call it "PATCH" in subject, but "RFC".
The "PATCH" is misleading.

>
> I'll review the individual patches in detail asap.

I would like to review too to learn iWARP a little bit closer, but most
probably won't have time to do it till -rc1.

>
> Thanks!
>
> Steve.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
       [not found]     ` <20180118072958.GW13639-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2018-01-18 17:03       ` Bernard Metzler
       [not found]         ` <OFC86A1EC5.60F00E9A-ON00258219.005DA6A4-00258219.005DBBB9-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
  0 siblings, 1 reply; 44+ messages in thread
From: Bernard Metzler @ 2018-01-18 17:03 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA


-----Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: -----

>To: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>From: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>Date: 01/18/2018 08:30AM
>Cc: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
>linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>Subject: Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
>
>On Wed, Jan 17, 2018 at 10:07:22AM -0600, Steve Wise wrote:
>> Hey Bernard,
>>
>> >
>> > This patch set contributes a new version of the SoftiWarp
>> > driver, as originally introduced Oct 6th, 2017.
>> >
>>
>>
>> Two things I think will help with review:
>>
>> 1) provide a github linux kernel repo with this series added.  That
>allows
>> reviewers to easily use both patches and a full kernel tree for
>reviewing.
>>
>> 2) provide the userlib patches (and a github repo) that works with
>the
>> latest upstream rdma-core.  That will allow folks to test use-side
>rdma as
>> part of their review.
>
>And please don't call it "PATCH" in subject, but "RFC".
>The "PATCH" is misleading.
>

Makes sense. I will make sure I remember.

>>
>> I'll review the individual patches in detail asap.
>
>I would like to review too to learn iWARP a little bit closer, but
>most
>probably won't have time to do it till -rc1.
>

Sure!

Let me set up a github thing which mirrors the most recent
kernel sources + the siw RFC ;) attached. Shall I just take
https://github.com/torvalds/linux ?

Thank you!
Bernard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 00/13] Request for Comments on SoftiWarp
       [not found]         ` <OFC86A1EC5.60F00E9A-ON00258219.005DA6A4-00258219.005DBBB9-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
@ 2018-01-18 21:52           ` Steve Wise
  0 siblings, 0 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-18 21:52 UTC (permalink / raw)
  To: 'Bernard Metzler', 'Leon Romanovsky'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

> >I would like to review too to learn iWARP a little bit closer, but
> >most
> >probably won't have time to do it till -rc1.
> >
> 
> Sure!
> 
> Let me set up a github thing which mirrors the most recent
> kernel sources + the siw RFC ;) attached. Shall I just take
> https://github.com/torvalds/linux ?

I think you could use this.  It seems to be up to date with his kernel.org git repo.  And it will be easier to fork from the github.com repo...



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 00/13] Request for Comments on SoftiWarp
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
                     ` (13 preceding siblings ...)
  2018-01-17 16:07   ` [PATCH v3 00/13] Request for Comments on SoftiWarp Steve Wise
@ 2018-01-23 16:31   ` Steve Wise
  2018-01-23 16:44     ` Jason Gunthorpe
  14 siblings, 1 reply; 44+ messages in thread
From: Steve Wise @ 2018-01-23 16:31 UTC (permalink / raw)
  To: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> This patch set contributes a new version of the SoftiWarp
> driver, as originally introduced Oct 6th, 2017.
> 
> Thank you all for the commments on my last submissiom. It
> was vey helpful and encouraging. I hereby re-send a SoftiWarp
> patch after taking all comments into account, I hope. Without
> introducing SoftiWarp (siw) again, let me fast forward
> to the changes made to the last submission:
> 
> 
> 1. Removed all kernel module parameters, as requested.
>    I am not completely happy yet with what came out of
>    that, since parameters like the name of interfaces
>    siw shall attach to,

Can configfs be used for this? I don't remember if that was discussed during
v2 review?  OR, the new rdma command that is part of iproute2 can be
extended.  It would use netlink messages to do the configuration.  Perhaps
Leon can comment on that when he gets time.  It could be extended to support
both rxe and siw.

> or if we want to use TCP_NODELAY
>    for low ping-pong delay, or disable it for higher
>    message rates for multiple outstanding operations,
>    were quite handy. I understand I have to find a better
>    solution for that. Ideally, we would be able to
>    distinguish between global siw driver control (like
>    interfaces it attaches to), and per-QP parameters
>    (somehow similar to socket options) which allow for
>    example to tune delay, setting CRC on/off, or control
>    GSO usage.
> 

Per connection settings could be done by extending the rdma_cm API.  See
rdma_set_option() in librdmacm from rdma-core, and kernel rdmacm API
functions like rdma_set_service_type() and rdma_set_reuseaddr().  We really
need the TCP options to be tweakable for iwarp cm_ids...


Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]     ` <20180114223603.19961-4-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
@ 2018-01-23 16:33       ` Steve Wise
  2018-01-23 16:43         ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Steve Wise @ 2018-01-23 16:33 UTC (permalink / raw)
  To: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Jason Gunthorpe

> Subject: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA
> subsystem
> 
> Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> ---
>  drivers/infiniband/sw/siw/siw_main.c | 816
> +++++++++++++++++++++++++++++++++++
>  1 file changed, 816 insertions(+)
>  create mode 100644 drivers/infiniband/sw/siw/siw_main.c
> 
> diff --git a/drivers/infiniband/sw/siw/siw_main.c
> b/drivers/infiniband/sw/siw/siw_main.c
> new file mode 100644
> index 000000000000..1b7fc58d4eb9
> --- /dev/null
> +++ b/drivers/infiniband/sw/siw/siw_main.c
> @@ -0,0 +1,816 @@
> +/*
> + * Software iWARP device driver
> + *
> + * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> + *
> + * Copyright (c) 2008-2017, IBM Corporation
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * BSD license below:
> + *
> + *   Redistribution and use in source and binary forms, with or
> + *   without modification, are permitted provided that the following
> + *   conditions are met:
> + *
> + *   - Redistributions of source code must retain the above copyright
notice,
> + *     this list of conditions and the following disclaimer.
> + *
> + *   - Redistributions in binary form must reproduce the above copyright
> + *     notice, this list of conditions and the following disclaimer in
the
> + *     documentation and/or other materials provided with the
distribution.
> + *
> + *   - Neither the name of IBM nor the names of its contributors may be
> + *     used to endorse or promote products derived from this software
without
> + *     specific prior written permission.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/errno.h>
> +#include <linux/netdevice.h>
> +#include <linux/inetdevice.h>
> +#include <net/net_namespace.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if_arp.h>
> +#include <linux/list.h>
> +#include <linux/kernel.h>
> +#include <linux/dma-mapping.h>
> +
> +#include <rdma/ib_verbs.h>
> +#include <rdma/ib_smi.h>
> +#include <rdma/ib_user_verbs.h>
> +
> +#include "siw.h"
> +#include "siw_obj.h"
> +#include "siw_cm.h"
> +#include "siw_verbs.h"
> +#include <linux/kthread.h>
> +
> +MODULE_AUTHOR("Bernard Metzler");
> +MODULE_DESCRIPTION("Software iWARP Driver");
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION("0.2");
> +
> +/* transmit from user buffer, if possible */
> +const bool zcopy_tx;
> +
> +/* Restrict usage of GSO, if hardware peer iwarp is unable to process
> + * large packets. gso_seg_limit = 1 lets siw send only packets up to
> + * one real MTU in size, but severly limits maximum bandwidth.
> + * gso_seg_limit = 0 makes use of GSO (and more than doubles throughput
> + * for large transfers).
> + */
> +const int gso_seg_limit;
> +

The GSO configuration needs to default to enable interoperation with all
vendors (and comply with the RFCs).  So make it 1 please.

Jason, would configfs be a reasonable way to allow tweaking these globals?

> +/* Attach siw also with loopback devices */
> +const bool loopback_enabled = true;
> +

I think I asked this before.  Why have a knob to enable/disable loopback?

> +/* We try to negotiate CRC on, if true */
> +const bool mpa_crc_required;
> +
> +/* MPA CRC on/off enforced */
> +const bool mpa_crc_strict;
> +
> +/* Set TCP_NODELAY, and push messages asap */
> +const bool siw_lowdelay = true;
> +/* Set TCP_QUICKACK */
> +const bool tcp_quickack;
> +
> +/* Select MPA version to be used during connection setup */
> +u_char mpa_version = MPA_REVISION_2;
> +
> +/* Selects MPA P2P mode (additional handshake during connection
> + * setup, if true
> + */
> +const bool peer_to_peer;
> +
> +static LIST_HEAD(siw_devlist);
> +
> +struct task_struct *siw_tx_thread[NR_CPUS];
> +struct crypto_shash *siw_crypto_shash;
> +
> +static ssize_t show_sw_version(struct device *dev,
> +			       struct device_attribute *attr, char *buf)
> +{
> +	struct siw_device *sdev = container_of(dev, struct siw_device,
> +					       base_dev.dev);
> +
> +	return sprintf(buf, "%x\n", sdev->attrs.version);
> +}
> +
> +static DEVICE_ATTR(sw_version, 0444, show_sw_version, NULL);
> +
> +static struct device_attribute *siw_dev_attributes[] = {
> +	&dev_attr_sw_version
> +};
> +
> +static int siw_modify_port(struct ib_device *base_dev, u8 port, int mask,
> +			   struct ib_port_modify *props)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int siw_device_register(struct siw_device *sdev)
> +{
> +	struct ib_device *base_dev = &sdev->base_dev;
> +	int rv, i;
> +	static int dev_id = 1;
> +
> +	rv = ib_register_device(base_dev, NULL);
> +	if (rv) {
> +		pr_warn("siw: %s: registration error %d\n",
> +			base_dev->name, rv);
> +		return rv;
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(siw_dev_attributes); ++i) {
> +		rv = device_create_file(&base_dev->dev,
siw_dev_attributes[i]);
> +		if (rv) {
> +			pr_warn("siw: %s: create file error: rv=%d\n",
> +				base_dev->name, rv);
> +			ib_unregister_device(base_dev);
> +			return rv;
> +		}
> +	}
> +	siw_debugfs_add_device(sdev);
> +
> +	sdev->attrs.vendor_part_id = dev_id++;
> +
> +	siw_dbg(sdev, "HWaddr=%02x.%02x.%02x.%02x.%02x.%02x\n",
> +		*(u8 *)sdev->netdev->dev_addr,
> +		*((u8 *)sdev->netdev->dev_addr + 1),
> +		*((u8 *)sdev->netdev->dev_addr + 2),
> +		*((u8 *)sdev->netdev->dev_addr + 3),
> +		*((u8 *)sdev->netdev->dev_addr + 4),
> +		*((u8 *)sdev->netdev->dev_addr + 5));
> +
> +	sdev->is_registered = 1;
> +
> +	return 0;
> +}
> +
> +static void siw_device_deregister(struct siw_device *sdev)
> +{
> +	int i;
> +
> +	siw_debugfs_del_device(sdev);
> +
> +	if (sdev->is_registered) {
> +
> +		siw_dbg(sdev, "deregister\n");
> +
> +		for (i = 0; i < ARRAY_SIZE(siw_dev_attributes); ++i)
> +			device_remove_file(&sdev->base_dev.dev,
> +					   siw_dev_attributes[i]);
> +
> +		ib_unregister_device(&sdev->base_dev);
> +	}
> +	if (atomic_read(&sdev->num_ctx) || atomic_read(&sdev->num_srq) ||
> +	    atomic_read(&sdev->num_mr) || atomic_read(&sdev->num_cep) ||
> +	    atomic_read(&sdev->num_qp) || atomic_read(&sdev->num_cq) ||
> +	    atomic_read(&sdev->num_pd)) {
> +		pr_warn("siw at %s: orphaned resources!\n",
> +			sdev->netdev->name);
> +		pr_warn("           CTX %d, SRQ %d, QP %d, CQ %d, MEM %d,
CEP
> %d, PD %d\n",
> +			atomic_read(&sdev->num_ctx),
> +			atomic_read(&sdev->num_srq),
> +			atomic_read(&sdev->num_qp),
> +			atomic_read(&sdev->num_cq),
> +			atomic_read(&sdev->num_mr),
> +			atomic_read(&sdev->num_cep),
> +			atomic_read(&sdev->num_pd));
> +	}
> +
> +	while (!list_empty(&sdev->cep_list)) {
> +		struct siw_cep *cep = list_entry(sdev->cep_list.next,
> +						 struct siw_cep, devq);
> +		list_del(&cep->devq);
> +		pr_warn("siw: at %s: free orphaned CEP 0x%p, state %d\n",
> +			sdev->base_dev.name, cep, cep->state);
> +		kfree(cep);
> +	}
> +	sdev->is_registered = 0;
> +}
> +
> +static void siw_device_destroy(struct siw_device *sdev)
> +{
> +	siw_dbg(sdev, "destroy device\n");
> +	siw_idr_release(sdev);
> +
> +	kfree(sdev->base_dev.iwcm);
> +	dev_put(sdev->netdev);
> +
> +	ib_dealloc_device(&sdev->base_dev);
> +}
> +
> +static struct siw_device *siw_dev_from_netdev(struct net_device *dev)
> +{
> +	if (!list_empty(&siw_devlist)) {
> +		struct list_head *pos;
> +
> +		list_for_each(pos, &siw_devlist) {
> +			struct siw_device *sdev =
> +				list_entry(pos, struct siw_device, list);
> +			if (sdev->netdev == dev)
> +				return sdev;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +static int siw_create_tx_threads(void)
> +{
> +	int cpu, rv, assigned = 0;
> +
> +	for_each_online_cpu(cpu) {
> +		/* Skip HT cores */
> +		if (cpu % cpumask_weight(topology_sibling_cpumask(cpu))) {
> +			siw_tx_thread[cpu] = NULL;
> +			continue;
> +		}
> +		siw_tx_thread[cpu] = kthread_create(siw_run_sq,
> +						   (unsigned long
*)(long)cpu,
> +						   "siw_tx/%d", cpu);
> +		if (IS_ERR(siw_tx_thread[cpu])) {
> +			rv = PTR_ERR(siw_tx_thread[cpu]);
> +			siw_tx_thread[cpu] = NULL;
> +			pr_info("Creating TX thread for CPU %d failed",
cpu);
> +			continue;
> +		}
> +		kthread_bind(siw_tx_thread[cpu], cpu);
> +
> +		wake_up_process(siw_tx_thread[cpu]);
> +		assigned++;
> +	}
> +	return assigned;
> +}
> +

I know in v2 review, you discussed the TX threads.  And you mentioned you
had tried workq threads [1], but the introduced lots of delay.  Have you
re-looked at the workq implementation?  If your analysis is several years
old, workq threads might provide what you need nowadays...

[1] https://www.spinics.net/lists/linux-rdma/msg55646.html

 
Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
  2018-01-23 16:33       ` Steve Wise
@ 2018-01-23 16:43         ` Jason Gunthorpe
       [not found]           ` <20180123164316.GC30670-uk2M96/98Pc@public.gmane.org>
  2018-01-23 17:23           ` Bernard Metzler
  0 siblings, 2 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2018-01-23 16:43 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 23, 2018 at 10:33:48AM -0600, Steve Wise wrote:

> > +/* Restrict usage of GSO, if hardware peer iwarp is unable to process
> > + * large packets. gso_seg_limit = 1 lets siw send only packets up to
> > + * one real MTU in size, but severly limits maximum bandwidth.
> > + * gso_seg_limit = 0 makes use of GSO (and more than doubles throughput
> > + * for large transfers).
> > + */
> > +const int gso_seg_limit;
> > +
> 
> The GSO configuration needs to default to enable interoperation with all
> vendors (and comply with the RFCs).  So make it 1 please.

The thing we call GSO in the netstack should be totally transparent on
the wire, so at the very least this needs some additional elaboration.

> Jason, would configfs be a reasonable way to allow tweaking these globals?

Why would we ever even bother to support a mode that is non-conformant
on the wire? Just remove it..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
  2018-01-23 16:31   ` Steve Wise
@ 2018-01-23 16:44     ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2018-01-23 16:44 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 23, 2018 at 10:31:24AM -0600, Steve Wise wrote:
> > 
> > This patch set contributes a new version of the SoftiWarp
> > driver, as originally introduced Oct 6th, 2017.
> > 
> > Thank you all for the commments on my last submissiom. It
> > was vey helpful and encouraging. I hereby re-send a SoftiWarp
> > patch after taking all comments into account, I hope. Without
> > introducing SoftiWarp (siw) again, let me fast forward
> > to the changes made to the last submission:
> > 
> > 
> > 1. Removed all kernel module parameters, as requested.
> >    I am not completely happy yet with what came out of
> >    that, since parameters like the name of interfaces
> >    siw shall attach to,
> 
> Can configfs be used for this? I don't remember if that was discussed during
> v2 review?  OR, the new rdma command that is part of iproute2 can be
> extended.  It would use netlink messages to do the configuration.  Perhaps
> Leon can comment on that when he gets time.  It could be extended to support
> both rxe and siw.

rdmatool command over rnetlink, supporting both rxe and siw.

The child interface provided by iproute is not really suitable..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]           ` <20180123164316.GC30670-uk2M96/98Pc@public.gmane.org>
@ 2018-01-23 16:58             ` Steve Wise
  2018-01-23 17:05               ` Jason Gunthorpe
  2018-01-23 17:21             ` Bernard Metzler
  1 sibling, 1 reply; 44+ messages in thread
From: Steve Wise @ 2018-01-23 16:58 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> On Tue, Jan 23, 2018 at 10:33:48AM -0600, Steve Wise wrote:
> 
> > > +/* Restrict usage of GSO, if hardware peer iwarp is unable to process
> > > + * large packets. gso_seg_limit = 1 lets siw send only packets up to
> > > + * one real MTU in size, but severly limits maximum bandwidth.
> > > + * gso_seg_limit = 0 makes use of GSO (and more than doubles
throughput
> > > + * for large transfers).
> > > + */
> > > +const int gso_seg_limit;
> > > +
> >
> > The GSO configuration needs to default to enable interoperation with all
> > vendors (and comply with the RFCs).  So make it 1 please.
> 
> The thing we call GSO in the netstack should be totally transparent on
> the wire, so at the very least this needs some additional elaboration.
> 

From: https://tools.ietf.org/html/rfc5041#section-5.2

"At the Data Source, the DDP layer MUST segment the data contained in
   a ULP message into a series of DDP Segments, where each DDP Segment
   contains a DDP Header and ULP Payload, and MUST be no larger than the
   MULPDU value Advertised by the LLP."

Where MULDPDU is the maximum ULP PDU that will fit in the TCP MSS...

> > Jason, would configfs be a reasonable way to allow tweaking these
globals?
> 
> Why would we ever even bother to support a mode that is non-conformant
> on the wire? Just remove it..

For soft iwarp, the throughput is greatly increased with allowing these
iWARP protocol PDUs to span many TCP segments.  So there could be an
argument for leaving it as a knob for soft iwarp <-> soft iwarp
configurations.  But IMO it needs to default to RFC compliance and
interoperability with hw implementations.

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
  2018-01-23 16:58             ` Steve Wise
@ 2018-01-23 17:05               ` Jason Gunthorpe
       [not found]                 ` <20180123170517.GE30670-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2018-01-23 17:05 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 23, 2018 at 10:58:01AM -0600, Steve Wise wrote:

> From: https://tools.ietf.org/html/rfc5041#section-5.2
> 
> "At the Data Source, the DDP layer MUST segment the data contained in
>    a ULP message into a series of DDP Segments, where each DDP Segment
>    contains a DDP Header and ULP Payload, and MUST be no larger than the
>    MULPDU value Advertised by the LLP."
> 
> Where MULDPDU is the maximum ULP PDU that will fit in the TCP MSS...

But exceeding the MULPDU has nothing to do with the netstack GSO
function.. right? GSO is entirely a local node optimization that
should not be detectable on the wire.

> > > Jason, would configfs be a reasonable way to allow tweaking these
> globals?
> > 
> > Why would we ever even bother to support a mode that is non-conformant
> > on the wire? Just remove it..
> 
> For soft iwarp, the throughput is greatly increased with allowing these
> iWARP protocol PDUs to span many TCP segments.  So there could be an
> argument for leaving it as a knob for soft iwarp <-> soft iwarp
> configurations.  But IMO it needs to default to RFC compliance and
> interoperability with hw implementations.

IMHO the purpose of things like rxe and siw is not maximum raw
performance (certainly not during the initial kernel accept phase), so
it should not include any non-conformant cruft..

Jsaon
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 05/13] SoftiWarp application interface
       [not found]     ` <20180114223603.19961-6-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
@ 2018-01-23 17:18       ` Steve Wise
  0 siblings, 0 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-23 17:18 UTC (permalink / raw)
  To: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> ---
>  drivers/infiniband/sw/siw/siw_ae.c    |  120 +++
>  drivers/infiniband/sw/siw/siw_verbs.c | 1876
> +++++++++++++++++++++++++++++++++
>  drivers/infiniband/sw/siw/siw_verbs.h |  119 +++
>  include/uapi/rdma/siw_user.h          |  216 ++++
>  4 files changed, 2331 insertions(+)
>  create mode 100644 drivers/infiniband/sw/siw/siw_ae.c
>  create mode 100644 drivers/infiniband/sw/siw/siw_verbs.c
>  create mode 100644 drivers/infiniband/sw/siw/siw_verbs.h
>  create mode 100644 include/uapi/rdma/siw_user.h
> 
> diff --git a/drivers/infiniband/sw/siw/siw_ae.c
> b/drivers/infiniband/sw/siw/siw_ae.c
> new file mode 100644
> index 000000000000..eac55ce7f56d
> --- /dev/null
> +++ b/drivers/infiniband/sw/siw/siw_ae.c
> @@ -0,0 +1,120 @@
> +/*
> + * Software iWARP device driver
> + *
> + * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> + *
> + * Copyright (c) 2008-2017, IBM Corporation
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * BSD license below:
> + *
> + *   Redistribution and use in source and binary forms, with or
> + *   without modification, are permitted provided that the following
> + *   conditions are met:
> + *
> + *   - Redistributions of source code must retain the above copyright
notice,
> + *     this list of conditions and the following disclaimer.
> + *
> + *   - Redistributions in binary form must reproduce the above copyright
> + *     notice, this list of conditions and the following disclaimer in
the
> + *     documentation and/or other materials provided with the
distribution.
> + *
> + *   - Neither the name of IBM nor the names of its contributors may be
> + *     used to endorse or promote products derived from this software
without
> + *     specific prior written permission.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/net.h>
> +#include <linux/scatterlist.h>
> +#include <linux/highmem.h>
> +#include <net/sock.h>
> +#include <net/tcp_states.h>
> +#include <net/tcp.h>
> +
> +#include <rdma/iw_cm.h>
> +#include <rdma/ib_verbs.h>
> +#include <rdma/ib_smi.h>
> +#include <rdma/ib_user_verbs.h>
> +
> +#include "siw.h"
> +#include "siw_obj.h"
> +#include "siw_cm.h"
> +
> +void siw_qp_event(struct siw_qp *qp, enum ib_event_type etype)
> +{
> +	struct ib_event event;
> +	struct ib_qp	*base_qp = &qp->base_qp;
> +
> +	/*
> +	 * Do not report asynchronous errors on QP which gets
> +	 * destroyed via verbs interface (siw_destroy_qp())
> +	 */
> +	if (qp->attrs.flags & SIW_QP_IN_DESTROY)
> +		return;
> +
> +	event.event = etype;
> +	event.device = base_qp->device;
> +	event.element.qp = base_qp;
> +
> +	if (base_qp->event_handler) {
> +		siw_dbg_qp(qp, "reporting event %d\n", etype);
> +		(*base_qp->event_handler)(&event, base_qp->qp_context);
> +	}
> +}
> +
> +void siw_cq_event(struct siw_cq *cq, enum ib_event_type etype)
> +{
> +	struct ib_event event;
> +	struct ib_cq	*base_cq = &cq->base_cq;
> +
> +	event.event = etype;
> +	event.device = base_cq->device;
> +	event.element.cq = base_cq;
> +
> +	if (base_cq->event_handler) {
> +		siw_dbg(cq->hdr.sdev, "reporting CQ event %d\n", etype);
> +		(*base_cq->event_handler)(&event, base_cq->cq_context);
> +	}
> +}
> +

The rdma provider must ensure that event upcalls are serialized per object.
Is this being done somewhere?  See "Callbacks" in
Documentation/infiniband/core_locking.txt.



> +void siw_srq_event(struct siw_srq *srq, enum ib_event_type etype)
> +{
> +	struct ib_event event;
> +	struct ib_srq	*base_srq = &srq->base_srq;
> +
> +	event.event = etype;
> +	event.device = base_srq->device;
> +	event.element.srq = base_srq;
> +
> +	if (base_srq->event_handler) {
> +		siw_dbg(srq->pd->hdr.sdev, "reporting SRQ event %d\n",
> etype);
> +		(*base_srq->event_handler)(&event, base_srq->srq_context);
> +	}
> +}
> +
> +void siw_port_event(struct siw_device *sdev, u8 port, enum ib_event_type
> etype)
> +{
> +	struct ib_event event;
> +
> +	event.event = etype;
> +	event.device = &sdev->base_dev;
> +	event.element.port_num = port;
> +
> +	siw_dbg(sdev, "reporting port event %d\n", etype);
> +
> +	ib_dispatch_event(&event);
> +}
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.c
> b/drivers/infiniband/sw/siw/siw_verbs.c
> new file mode 100644
> index 000000000000..667b004e5c6d
> --- /dev/null
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -0,0 +1,1876 @@
> +/*
> + * Software iWARP device driver
> + *
> + * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> + *
> + * Copyright (c) 2008-2017, IBM Corporation
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * BSD license below:
> + *
> + *   Redistribution and use in source and binary forms, with or
> + *   without modification, are permitted provided that the following
> + *   conditions are met:
> + *
> + *   - Redistributions of source code must retain the above copyright
notice,
> + *     this list of conditions and the following disclaimer.
> + *
> + *   - Redistributions in binary form must reproduce the above copyright
> + *     notice, this list of conditions and the following disclaimer in
the
> + *     documentation and/or other materials provided with the
distribution.
> + *
> + *   - Neither the name of IBM nor the names of its contributors may be
> + *     used to endorse or promote products derived from this software
without
> + *     specific prior written permission.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vmalloc.h>
> +
> +#include <rdma/iw_cm.h>
> +#include <rdma/ib_verbs.h>
> +#include <rdma/ib_smi.h>
> +#include <rdma/ib_user_verbs.h>
> +
> +#include "siw.h"
> +#include "siw_verbs.h"
> +#include "siw_obj.h"
> +#include "siw_cm.h"
> +
> +static int ib_qp_state_to_siw_qp_state[IB_QPS_ERR+1] = {
> +	[IB_QPS_RESET]	= SIW_QP_STATE_IDLE,
> +	[IB_QPS_INIT]	= SIW_QP_STATE_IDLE,
> +	[IB_QPS_RTR]	= SIW_QP_STATE_RTR,
> +	[IB_QPS_RTS]	= SIW_QP_STATE_RTS,
> +	[IB_QPS_SQD]	= SIW_QP_STATE_CLOSING,
> +	[IB_QPS_SQE]	= SIW_QP_STATE_TERMINATE,
> +	[IB_QPS_ERR]	= SIW_QP_STATE_ERROR
> +};
> +
> +static struct siw_pd *siw_pd_base2siw(struct ib_pd *base_pd)
> +{
> +	return container_of(base_pd, struct siw_pd, base_pd);
> +}
> +
> +static struct siw_ucontext *siw_ctx_base2siw(struct ib_ucontext
*base_ctx)
> +{
> +	return container_of(base_ctx, struct siw_ucontext, ib_ucontext);
> +}
> +
> +static struct siw_cq *siw_cq_base2siw(struct ib_cq *base_cq)
> +{
> +	return container_of(base_cq, struct siw_cq, base_cq);
> +}
> +
> +static struct siw_srq *siw_srq_base2siw(struct ib_srq *base_srq)
> +{
> +	return container_of(base_srq, struct siw_srq, base_srq);
> +}
> +
> +static u32 siw_insert_uobj(struct siw_ucontext *uctx, void *vaddr, u32
size)
> +{
> +	struct siw_uobj *uobj;
> +	u32	key;
> +
> +	uobj = kzalloc(sizeof(*uobj), GFP_KERNEL);
> +	if (!uobj)
> +		return SIW_INVAL_UOBJ_KEY;
> +
> +	size = PAGE_ALIGN(size);
> +
> +	spin_lock(&uctx->uobj_lock);
> +
> +	if (list_empty(&uctx->uobj_list))
> +		uctx->uobj_key = 0;
> +
> +	key = uctx->uobj_key;
> +	if (key > SIW_MAX_UOBJ_KEY) {
> +		spin_unlock(&uctx->uobj_lock);
> +		kfree(uobj);
> +		return SIW_INVAL_UOBJ_KEY;
> +	}
> +	uobj->key = key;
> +	uobj->size = size;
> +	uobj->addr = vaddr;
> +
> +	uctx->uobj_key += size; /* advance for next object */
> +
> +	list_add_tail(&uobj->list, &uctx->uobj_list);
> +
> +	spin_unlock(&uctx->uobj_lock);
> +
> +	return key;
> +}
> +
> +static struct siw_uobj *siw_remove_uobj(struct siw_ucontext *uctx, u32
key,
> +					u32 size)
> +{
> +	struct list_head *pos, *nxt;
> +
> +	spin_lock(&uctx->uobj_lock);
> +
> +	list_for_each_safe(pos, nxt, &uctx->uobj_list) {
> +		struct siw_uobj *uobj = list_entry(pos, struct siw_uobj,
list);
> +
> +		if (uobj->key == key && uobj->size == size) {
> +			list_del(&uobj->list);
> +			spin_unlock(&uctx->uobj_lock);
> +			return uobj;
> +		}
> +	}
> +	spin_unlock(&uctx->uobj_lock);
> +
> +	return NULL;
> +}
> +
> +int siw_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma)
> +{
> +	struct siw_ucontext     *uctx = siw_ctx_base2siw(ctx);
> +	struct siw_uobj         *uobj;
> +	u32     key = vma->vm_pgoff << PAGE_SHIFT;
> +	int     size = vma->vm_end - vma->vm_start;
> +	int     rv = -EINVAL;
> +
> +	/*
> +	 * Must be page aligned
> +	 */
> +	if (vma->vm_start & (PAGE_SIZE - 1)) {
> +		pr_warn("map not page aligned\n");
> +		goto out;
> +	}
> +
> +	uobj = siw_remove_uobj(uctx, key, size);
> +	if (!uobj) {
> +		pr_warn("mmap lookup failed: %u, %d\n", key, size);
> +		goto out;
> +	}
> +	rv = remap_vmalloc_range(vma, uobj->addr, 0);
> +	if (rv)
> +		pr_warn("remap_vmalloc_range failed: %u, %d\n", key, size);
> +
> +	kfree(uobj);
> +out:
> +	return rv;
> +}
> +
> +struct ib_ucontext *siw_alloc_ucontext(struct ib_device *base_dev,
> +				       struct ib_udata *udata)
> +{
> +	struct siw_ucontext *ctx = NULL;
> +	struct siw_device *sdev = siw_dev_base2siw(base_dev);
> +	int rv;
> +
> +	if (atomic_inc_return(&sdev->num_ctx) > SIW_MAX_CONTEXT) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	spin_lock_init(&ctx->uobj_lock);
> +	INIT_LIST_HEAD(&ctx->uobj_list);
> +	ctx->uobj_key = 0;
> +
> +	ctx->sdev = sdev;
> +	if (udata) {
> +		struct siw_uresp_alloc_ctx uresp;
> +
> +		memset(&uresp, 0, sizeof(uresp));
> +		uresp.dev_id = sdev->attrs.vendor_part_id;
> +
> +		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> +		if (rv)
> +			goto err_out;
> +	}
> +	siw_dbg(sdev, "success. now %d context(s)\n",
> +		atomic_read(&sdev->num_ctx));
> +
> +	return &ctx->ib_ucontext;
> +
> +err_out:
> +	kfree(ctx);
> +
> +	atomic_dec(&sdev->num_ctx);
> +	siw_dbg(sdev, "failure %d. now %d context(s)\n",
> +		rv, atomic_read(&sdev->num_ctx));
> +
> +	return ERR_PTR(rv);
> +}
> +
> +int siw_dealloc_ucontext(struct ib_ucontext *base_ctx)
> +{
> +	struct siw_ucontext *ctx = siw_ctx_base2siw(base_ctx);
> +
> +	atomic_dec(&ctx->sdev->num_ctx);
> +	kfree(ctx);
> +	return 0;
> +}
> +
> +int siw_query_device(struct ib_device *base_dev, struct ib_device_attr
*attr,
> +		     struct ib_udata *unused)
> +{
> +	struct siw_device *sdev = siw_dev_base2siw(base_dev);
> +	/*
> +	 * A process context is needed to report avail memory resources.
> +	 */
> +	if (in_interrupt())
> +		return -EINVAL;
> +
> +	memset(attr, 0, sizeof(*attr));
> +
> +	attr->max_mr_size = rlimit(RLIMIT_MEMLOCK); /* per process */
> +	attr->vendor_id = sdev->attrs.vendor_id;
> +	attr->vendor_part_id = sdev->attrs.vendor_part_id;
> +	attr->max_qp = sdev->attrs.max_qp;
> +	attr->max_qp_wr = sdev->attrs.max_qp_wr;
> +
> +	attr->max_qp_rd_atom = sdev->attrs.max_ord;
> +	attr->max_qp_init_rd_atom = sdev->attrs.max_ird;
> +	attr->max_res_rd_atom = sdev->attrs.max_qp * sdev->attrs.max_ird;
> +	attr->device_cap_flags = sdev->attrs.cap_flags;
> +	attr->max_sge = sdev->attrs.max_sge;
> +	attr->max_sge_rd = sdev->attrs.max_sge_rd;
> +	attr->max_cq = sdev->attrs.max_cq;
> +	attr->max_cqe = sdev->attrs.max_cqe;
> +	attr->max_mr = sdev->attrs.max_mr;
> +	attr->max_pd = sdev->attrs.max_pd;
> +	attr->max_mw = sdev->attrs.max_mw;
> +	attr->max_fmr = sdev->attrs.max_fmr;
> +	attr->max_srq = sdev->attrs.max_srq;
> +	attr->max_srq_wr = sdev->attrs.max_srq_wr;
> +	attr->max_srq_sge = sdev->attrs.max_srq_sge;
> +	attr->max_fast_reg_page_list_len = SIW_MAX_SGE_PBL;
> +	attr->page_size_cap = PAGE_SIZE;
> +	/* Revisit if RFC 7306 gets supported */
> +	attr->atomic_cap = 0;
> +
> +	memcpy(&attr->sys_image_guid, sdev->netdev->dev_addr, 6);
> +
> +	return 0;
> +}
> +
> +/*
> + * Approximate translation of real MTU for IB.
> + */
> +static inline enum ib_mtu siw_mtu_net2base(unsigned short mtu)
> +{
> +	if (mtu >= 4096)
> +		return IB_MTU_4096;
> +	if (mtu >= 2048)
> +		return IB_MTU_2048;
> +	if (mtu >= 1024)
> +		return IB_MTU_1024;
> +	if (mtu >= 512)
> +		return IB_MTU_512;
> +	if (mtu >= 256)
> +		return IB_MTU_256;
> +	return IB_MTU_4096;
> +}
> +
> +int siw_query_port(struct ib_device *base_dev, u8 port,
> +		     struct ib_port_attr *attr)
> +{
> +	struct siw_device *sdev = siw_dev_base2siw(base_dev);
> +
> +	memset(attr, 0, sizeof(*attr));
> +
> +	attr->state = sdev->state;
> +	attr->max_mtu = siw_mtu_net2base(sdev->netdev->mtu);
> +	attr->active_mtu = attr->max_mtu;
> +	attr->gid_tbl_len = 1;
> +	attr->port_cap_flags = IB_PORT_CM_SUP;
> +	attr->port_cap_flags |= IB_PORT_DEVICE_MGMT_SUP;
> +	attr->max_msg_sz = -1;
> +	attr->pkey_tbl_len = 1;
> +	attr->active_width = 2;
> +	attr->active_speed = 2;
> +	attr->phys_state = sdev->state == IB_PORT_ACTIVE ? 5 : 3;
> +	/*
> +	 * All zero
> +	 *
> +	 * attr->lid = 0;
> +	 * attr->bad_pkey_cntr = 0;
> +	 * attr->qkey_viol_cntr = 0;
> +	 * attr->sm_lid = 0;
> +	 * attr->lmc = 0;
> +	 * attr->max_vl_num = 0;
> +	 * attr->sm_sl = 0;
> +	 * attr->subnet_timeout = 0;
> +	 * attr->init_type_repy = 0;
> +	 */
> +	return 0;
> +}
> +
> +int siw_get_port_immutable(struct ib_device *base_dev, u8 port,
> +			   struct ib_port_immutable *port_immutable)
> +{
> +	struct ib_port_attr attr;
> +	int rv = siw_query_port(base_dev, port, &attr);
> +
> +	if (rv)
> +		return rv;
> +
> +	port_immutable->pkey_tbl_len = attr.pkey_tbl_len;
> +	port_immutable->gid_tbl_len = attr.gid_tbl_len;
> +	port_immutable->core_cap_flags = RDMA_CORE_PORT_IWARP;
> +
> +	return 0;
> +}
> +
> +int siw_query_pkey(struct ib_device *base_dev, u8 port, u16 idx, u16
*pkey)
> +{
> +	/* Report the default pkey */
> +	*pkey = 0xffff;
> +	return 0;
> +}
> +
> +int siw_query_gid(struct ib_device *base_dev, u8 port, int idx,
> +		   union ib_gid *gid)
> +{
> +	struct siw_device *sdev = siw_dev_base2siw(base_dev);
> +
> +	/* subnet_prefix == interface_id == 0; */
> +	memset(gid, 0, sizeof(*gid));
> +	memcpy(&gid->raw[0], sdev->netdev->dev_addr, 6);
> +
> +	return 0;
> +}
> +
> +struct ib_pd *siw_alloc_pd(struct ib_device *base_dev,
> +			   struct ib_ucontext *context, struct ib_udata
*udata)
> +{
> +	struct siw_pd *pd = NULL;
> +	struct siw_device *sdev  = siw_dev_base2siw(base_dev);
> +	int rv;
> +
> +	if (atomic_inc_return(&sdev->num_pd) > SIW_MAX_PD) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	pd = kmalloc(sizeof(*pd), GFP_KERNEL);
> +	if (!pd) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	rv = siw_pd_add(sdev, pd);
> +	if (rv) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (context) {
> +		if (ib_copy_to_udata(udata, &pd->hdr.id,
sizeof(pd->hdr.id))) {
> +			rv = -EFAULT;
> +			goto err_out_idr;
> +		}
> +	}
> +	siw_dbg(sdev, "success. now %d pd's(s)\n",
> +		atomic_read(&sdev->num_pd));
> +	return &pd->base_pd;
> +
> +err_out_idr:
> +	siw_remove_obj(&sdev->idr_lock, &sdev->pd_idr, &pd->hdr);
> +err_out:
> +	kfree(pd);
> +	atomic_dec(&sdev->num_pd);
> +	siw_dbg(sdev, "failure %d. now %d pd's(s)\n",
> +		rv, atomic_read(&sdev->num_pd));
> +
> +	return ERR_PTR(rv);
> +}
> +
> +int siw_dealloc_pd(struct ib_pd *base_pd)
> +{
> +	struct siw_pd *pd = siw_pd_base2siw(base_pd);
> +	struct siw_device *sdev = siw_dev_base2siw(base_pd->device);
> +
> +	siw_remove_obj(&sdev->idr_lock, &sdev->pd_idr, &pd->hdr);
> +	siw_pd_put(pd);
> +
> +	return 0;
> +}
> +
> +void siw_qp_get_ref(struct ib_qp *base_qp)
> +{
> +	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
> +
> +	siw_dbg_qp(qp, "get user reference\n");
> +	siw_qp_get(qp);
> +}
> +
> +void siw_qp_put_ref(struct ib_qp *base_qp)
> +{
> +	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
> +
> +	siw_dbg_qp(qp, "put user reference\n");
> +	siw_qp_put(qp);
> +}
> +
> +int siw_no_mad(struct ib_device *base_dev, int flags, u8 port,
> +	       const struct ib_wc *wc, const struct ib_grh *grh,
> +	       const struct ib_mad_hdr *in_mad, size_t in_mad_size,
> +	       struct ib_mad_hdr *out_mad, size_t *out_mad_size,
> +	       u16 *outmad_pkey_index)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +/*
> + * siw_create_qp()
> + *
> + * Create QP of requested size on given device.
> + *
> + * @base_pd:	Base PD contained in siw PD
> + * @attrs:	Initial QP attributes.
> + * @udata:	used to provide QP ID, SQ and RQ size back to user.
> + */
> +
> +struct ib_qp *siw_create_qp(struct ib_pd *base_pd,
> +			    struct ib_qp_init_attr *attrs,
> +			    struct ib_udata *udata)
> +{
> +	struct siw_qp		*qp = NULL;
> +	struct siw_pd		*pd = siw_pd_base2siw(base_pd);
> +	struct ib_device	 *base_dev = base_pd->device;
> +	struct siw_device	*sdev = siw_dev_base2siw(base_dev);
> +	struct siw_cq		*scq = NULL, *rcq = NULL;
> +
> +	unsigned long flags;
> +	int num_sqe, num_rqe, rv = 0;
> +
> +	siw_dbg(sdev, "create new qp\n");
> +
> +	if (atomic_inc_return(&sdev->num_qp) > SIW_MAX_QP) {
> +		siw_dbg(sdev, "too many qp's\n");
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (attrs->qp_type != IB_QPT_RC) {
> +		siw_dbg(sdev, "only rc qp's supported\n");
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +	if ((attrs->cap.max_send_wr > SIW_MAX_QP_WR) ||
> +	    (attrs->cap.max_recv_wr > SIW_MAX_QP_WR) ||
> +	    (attrs->cap.max_send_sge > SIW_MAX_SGE)  ||
> +	    (attrs->cap.max_recv_sge > SIW_MAX_SGE)) {
> +		siw_dbg(sdev, "qp size error\n");
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +	if (attrs->cap.max_inline_data > SIW_MAX_INLINE) {
> +		siw_dbg(sdev, "max inline send: %d > %d\n",
> +			attrs->cap.max_inline_data, (int)SIW_MAX_INLINE);
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +	/*
> +	 * NOTE: we allow for zero element SQ and RQ WQE's SGL's
> +	 * but not for a QP unable to hold any WQE (SQ + RQ)
> +	 */
> +	if (attrs->cap.max_send_wr + attrs->cap.max_recv_wr == 0) {
> +		siw_dbg(sdev, "qp must have send or receive queue\n");
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	scq = siw_cq_id2obj(sdev, ((struct siw_cq
*)attrs->send_cq)->hdr.id);
> +	rcq = siw_cq_id2obj(sdev, ((struct siw_cq
*)attrs->recv_cq)->hdr.id);
> +
> +	if (!scq || (!rcq && !attrs->srq)) {
> +		siw_dbg(sdev, "send cq or receive cq invalid\n");
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +	qp = kzalloc(sizeof(*qp), GFP_KERNEL);
> +	if (!qp) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +
> +	init_rwsem(&qp->state_lock);
> +	spin_lock_init(&qp->sq_lock);
> +	spin_lock_init(&qp->rq_lock);
> +	spin_lock_init(&qp->orq_lock);
> +
> +	init_waitqueue_head(&qp->tx_ctx.waitq);
> +
> +	if (!base_pd->uobject)
> +		qp->kernel_verbs = 1;
> +
> +	rv = siw_qp_add(sdev, qp);
> +	if (rv)
> +		goto err_out;
> +
> +	num_sqe = roundup_pow_of_two(attrs->cap.max_send_wr);
> +	num_rqe = roundup_pow_of_two(attrs->cap.max_recv_wr);
> +
> +	if (qp->kernel_verbs)
> +		qp->sendq = vzalloc(num_sqe * sizeof(struct siw_sqe));
> +	else
> +		qp->sendq = vmalloc_user(num_sqe * sizeof(struct siw_sqe));
> +
> +	if (qp->sendq == NULL) {
> +		siw_dbg_qp(qp, "send queue size %d alloc failed\n",
num_sqe);
> +		rv = -ENOMEM;
> +		goto err_out_idr;
> +	}
> +	if (attrs->sq_sig_type != IB_SIGNAL_REQ_WR) {
> +		if (attrs->sq_sig_type == IB_SIGNAL_ALL_WR)
> +			qp->attrs.flags |= SIW_SIGNAL_ALL_WR;
> +		else {
> +			rv = -EINVAL;
> +			goto err_out_idr;
> +		}
> +	}
> +	qp->pd  = pd;
> +	qp->scq = scq;
> +	qp->rcq = rcq;
> +
> +	if (attrs->srq) {
> +		/*
> +		 * SRQ support.
> +		 * Verbs 6.3.7: ignore RQ size, if SRQ present
> +		 * Verbs 6.3.5: do not check PD of SRQ against PD of QP
> +		 */
> +		qp->srq = siw_srq_base2siw(attrs->srq);
> +		qp->attrs.rq_size = 0;
> +		siw_dbg_qp(qp, "[SRQ 0x%p] attached\n", qp->srq);
> +	} else if (num_rqe) {
> +		if (qp->kernel_verbs)
> +			qp->recvq = vzalloc(num_rqe * sizeof(struct
siw_rqe));
> +		else
> +			qp->recvq = vmalloc_user(num_rqe *
> +						 sizeof(struct siw_rqe));
> +
> +		if (qp->recvq == NULL) {
> +			siw_dbg_qp(qp, "recv queue size %d alloc failed\n",
> +				   num_rqe);
> +			rv = -ENOMEM;
> +			goto err_out_idr;
> +		}
> +
> +		qp->attrs.rq_size = num_rqe;
> +	}
> +	qp->attrs.sq_size = num_sqe;
> +	qp->attrs.sq_max_sges = attrs->cap.max_send_sge;
> +	/*
> +	 * ofed has no max_send_sge_rdmawrite
> +	 */
> +	qp->attrs.sq_max_sges_rdmaw = attrs->cap.max_send_sge;
> +	qp->attrs.rq_max_sges = attrs->cap.max_recv_sge;
> +
> +	qp->attrs.state = SIW_QP_STATE_IDLE;
> +
> +	if (udata) {
> +		struct siw_uresp_create_qp uresp;
> +		struct siw_ucontext *ctx;
> +
> +		memset(&uresp, 0, sizeof(uresp));
> +		ctx = siw_ctx_base2siw(base_pd->uobject->context);
> +
> +		uresp.sq_key = uresp.rq_key = SIW_INVAL_UOBJ_KEY;
> +		uresp.num_sqe = num_sqe;
> +		uresp.num_rqe = num_rqe;
> +		uresp.qp_id = QP_ID(qp);
> +
> +		if (qp->sendq) {
> +			uresp.sq_key = siw_insert_uobj(ctx, qp->sendq,
> +					num_sqe * sizeof(struct siw_sqe));
> +			if (uresp.sq_key > SIW_MAX_UOBJ_KEY)
> +				siw_dbg_qp(qp, "preparing mmap sq
failed\n");
> +		}
> +		if (qp->recvq) {
> +			uresp.rq_key = siw_insert_uobj(ctx, qp->recvq,
> +					num_rqe * sizeof(struct siw_rqe));
> +			if (uresp.rq_key > SIW_MAX_UOBJ_KEY)
> +				siw_dbg_qp(qp, "preparing mmap rq
failed\n");
> +		}
> +		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> +		if (rv)
> +			goto err_out_idr;
> +	}
> +	qp->tx_cpu = siw_get_tx_cpu(sdev);
> +	if (qp->tx_cpu < 0) {
> +		rv = -EINVAL;
> +		goto err_out_idr;
> +	}
> +	atomic_set(&qp->tx_ctx.in_use, 0);
> +
> +	qp->base_qp.qp_num = QP_ID(qp);
> +
> +	siw_pd_get(pd);
> +
> +	INIT_LIST_HEAD(&qp->devq);
> +	spin_lock_irqsave(&sdev->idr_lock, flags);
> +	list_add_tail(&qp->devq, &sdev->qp_list);
> +	spin_unlock_irqrestore(&sdev->idr_lock, flags);
> +
> +	return &qp->base_qp;
> +
> +err_out_idr:
> +	siw_remove_obj(&sdev->idr_lock, &sdev->qp_idr, &qp->hdr);
> +err_out:
> +	if (scq)
> +		siw_cq_put(scq);
> +	if (rcq)
> +		siw_cq_put(rcq);
> +
> +	if (qp) {
> +		if (qp->sendq)
> +			vfree(qp->sendq);
> +		if (qp->recvq)
> +			vfree(qp->recvq);
> +		kfree(qp);
> +	}
> +	atomic_dec(&sdev->num_qp);
> +
> +	return ERR_PTR(rv);
> +}
> +
> +/*
> + * Minimum siw_query_qp() verb interface.
> + *
> + * @qp_attr_mask is not used but all available information is provided
> + */
> +int siw_query_qp(struct ib_qp *base_qp, struct ib_qp_attr *qp_attr,
> +		 int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr)
> +{
> +	struct siw_qp *qp;
> +	struct siw_device *sdev;
> +
> +	if (base_qp && qp_attr && qp_init_attr) {
> +		qp = siw_qp_base2siw(base_qp);
> +		sdev = siw_dev_base2siw(base_qp->device);
> +	} else
> +		return -EINVAL;
> +
> +	qp_attr->cap.max_inline_data = SIW_MAX_INLINE;
> +	qp_attr->cap.max_send_wr = qp->attrs.sq_size;
> +	qp_attr->cap.max_send_sge = qp->attrs.sq_max_sges;
> +	qp_attr->cap.max_recv_wr = qp->attrs.rq_size;
> +	qp_attr->cap.max_recv_sge = qp->attrs.rq_max_sges;
> +	qp_attr->path_mtu = siw_mtu_net2base(sdev->netdev->mtu);
> +	qp_attr->max_rd_atomic = qp->attrs.irq_size;
> +	qp_attr->max_dest_rd_atomic = qp->attrs.orq_size;
> +
> +	qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE |
> +			IB_ACCESS_REMOTE_WRITE |
> IB_ACCESS_REMOTE_READ;
> +
> +	qp_init_attr->qp_type = base_qp->qp_type;
> +	qp_init_attr->send_cq = base_qp->send_cq;
> +	qp_init_attr->recv_cq = base_qp->recv_cq;
> +	qp_init_attr->srq = base_qp->srq;
> +
> +	qp_init_attr->cap = qp_attr->cap;
> +
> +	return 0;
> +}
> +
> +int siw_verbs_modify_qp(struct ib_qp *base_qp, struct ib_qp_attr *attr,
> +			int attr_mask, struct ib_udata *udata)
> +{
> +	struct siw_qp_attrs	new_attrs;
> +	enum siw_qp_attr_mask	siw_attr_mask = 0;
> +	struct siw_qp		*qp = siw_qp_base2siw(base_qp);
> +	int			rv = 0;
> +
> +	if (!attr_mask)
> +		return 0;
> +
> +	memset(&new_attrs, 0, sizeof(new_attrs));
> +
> +	if (attr_mask & IB_QP_ACCESS_FLAGS) {
> +
> +		siw_attr_mask = SIW_QP_ATTR_ACCESS_FLAGS;
> +
> +		if (attr->qp_access_flags & IB_ACCESS_REMOTE_READ)
> +			new_attrs.flags |= SIW_RDMA_READ_ENABLED;
> +		if (attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE)
> +			new_attrs.flags |= SIW_RDMA_WRITE_ENABLED;
> +		if (attr->qp_access_flags & IB_ACCESS_MW_BIND)
> +			new_attrs.flags |= SIW_RDMA_BIND_ENABLED;
> +	}
> +	if (attr_mask & IB_QP_STATE) {
> +		siw_dbg_qp(qp, "desired ib qp state: %s\n",
> +			   ib_qp_state_to_string[attr->qp_state]);
> +
> +		new_attrs.state = ib_qp_state_to_siw_qp_state[attr-
> >qp_state];
> +
> +		if (new_attrs.state > SIW_QP_STATE_RTS)
> +			qp->tx_ctx.tx_suspend = 1;
> +
> +		siw_attr_mask |= SIW_QP_ATTR_STATE;
> +	}
> +	if (!attr_mask)
> +		goto out;
> +
> +	down_write(&qp->state_lock);
> +
> +	rv = siw_qp_modify(qp, &new_attrs, siw_attr_mask);
> +
> +	up_write(&qp->state_lock);
> +out:
> +	return rv;
> +}
> +
> +int siw_destroy_qp(struct ib_qp *base_qp)
> +{
> +	struct siw_qp		*qp = siw_qp_base2siw(base_qp);
> +	struct siw_qp_attrs	qp_attrs;
> +
> +	siw_dbg_qp(qp, "state %d, cep 0x%p\n", qp->attrs.state, qp->cep);
> +
> +	/*
> +	 * Mark QP as in process of destruction to prevent from
> +	 * any async callbacks to RDMA core
> +	 */
> +	qp->attrs.flags |= SIW_QP_IN_DESTROY;
> +	qp->rx_ctx.rx_suspend = 1;
> +
> +	down_write(&qp->state_lock);
> +
> +	qp_attrs.state = SIW_QP_STATE_ERROR;
> +	(void)siw_qp_modify(qp, &qp_attrs, SIW_QP_ATTR_STATE);
> +
> +	if (qp->cep) {
> +		siw_cep_put(qp->cep);
> +		qp->cep = NULL;
> +	}
> +
> +	up_write(&qp->state_lock);
> +
> +	kfree(qp->rx_ctx.mpa_crc_hd);
> +	kfree(qp->tx_ctx.mpa_crc_hd);
> +
> +	/* Drop references */
> +	siw_cq_put(qp->scq);
> +	siw_cq_put(qp->rcq);
> +	siw_pd_put(qp->pd);
> +	qp->scq = qp->rcq = NULL;
> +
> +	siw_qp_put(qp);
> +
> +	return 0;
> +}
> +
> +/*
> + * siw_copy_sgl()
> + *
> + * Copy SGL from RDMA core representation to local
> + * representation.
> + */
> +static inline void siw_copy_sgl(struct ib_sge *sge, struct siw_sge
*siw_sge,
> +			       int num_sge)
> +{
> +	while (num_sge--) {
> +		siw_sge->laddr = sge->addr;
> +		siw_sge->length = sge->length;
> +		siw_sge->lkey = sge->lkey;
> +
> +		siw_sge++; sge++;
> +	}
> +}
> +
> +/*
> + * siw_copy_inline_sgl()
> + *
> + * Prepare sgl of inlined data for sending. For userland callers
> + * function checks if given buffer addresses and len's are within
> + * process context bounds.
> + * Data from all provided sge's are copied together into the wqe,
> + * referenced by a single sge.
> + */
> +static int siw_copy_inline_sgl(struct ib_send_wr *core_wr, struct siw_sqe
*sqe)
> +{
> +	struct ib_sge	*core_sge = core_wr->sg_list;
> +	void		*kbuf	 = &sqe->sge[1];
> +	int		num_sge	 = core_wr->num_sge,
> +			bytes	 = 0;
> +
> +	sqe->sge[0].laddr = (u64)kbuf;
> +	sqe->sge[0].lkey = 0;
> +
> +	while (num_sge--) {
> +		if (!core_sge->length) {
> +			core_sge++;
> +			continue;
> +		}
> +		bytes += core_sge->length;
> +		if (bytes > SIW_MAX_INLINE) {
> +			bytes = -EINVAL;
> +			break;
> +		}
> +		memcpy(kbuf, (void *)(uintptr_t)core_sge->addr,
> +		       core_sge->length);
> +
> +		kbuf += core_sge->length;
> +		core_sge++;
> +	}
> +	sqe->sge[0].length = bytes > 0 ? bytes : 0;
> +	sqe->num_sge = bytes > 0 ? 1 : 0;
> +
> +	return bytes;
> +}
> +
> +/*
> + * siw_post_send()
> + *
> + * Post a list of S-WR's to a SQ.
> + *
> + * @base_qp:	Base QP contained in siw QP
> + * @wr:		Null terminated list of user WR's
> + * @bad_wr:	Points to failing WR in case of synchronous failure.
> + */
> +int siw_post_send(struct ib_qp *base_qp, struct ib_send_wr *wr,
> +		  struct ib_send_wr **bad_wr)
> +{
> +	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
> +	struct siw_wqe	*wqe = tx_wqe(qp);
> +
> +	unsigned long flags;
> +	int rv = 0;
> +
> +	siw_dbg_qp(qp, "state %d\n", qp->attrs.state);
> +
> +	/*
> +	 * Try to acquire QP state lock. Must be non-blocking
> +	 * to accommodate kernel clients needs.
> +	 */
> +	if (!down_read_trylock(&qp->state_lock)) {
> +		*bad_wr = wr;
> +		return -ENOTCONN;
> +	}
> +

Under what conditions does down_read_trylock() return 0?  This seems like a
kernel ULP that has multiple threads posting to the sq might get an error
due to lock contention?  

> +	if (unlikely(qp->attrs.state != SIW_QP_STATE_RTS)) {
> +		up_read(&qp->state_lock);
> +		*bad_wr = wr;
> +		return -ENOTCONN;
> +	}

NOTCONN is inappropriate here too.  Also, I relaxed iw_cxgb4 to support
ib_drain_qp().  Basically if the kernel QP is in ERROR,  a post will be
completed via a CQE inserted with status IB_WC_WR_FLUSH_ERR.  See how
ib_drain_qp() works for details.  The reason I bring this up, is that both
iser and nvmf ULPs use ib_drain_sq/rq/qp() so you might want to support
them.  There is an alternate solution though.  ib_drain_sq/rq() allows the
low level driver to have its own drain implementation.  See those functions
for details.


> +	if (wr && !qp->kernel_verbs) {
> +		siw_dbg_qp(qp, "wr must be empty for user mapped sq\n");
> +		up_read(&qp->state_lock);
> +		*bad_wr = wr;
> +		return -EINVAL;
> +	}
> +
> +	spin_lock_irqsave(&qp->sq_lock, flags);
> +
> +	while (wr) {
> +		u32 idx = qp->sq_put % qp->attrs.sq_size;
> +		struct siw_sqe *sqe = &qp->sendq[idx];
> +
> +		if (sqe->flags) {
> +			siw_dbg_qp(qp, "sq full\n");
> +			rv = -ENOMEM;
> +			break;
> +		}
> +		if (wr->num_sge > qp->attrs.sq_max_sges) {
> +			siw_dbg_qp(qp, "too many sge's: %d\n", wr-
> >num_sge);
> +			rv = -EINVAL;
> +			break;
> +		}
> +		sqe->id = wr->wr_id;
> +
> +		if ((wr->send_flags & IB_SEND_SIGNALED) ||
> +		    (qp->attrs.flags & SIW_SIGNAL_ALL_WR))
> +			sqe->flags |= SIW_WQE_SIGNALLED;
> +
> +		if (wr->send_flags & IB_SEND_FENCE)
> +			sqe->flags |= SIW_WQE_READ_FENCE;
> +
> +		switch (wr->opcode) {
> +
> +		case IB_WR_SEND:
> +		case IB_WR_SEND_WITH_INV:
> +			if (wr->send_flags & IB_SEND_SOLICITED)
> +				sqe->flags |= SIW_WQE_SOLICITED;
> +
> +			if (!(wr->send_flags & IB_SEND_INLINE)) {
> +				siw_copy_sgl(wr->sg_list, sqe->sge,
> +					     wr->num_sge);
> +				sqe->num_sge = wr->num_sge;
> +			} else {
> +				rv = siw_copy_inline_sgl(wr, sqe);
> +				if (rv <= 0) {
> +					rv = -EINVAL;
> +					break;
> +				}
> +				sqe->flags |= SIW_WQE_INLINE;
> +				sqe->num_sge = 1;
> +			}
> +			if (wr->opcode == IB_WR_SEND)
> +				sqe->opcode = SIW_OP_SEND;
> +			else {
> +				sqe->opcode = SIW_OP_SEND_REMOTE_INV;
> +				sqe->rkey = wr->ex.invalidate_rkey;
> +			}
> +			break;
> +
> +		case IB_WR_RDMA_READ_WITH_INV:
> +		case IB_WR_RDMA_READ:
> +			/*
> +			 * OFED WR restricts RREAD sink to SGL containing
> +			 * 1 SGE only. we could relax to SGL with multiple
> +			 * elements referring the SAME ltag or even sending
> +			 * a private per-rreq tag referring to a checked
> +			 * local sgl with MULTIPLE ltag's. would be easy
> +			 * to do...
> +			 */
> +			if (unlikely(wr->num_sge != 1)) {
> +				rv = -EINVAL;
> +				break;
> +			}
> +			siw_copy_sgl(wr->sg_list, &sqe->sge[0], 1);
> +			/*
> +			 * NOTE: zero length RREAD is allowed!
> +			 */
> +			sqe->raddr	= rdma_wr(wr)->remote_addr;
> +			sqe->rkey	= rdma_wr(wr)->rkey;
> +			sqe->num_sge	= 1;
> +
> +			if (wr->opcode == IB_WR_RDMA_READ)
> +				sqe->opcode = SIW_OP_READ;
> +			else
> +				sqe->opcode = SIW_OP_READ_LOCAL_INV;
> +			break;
> +
> +		case IB_WR_RDMA_WRITE:
> +			if (!(wr->send_flags & IB_SEND_INLINE)) {
> +				siw_copy_sgl(wr->sg_list, &sqe->sge[0],
> +					     wr->num_sge);
> +				sqe->num_sge = wr->num_sge;
> +			} else {
> +				rv = siw_copy_inline_sgl(wr, sqe);
> +				if (unlikely(rv < 0)) {
> +					rv = -EINVAL;
> +					break;
> +				}
> +				sqe->flags |= SIW_WQE_INLINE;
> +				sqe->num_sge = 1;
> +			}
> +			sqe->raddr	= rdma_wr(wr)->remote_addr;
> +			sqe->rkey	= rdma_wr(wr)->rkey;
> +			sqe->opcode	= SIW_OP_WRITE;
> +
> +			break;
> +
> +		case IB_WR_REG_MR:
> +			sqe->base_mr = (uint64_t)reg_wr(wr)->mr;
> +			sqe->rkey = reg_wr(wr)->key;
> +			sqe->access = SIW_MEM_LREAD;
> +			if (reg_wr(wr)->access & IB_ACCESS_LOCAL_WRITE)
> +				sqe->access |= SIW_MEM_LWRITE;
> +			if (reg_wr(wr)->access & IB_ACCESS_REMOTE_WRITE)
> +				sqe->access |= SIW_MEM_RWRITE;
> +			if (reg_wr(wr)->access & IB_ACCESS_REMOTE_READ)
> +				sqe->access |= SIW_MEM_RREAD;
> +			sqe->opcode = SIW_OP_REG_MR;
> +
> +			break;
> +
> +		case IB_WR_LOCAL_INV:
> +			sqe->rkey = wr->ex.invalidate_rkey;
> +			sqe->opcode = SIW_OP_INVAL_STAG;
> +
> +			break;
> +
> +		default:
> +			siw_dbg_qp(qp, "ib wr type %d unsupported\n",
> +				   wr->opcode);
> +			rv = -EINVAL;
> +			break;
> +		}
> +		siw_dbg_qp(qp, "opcode %d, flags 0x%x\n",
> +			   sqe->opcode, sqe->flags);
> +
> +		if (unlikely(rv < 0))
> +			break;
> +
> +		/* make SQE only vaild after completely written */
> +		smp_wmb();
> +		sqe->flags |= SIW_WQE_VALID;
> +
> +		qp->sq_put++;
> +		wr = wr->next;
> +	}
> +
> +	/*
> +	 * Send directly if SQ processing is not in progress.
> +	 * Eventual immediate errors (rv < 0) do not affect the involved
> +	 * RI resources (Verbs, 8.3.1) and thus do not prevent from SQ
> +	 * processing, if new work is already pending. But rv must be passed
> +	 * to caller.
> +	 */
> +	if (wqe->wr_status != SIW_WR_IDLE) {
> +		spin_unlock_irqrestore(&qp->sq_lock, flags);
> +		goto skip_direct_sending;
> +	}
> +	rv = siw_activate_tx(qp);
> +	spin_unlock_irqrestore(&qp->sq_lock, flags);
> +
> +	if (rv <= 0)
> +		goto skip_direct_sending;
> +
> +	if (qp->kernel_verbs) {
> +		rv = siw_sq_start(qp);
> +	} else {
> +		qp->tx_ctx.in_syscall = 1;
> +
> +		if (siw_qp_sq_process(qp) != 0 && !(qp->tx_ctx.tx_suspend))
> +			siw_qp_cm_drop(qp, 0);
> +
> +		qp->tx_ctx.in_syscall = 0;
> +	}
> +skip_direct_sending:
> +
> +	up_read(&qp->state_lock);
> +
> +	if (rv >= 0)
> +		return 0;
> +	/*
> +	 * Immediate error
> +	 */
> +	siw_dbg_qp(qp, "error %d\n", rv);
> +
> +	*bad_wr = wr;
> +	return rv;
> +}
> +
> +/*
> + * siw_post_receive()
> + *
> + * Post a list of R-WR's to a RQ.
> + *
> + * @base_qp:	Base QP contained in siw QP
> + * @wr:		Null terminated list of user WR's
> + * @bad_wr:	Points to failing WR in case of synchronous failure.
> + */
> +int siw_post_receive(struct ib_qp *base_qp, struct ib_recv_wr *wr,
> +		     struct ib_recv_wr **bad_wr)
> +{
> +	struct siw_qp	*qp = siw_qp_base2siw(base_qp);
> +	int rv = 0;
> +
> +	siw_dbg_qp(qp, "state %d\n", qp->attrs.state);
> +
> +	if (qp->srq) {
> +		*bad_wr = wr;
> +		return -EOPNOTSUPP; /* what else from errno.h? */
> +	}
> +	/*
> +	 * Try to acquire QP state lock. Must be non-blocking
> +	 * to accommodate kernel clients needs.
> +	 */
> +	if (!down_read_trylock(&qp->state_lock)) {
> +		*bad_wr = wr;
> +		return -ENOTCONN;
> +	}

Same comment here on trylock...


> +	if (!qp->kernel_verbs) {
> +		siw_dbg_qp(qp, "no kernel post_recv for user mapped sq\n");
> +		up_read(&qp->state_lock);
> +		*bad_wr = wr;
> +		return -EINVAL;
> +	}
> +	if (qp->attrs.state > SIW_QP_STATE_RTS) {
> +		up_read(&qp->state_lock);
> +		*bad_wr = wr;
> +		return -EINVAL;
> +	}
> +	while (wr) {
> +		u32 idx = qp->rq_put % qp->attrs.rq_size;
> +		struct siw_rqe *rqe = &qp->recvq[idx];
> +
> +		if (rqe->flags) {
> +			siw_dbg_qp(qp, "rq full\n");
> +			rv = -ENOMEM;
> +			break;
> +		}
> +		if (wr->num_sge > qp->attrs.rq_max_sges) {
> +			siw_dbg_qp(qp, "too many sge's: %d\n", wr-
> >num_sge);
> +			rv = -EINVAL;
> +			break;
> +		}
> +		rqe->id = wr->wr_id;
> +		rqe->num_sge = wr->num_sge;
> +		siw_copy_sgl(wr->sg_list, rqe->sge, wr->num_sge);
> +
> +		/* make sure RQE is completely written before valid */
> +		smp_wmb();
> +
> +		rqe->flags = SIW_WQE_VALID;
> +
> +		qp->rq_put++;
> +		wr = wr->next;
> +	}
> +	if (rv < 0) {
> +		siw_dbg_qp(qp, "error %d\n", rv);
> +		*bad_wr = wr;
> +	}
> +	up_read(&qp->state_lock);
> +
> +	return rv > 0 ? 0 : rv;
> +}
> +
> +int siw_destroy_cq(struct ib_cq *base_cq)
> +{
> +	struct siw_cq		*cq  = siw_cq_base2siw(base_cq);
> +	struct ib_device	*base_dev = base_cq->device;
> +	struct siw_device	*sdev = siw_dev_base2siw(base_dev);
> +
> +	siw_cq_flush(cq);
> +
> +	siw_remove_obj(&sdev->idr_lock, &sdev->cq_idr, &cq->hdr);
> +	siw_cq_put(cq);
> +
> +	return 0;
> +}
> +
> +/*
> + * siw_create_cq()
> + *
> + * Create CQ of requested size on given device.
> + *
> + * @base_dev:	RDMA device contained in siw device
> + * @size:	maximum number of CQE's allowed.
> + * @ib_context: user context.
> + * @udata:	used to provide CQ ID back to user.
> + */
> +
> +struct ib_cq *siw_create_cq(struct ib_device *base_dev,
> +			    const struct ib_cq_init_attr *attr,
> +			    struct ib_ucontext *ib_context,
> +			    struct ib_udata *udata)
> +{
> +	struct siw_cq			*cq = NULL;
> +	struct siw_device		*sdev = siw_dev_base2siw(base_dev);
> +	struct siw_uresp_create_cq	uresp;
> +	int rv, size = attr->cqe;
> +
> +	if (!base_dev) {
> +		rv = -ENODEV;
> +		goto err_out;
> +	}
> +	if (atomic_inc_return(&sdev->num_cq) > SIW_MAX_CQ) {
> +		siw_dbg(sdev, "too many cq's\n");
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (size < 1 || size > SIW_MAX_CQE) {
> +		siw_dbg(sdev, "cq size error: %d\n", size);
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +	cq = kzalloc(sizeof(*cq), GFP_KERNEL);
> +	if (!cq) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	size = roundup_pow_of_two(size);
> +	cq->base_cq.cqe = size;
> +	cq->num_cqe = size;
> +
> +	if (!ib_context) {
> +		cq->kernel_verbs = 1;
> +		cq->queue = vzalloc(size * sizeof(struct siw_cqe)
> +				+ sizeof(struct siw_cq_ctrl));
> +	} else
> +		cq->queue = vmalloc_user(size * sizeof(struct siw_cqe)
> +				+ sizeof(struct siw_cq_ctrl));
> +
> +	if (cq->queue == NULL) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +
> +	rv = siw_cq_add(sdev, cq);
> +	if (rv)
> +		goto err_out;
> +
> +	spin_lock_init(&cq->lock);
> +
> +	cq->notify = &((struct siw_cq_ctrl *)&cq->queue[size])->notify;
> +
> +	if (!cq->kernel_verbs) {
> +		struct siw_ucontext *ctx = siw_ctx_base2siw(ib_context);
> +
> +		uresp.cq_key = siw_insert_uobj(ctx, cq->queue,
> +					size * sizeof(struct siw_cqe) +
> +					sizeof(struct siw_cq_ctrl));
> +
> +		if (uresp.cq_key > SIW_MAX_UOBJ_KEY)
> +			siw_dbg(sdev, "[CQ %d]: preparing mmap rq failed\n",
> +				OBJ_ID(cq));
> +
> +		uresp.cq_id = OBJ_ID(cq);
> +		uresp.num_cqe = size;
> +
> +		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> +		if (rv)
> +			goto err_out_idr;
> +	}
> +	return &cq->base_cq;
> +
> +err_out_idr:
> +	siw_remove_obj(&sdev->idr_lock, &sdev->cq_idr, &cq->hdr);
> +err_out:
> +	siw_dbg(sdev, "cq creation failed: %d", rv);
> +
> +	if (cq && cq->queue)
> +		vfree(cq->queue);
> +
> +	kfree(cq);
> +	atomic_dec(&sdev->num_cq);
> +
> +	return ERR_PTR(rv);
> +}
> +
> +/*
> + * siw_poll_cq()
> + *
> + * Reap CQ entries if available and copy work completion status into
> + * array of WC's provided by caller. Returns number of reaped CQE's.
> + *
> + * @base_cq:	Base CQ contained in siw CQ.
> + * @num_cqe:	Maximum number of CQE's to reap.
> + * @wc:		Array of work completions to be filled by siw.
> + */
> +int siw_poll_cq(struct ib_cq *base_cq, int num_cqe, struct ib_wc *wc)
> +{
> +	struct siw_cq		*cq  = siw_cq_base2siw(base_cq);
> +	int			i;
> +
> +	for (i = 0; i < num_cqe; i++) {
> +		if (!(siw_reap_cqe(cq, wc)))
> +			break;
> +		wc++;
> +	}
> +	return i;
> +}
> +
> +/*
> + * siw_req_notify_cq()
> + *
> + * Request notification for new CQE's added to that CQ.
> + * Defined flags:
> + * o SIW_CQ_NOTIFY_SOLICITED lets siw trigger a notification
> + *   event if a WQE with notification flag set enters the CQ
> + * o SIW_CQ_NOTIFY_NEXT_COMP lets siw trigger a notification
> + *   event if a WQE enters the CQ.
> + * o IB_CQ_REPORT_MISSED_EVENTS: return value will provide the
> + *   number of not reaped CQE's regardless of its notification
> + *   type and current or new CQ notification settings.
> + *
> + * @base_cq:	Base CQ contained in siw CQ.
> + * @flags:	Requested notification flags.
> + */
> +int siw_req_notify_cq(struct ib_cq *base_cq, enum ib_cq_notify_flags
flags)
> +{
> +	struct siw_cq	 *cq  = siw_cq_base2siw(base_cq);
> +
> +	siw_dbg(cq->hdr.sdev, "[CQ %d]: flags: 0x%8x\n", OBJ_ID(cq), flags);
> +
> +	if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED)
> +		/* CQ event for next solicited completion */
> +		smp_store_mb(*cq->notify, SIW_NOTIFY_SOLICITED);
> +	else
> +		/* CQ event for any signalled completion */
> +		smp_store_mb(*cq->notify, SIW_NOTIFY_ALL);
> +
> +	if (flags & IB_CQ_REPORT_MISSED_EVENTS)
> +		return cq->cq_put - cq->cq_get;
> +
> +	return 0;
> +}
> +
> +/*
> + * siw_dereg_mr()
> + *
> + * Release Memory Region.
> + *
> + * TODO: Update function if Memory Windows are supported by siw:
> + *       Is OFED core checking for MW dependencies for current
> + *       MR before calling MR deregistration?.
> + *
> + * @base_mr:     Base MR contained in siw MR.
> + */
> +int siw_dereg_mr(struct ib_mr *base_mr)
> +{
> +	struct siw_mr *mr;
> +	struct siw_device *sdev = siw_dev_base2siw(base_mr->device);
> +
> +	mr = siw_mr_base2siw(base_mr);
> +
> +	siw_dbg(sdev, "[MEM %d]: deregister mr, #ref's %d\n",
> +		mr->mem.hdr.id, refcount_read(&mr->mem.hdr.ref));
> +
> +	mr->mem.stag_valid = 0;
> +
> +	siw_remove_obj(&sdev->idr_lock, &sdev->mem_idr, &mr->mem.hdr);
> +	siw_mem_put(&mr->mem);
> +
> +	return 0;
> +}
> +
> +static struct siw_mr *siw_create_mr(struct siw_device *sdev, void
*mem_obj,
> +				    u64 start, u64 len, int rights)
> +{
> +	struct siw_mr *mr = kzalloc(sizeof(*mr), GFP_KERNEL);
> +	unsigned long flags;
> +
> +	if (!mr)
> +		return NULL;
> +
> +	mr->mem.stag_valid = 0;
> +
> +	if (siw_mem_add(sdev, &mr->mem) < 0) {
> +		kfree(mr);
> +		return NULL;
> +	}
> +	siw_dbg(sdev, "[MEM %d]: new mr, object 0x%p\n",
> +		mr->mem.hdr.id, mem_obj);
> +
> +	mr->base_mr.lkey = mr->base_mr.rkey = mr->mem.hdr.id << 8;
> +
> +	mr->mem.va  = start;
> +	mr->mem.len = len;
> +	mr->mem.mr  = NULL;
> +	mr->mem.perms = SIW_MEM_LREAD | /* not selectable in RDMA core
> */
> +		(rights & IB_ACCESS_REMOTE_READ  ? SIW_MEM_RREAD  : 0) |
> +		(rights & IB_ACCESS_LOCAL_WRITE  ? SIW_MEM_LWRITE : 0) |
> +		(rights & IB_ACCESS_REMOTE_WRITE ? SIW_MEM_RWRITE :
> 0);
> +
> +	mr->mem_obj = mem_obj;
> +
> +	INIT_LIST_HEAD(&mr->devq);
> +	spin_lock_irqsave(&sdev->idr_lock, flags);
> +	list_add_tail(&mr->devq, &sdev->mr_list);
> +	spin_unlock_irqrestore(&sdev->idr_lock, flags);
> +
> +	return mr;
> +}
> +
> +/*
> + * siw_reg_user_mr()
> + *
> + * Register Memory Region.
> + *
> + * @base_pd:	Base PD contained in siw PD.
> + * @start:	starting address of MR (virtual address)
> + * @len:	len of MR
> + * @rnic_va:	not used by siw
> + * @rights:	MR access rights
> + * @udata:	user buffer to communicate STag and Key.
> + */
> +struct ib_mr *siw_reg_user_mr(struct ib_pd *base_pd, u64 start, u64 len,
> +			      u64 rnic_va, int rights, struct ib_udata
*udata)
> +{
> +	struct siw_mr		*mr = NULL;
> +	struct siw_pd		*pd = siw_pd_base2siw(base_pd);
> +	struct siw_umem		*umem = NULL;
> +	struct siw_ureq_reg_mr	ureq;
> +	struct siw_uresp_reg_mr	uresp;
> +	struct siw_device	*sdev = pd->hdr.sdev;
> +
> +	unsigned long mem_limit = rlimit(RLIMIT_MEMLOCK);
> +	int rv;
> +
> +	siw_dbg(sdev, "[PD %d]: start: 0x%016llx, va: 0x%016llx, len:
%llu\n",
> +		OBJ_ID(pd), (unsigned long long)start,
> +		(unsigned long long)rnic_va, (unsigned long long)len);
> +
> +	if (atomic_inc_return(&sdev->num_mr) > SIW_MAX_MR) {
> +		siw_dbg(sdev, "[PD %d]: too many mr's\n", OBJ_ID(pd));
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (!len) {
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +	if (mem_limit != RLIM_INFINITY) {
> +		unsigned long num_pages =
> +			(PAGE_ALIGN(len + (start & ~PAGE_MASK))) >>
> PAGE_SHIFT;
> +		mem_limit >>= PAGE_SHIFT;
> +
> +		if (num_pages > mem_limit - current->mm->locked_vm) {
> +			siw_dbg(sdev,
> +				"[PD %d]: pages req %lu, max %lu, lock
%lu\n",
> +				OBJ_ID(pd), num_pages, mem_limit,
> +				current->mm->locked_vm);
> +			rv = -ENOMEM;
> +			goto err_out;
> +		}
> +	}
> +	umem = siw_umem_get(start, len);
> +	if (IS_ERR(umem)) {
> +		rv = PTR_ERR(umem);
> +		siw_dbg(sdev, "[PD %d]: getting user memory failed: %d\n",
> +			OBJ_ID(pd), rv);
> +		umem = NULL;
> +		goto err_out;
> +	}
> +	mr = siw_create_mr(sdev, umem, start, len, rights);
> +	if (!mr) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (udata) {
> +		rv = ib_copy_from_udata(&ureq, udata, sizeof(ureq));
> +		if (rv)
> +			goto err_out;
> +
> +		mr->base_mr.lkey |= ureq.stag_key;
> +		mr->base_mr.rkey |= ureq.stag_key;
> +		uresp.stag = mr->base_mr.lkey;
> +
> +		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> +		if (rv)
> +			goto err_out;
> +	}
> +	mr->pd = pd;
> +	siw_pd_get(pd);
> +
> +	mr->mem.stag_valid = 1;
> +
> +	return &mr->base_mr;
> +
> +err_out:
> +	if (mr) {
> +		siw_remove_obj(&sdev->idr_lock, &sdev->mem_idr, &mr-
> >mem.hdr);
> +		siw_mem_put(&mr->mem);
> +		umem = NULL;
> +	} else
> +		atomic_dec(&sdev->num_mr);
> +
> +	if (umem)
> +		siw_umem_release(umem);
> +
> +	return ERR_PTR(rv);
> +}
> +
> +struct ib_mr *siw_alloc_mr(struct ib_pd *base_pd, enum ib_mr_type
mr_type,
> +			   u32 max_sge)
> +{
> +	struct siw_mr	*mr;
> +	struct siw_pd	*pd = siw_pd_base2siw(base_pd);
> +	struct siw_device *sdev = pd->hdr.sdev;
> +	struct siw_pbl	*pbl = NULL;
> +	int rv;
> +
> +	if (atomic_inc_return(&sdev->num_mr) > SIW_MAX_MR) {
> +		siw_dbg(sdev, "[PD %d]: too many mr's\n", OBJ_ID(pd));
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (mr_type != IB_MR_TYPE_MEM_REG) {
> +		siw_dbg(sdev, "[PD %d]: mr type %d unsupported\n",
> +			OBJ_ID(pd), mr_type);
> +		rv = -EOPNOTSUPP;
> +		goto err_out;
> +	}
> +	if (max_sge > SIW_MAX_SGE_PBL) {
> +		siw_dbg(sdev, "[PD %d]: too many sge's: %d\n",
> +			OBJ_ID(pd), max_sge);
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	pbl = siw_pbl_alloc(max_sge);
> +	if (IS_ERR(pbl)) {
> +		rv = PTR_ERR(pbl);
> +		siw_dbg(sdev, "[PD %d]: pbl allocation failed: %d\n",
> +			OBJ_ID(pd), rv);
> +		pbl = NULL;
> +		goto err_out;
> +	}
> +	mr = siw_create_mr(sdev, pbl, 0, max_sge * PAGE_SIZE, 0);
> +	if (!mr) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	mr->mem.is_pbl = 1;
> +	mr->pd = pd;
> +	siw_pd_get(pd);
> +
> +	siw_dbg(sdev, "[PD %d], [MEM %d]: success\n",
> +		OBJ_ID(pd), OBJ_ID(&mr->mem));
> +
> +	return &mr->base_mr;
> +
> +err_out:
> +	if (pbl)
> +		siw_pbl_free(pbl);
> +
> +	siw_dbg(sdev, "[PD %d]: failed: %d\n", OBJ_ID(pd), rv);
> +
> +	atomic_dec(&sdev->num_mr);
> +
> +	return ERR_PTR(rv);
> +}
> +
> +/* Just used to count number of pages being mapped */
> +static int siw_set_pbl_page(struct ib_mr *base_mr, u64 buf_addr)
> +{
> +	return 0;
> +}
> +
> +int siw_map_mr_sg(struct ib_mr *base_mr, struct scatterlist *sl, int
num_sle,
> +		  unsigned int *sg_off)
> +{
> +	struct scatterlist *slp;
> +	struct siw_mr *mr = siw_mr_base2siw(base_mr);
> +	struct siw_pbl *pbl = mr->pbl;
> +	struct siw_pble *pble = pbl->pbe;
> +	u64 pbl_size;
> +	int i, rv;
> +
> +	if (!pbl) {
> +		siw_dbg(mr->mem.hdr.sdev, "[MEM %d]: no pbl allocated\n",
> +			OBJ_ID(&mr->mem));
> +		return -EINVAL;
> +	}
> +	if (pbl->max_buf < num_sle) {
> +		siw_dbg(mr->mem.hdr.sdev, "[MEM %d]: too many sge's:
> %d>%d\n",
> +			OBJ_ID(&mr->mem), mr->pbl->max_buf, num_sle);
> +		return -ENOMEM;
> +	}
> +
> +	for_each_sg(sl, slp, num_sle, i) {
> +		if (sg_dma_len(slp) == 0) {
> +			siw_dbg(mr->mem.hdr.sdev, "[MEM %d]: empty
> sge\n",
> +				OBJ_ID(&mr->mem));
> +			return -EINVAL;
> +		}
> +		if (i == 0) {
> +			pble->addr = sg_dma_address(slp);
> +			pble->size = sg_dma_len(slp);
> +			pble->pbl_off = 0;
> +			pbl_size = pble->size;
> +			pbl->num_buf = 1;
> +
> +			continue;
> +		}
> +		/* Merge PBL entries if adjacent */
> +		if (pble->addr + pble->size == sg_dma_address(slp))
> +			pble->size += sg_dma_len(slp);
> +		else {
> +			pble++;
> +			pbl->num_buf++;
> +			pble->addr = sg_dma_address(slp);
> +			pble->size = sg_dma_len(slp);
> +			pble->pbl_off = pbl_size;
> +		}
> +		pbl_size += sg_dma_len(slp);
> +
> +		siw_dbg(mr->mem.hdr.sdev,
> +			"[MEM %d]: sge[%d], size %llu, addr %p, total
%llu\n",
> +			OBJ_ID(&mr->mem), i, pble->size, (void *)pble->addr,
> +			pbl_size);
> +	}
> +	rv = ib_sg_to_pages(base_mr, sl, num_sle, sg_off, siw_set_pbl_page);
> +	if (rv > 0) {
> +		mr->mem.len = base_mr->length;
> +		mr->mem.va = base_mr->iova;
> +		siw_dbg(mr->mem.hdr.sdev,
> +			"[MEM %d]: %llu byte, %u SLE into %u entries\n",
> +			OBJ_ID(&mr->mem), mr->mem.len, num_sle, pbl-
> >num_buf);
> +	}
> +	return rv;
> +}
> +
> +/*
> + * siw_get_dma_mr()
> + *
> + * Create a (empty) DMA memory region, where no umem is attached.
> + */
> +struct ib_mr *siw_get_dma_mr(struct ib_pd *base_pd, int rights)
> +{
> +	struct siw_mr *mr;
> +	struct siw_pd *pd = siw_pd_base2siw(base_pd);
> +	struct siw_device *sdev = pd->hdr.sdev;
> +	int rv;
> +
> +	if (atomic_inc_return(&sdev->num_mr) > SIW_MAX_MR) {
> +		siw_dbg(sdev, "[PD %d]: too many mr's\n", OBJ_ID(pd));
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	mr = siw_create_mr(sdev, NULL, 0, ULONG_MAX, rights);
> +	if (!mr) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	mr->mem.stag_valid = 1;
> +
> +	mr->pd = pd;
> +	siw_pd_get(pd);
> +
> +	siw_dbg(sdev, "[PD %d], [MEM %d]: success\n",
> +		OBJ_ID(pd), OBJ_ID(&mr->mem));
> +
> +	return &mr->base_mr;
> +
> +err_out:
> +	atomic_dec(&sdev->num_mr);
> +
> +	return ERR_PTR(rv);
> +}
> +
> +/*
> + * siw_create_srq()
> + *
> + * Create Shared Receive Queue of attributes @init_attrs
> + * within protection domain given by @base_pd.
> + *
> + * @base_pd:	Base PD contained in siw PD.
> + * @init_attrs:	SRQ init attributes.
> + * @udata:	not used by siw.
> + */
> +struct ib_srq *siw_create_srq(struct ib_pd *base_pd,
> +			      struct ib_srq_init_attr *init_attrs,
> +			      struct ib_udata *udata)
> +{
> +	struct siw_srq		*srq = NULL;
> +	struct ib_srq_attr	*attrs = &init_attrs->attr;
> +	struct siw_pd		*pd = siw_pd_base2siw(base_pd);
> +	struct siw_device	*sdev = pd->hdr.sdev;
> +
> +	int kernel_verbs = base_pd->uobject ? 0 : 1;
> +	int rv;
> +
> +	if (atomic_inc_return(&sdev->num_srq) > SIW_MAX_SRQ) {
> +		siw_dbg(sdev, "too many srq's\n");
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (attrs->max_wr == 0 || attrs->max_wr > SIW_MAX_SRQ_WR ||
> +	    attrs->max_sge > SIW_MAX_SGE || attrs->srq_limit >
attrs->max_wr)
> {
> +		rv = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	srq = kzalloc(sizeof(*srq), GFP_KERNEL);
> +	if (!srq) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +
> +	srq->max_sge = attrs->max_sge;
> +	srq->num_rqe = roundup_pow_of_two(attrs->max_wr);
> +	atomic_set(&srq->space, srq->num_rqe);
> +
> +	srq->limit = attrs->srq_limit;
> +	if (srq->limit)
> +		srq->armed = 1;
> +
> +	if (kernel_verbs)
> +		srq->recvq = vzalloc(srq->num_rqe * sizeof(struct siw_rqe));
> +	else
> +		srq->recvq = vmalloc_user(srq->num_rqe *
> +					  sizeof(struct siw_rqe));
> +
> +	if (srq->recvq == NULL) {
> +		rv = -ENOMEM;
> +		goto err_out;
> +	}
> +	if (kernel_verbs) {
> +		srq->kernel_verbs = 1;
> +	} else if (udata) {
> +		struct siw_uresp_create_srq uresp;
> +		struct siw_ucontext *ctx;
> +
> +		memset(&uresp, 0, sizeof(uresp));
> +		ctx = siw_ctx_base2siw(base_pd->uobject->context);
> +
> +		uresp.num_rqe = srq->num_rqe;
> +		uresp.srq_key = siw_insert_uobj(ctx, srq->recvq,
> +					srq->num_rqe * sizeof(struct
siw_rqe));
> +
> +		if (uresp.srq_key > SIW_MAX_UOBJ_KEY)
> +			siw_dbg(sdev, "preparing mmap srq failed\n");
> +
> +		rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> +		if (rv)
> +			goto err_out;
> +	}
> +	srq->pd	= pd;
> +	siw_pd_get(pd);
> +
> +	spin_lock_init(&srq->lock);
> +
> +	siw_dbg(sdev, "[SRQ 0x%p]: success\n", srq);
> +
> +	return &srq->base_srq;
> +
> +err_out:
> +	if (srq) {
> +		if (srq->recvq)
> +			vfree(srq->recvq);
> +		kfree(srq);
> +	}
> +	atomic_dec(&sdev->num_srq);
> +
> +	return ERR_PTR(rv);
> +}
> +
> +/*
> + * siw_modify_srq()
> + *
> + * Modify SRQ. The caller may resize SRQ and/or set/reset notification
> + * limit and (re)arm IB_EVENT_SRQ_LIMIT_REACHED notification.
> + *
> + * NOTE: it is unclear if RDMA core allows for changing the MAX_SGE
> + * parameter. siw_modify_srq() does not check the attrs->max_sge param.
> + */
> +int siw_modify_srq(struct ib_srq *base_srq, struct ib_srq_attr *attrs,
> +		   enum ib_srq_attr_mask attr_mask, struct ib_udata *udata)
> +{
> +	struct siw_srq	*srq = siw_srq_base2siw(base_srq);
> +	unsigned long	flags;
> +	int rv = 0;
> +
> +	spin_lock_irqsave(&srq->lock, flags);
> +
> +	if (attr_mask & IB_SRQ_MAX_WR) {
> +		/* resize request not yet supported */
> +		rv = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (attr_mask & IB_SRQ_LIMIT) {
> +		if (attrs->srq_limit) {
> +			if (unlikely(attrs->srq_limit > srq->num_rqe)) {
> +				rv = -EINVAL;
> +				goto out;
> +			}
> +			srq->armed = 1;
> +		} else
> +			srq->armed = 0;
> +
> +		srq->limit = attrs->srq_limit;
> +	}
> +out:
> +	spin_unlock_irqrestore(&srq->lock, flags);
> +
> +	return rv;
> +}
> +
> +/*
> + * siw_query_srq()
> + *
> + * Query SRQ attributes.
> + */
> +int siw_query_srq(struct ib_srq *base_srq, struct ib_srq_attr *attrs)
> +{
> +	struct siw_srq	*srq = siw_srq_base2siw(base_srq);
> +	unsigned long	flags;
> +
> +	spin_lock_irqsave(&srq->lock, flags);
> +
> +	attrs->max_wr = srq->num_rqe;
> +	attrs->max_sge = srq->max_sge;
> +	attrs->srq_limit = srq->limit;
> +
> +	spin_unlock_irqrestore(&srq->lock, flags);
> +
> +	return 0;
> +}
> +
> +/*
> + * siw_destroy_srq()
> + *
> + * Destroy SRQ.
> + * It is assumed that the SRQ is not referenced by any
> + * QP anymore - the code trusts the RDMA core environment to keep track
> + * of QP references.
> + */
> +int siw_destroy_srq(struct ib_srq *base_srq)
> +{
> +	struct siw_srq		*srq = siw_srq_base2siw(base_srq);
> +	struct siw_device	*sdev = srq->pd->hdr.sdev;
> +
> +	siw_pd_put(srq->pd);
> +
> +	vfree(srq->recvq);
> +	kfree(srq);
> +
> +	atomic_dec(&sdev->num_srq);
> +
> +	return 0;
> +}
> +
> +/*
> + * siw_post_srq_recv()
> + *
> + * Post a list of receive queue elements to SRQ.
> + * NOTE: The function does not check or lock a certain SRQ state
> + *       during the post operation. The code simply trusts the
> + *       RDMA core environment.
> + *
> + * @base_srq:	Base SRQ contained in siw SRQ
> + * @wr:		List of R-WR's
> + * @bad_wr:	Updated to failing WR if posting fails.
> + */
> +int siw_post_srq_recv(struct ib_srq *base_srq, struct ib_recv_wr *wr,
> +		      struct ib_recv_wr **bad_wr)
> +{
> +	struct siw_srq	*srq = siw_srq_base2siw(base_srq);
> +	int rv = 0;
> +
> +	if (!srq->kernel_verbs) {
> +		siw_dbg(srq->pd->hdr.sdev,
> +			"[SRQ 0x%p]: no kernel post_recv for mapped srq\n",
> +			srq);
> +		rv = -EINVAL;
> +		goto out;
> +	}
> +	while (wr) {
> +		u32 idx = srq->rq_put % srq->num_rqe;
> +		struct siw_rqe *rqe = &srq->recvq[idx];
> +
> +		if (rqe->flags) {
> +			siw_dbg(srq->pd->hdr.sdev, "[SRQ 0x%p]: srq full\n",
> +				srq);
> +			rv = -ENOMEM;
> +			break;
> +		}
> +		if (wr->num_sge > srq->max_sge) {
> +			siw_dbg(srq->pd->hdr.sdev,
> +				"[SRQ 0x%p]: too many sge's: %d\n",
> +				srq, wr->num_sge);
> +			rv = -EINVAL;
> +			break;
> +		}
> +		rqe->id = wr->wr_id;
> +		rqe->num_sge = wr->num_sge;
> +		siw_copy_sgl(wr->sg_list, rqe->sge, wr->num_sge);
> +
> +		/* Make sure S-RQE is completely written before valid */
> +		smp_wmb();
> +
> +		rqe->flags = SIW_WQE_VALID;
> +
> +		srq->rq_put++;
> +		wr = wr->next;
> +	}
> +out:
> +	if (unlikely(rv < 0)) {
> +		siw_dbg(srq->pd->hdr.sdev, "[SRQ 0x%p]: error %d\n", srq,
rv);
> +		*bad_wr = wr;
> +	}
> +	return rv;
> +}
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.h
> b/drivers/infiniband/sw/siw/siw_verbs.h
> new file mode 100644
> index 000000000000..829f603923cc
> --- /dev/null
> +++ b/drivers/infiniband/sw/siw/siw_verbs.h
> @@ -0,0 +1,119 @@
> +/*
> + * Software iWARP device driver
> + *
> + * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> + *
> + * Copyright (c) 2008-2017, IBM Corporation
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * BSD license below:
> + *
> + *   Redistribution and use in source and binary forms, with or
> + *   without modification, are permitted provided that the following
> + *   conditions are met:
> + *
> + *   - Redistributions of source code must retain the above copyright
notice,
> + *     this list of conditions and the following disclaimer.
> + *
> + *   - Redistributions in binary form must reproduce the above copyright
> + *     notice, this list of conditions and the following disclaimer in
the
> + *     documentation and/or other materials provided with the
distribution.
> + *
> + *   - Neither the name of IBM nor the names of its contributors may be
> + *     used to endorse or promote products derived from this software
without
> + *     specific prior written permission.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> THE
> + * SOFTWARE.
> + */
> +
> +#ifndef _SIW_VERBS_H
> +#define _SIW_VERBS_H
> +
> +#include <linux/errno.h>
> +
> +#include <rdma/iw_cm.h>
> +#include <rdma/ib_verbs.h>
> +#include <rdma/ib_smi.h>
> +#include <rdma/ib_user_verbs.h>
> +
> +#include "siw.h"
> +#include "siw_cm.h"
> +
> +
> +extern struct ib_ucontext *siw_alloc_ucontext(struct ib_device *ibdev,
> +					      struct ib_udata *udata);
> +extern int siw_dealloc_ucontext(struct ib_ucontext *ucontext);
> +extern int siw_query_port(struct ib_device *ibdev, u8 port,
> +			  struct ib_port_attr *attr);
> +extern int siw_get_port_immutable(struct ib_device *ibdev, u8 port,
> +				  struct ib_port_immutable *port_imm);
> +extern int siw_query_device(struct ib_device *ibdev,
> +			    struct ib_device_attr *attr,
> +			    struct ib_udata *udata);
> +extern struct ib_cq *siw_create_cq(struct ib_device *ibdev,
> +				   const struct ib_cq_init_attr *attr,
> +				   struct ib_ucontext *ucontext,
> +				   struct ib_udata *udata);
> +extern int siw_no_mad(struct ib_device *base_dev, int flags, u8 port,
> +		      const struct ib_wc *wc, const struct ib_grh *grh,
> +		      const struct ib_mad_hdr *in_mad, size_t in_mad_size,
> +		      struct ib_mad_hdr *out_mad, size_t *out_mad_size,
> +		      u16 *outmad_pkey_index);
> +extern int siw_query_port(struct ib_device *ibdev, u8 port,
> +			  struct ib_port_attr *attr);
> +extern int siw_query_pkey(struct ib_device *ibdev, u8 port,
> +			  u16 idx, u16 *pkey);
> +extern int siw_query_gid(struct ib_device *ibdev, u8 port, int idx,
> +			 union ib_gid *gid);
> +extern struct ib_pd *siw_alloc_pd(struct ib_device *ibdev,
> +				  struct ib_ucontext *ucontext,
> +				  struct ib_udata *udata);
> +extern int siw_dealloc_pd(struct ib_pd *pd);
> +extern struct ib_qp *siw_create_qp(struct ib_pd *pd,
> +				  struct ib_qp_init_attr *attr,
> +				   struct ib_udata *udata);
> +extern int siw_query_qp(struct ib_qp *ofa_qp, struct ib_qp_attr *qp_attr,
> +			int qp_attr_mask, struct ib_qp_init_attr
*qp_init_attr);
> +extern int siw_verbs_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr
*attr,
> +			       int attr_mask, struct ib_udata *udata);
> +extern int siw_destroy_qp(struct ib_qp *ibqp);
> +extern int siw_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
> +			 struct ib_send_wr **bad_wr);
> +extern int siw_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
> +			    struct ib_recv_wr **bad_wr);
> +extern int siw_destroy_cq(struct ib_cq *ibcq);
> +extern int siw_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc
*wc);
> +extern int siw_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags
> flags);
> +extern struct ib_mr *siw_reg_user_mr(struct ib_pd *ibpd, u64 start, u64
len,
> +				     u64 rnic_va, int rights,
> +				     struct ib_udata *udata);
> +extern struct ib_mr *siw_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type
> mr_type,
> +				  u32 max_sge);
> +extern struct ib_mr *siw_get_dma_mr(struct ib_pd *ibpd, int rights);
> +extern int siw_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sl,
> +			 int num_sle, unsigned int *sg_off);
> +extern int siw_dereg_mr(struct ib_mr *ibmr);
> +extern struct ib_srq *siw_create_srq(struct ib_pd *ibpd,
> +				     struct ib_srq_init_attr *attr,
> +				     struct ib_udata *udata);
> +extern int siw_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
> +			  enum ib_srq_attr_mask mask, struct ib_udata
*udata);
> +extern int siw_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr);
> +extern int siw_destroy_srq(struct ib_srq *ibsrq);
> +extern int siw_post_srq_recv(struct ib_srq *ibsrq, struct ib_recv_wr *wr,
> +			     struct ib_recv_wr **bad_wr);
> +extern int siw_mmap(struct ib_ucontext *ibctx, struct vm_area_struct
*vma);
> +
> +extern const struct dma_map_ops siw_dma_generic_ops;
> +
> +#endif
> diff --git a/include/uapi/rdma/siw_user.h b/include/uapi/rdma/siw_user.h
> new file mode 100644
> index 000000000000..85cdc3e2c83d
> --- /dev/null
> +++ b/include/uapi/rdma/siw_user.h
> @@ -0,0 +1,216 @@
> +/*
> + * Software iWARP device driver for Linux
> + *
> + * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> + *
> + * Copyright (c) 2008-2017, IBM Corporation
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * BSD license below:
> + *
> + *   Redistribution and use in source and binary forms, with or
> + *   without modification, are permitted provided that the following
> + *   conditions are met:
> + *
> + *   - Redistributions of source code must retain the above copyright
notice,
> + *     this list of conditions and the following disclaimer.
> + *
> + *   - Redistributions in binary form must reproduce the above copyright
> + *     notice, this list of conditions and the following disclaimer in
the
> + *     documentation and/or other materials provided with the
distribution.
> + *
> + *   - Neither the name of IBM nor the names of its contributors may be
> + *     used to endorse or promote products derived from this software
without
> + *     specific prior written permission.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> THE
> + * SOFTWARE.
> + */
> +
> +#ifndef _SIW_USER_H
> +#define _SIW_USER_H
> +
> +#include <linux/types.h>
> +
> +/*Common string that is matched to accept the device by the user
library*/
> +#define SIW_NODE_DESC_COMMON	"Software iWARP stack"
> +
> +#define SIW_IBDEV_PREFIX "siw_"
> +
> +#define VERSION_ID_SOFTIWARP	2
> +
> +#define SIW_MAX_SGE		6
> +#define SIW_MAX_UOBJ_KEY	0xffffff
> +#define SIW_INVAL_UOBJ_KEY	(SIW_MAX_UOBJ_KEY + 1)
> +
> +struct siw_uresp_create_cq {
> +	__u32	cq_id;
> +	__u32	num_cqe;
> +	__u32	cq_key;
> +};
> +
> +struct siw_uresp_create_qp {
> +	__u32	qp_id;
> +	__u32	num_sqe;
> +	__u32	num_rqe;
> +	__u32	sq_key;
> +	__u32	rq_key;
> +};
> +
> +struct siw_ureq_reg_mr {
> +	__u8	stag_key;
> +	__u8	reserved[3];
> +};
> +
> +struct siw_uresp_reg_mr {
> +	__u32	stag;
> +};
> +
> +struct siw_uresp_create_srq {
> +	__u32	num_rqe;
> +	__u32	srq_key;
> +};
> +
> +struct siw_uresp_alloc_ctx {
> +	__u32	dev_id;
> +};
> +
> +enum siw_opcode {
> +	SIW_OP_WRITE		= 0,
> +	SIW_OP_READ		= 1,
> +	SIW_OP_READ_LOCAL_INV	= 2,
> +	SIW_OP_SEND		= 3,
> +	SIW_OP_SEND_WITH_IMM	= 4,
> +	SIW_OP_SEND_REMOTE_INV	= 5,
> +
> +	/* Unsupported */
> +	SIW_OP_FETCH_AND_ADD	= 6,
> +	SIW_OP_COMP_AND_SWAP	= 7,
> +
> +	SIW_OP_RECEIVE		= 8,
> +	/* provider internal SQE */
> +	SIW_OP_READ_RESPONSE	= 9,
> +	/*
> +	 * below opcodes valid for
> +	 * in-kernel clients only
> +	 */
> +	SIW_OP_INVAL_STAG	= 10,
> +	SIW_OP_REG_MR		= 11,
> +	SIW_NUM_OPCODES		= 12
> +};
> +
> +/* Keep it same as ibv_sge to allow for memcpy */
> +struct siw_sge {
> +	__u64	laddr;
> +	__u32	length;
> +	__u32	lkey;
> +};
> +
> +/*
> + * Inline data are kept within the work request itself occupying
> + * the space of sge[1] .. sge[n]. Therefore, inline data cannot be
> + * supported if SIW_MAX_SGE is below 2 elements.
> + */
> +#define SIW_MAX_INLINE	(sizeof(struct siw_sge) * (SIW_MAX_SGE - 1))
> +
> +#if SIW_MAX_SGE < 2
> +#error "SIW_MAX_SGE must be at least 2"
> +#endif

This could be done with BUILD_BUG_ON().

> +
> +enum siw_wqe_flags {
> +	SIW_WQE_VALID           = 1,
> +	SIW_WQE_INLINE          = (1 << 1),
> +	SIW_WQE_SIGNALLED       = (1 << 2),
> +	SIW_WQE_SOLICITED       = (1 << 3),
> +	SIW_WQE_READ_FENCE	= (1 << 4),
> +	SIW_WQE_COMPLETED       = (1 << 5)
> +};
> +
> +/* Send Queue Element */
> +struct siw_sqe {
> +	__u64	id;
> +	__u16	flags;
> +	__u8	num_sge;
> +	/* Contains enum siw_opcode values */
> +	__u8	opcode;
> +	__u32	rkey;
> +	union {
> +		__u64	raddr;
> +		__u64	base_mr;
> +	};
> +	union {
> +		struct siw_sge	sge[SIW_MAX_SGE];
> +		__u32	access;
> +	};
> +};
> +
> +/* Receive Queue Element */
> +struct siw_rqe {
> +	__u64	id;
> +	__u16	flags;
> +	__u8	num_sge;
> +	/*
> +	 * only used by kernel driver,
> +	 * ignored if set by user
> +	 */
> +	__u8	opcode;
> +	__u32	unused;
> +	struct siw_sge	sge[SIW_MAX_SGE];
> +};
> +
> +enum siw_notify_flags {
> +	SIW_NOTIFY_NOT			= (0),
> +	SIW_NOTIFY_SOLICITED		= (1 << 0),
> +	SIW_NOTIFY_NEXT_COMPLETION	= (1 << 1),
> +	SIW_NOTIFY_MISSED_EVENTS	= (1 << 2),
> +	SIW_NOTIFY_ALL = SIW_NOTIFY_SOLICITED |
> +			SIW_NOTIFY_NEXT_COMPLETION |
> +			SIW_NOTIFY_MISSED_EVENTS
> +};
> +
> +enum siw_wc_status {
> +	SIW_WC_SUCCESS		= 0,
> +	SIW_WC_LOC_LEN_ERR	= 1,
> +	SIW_WC_LOC_PROT_ERR	= 2,
> +	SIW_WC_LOC_QP_OP_ERR	= 3,
> +	SIW_WC_WR_FLUSH_ERR	= 4,
> +	SIW_WC_BAD_RESP_ERR	= 5,
> +	SIW_WC_LOC_ACCESS_ERR	= 6,
> +	SIW_WC_REM_ACCESS_ERR	= 7,
> +	SIW_WC_REM_INV_REQ_ERR	= 8,
> +	SIW_WC_GENERAL_ERR	= 9,
> +	SIW_NUM_WC_STATUS	= 10
> +};
> +
> +struct siw_cqe {
> +	__u64	id;
> +	__u8	flags;
> +	__u8	opcode;
> +	__u16	status;
> +	__u32	bytes;
> +	__u64	imm_data;
> +	/* QP number or QP pointer */
> +	union {
> +		void	*qp;
> +		__u64	qp_id;
> +	};
> +};
> +
> +/*
> + * Shared structure between user and kernel
> + * to control CQ arming.
> + */
> +struct siw_cq_ctrl {
> +	enum siw_notify_flags	notify;
> +};
> +
> +#endif
> --
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]           ` <20180123164316.GC30670-uk2M96/98Pc@public.gmane.org>
  2018-01-23 16:58             ` Steve Wise
@ 2018-01-23 17:21             ` Bernard Metzler
  1 sibling, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-23 17:21 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA


-----Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote: -----

>To: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>From: Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org>
>Date: 01/23/2018 05:43PM
>Cc: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
>linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>Subject: Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network
>and RDMA subsystem
>
>On Tue, Jan 23, 2018 at 10:33:48AM -0600, Steve Wise wrote:
>
>> > +/* Restrict usage of GSO, if hardware peer iwarp is unable to
>process
>> > + * large packets. gso_seg_limit = 1 lets siw send only packets
>up to
>> > + * one real MTU in size, but severly limits maximum bandwidth.
>> > + * gso_seg_limit = 0 makes use of GSO (and more than doubles
>throughput
>> > + * for large transfers).
>> > + */
>> > +const int gso_seg_limit;
>> > +
>> 
>> The GSO configuration needs to default to enable interoperation
>with all
>> vendors (and comply with the RFCs).  So make it 1 please.
>
>The thing we call GSO in the netstack should be totally transparent
>on
>the wire, so at the very least this needs some additional
>elaboration.
>
>> Jason, would configfs be a reasonable way to allow tweaking these
>globals?
>
>Why would we ever even bother to support a mode that is
>non-conformant
>on the wire? Just remove it..
>

siw as a software RDMA provider benefits from GSO alot.
Building 8 frames with 8 headers and trailers brings max
throughput down to some 3.5GB/s. Using GSO, and shipping
64K we see 7.8GB/s or so. That's what TCP on the socket API
has to offer.

Strictly speaking, siw reads the current segment size
as it gets it from TCP. Older Chelsio adapters were able
to process large packets, the latest HW is not it seems.

I was thinking of leaving frame size as one MTU, but check
if the peer (siw?) can handle more. There are sufficient
spare bits in the MPA req/rep header to negotiate that,
which shall be ignored by hardware ;)

siw can never fully comply with what a hardware iwarp
wants to get - it sits on top of a TCP stream. Under
load, we see lots of frame fragmentation somewhere in
the middle - sometimes just the header makes it glued
after the trailer of the previous frame in one TCP
segment, sometimes the wire-frame breaks in the middle
of the data part, sometimes all data of a DDP frame are
there, but the trailer checksum is still missing...the
TCP send window is something siw cannot influence.

Thanks,
Bernard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
  2018-01-23 16:43         ` Jason Gunthorpe
       [not found]           ` <20180123164316.GC30670-uk2M96/98Pc@public.gmane.org>
@ 2018-01-23 17:23           ` Bernard Metzler
       [not found]             ` <OF34749881.4216E578-ON0025821E.005F897C-0025821E.005F8983-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
  1 sibling, 1 reply; 44+ messages in thread
From: Bernard Metzler @ 2018-01-23 17:23 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Jason Gunthorpe', linux-rdma-u79uwXL29TY76Z2rM5mHXA



---
Bernard Metzler, PhD
Tech. Leader High Performance I/O, Principal Research Staff
IBM Zurich Research Laboratory
Saeumerstrasse 4
CH-8803 Rueschlikon, Switzerland
+41 44 724 8605

-----"Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote: -----

>To: "'Jason Gunthorpe'" <jgg-uk2M96/98Pc@public.gmane.org>
>From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>Date: 01/23/2018 05:58PM
>Cc: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
><linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>Subject: RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network
>and RDMA subsystem
>
>> 
>> On Tue, Jan 23, 2018 at 10:33:48AM -0600, Steve Wise wrote:
>> 
>> > > +/* Restrict usage of GSO, if hardware peer iwarp is unable to
>process
>> > > + * large packets. gso_seg_limit = 1 lets siw send only packets
>up to
>> > > + * one real MTU in size, but severly limits maximum bandwidth.
>> > > + * gso_seg_limit = 0 makes use of GSO (and more than doubles
>throughput
>> > > + * for large transfers).
>> > > + */
>> > > +const int gso_seg_limit;
>> > > +
>> >
>> > The GSO configuration needs to default to enable interoperation
>with all
>> > vendors (and comply with the RFCs).  So make it 1 please.
>> 
>> The thing we call GSO in the netstack should be totally transparent
>on
>> the wire, so at the very least this needs some additional
>elaboration.
>> 
>
>From:
>https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_h
>tml_rfc5041-23section-2D5.2&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=2TaYX
>Q0T-r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=maNYrMpuQjDoXG0NADT5M-LhIwmL
>KrmFzWtWh4J3sAc&s=ZDYAQqhyaTuRtFqQpA95k0oT6ildygWkkmJ3fqIWNzU&e=
>
>"At the Data Source, the DDP layer MUST segment the data contained in
>   a ULP message into a series of DDP Segments, where each DDP
>Segment
>   contains a DDP Header and ULP Payload, and MUST be no larger than
>the
>   MULPDU value Advertised by the LLP."

TCP advertises 64k on the kernel socket ;)


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]                 ` <20180123170517.GE30670-uk2M96/98Pc@public.gmane.org>
@ 2018-01-23 17:24                   ` Steve Wise
  2018-01-23 17:28                     ` Jason Gunthorpe
       [not found]                     ` <20180123172830.GF30670-uk2M96/98Pc@public.gmane.org>
  0 siblings, 2 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-23 17:24 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> On Tue, Jan 23, 2018 at 10:58:01AM -0600, Steve Wise wrote:
> 
> > From: https://tools.ietf.org/html/rfc5041#section-5.2
> >
> > "At the Data Source, the DDP layer MUST segment the data contained in
> >    a ULP message into a series of DDP Segments, where each DDP Segment
> >    contains a DDP Header and ULP Payload, and MUST be no larger than the
> >    MULPDU value Advertised by the LLP."
> >
> > Where MULDPDU is the maximum ULP PDU that will fit in the TCP MSS...
> 
> But exceeding the MULPDU has nothing to do with the netstack GSO
> function.. right? GSO is entirely a local node optimization that
> should not be detectable on the wire.

It is not detectable by TCP on the wire, however the iWARP protocols that
impose message boundaries, among other things, require that the iWARP PDU
fits in a single TCP segment.  Since softiwarp is building the iwarp PDU, if
it builds one based on a 64K GSO advertised MSS, then the resulting wire
packets will have man TCP segments all containing parts of a single iWARP
PDU, which violates the spec I quoted.

> 
> > > > Jason, would configfs be a reasonable way to allow tweaking these
> > globals?
> > >
> > > Why would we ever even bother to support a mode that is non-conformant
> > > on the wire? Just remove it..
> >
> > For soft iwarp, the throughput is greatly increased with allowing these
> > iWARP protocol PDUs to span many TCP segments.  So there could be an
> > argument for leaving it as a knob for soft iwarp <-> soft iwarp
> > configurations.  But IMO it needs to default to RFC compliance and
> > interoperability with hw implementations.
> 
> IMHO the purpose of things like rxe and siw is not maximum raw
> performance (certainly not during the initial kernel accept phase), 

I agree.

> so
> it should not include any non-conformant cruft..
> 

IMO leaving it as a non-default-enabled knob is ok in this case.  

Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
  2018-01-23 17:24                   ` Steve Wise
@ 2018-01-23 17:28                     ` Jason Gunthorpe
       [not found]                     ` <20180123172830.GF30670-uk2M96/98Pc@public.gmane.org>
  1 sibling, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2018-01-23 17:28 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 23, 2018 at 11:24:24AM -0600, Steve Wise wrote:
> > 
> > On Tue, Jan 23, 2018 at 10:58:01AM -0600, Steve Wise wrote:
> > 
> > > From: https://tools.ietf.org/html/rfc5041#section-5.2
> > >
> > > "At the Data Source, the DDP layer MUST segment the data contained in
> > >    a ULP message into a series of DDP Segments, where each DDP Segment
> > >    contains a DDP Header and ULP Payload, and MUST be no larger than the
> > >    MULPDU value Advertised by the LLP."
> > >
> > > Where MULDPDU is the maximum ULP PDU that will fit in the TCP MSS...
> > 
> > But exceeding the MULPDU has nothing to do with the netstack GSO
> > function.. right? GSO is entirely a local node optimization that
> > should not be detectable on the wire.
> 
> It is not detectable by TCP on the wire, however the iWARP protocols that
> impose message boundaries, among other things, require that the iWARP PDU
> fits in a single TCP segment.  Since softiwarp is building the iwarp PDU, if
> it builds one based on a 64K GSO advertised MSS, then the resulting wire
> packets will have man TCP segments all containing parts of a single iWARP
> PDU, which violates the spec I quoted.

But that still has nothing to do with GSO, can't you GSS up to MULPDU?

Isn't the issue here more that, as Bernard says, siw is totally broken
since it can't control the TCP layer segmentation boundaries? :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 00/13] Request for Comments on SoftiWarp
  2018-01-14 22:35 [PATCH v3 00/13] Request for Comments on SoftiWarp Bernard Metzler
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
@ 2018-01-23 17:31 ` Bernard Metzler
  2018-02-02 14:37 ` Bernard Metzler
  2 siblings, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-23 17:31 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA



-----"Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote: -----

>To: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
><linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>Date: 01/17/2018 05:07PM
>Subject: RE: [PATCH v3 00/13] Request for Comments on SoftiWarp
>
>Hey Bernard,
>
>> 
>> This patch set contributes a new version of the SoftiWarp
>> driver, as originally introduced Oct 6th, 2017.
>> 
>
>
>Two things I think will help with review:
>
>1) provide a github linux kernel repo with this series added.  That
>allows
>reviewers to easily use both patches and a full kernel tree for
>reviewing.
>

I have cloned https://github.com/torvalds/linux,
patched it and pushed it to
https://github.com/zrlio/softiwarp-for-linux-rdma


I will make another clone available for the patched linux-rdma user
level part of it.


>2) provide the userlib patches (and a github repo) that works with
>the
>latest upstream rdma-core.  That will allow folks to test use-side
>rdma as
>part of their review.
>
>I'll review the individual patches in detail asap.
>
>Thanks!
>
>Steve.
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]                     ` <20180123172830.GF30670-uk2M96/98Pc@public.gmane.org>
@ 2018-01-23 17:39                       ` Bernard Metzler
  2018-01-23 17:42                       ` Steve Wise
  1 sibling, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-23 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA


-----Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote: -----

>To: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>From: Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org>
>Date: 01/23/2018 06:28PM
>Cc: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
>linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>Subject: Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network
>and RDMA subsystem
>
>On Tue, Jan 23, 2018 at 11:24:24AM -0600, Steve Wise wrote:
>> > 
>> > On Tue, Jan 23, 2018 at 10:58:01AM -0600, Steve Wise wrote:
>> > 
>> > > From:
>https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_h
>tml_rfc5041-23section-2D5.2&d=DwIBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=2TaYX
>Q0T-r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=bZXyEf2ir9_OhkryeDGP_iwh7sDL
>DExD5C6xOEAgGAE&s=N-AUE48HDPRCcFKzdGiM59z8OSmoZKD9AkS1WBpnfLY&e=
>> > >
>> > > "At the Data Source, the DDP layer MUST segment the data
>contained in
>> > >    a ULP message into a series of DDP Segments, where each DDP
>Segment
>> > >    contains a DDP Header and ULP Payload, and MUST be no larger
>than the
>> > >    MULPDU value Advertised by the LLP."
>> > >
>> > > Where MULDPDU is the maximum ULP PDU that will fit in the TCP
>MSS...
>> > 
>> > But exceeding the MULPDU has nothing to do with the netstack GSO
>> > function.. right? GSO is entirely a local node optimization that
>> > should not be detectable on the wire.
>> 
>> It is not detectable by TCP on the wire, however the iWARP
>protocols that
>> impose message boundaries, among other things, require that the
>iWARP PDU
>> fits in a single TCP segment.  Since softiwarp is building the
>iwarp PDU, if
>> it builds one based on a 64K GSO advertised MSS, then the resulting
>wire
>> packets will have man TCP segments all containing parts of a single
>iWARP
>> PDU, which violates the spec I quoted.
>
>But that still has nothing to do with GSO, can't you GSS up to
>MULPDU?
>
>Isn't the issue here more that, as Bernard says, siw is totally
>broken
>since it can't control the TCP layer segmentation boundaries? :(
>

totally broken. hmm.

Not integrating siw with the TCP kernel code makes it impossible
to avoid these things. Can kernel iSCSI avoid fragmentation and
mis-alignment? No.
I wanted to avoid tweaking kernel TCP code. That would make
siw non acceptable.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]                     ` <20180123172830.GF30670-uk2M96/98Pc@public.gmane.org>
  2018-01-23 17:39                       ` Bernard Metzler
@ 2018-01-23 17:42                       ` Steve Wise
  2018-01-23 17:47                         ` Jason Gunthorpe
  1 sibling, 1 reply; 44+ messages in thread
From: Steve Wise @ 2018-01-23 17:42 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> On Tue, Jan 23, 2018 at 11:24:24AM -0600, Steve Wise wrote:
> > >
> > > On Tue, Jan 23, 2018 at 10:58:01AM -0600, Steve Wise wrote:
> > >
> > > > From: https://tools.ietf.org/html/rfc5041#section-5.2
> > > >
> > > > "At the Data Source, the DDP layer MUST segment the data contained
in
> > > >    a ULP message into a series of DDP Segments, where each DDP
Segment
> > > >    contains a DDP Header and ULP Payload, and MUST be no larger than
the
> > > >    MULPDU value Advertised by the LLP."
> > > >
> > > > Where MULDPDU is the maximum ULP PDU that will fit in the TCP MSS...
> > >
> > > But exceeding the MULPDU has nothing to do with the netstack GSO
> > > function.. right? GSO is entirely a local node optimization that
> > > should not be detectable on the wire.
> >
> > It is not detectable by TCP on the wire, however the iWARP protocols
that
> > impose message boundaries, among other things, require that the iWARP
PDU
> > fits in a single TCP segment.  Since softiwarp is building the iwarp
PDU, if
> > it builds one based on a 64K GSO advertised MSS, then the resulting wire
> > packets will have man TCP segments all containing parts of a single
iWARP
> > PDU, which violates the spec I quoted.
> 
> But that still has nothing to do with GSO, can't you GSS up to MULPDU?
> 

I don't understand your question.

> Isn't the issue here more that, as Bernard says, siw is totally broken
> since it can't control the TCP layer segmentation boundaries? :(
> 

Since the iwarp protocols run on top of TCP/IP, there is always the case
that some middle box resegments tcp segments differently, so a good iwarp HW
implementation should deal with funny alignments, partial iWARP PDUs
arriving, etc.   But the RFCs, as I read them, want implementations to try
"really hard" to avoid spanning an iWARP PDU across many TCP segments.  And
I think siw should do the same, by default.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]             ` <OF34749881.4216E578-ON0025821E.005F897C-0025821E.005F8983-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
@ 2018-01-23 17:43               ` Steve Wise
  0 siblings, 0 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-23 17:43 UTC (permalink / raw)
  To: 'Bernard Metzler'
  Cc: 'Jason Gunthorpe', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> >"At the Data Source, the DDP layer MUST segment the data contained in
> >   a ULP message into a series of DDP Segments, where each DDP
> >Segment
> >   contains a DDP Header and ULP Payload, and MUST be no larger than
> >the
> >   MULPDU value Advertised by the LLP."
> 
> TCP advertises 64k on the kernel socket ;)

Yes, but it also advertises the actual wire MSS, right?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
  2018-01-23 17:42                       ` Steve Wise
@ 2018-01-23 17:47                         ` Jason Gunthorpe
       [not found]                           ` <20180123174758.GH30670-uk2M96/98Pc@public.gmane.org>
  2018-01-23 18:21                           ` Bernard Metzler
  0 siblings, 2 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2018-01-23 17:47 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 23, 2018 at 11:42:20AM -0600, Steve Wise wrote:

> Since the iwarp protocols run on top of TCP/IP, there is always the case
> that some middle box resegments tcp segments differently, so a good iwarp HW
> implementation should deal with funny alignments, partial iWARP PDUs
> arriving, etc.   But the RFCs, as I read them, want implementations to try
> "really hard" to avoid spanning an iWARP PDU across many TCP segments.  And
> I think siw should do the same, by default.

But Bernard just said siw doesn't interoperate in certain cases
because of this - so that sounds like more than 'try really hard' ??

Or is that an overstatement and it just makes the rx side slower if
segmentation is not optimal?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
       [not found]                           ` <20180123174758.GH30670-uk2M96/98Pc@public.gmane.org>
@ 2018-01-23 17:55                             ` Steve Wise
  0 siblings, 0 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-23 17:55 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> On Tue, Jan 23, 2018 at 11:42:20AM -0600, Steve Wise wrote:
> 
> > Since the iwarp protocols run on top of TCP/IP, there is always the case
> > that some middle box resegments tcp segments differently, so a good
iwarp
> HW
> > implementation should deal with funny alignments, partial iWARP PDUs
> > arriving, etc.   But the RFCs, as I read them, want implementations to
try
> > "really hard" to avoid spanning an iWARP PDU across many TCP segments.
> And
> > I think siw should do the same, by default.
> 
> But Bernard just said siw doesn't interoperate in certain cases
> because of this - so that sounds like more than 'try really hard' ??
> 
> Or is that an overstatement and it just makes the rx side slower if
> segmentation is not optimal?

Creating 64K iWARP PDUs causes interoperability problems.  If siw builds
iWARP PDUs that fit within the TCP wire MSS, then these problems are
avoided, and siw becomes more spec compliant.  If the iWARP PDUs are built
this way, then nothing the tcp stack does will cause problems other than
slow down things.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 05/13] SoftiWarp application interface
  2018-01-14 22:35   ` [PATCH v3 05/13] SoftiWarp application interface Bernard Metzler
       [not found]     ` <20180114223603.19961-6-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
@ 2018-01-23 18:07     ` Bernard Metzler
       [not found]       ` <OF7390EC15.6C2741F2-ON0025821E.0062D222-0025821E.00639402-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
  1 sibling, 1 reply; 44+ messages in thread
From: Bernard Metzler @ 2018-01-23 18:07 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA



-----"Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote: -----

>To: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
><linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>Date: 01/23/2018 06:19PM
>Subject: RE: [PATCH v3 05/13] SoftiWarp application interface
>
>> 
>> Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
>> ---
>>  drivers/infiniband/sw/siw/siw_ae.c    |  120 +++
>>  drivers/infiniband/sw/siw/siw_verbs.c | 1876
>> +++++++++++++++++++++++++++++++++
>>  drivers/infiniband/sw/siw/siw_verbs.h |  119 +++
>>  include/uapi/rdma/siw_user.h          |  216 ++++
>>  4 files changed, 2331 insertions(+)
>>  create mode 100644 drivers/infiniband/sw/siw/siw_ae.c
>>  create mode 100644 drivers/infiniband/sw/siw/siw_verbs.c
>>  create mode 100644 drivers/infiniband/sw/siw/siw_verbs.h
>>  create mode 100644 include/uapi/rdma/siw_user.h
>> 
>> diff --git a/drivers/infiniband/sw/siw/siw_ae.c
>> b/drivers/infiniband/sw/siw/siw_ae.c
>> new file mode 100644
>> index 000000000000..eac55ce7f56d
>> --- /dev/null
>> +++ b/drivers/infiniband/sw/siw/siw_ae.c
>> @@ -0,0 +1,120 @@
>> +/*
>> + * Software iWARP device driver
>> + *
>> + * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
>> + *
>> + * Copyright (c) 2008-2017, IBM Corporation
>> + *
>> + * This software is available to you under a choice of one of two
>> + * licenses. You may choose to be licensed under the terms of the
>GNU
>> + * General Public License (GPL) Version 2, available from the file
>> + * COPYING in the main directory of this source tree, or the
>> + * BSD license below:
>> + *
>> + *   Redistribution and use in source and binary forms, with or
>> + *   without modification, are permitted provided that the
>following
>> + *   conditions are met:
>> + *
>> + *   - Redistributions of source code must retain the above
>copyright
>notice,
>> + *     this list of conditions and the following disclaimer.
>> + *
>> + *   - Redistributions in binary form must reproduce the above
>copyright
>> + *     notice, this list of conditions and the following
>disclaimer in
>the
>> + *     documentation and/or other materials provided with the
>distribution.
>> + *
>> + *   - Neither the name of IBM nor the names of its contributors
>may be
>> + *     used to endorse or promote products derived from this
>software
>without
>> + *     specific prior written permission.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
>> OF
>> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
>> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
>> HOLDERS
>> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
>AN
>> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
>> IN
>> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> THE
>> + * SOFTWARE.
>> + */
>> +
>> +#include <linux/errno.h>
>> +#include <linux/types.h>
>> +#include <linux/net.h>
>> +#include <linux/scatterlist.h>
>> +#include <linux/highmem.h>
>> +#include <net/sock.h>
>> +#include <net/tcp_states.h>
>> +#include <net/tcp.h>
>> +
>> +#include <rdma/iw_cm.h>
>> +#include <rdma/ib_verbs.h>
>> +#include <rdma/ib_smi.h>
>> +#include <rdma/ib_user_verbs.h>
>> +
>> +#include "siw.h"
>> +#include "siw_obj.h"
>> +#include "siw_cm.h"
>> +
>> +void siw_qp_event(struct siw_qp *qp, enum ib_event_type etype)
>> +{
>> +	struct ib_event event;
>> +	struct ib_qp	*base_qp = &qp->base_qp;
>> +
>> +	/*
>> +	 * Do not report asynchronous errors on QP which gets
>> +	 * destroyed via verbs interface (siw_destroy_qp())
>> +	 */
>> +	if (qp->attrs.flags & SIW_QP_IN_DESTROY)
>> +		return;
>> +
>> +	event.event = etype;
>> +	event.device = base_qp->device;
>> +	event.element.qp = base_qp;
>> +
>> +	if (base_qp->event_handler) {
>> +		siw_dbg_qp(qp, "reporting event %d\n", etype);
>> +		(*base_qp->event_handler)(&event, base_qp->qp_context);
>> +	}
>> +}
>> +
>> +void siw_cq_event(struct siw_cq *cq, enum ib_event_type etype)
>> +{
>> +	struct ib_event event;
>> +	struct ib_cq	*base_cq = &cq->base_cq;
>> +
>> +	event.event = etype;
>> +	event.device = base_cq->device;
>> +	event.element.cq = base_cq;
>> +
>> +	if (base_cq->event_handler) {
>> +		siw_dbg(cq->hdr.sdev, "reporting CQ event %d\n", etype);
>> +		(*base_cq->event_handler)(&event, base_cq->cq_context);
>> +	}
>> +}
>> +
>
>The rdma provider must ensure that event upcalls are serialized per
>object.
>Is this being done somewhere?  See "Callbacks" in
>Documentation/infiniband/core_locking.txt.
>

That says, if multiple QP's etc. contribute to a CQ, only one
CQ event handler is allowed to run? That's not there, I would have
to make a lock around...
>
>
>> +void siw_srq_event(struct siw_srq *srq, enum ib_event_type etype)
>> +{
>> +	struct ib_event event;
>> +	struct ib_srq	*base_srq = &srq->base_srq;
>> +
>> +	event.event = etype;
>> +	event.device = base_srq->device;
>> +	event.element.srq = base_srq;
>> +
>> +	if (base_srq->event_handler) {
>> +		siw_dbg(srq->pd->hdr.sdev, "reporting SRQ event %d\n",
>> etype);
>> +		(*base_srq->event_handler)(&event, base_srq->srq_context);
>> +	}
>> +}
>> +

...


>> +	/*
>> +	 * Try to acquire QP state lock. Must be non-blocking
>> +	 * to accommodate kernel clients needs.
>> +	 */
>> +	if (!down_read_trylock(&qp->state_lock)) {
>> +		*bad_wr = wr;
>> +		return -ENOTCONN;
>> +	}
>> +
>
>Under what conditions does down_read_trylock() return 0?  This seems
>like a
>kernel ULP that has multiple threads posting to the sq might get an
>error
>due to lock contention?  

This lock is only taken if the QP is to be transitioned into
another state. If we are here, we assume to be in RTS state.
If the lock is not available, someone else just moves the QP out
of RTS and we should not do much here.

We do trylock() since we cannot sleep in some of our contexts...

>
>> +	if (unlikely(qp->attrs.state != SIW_QP_STATE_RTS)) {
>> +		up_read(&qp->state_lock);
>> +		*bad_wr = wr;
>> +		return -ENOTCONN;
>> +	}
>
>NOTCONN is inappropriate here too.  Also, I relaxed iw_cxgb4 to
>support
>ib_drain_qp().  Basically if the kernel QP is in ERROR,  a post will
>be
>completed via a CQE inserted with status IB_WC_WR_FLUSH_ERR.  See how
>ib_drain_qp() works for details.  The reason I bring this up, is that
>both
>iser and nvmf ULPs use ib_drain_sq/rq/qp() so you might want to
>support
>them.  There is an alternate solution though.  ib_drain_sq/rq()
>allows the
>low level driver to have its own drain implementation.  See those
>functions
>for details.
>

Ok, will look into it.

siw has methods to drain send and receive queue
(siw_sq_flush()/siw_rq_flush())

If this is a requirement, and an immediate error is not appropriate,
we could put the failing work request at the CQ for later draining.

>
>> +	if (wr && !qp->kernel_verbs) {
>> +		siw_dbg_qp(qp, "wr must be empty for user mapped sq\n");
>> +		up_read(&qp->state_lock);
>> +		*bad_wr = wr;
>> +		return -EINVAL;
>> +	}

...


>> +#define SIW_MAX_INLINE	(sizeof(struct siw_sge) * (SIW_MAX_SGE -
>1))
>> +
>> +#if SIW_MAX_SGE < 2
>> +#error "SIW_MAX_SGE must be at least 2"
>> +#endif
>
>This could be done with BUILD_BUG_ON().
>

Right. will do!

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 05/13] SoftiWarp application interface
       [not found]       ` <OF7390EC15.6C2741F2-ON0025821E.0062D222-0025821E.00639402-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
@ 2018-01-23 18:12         ` Steve Wise
  0 siblings, 0 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-23 18:12 UTC (permalink / raw)
  To: 'Bernard Metzler'; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

> >The rdma provider must ensure that event upcalls are serialized per
> >object.
> >Is this being done somewhere?  See "Callbacks" in
> >Documentation/infiniband/core_locking.txt.
> >
> 
> That says, if multiple QP's etc. contribute to a CQ, only one
> CQ event handler is allowed to run? That's not there, I would have
> to make a lock around...

Correct.  Basically, for a given CQ object, the driver can only have a single outstanding call to the ULP's event handler for that CQ at any point in time.  

...

> >> +	/*
> >> +	 * Try to acquire QP state lock. Must be non-blocking
> >> +	 * to accommodate kernel clients needs.
> >> +	 */
> >> +	if (!down_read_trylock(&qp->state_lock)) {
> >> +		*bad_wr = wr;
> >> +		return -ENOTCONN;
> >> +	}
> >> +
> >
> >Under what conditions does down_read_trylock() return 0?  This seems
> >like a
> >kernel ULP that has multiple threads posting to the sq might get an
> >error
> >due to lock contention?
> 
> This lock is only taken if the QP is to be transitioned into
> another state. If we are here, we assume to be in RTS state.
> If the lock is not available, someone else just moves the QP out
> of RTS and we should not do much here.
> 

Ah, I see.

> We do trylock() since we cannot sleep in some of our contexts...
> 

Right.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem
  2018-01-23 17:47                         ` Jason Gunthorpe
       [not found]                           ` <20180123174758.GH30670-uk2M96/98Pc@public.gmane.org>
@ 2018-01-23 18:21                           ` Bernard Metzler
  1 sibling, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-23 18:21 UTC (permalink / raw)
  To: Steve Wise; +Cc: 'Jason Gunthorpe', linux-rdma-u79uwXL29TY76Z2rM5mHXA


-----"Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote: -----

>To: "'Jason Gunthorpe'" <jgg-uk2M96/98Pc@public.gmane.org>
>From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>Date: 01/23/2018 06:55PM
>Cc: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
><linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>Subject: RE: [PATCH v3 03/13] Attach/detach SoftiWarp to/from network
>and RDMA subsystem
>
>> 
>> On Tue, Jan 23, 2018 at 11:42:20AM -0600, Steve Wise wrote:
>> 
>> > Since the iwarp protocols run on top of TCP/IP, there is always
>the case
>> > that some middle box resegments tcp segments differently, so a
>good
>iwarp
>> HW
>> > implementation should deal with funny alignments, partial iWARP
>PDUs
>> > arriving, etc.   But the RFCs, as I read them, want
>implementations to
>try
>> > "really hard" to avoid spanning an iWARP PDU across many TCP
>segments.
>> And
>> > I think siw should do the same, by default.
>> 
>> But Bernard just said siw doesn't interoperate in certain cases
>> because of this - so that sounds like more than 'try really hard'
>??
>> 
>> Or is that an overstatement and it just makes the rx side slower if
>> segmentation is not optimal?
>
>Creating 64K iWARP PDUs causes interoperability problems.  If siw
>builds
>iWARP PDUs that fit within the TCP wire MSS, then these problems are
>avoided, and siw becomes more spec compliant.  If the iWARP PDUs are
>built
>this way, then nothing the tcp stack does will cause problems other
>than
>slow down things.
>
Right. It's just the slowdown what hurts me. But it may improve the
need to buy real iWarp HW for good performance ;)

See, I came for the other side looking at that - how to make best
use of kernel services, if I anyway cannot completely control the wire
shape of the frames I push. But I understand that some HW cannot 
deal with it.

Same is true for small messages. if siw puts one 8 byte payload packet
in one TCP frame, we end up with 350k IOPs. If we allow to pull
multiple of it into one frame, if the send queue has more work to do,
it goes much higher, and we avoid much extra delay.
I recently removed that optimization, just to avoid discussion on it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 06/13] SoftiWarp connection management
       [not found]     ` <20180114223603.19961-7-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
@ 2018-01-30 21:27       ` Steve Wise
  0 siblings, 0 replies; 44+ messages in thread
From: Steve Wise @ 2018-01-30 21:27 UTC (permalink / raw)
  To: 'Bernard Metzler', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> ---
>  drivers/infiniband/sw/siw/siw_cm.c | 2184
> ++++++++++++++++++++++++++++++++++++
>  drivers/infiniband/sw/siw/siw_cm.h |  156 +++
>  2 files changed, 2340 insertions(+)
>  create mode 100644 drivers/infiniband/sw/siw/siw_cm.c
>  create mode 100644 drivers/infiniband/sw/siw/siw_cm.h

<snip>

> diff --git a/drivers/infiniband/sw/siw/siw_cm.h
> b/drivers/infiniband/sw/siw/siw_cm.h
> new file mode 100644
> index 000000000000..36a8d15d793a
> --- /dev/null
> +++ b/drivers/infiniband/sw/siw/siw_cm.h
> @@ -0,0 +1,156 @@
> +/*
> + * Software iWARP device driver
> + *
> + * Authors: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
> + *          Greg Joyce <greg-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
> + *
> + * Copyright (c) 2008-2017, IBM Corporation
> + * Copyright (c) 2017, Open Grid Computing, Inc.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * BSD license below:
> + *
> + *   Redistribution and use in source and binary forms, with or
> + *   without modification, are permitted provided that the following
> + *   conditions are met:
> + *
> + *   - Redistributions of source code must retain the above copyright
notice,
> + *     this list of conditions and the following disclaimer.
> + *
> + *   - Redistributions in binary form must reproduce the above copyright
> + *     notice, this list of conditions and the following disclaimer in
the
> + *     documentation and/or other materials provided with the
distribution.
> + *
> + *   - Neither the name of IBM nor the names of its contributors may be
> + *     used to endorse or promote products derived from this software
without
> + *     specific prior written permission.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
> OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
> HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
> IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> THE
> + * SOFTWARE.
> + */
> +
> +#ifndef _SIW_CM_H
> +#define _SIW_CM_H
> +
> +#include <net/sock.h>
> +#include <linux/tcp.h>
> +
> +#include <rdma/iw_cm.h>
> +
> +
> +enum siw_cep_state {
> +	SIW_EPSTATE_IDLE = 1,
> +	SIW_EPSTATE_LISTENING,
> +	SIW_EPSTATE_CONNECTING,
> +	SIW_EPSTATE_AWAIT_MPAREQ,
> +	SIW_EPSTATE_RECVD_MPAREQ,
> +	SIW_EPSTATE_AWAIT_MPAREP,
> +	SIW_EPSTATE_RDMA_MODE,
> +	SIW_EPSTATE_CLOSED
> +};
> +
> +struct siw_mpa_info {
> +	struct mpa_rr		hdr;	/* peer mpa hdr in host byte order
*/
> +	struct mpa_v2_data	v2_ctrl;
> +	struct mpa_v2_data	v2_ctrl_req;
> +	char			*pdata;
> +	int			bytes_rcvd;
> +};
> +
> +struct siw_llp_info {
> +	struct socket		*sock;
> +	struct sockaddr_in	laddr;	/* redundant with socket info above
*/
> +	struct sockaddr_in	raddr;	/* dito, consider removal */
> +	struct siw_sk_upcalls	sk_def_upcalls;
> +};
> +
> +struct siw_device;
> +
> +struct siw_cep {
> +	struct iw_cm_id		*cm_id;
> +	struct siw_device	*sdev;
> +
> +	struct list_head	devq;
> +	/*
> +	 * The provider_data element of a listener IWCM ID
> +	 * refers to a list of one or more listener CEPs
> +	 */
> +	struct list_head	listenq;
> +	struct siw_cep		*listen_cep;
> +	struct siw_qp		*qp;
> +	spinlock_t		lock;
> +	wait_queue_head_t	waitq;
> +	struct kref		ref;
> +	enum siw_cep_state	state;
> +	short			in_use;
> +	struct siw_cm_work	*mpa_timer;
> +	struct list_head	work_freelist;
> +	struct siw_llp_info	llp;
> +	struct siw_mpa_info	mpa;
> +	int			ord;
> +	int			ird;
> +	bool			enhanced_rdma_conn_est;
> +	int			sk_error; /* not (yet) used XXX */
> +
> +	/* Saved upcalls of socket llp.sock */
> +	void	(*sk_state_change)(struct sock *sk);
> +	void	(*sk_data_ready)(struct sock *sk);
> +	void	(*sk_write_space)(struct sock *sk);
> +	void	(*sk_error_report)(struct sock *sk);
> +};
> +
> +/*
> + * Connection initiator waits 10 seconds to receive an
> + * MPA reply after sending out MPA request. Reponder waits for
> + * 5 seconds for MPA request to arrive if new TCP connection
> + * was set up.
> + */
> +#define MPAREQ_TIMEOUT	(HZ*10)
> +#define MPAREP_TIMEOUT	(HZ*5)
> +

These timeout times seem pretty low for a WAN link or even a brief network
down event during MPA negotiation connection.   Any particular reason for 10
and 5?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 06/13] SoftiWarp connection management
  2018-01-14 22:35   ` [PATCH v3 06/13] SoftiWarp connection management Bernard Metzler
       [not found]     ` <20180114223603.19961-7-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
@ 2018-01-31 17:28     ` Bernard Metzler
  1 sibling, 0 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-01-31 17:28 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA


-----"Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote: -----

>To: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
><linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>Date: 01/30/2018 10:27PM
>Subject: RE: [PATCH v3 06/13] SoftiWarp connection management
>
>> 
>> Signed-off-by: Bernard Metzler <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
>> ---
>>  drivers/infiniband/sw/siw/siw_cm.c | 2184
>> ++++++++++++++++++++++++++++++++++++
>>  drivers/infiniband/sw/siw/siw_cm.h |  156 +++
>>  2 files changed, 2340 insertions(+)
>>  create mode 100644 drivers/infiniband/sw/siw/siw_cm.c
>>  create mode 100644 drivers/infiniband/sw/siw/siw_cm.h
>
><snip>
>
>> diff --git a/drivers/infiniband/sw/siw/siw_cm.h


<snip>

>> +
>> +/*
>> + * Connection initiator waits 10 seconds to receive an
>> + * MPA reply after sending out MPA request. Reponder waits for
>> + * 5 seconds for MPA request to arrive if new TCP connection
>> + * was set up.
>> + */
>> +#define MPAREQ_TIMEOUT	(HZ*10)
>> +#define MPAREP_TIMEOUT	(HZ*5)
>> +
>
>These timeout times seem pretty low for a WAN link or even a brief
>network
>down event during MPA negotiation connection.   Any particular reason
>for 10
>and 5?
>
>
>
i picked those rather small numbers since it covers basically
only the latency after the TCP connection went up, so the
timer gets started only after the TCP connection was
established. at client side, we wait 5 seconds after sending the MPA
request to get the reply. on server side, we wait 10 seconds
for the MPA request, after we have accepted a new TCP connection.

maybe i shall make that longer? but the only real delay we have to
deal with is the server side application which thinks too long
if it shall 'accept' or 'reject' (maybe that is very artifical,
but i had to deal with cases of an application which never rejected/
accepted).

thanks
Bernard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v3 00/13] Request for Comments on SoftiWarp
  2018-01-14 22:35 [PATCH v3 00/13] Request for Comments on SoftiWarp Bernard Metzler
       [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
  2018-01-23 17:31 ` Bernard Metzler
@ 2018-02-02 14:37 ` Bernard Metzler
       [not found]   ` <OFD1018BEE.35589194-ON00258228.00505F2B-00258228.00505F31-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
       [not found]   ` <20180202185640.GC9080-uk2M96/98Pc@public.gmane.org>
  2 siblings, 2 replies; 44+ messages in thread
From: Bernard Metzler @ 2018-02-02 14:37 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA


-----"Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote: -----

>To: "'Bernard Metzler'" <bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>,
><linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>Date: 01/17/2018 05:07PM
>Subject: RE: [PATCH v3 00/13] Request for Comments on SoftiWarp
>
>Hey Bernard,
>
>> 
>> This patch set contributes a new version of the SoftiWarp
>> driver, as originally introduced Oct 6th, 2017.
>> 
>
>
>Two things I think will help with review:
>
>1) provide a github linux kernel repo with this series added.  That
>allows
>reviewers to easily use both patches and a full kernel tree for
>reviewing.
>

as I wrote earlier, please see

git-9UaJU3cA/F/QT0dZR+AlfA@public.gmane.org:zrlio/softiwarp-for-linux-rdma.git


>2) provide the userlib patches (and a github repo) that works with
>the
>latest upstream rdma-core.  That will allow folks to test use-side
>rdma as
>part of their review.
>

I added libsiw to the current (as of today) rmda-core bundle
as found at https://github.com/linux-rdma/rdma-core.git, see

git-9UaJU3cA/F/QT0dZR+AlfA@public.gmane.org:zrlio/softiwarp-user-for-linux-rdma.git

It enables user level clients for the siw driver. Let me
know if we need a patch for user land as well for now.
I don't want to spam the list.

Thanks!
Bernard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
       [not found]   ` <OFD1018BEE.35589194-ON00258228.00505F2B-00258228.00505F31-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
@ 2018-02-02 18:56     ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2018-02-02 18:56 UTC (permalink / raw)
  To: Bernard Metzler; +Cc: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 02, 2018 at 02:37:52PM +0000, Bernard Metzler wrote:

> I added libsiw to the current (as of today) rmda-core bundle
> as found at https://github.com/linux-rdma/rdma-core.git, see

Can you use the standard OFA license please? IBM is an OFA member and
is behloden to the membership agreement requring the common dual
license.

> It enables user level clients for the siw driver. Let me
> know if we need a patch for user land as well for now.
> I don't want to spam the list.

We will eventually..

+static const struct verbs_match_ent rnic_table[] = {
+	VERBS_NAME_MATCH("siw", NULL),
+	{},
+};

I really don't want to add more VERBS_NAME_MATCH, but I don't have an
alternative ready right now :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
       [not found]   ` <20180202185640.GC9080-uk2M96/98Pc@public.gmane.org>
@ 2018-02-04 20:08     ` Bernard Metzler
       [not found]       ` <OFEF72EEE6.7E5C3FA5-ON0025822A.005B18E4-0025822A.006EA4BB-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
  0 siblings, 1 reply; 44+ messages in thread
From: Bernard Metzler @ 2018-02-04 20:08 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA



-----Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote: -----

>To: Bernard Metzler <BMT-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
>From: Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org>
>Date: 02/02/2018 07:57PM
>Cc: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>,
>linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>Subject: Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
>
>On Fri, Feb 02, 2018 at 02:37:52PM +0000, Bernard Metzler wrote:
>
>> I added libsiw to the current (as of today) rmda-core bundle
>> as found at
>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linux
>-2Drdma_rdma-2Dcore.git&d=DwIBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=2TaYXQ0T-
>r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=UyHYNdR3HpzdZu5UEC-C3m-y2oY_mSux
>ocEezYKO-Sw&s=_1Odt9orPiTCNcrPz9U1gEwckeRc4RoeUwPsvPogtCI&e=, see
>
>Can you use the standard OFA license please? IBM is an OFA member and
>is behloden to the membership agreement requring the common dual
>license.
>

sorry to see that. What's exactly wrong with the current (dual) license 
statement from IBM? I checked several other user libs. All come with
some flavor of a dual BSD/GPLv2 license statement as this one,
as far as I can see.

>> It enables user level clients for the siw driver. Let me
>> know if we need a patch for user land as well for now.
>> I don't want to spam the list.
>
>We will eventually..
>
>+static const struct verbs_match_ent rnic_table[] = {
>+	VERBS_NAME_MATCH("siw", NULL),
>+	{},
>+};
>
>I really don't want to add more VERBS_NAME_MATCH, but I don't have an
>alternative ready right now :(
>

Hmm, I blindly followed what others did. This is different from
what libsiw had until now (doing its own sys entries reading/pattern
matching). I hope I did it right?
In any case, I hope this current approach allows for co-existence
of user libs from different vendors if heterogeneous hardware is
in use.

Thanks very much!
Bernard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/13] Request for Comments on SoftiWarp
       [not found]       ` <OFEF72EEE6.7E5C3FA5-ON0025822A.005B18E4-0025822A.006EA4BB-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
@ 2018-02-04 20:39         ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2018-02-04 20:39 UTC (permalink / raw)
  To: Bernard Metzler; +Cc: Steve Wise, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sun, Feb 04, 2018 at 08:08:31PM +0000, Bernard Metzler wrote:

> >Can you use the standard OFA license please? IBM is an OFA member and
> >is behloden to the membership agreement requring the common dual
> >license.
> >
> 
> sorry to see that. What's exactly wrong with the current (dual) license 
> statement from IBM? I checked several other user libs. All come with
> some flavor of a dual BSD/GPLv2 license statement as this one,
> as far as I can see.

The common BSD license does not have the advertising clause. It is not
'some flavour', the open fabrics agreement specifies their specific
flavor of BSD license see COPYING.BSD_MIT

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2018-02-04 20:39 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-14 22:35 [PATCH v3 00/13] Request for Comments on SoftiWarp Bernard Metzler
     [not found] ` <20180114223603.19961-1-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
2018-01-14 22:35   ` [PATCH v3 01/13] iWARP wire packet format definition Bernard Metzler
2018-01-14 22:35   ` [PATCH v3 02/13] Main SoftiWarp include file Bernard Metzler
2018-01-14 22:35   ` [PATCH v3 03/13] Attach/detach SoftiWarp to/from network and RDMA subsystem Bernard Metzler
     [not found]     ` <20180114223603.19961-4-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
2018-01-23 16:33       ` Steve Wise
2018-01-23 16:43         ` Jason Gunthorpe
     [not found]           ` <20180123164316.GC30670-uk2M96/98Pc@public.gmane.org>
2018-01-23 16:58             ` Steve Wise
2018-01-23 17:05               ` Jason Gunthorpe
     [not found]                 ` <20180123170517.GE30670-uk2M96/98Pc@public.gmane.org>
2018-01-23 17:24                   ` Steve Wise
2018-01-23 17:28                     ` Jason Gunthorpe
     [not found]                     ` <20180123172830.GF30670-uk2M96/98Pc@public.gmane.org>
2018-01-23 17:39                       ` Bernard Metzler
2018-01-23 17:42                       ` Steve Wise
2018-01-23 17:47                         ` Jason Gunthorpe
     [not found]                           ` <20180123174758.GH30670-uk2M96/98Pc@public.gmane.org>
2018-01-23 17:55                             ` Steve Wise
2018-01-23 18:21                           ` Bernard Metzler
2018-01-23 17:21             ` Bernard Metzler
2018-01-23 17:23           ` Bernard Metzler
     [not found]             ` <OF34749881.4216E578-ON0025821E.005F897C-0025821E.005F8983-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
2018-01-23 17:43               ` Steve Wise
2018-01-14 22:35   ` [PATCH v3 04/13] SoftiWarp object management Bernard Metzler
2018-01-14 22:35   ` [PATCH v3 05/13] SoftiWarp application interface Bernard Metzler
     [not found]     ` <20180114223603.19961-6-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
2018-01-23 17:18       ` Steve Wise
2018-01-23 18:07     ` Bernard Metzler
     [not found]       ` <OF7390EC15.6C2741F2-ON0025821E.0062D222-0025821E.00639402-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
2018-01-23 18:12         ` Steve Wise
2018-01-14 22:35   ` [PATCH v3 06/13] SoftiWarp connection management Bernard Metzler
     [not found]     ` <20180114223603.19961-7-bmt-OA+xvbQnYDHMbYB6QlFGEg@public.gmane.org>
2018-01-30 21:27       ` Steve Wise
2018-01-31 17:28     ` Bernard Metzler
2018-01-14 22:35   ` [PATCH v3 07/13] SoftiWarp application buffer management Bernard Metzler
2018-01-14 22:35   ` [PATCH v3 08/13] SoftiWarp Queue Pair methods Bernard Metzler
2018-01-14 22:35   ` [PATCH v3 09/13] SoftiWarp transmit path Bernard Metzler
2018-01-14 22:36   ` [PATCH v3 10/13] SoftiWarp receive path Bernard Metzler
2018-01-14 22:36   ` [PATCH v3 11/13] SoftiWarp Completion Queue methods Bernard Metzler
2018-01-14 22:36   ` [PATCH v3 12/13] SoftiWarp debugging code Bernard Metzler
2018-01-14 22:36   ` [PATCH v3 13/13] Add SoftiWarp to kernel build environment Bernard Metzler
2018-01-17 16:07   ` [PATCH v3 00/13] Request for Comments on SoftiWarp Steve Wise
2018-01-18  7:29     ` Leon Romanovsky
     [not found]     ` <20180118072958.GW13639-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2018-01-18 17:03       ` Bernard Metzler
     [not found]         ` <OFC86A1EC5.60F00E9A-ON00258219.005DA6A4-00258219.005DBBB9-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
2018-01-18 21:52           ` Steve Wise
2018-01-23 16:31   ` Steve Wise
2018-01-23 16:44     ` Jason Gunthorpe
2018-01-23 17:31 ` Bernard Metzler
2018-02-02 14:37 ` Bernard Metzler
     [not found]   ` <OFD1018BEE.35589194-ON00258228.00505F2B-00258228.00505F31-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
2018-02-02 18:56     ` Jason Gunthorpe
     [not found]   ` <20180202185640.GC9080-uk2M96/98Pc@public.gmane.org>
2018-02-04 20:08     ` Bernard Metzler
     [not found]       ` <OFEF72EEE6.7E5C3FA5-ON0025822A.005B18E4-0025822A.006EA4BB-8eTO7WVQ4XIsd+ienQ86orlN3bxYEBpz@public.gmane.org>
2018-02-04 20:39         ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.